[PATCH 0/2] publish: Always render nar/narinfo during backing.

  • Open
  • quality assurance status badge
Details
2 participants
  • Ludovic Courtès
  • Mathieu Othacehe
Owner
unassigned
Submitted by
Mathieu Othacehe
Severity
normal
M
M
Mathieu Othacehe wrote on 13 Aug 2021 12:28
(address . guix-patches@gnu.org)(name . Mathieu Othacehe)(address . othacehe@gnu.org)
20210813102800.805-1-othacehe@gnu.org
Hello,

This patchset does two things:

* It ensures that the narinfo creation is always done in a dedicated thread to
prevent blocking the main publish thread and cause the timeout connection
errors that some users experimented.

* It removes the cache bypass support and always renders temporary narinfo and
nar files during baking. This way, the first user asking for a substitute
won't get a 404 error. It also removes the need for Cuirass to pre-bake nar
files. Finally it will make way easier to use a publish server with caching
in a CI context.

The downside is that it makes the publish server code harder to understand and
that it consumes more server resources by duplicating the nar/narinfo file
creation.

In the long run, spawning new threads constantly won't scale. I think it would
be nice to rewrite the publish server using Fiber or an HTTP server framework.

Thanks,

Mathieu

Mathieu Othacehe (2):
publish: Defer narinfo string creation to the http-write.
publish: Remove cache bypass support.

doc/guix.texi | 15 -----
gnu/services/base.scm | 11 +---
guix/scripts/publish.scm | 136 +++++++++++++++++++++++----------------
tests/publish.scm | 88 +++----------------------
4 files changed, 93 insertions(+), 157 deletions(-)

--
2.32.0
M
M
Mathieu Othacehe wrote on 13 Aug 2021 12:30
[PATCH 1/2] publish: Defer narinfo string creation to the http-write.
(address . 50040@debbugs.gnu.org)(name . Mathieu Othacehe)(address . othacehe@gnu.org)
20210813103030.1017-1-othacehe@gnu.org
The "narinfo-string" procedure is expensive in term of IO operations and can
take a while under IO pressure, such a GC collecting. Defer its call to a new
thread created in the http-write procedure.


* guix/scripts/publish.scm (render-narinfo): Defer the narinfo string creation
to the http-write procedure.
(compression->sexp, sexp->compression): New procedures.
("X-Nar-Compression"): Use them.
("X-Narinfo-Compressions"): New custom header.
(strip-headers): Add the x-nar-path header.
(http-write): Add narinfo on-the-fly creation support. It happens in a
separated thread to prevent blocking the main thread.
---
guix/scripts/publish.scm | 82 +++++++++++++++++++++++++++++++++-------
1 file changed, 69 insertions(+), 13 deletions(-)

Toggle diff (133 lines)
diff --git a/guix/scripts/publish.scm b/guix/scripts/publish.scm
index 913cbd4fda..981ef8d267 100644
--- a/guix/scripts/publish.scm
+++ b/guix/scripts/publish.scm
@@ -24,6 +24,7 @@
#:use-module ((system repl server) #:prefix repl:)
#:use-module (ice-9 binary-ports)
#:use-module (ice-9 format)
+ #:use-module (ice-9 iconv)
#:use-module (ice-9 match)
#:use-module (ice-9 poll)
#:use-module (ice-9 regex)
@@ -409,15 +410,18 @@ appropriate duration. NAR-PATH specifies the prefix for nar URLs."
(let ((store-path (hash-part->path store hash)))
(if (string-null? store-path)
(not-found request #:phrase "" #:ttl negative-ttl)
- (values `((content-type . (application/x-nix-narinfo))
+ (values `((content-type . (application/x-nix-narinfo
+ (charset . "UTF-8")))
+ (x-nar-path . ,nar-path)
+ (x-narinfo-compressions . ,compressions)
,@(if ttl
`((cache-control (max-age . ,ttl)))
'()))
- (cut display
- (narinfo-string store store-path
- #:nar-path nar-path
- #:compressions compressions)
- <>)))))
+ ;; Do not call narinfo-string directly here as it is an
+ ;; expensive call that could potentially block the main
+ ;; thread. Instead, create the narinfo string in the
+ ;; http-write procedure.
+ store-path))))
(define* (nar-cache-file directory item
#:key (compression %no-compression))
@@ -672,19 +676,38 @@ requested using POOL."
(link narinfo other)))
others))))))
+(define (compression->sexp compression)
+ "Return the SEXP representation of COMPRESSION."
+ (match compression
+ (($ <compression> type level)
+ `(compression ,type ,level))))
+
+(define (sexp->compression sexp)
+ "Turn the given SEXP into a <compression> record and return it."
+ (match sexp
+ (('compression type level)
+ (compression type level))))
+
;; XXX: Declare the 'X-Nar-Compression' HTTP header, which is in fact for
;; internal consumption: it allows us to pass the compression info to
;; 'http-write', as part of the workaround to <http://bugs.gnu.org/21093>.
(declare-header! "X-Nar-Compression"
(lambda (str)
- (match (call-with-input-string str read)
- (('compression type level)
- (compression type level))))
+ (sexp->compression
+ (call-with-input-string str read)))
compression?
(lambda (compression port)
- (match compression
- (($ <compression> type level)
- (write `(compression ,type ,level) port)))))
+ (write (compression->sexp compression) port)))
+
+;; This header is used to pass the supported compressions to http-write in
+;; order to format on-the-fly narinfo responses.
+(declare-header! "X-Narinfo-Compressions"
+ (lambda (str)
+ (map sexp->compression
+ (call-with-input-string str read)))
+ (cut every compression? <>)
+ (lambda (compressions port)
+ (write (map compression->sexp compressions) port)))
(define* (render-nar store request store-item
#:key (compression %no-compression))
@@ -839,7 +862,8 @@ example: \"/foo/bar\" yields '(\"foo\" \"bar\")."
"Return RESPONSE's headers minus 'Content-Length' and our internal headers."
(fold alist-delete
(response-headers response)
- '(content-length x-raw-file x-nar-compression)))
+ '(content-length x-raw-file x-nar-compression
+ x-narinfo-compressions x-nar-path)))
(define (sans-content-length response)
"Return RESPONSE without its 'content-length' header."
@@ -973,6 +997,38 @@ blocking."
(unless keep-alive?
(close-port client)))
(values))))))
+ (('application/x-nix-narinfo . _)
+ (let ((compressions (assoc-ref (response-headers response)
+ 'x-narinfo-compressions))
+ (nar-path (assoc-ref (response-headers response)
+ 'x-nar-path)))
+ (if nar-path
+ (begin
+ (when (keep-alive? response)
+ (keep-alive client))
+ (call-with-new-thread
+ (lambda ()
+ (set-thread-name "publish narinfo")
+ (let* ((narinfo
+ (with-store store
+ (narinfo-string store (utf8->string body)
+ #:nar-path nar-path
+ #:compressions compressions)))
+ (narinfo-bv (string->bytevector narinfo "UTF-8"))
+ (narinfo-length
+ (bytevector-length narinfo-bv))
+ (response (write-response
+ (with-content-length response
+ narinfo-length)
+ client))
+ (output (response-port response)))
+ (configure-socket client)
+ (put-bytevector output narinfo-bv)
+ (force-output output)
+ (unless (keep-alive? response)
+ (close-port output))
+ (values)))))
+ (%http-write server client response body))))
(_
(match (assoc-ref (response-headers response) 'x-raw-file)
((? string? file)
--
2.32.0
M
M
Mathieu Othacehe wrote on 13 Aug 2021 12:30
[PATCH 2/2] publish: Remove cache bypass support.
(address . 50040@debbugs.gnu.org)(name . Mathieu Othacehe)(address . othacehe@gnu.org)
20210813103030.1017-2-othacehe@gnu.org
Unconditionally render narinfo and nar files during baking. This avoids the
need for pre-baking nar with a CI system. It also prevents receiving 404
errors during baking.

* guix/scripts/publish.scm (%options): Remove the cache-bypass-threshold
option.
(show-help): Ditto.
(cache-bypass-threshold, bypass-cache?): Remove them.
(render-narinfo/cached): Render a temporary narinfo unconditionally during
baking.
(render-nar/cached): Render a live nar unconditionally during baking.
(guix-publish): Remove cache bypass option.
* gnu/services/base.scm (<guix-publish-configuration>)
[cache-bypass-threshold]: Remove it
(guix-publish-shepherd-service): Adapt it.
* doc/guix.texi (Base services, Invoking guix publish): Document it.
* tests/publish.scm ("with cache", "with cache, lzip + gzip", "with cache,
uncompressed"): Adapt them.
("with cache, cache bypass", "with cache, cache bypass, unmapped hash part"):
Remove them.
---
doc/guix.texi | 28 -------------
gnu/services/base.scm | 11 +----
guix/scripts/publish.scm | 54 ++++++------------------
tests/publish.scm | 88 ++++------------------------------------
4 files changed, 24 insertions(+), 157 deletions(-)

Toggle diff (351 lines)
diff --git a/doc/guix.texi b/doc/guix.texi
index 78c1c09858..adb2ada4e8 100644
--- a/doc/guix.texi
+++ b/doc/guix.texi
@@ -12752,13 +12752,6 @@ archive is cached in @var{directory}, subsequent requests succeed and
are served directly from the cache, which guarantees that clients get
the best possible bandwidth.
-That first @code{.narinfo} request nonetheless returns 200, provided the
-requested store item is ``small enough'', below the cache bypass
-threshold---see @option{--cache-bypass-threshold} below. That way,
-clients do not have to wait until the archive is baked. For larger
-store items, the first @code{.narinfo} request returns 404, meaning that
-clients have to wait until the archive is baked.
-
The ``baking'' process is performed by worker threads. By default, one
thread per CPU core is created, but this can be customized. See
@option{--workers} below.
@@ -12794,21 +12787,6 @@ This parameter can help adjust server load and substitute latency by
instructing cooperating clients to be more or less patient when a store
item is missing.
-@item --cache-bypass-threshold=@var{size}
-When used in conjunction with @option{--cache}, store items smaller than
-@var{size} are immediately available, even when they are not yet in
-cache. @var{size} is a size in bytes, or it can be suffixed by @code{M}
-for megabytes and so on. The default is @code{10M}.
-
-``Cache bypass'' allows you to reduce the publication delay for clients
-at the expense of possibly additional I/O and CPU use on the server
-side: depending on the client access patterns, those store items can end
-up being baked several times until a copy is available in cache.
-
-Increasing the threshold may be useful for sites that have few users, or
-to guarantee that users get substitutes even for store items that are
-not popular.
-
@item --nar-path=@var{path}
Use @var{path} as the prefix for the URLs of ``nar'' files
(@pxref{Invoking guix archive, normalized archives}).
@@ -15818,12 +15796,6 @@ When it is an integer, this is the number of worker threads used for
caching; when @code{#f}, the number of processors is used.
@xref{Invoking guix publish, @option{--workers}}, for more information.
-@item @code{cache-bypass-threshold} (default: 10 MiB)
-When @code{cache} is true, this is the maximum size in bytes of a store
-item for which @command{guix publish} may bypass its cache in case of a
-cache miss. @xref{Invoking guix publish,
-@option{--cache-bypass-threshold}}, for more information.
-
@item @code{ttl} (default: @code{#f})
When it is an integer, this denotes the @dfn{time-to-live} in seconds
of the published archives. @xref{Invoking guix publish, @option{--ttl}},
diff --git a/gnu/services/base.scm b/gnu/services/base.scm
index c784d312b1..ac469abf8c 100644
--- a/gnu/services/base.scm
+++ b/gnu/services/base.scm
@@ -1775,8 +1775,6 @@ proxy of 'guix-daemon'...~%")
(default "nar"))
(cache guix-publish-configuration-cache ;#f | string
(default #f))
- (cache-bypass-threshold guix-publish-configuration-cache-bypass-threshold
- (default (* 10 (expt 2 20)))) ;integer
(workers guix-publish-configuration-workers ;#f | integer
(default #f))
(ttl guix-publish-configuration-ttl ;#f | integer
@@ -1815,8 +1813,7 @@ raise a deprecation warning if the 'compression-level' field was used."
lst))))
(match-record config <guix-publish-configuration>
- (guix port host nar-path cache workers ttl cache-bypass-threshold
- advertise?)
+ (guix port host nar-path cache workers ttl advertise?)
(list (shepherd-service
(provision '(guix-publish))
(requirement `(user-processes
@@ -1843,11 +1840,7 @@ raise a deprecation warning if the 'compression-level' field was used."
"s"))
#~())
#$@(if cache
- #~((string-append "--cache=" #$cache)
- #$(string-append
- "--cache-bypass-threshold="
- (number->string
- cache-bypass-threshold)))
+ #~((string-append "--cache=" #$cache))
#~()))
;; Make sure we run in a UTF-8 locale so we can produce
diff --git a/guix/scripts/publish.scm b/guix/scripts/publish.scm
index 981ef8d267..64f444e901 100644
--- a/guix/scripts/publish.scm
+++ b/guix/scripts/publish.scm
@@ -97,9 +97,6 @@ Publish ~a over HTTP.\n") %store-directory)
compress archives with METHOD at LEVEL"))
(display (G_ "
-c, --cache=DIRECTORY cache published items to DIRECTORY"))
- (display (G_ "
- --cache-bypass-threshold=SIZE
- serve store items below SIZE even when not cached"))
(display (G_ "
--workers=N use N workers to bake items"))
(display (G_ "
@@ -214,10 +211,6 @@ usage."
(option '(#\c "cache") #t #f
(lambda (opt name arg result)
(alist-cons 'cache arg result)))
- (option '("cache-bypass-threshold") #t #f
- (lambda (opt name arg result)
- (alist-cons 'cache-bypass-threshold (size->number arg)
- result)))
(option '("workers") #t #f
(lambda (opt name arg result)
(alist-cons 'workers (string->number* arg)
@@ -513,18 +506,6 @@ vanished from the store in the meantime."
result))
(apply throw args))))))
-(define cache-bypass-threshold
- ;; Maximum size of a store item that may be served by the '/cached' handlers
- ;; below even when not in cache.
- (make-parameter (* 10 (expt 2 20))))
-
-(define (bypass-cache? store item)
- "Return true if we allow ITEM to be downloaded before it is cached. ITEM is
-interpreted as the basename of a store item."
- (guard (c ((store-error? c) #f))
- (< (path-info-nar-size (query-path-info store item))
- (cache-bypass-threshold))))
-
(define* (render-narinfo/cached store request hash
#:key ttl (compressions (list %no-compression))
(nar-path "nar") negative-ttl
@@ -584,19 +565,14 @@ requested using POOL."
#:delete-entry delete-entry
#:cleanup-period ttl))))
- ;; If ITEM passes 'bypass-cache?', render a temporary narinfo right
- ;; away, with a short TTL. The narinfo is temporary because it
- ;; lacks 'FileSize', for instance, which the cached narinfo will
- ;; have. Chances are that the nar will be baked by the time the
- ;; client asks for it.
- (if (bypass-cache? store item)
- (render-narinfo store request hash
- #:ttl 300 ;temporary
- #:nar-path nar-path
- #:compressions compressions)
- (not-found request
- #:phrase "We're baking it"
- #:ttl 300))) ;should be available within 5m
+ ;; Render a temporary narinfo right away, with a short TTL. The
+ ;; narinfo is temporary because it lacks 'FileSize', for instance,
+ ;; which the cached narinfo will have. Chances are that the nar
+ ;; will be baked by the time the client asks for it.
+ (render-narinfo store request hash
+ #:ttl 300 ;temporary
+ #:nar-path nar-path
+ #:compressions compressions))
(else
(not-found request #:phrase "" #:ttl negative-ttl)))))
@@ -740,9 +716,9 @@ return it; otherwise, return 404. When TTL is true, use it as the
`((cache-control (max-age . ,ttl)))
'())
- ;; XXX: We're not returning the actual contents, deferring
- ;; instead to 'http-write'. This is a hack to work around
- ;; <http://bugs.gnu.org/21093>.
+ ;; XXX: We're not returning the actual contents,
+ ;; deferring instead to 'http-write'. This is a hack to
+ ;; work around <http://bugs.gnu.org/21093>.
(x-raw-file . ,cached))
#f))
((let* ((hash (and=> (string-index store-item #\-)
@@ -750,8 +726,7 @@ return it; otherwise, return 404. When TTL is true, use it as the
(item (and hash
(guard (c ((store-error? c) #f))
(hash-part->path store hash)))))
- (and item (not (string-null? item))
- (bypass-cache? store item)))
+ (and item (not (string-null? item))))
;; Render STORE-ITEM live. We reach this because STORE-ITEM is
;; being baked but clients are already asking for it. Thus, we're
;; duplicating work, but doing so allows us to reduce delays.
@@ -1306,10 +1281,7 @@ headers."
consider using the '--user' option!~%")))
(parameterize ((%public-key public-key)
- (%private-key private-key)
- (cache-bypass-threshold
- (or (assoc-ref opts 'cache-bypass-threshold)
- (cache-bypass-threshold))))
+ (%private-key private-key))
(info (G_ "publishing ~a on ~a, port ~d~%")
%store-directory
(inet-ntop (sockaddr:fam address) (sockaddr:addr address))
diff --git a/tests/publish.scm b/tests/publish.scm
index c3d086995a..43871a2a93 100644
--- a/tests/publish.scm
+++ b/tests/publish.scm
@@ -422,15 +422,14 @@ References: ~%"
200 ;nar/gzip/…
#t ;Content-Length
#t ;FileSize
- 404) ;nar/…
+ 200) ;nar/…
(call-with-temporary-directory
(lambda (cache)
(let ((thread (with-separate-output-ports
(call-with-new-thread
(lambda ()
(guix-publish "--port=6797" "-C2"
- (string-append "--cache=" cache)
- "--cache-bypass-threshold=0"))))))
+ (string-append "--cache=" cache)))))))
(wait-until-ready 6797)
(let* ((base "http://localhost:6797/")
(part (store-path-hash-part %item))
@@ -441,9 +440,9 @@ References: ~%"
(nar (string-append cache "/gzip/"
(basename %item) ".nar"))
(response (http-get url)))
- (and (= 404 (response-code response))
+ (and (= 200 (response-code response))
- ;; We should get an explicitly short TTL for 404 in this case
+ ;; We should get an explicitly short TTL for 200 in this case
;; because it's going to become 200 shortly.
(match (assq-ref (response-headers response) 'cache-control)
((('max-age . ttl))
@@ -477,15 +476,14 @@ References: ~%"
(response-code uncompressed)))))))))
(test-equal "with cache, lzip + gzip"
- '(200 200 404)
+ '(200 200 200)
(call-with-temporary-directory
(lambda (cache)
(let ((thread (with-separate-output-ports
(call-with-new-thread
(lambda ()
(guix-publish "--port=6794" "-Cgzip:2" "-Clzip:2"
- (string-append "--cache=" cache)
- "--cache-bypass-threshold=0"))))))
+ (string-append "--cache=" cache)))))))
(wait-until-ready 6794)
(let* ((base "http://localhost:6794/")
(part (store-path-hash-part %item))
@@ -533,15 +531,14 @@ References: ~%"
(* 42 3600) ;TTL on nar/…
(path-info-nar-size
(query-path-info %store item)) ;FileSize
- 404) ;nar/gzip/…
+ 200) ;nar/gzip/…
(call-with-temporary-directory
(lambda (cache)
(let ((thread (with-separate-output-ports
(call-with-new-thread
(lambda ()
(guix-publish "--port=6796" "-C2" "--ttl=42h"
- (string-append "--cache=" cache)
- "--cache-bypass-threshold=0"))))))
+ (string-append "--cache=" cache)))))))
(wait-until-ready 6796)
(let* ((base "http://localhost:6796/")
(part (store-path-hash-part item))
@@ -551,7 +548,7 @@ References: ~%"
(nar (string-append cache "/none/"
(basename item) ".nar"))
(response (http-get url)))
- (and (= 404 (response-code response))
+ (and (= 200 (response-code response))
(wait-for-file cached)
(let* ((response (http-get url))
@@ -611,73 +608,6 @@ References: ~%"
(delete-paths %store (list item))
(response-code (pk 'response (http-get url))))))))))
-(test-equal "with cache, cache bypass"
- 200
- (call-with-temporary-directory
- (lambda (cache)
- (let ((thread (with-separate-output-ports
- (call-with-new-thread
- (lambda ()
- (guix-publish "--port=6788" "-C" "gzip"
- (string-append "--cache=" cache)))))))
- (wait-until-ready 6788)
-
- (let* ((base "http://localhost:6788/")
- (item (add-text-to-store %store "random" (random-text)))
- (part (store-path-hash-part item))
- (narinfo (string-append base part ".narinfo"))
- (nar (string-append base "nar/gzip/" (basename item)))
- (cached (string-append cache "/gzip/" (basename item)
- ".narinfo")))
- ;; We're below the default cache bypass threshold, so NAR and NARINFO
- ;; should immediately return 200. The NARINFO request should trigger
- ;; caching, and the next request to NAR should return 200 as well.
- (and (let ((response (pk 'r1 (http-get nar))))
- (and (= 200 (response-code response))
- (not (response-content-length response)))) ;not known
- (= 200 (response-code (http-get narinfo)))
- (begin
- (wait-for-file cached)
- (let ((response (pk 'r2 (http-get nar))))
- (and (> (response-content-length response)
- (stat:size (stat item)))
- (response-code response))))))))))
-
-(test-equal "with cache, cache bypass, unmapped hash part"
- 200
-
- ;; This test reproduces the bug described in <https://bugs.gnu.org/44442>:
- ;; the daemon connection would be closed as a side effect of a nar request
- ;; for a non-existing file name.
- (call-with-temporary-directory
- (lambda (cache)
- (let ((thread (with-separate-output-ports
- (call-with-new-thread
- (lambda ()
- (guix-publish "--port=6787" "-C" "gzip"
- (string-append "--cache=" cache)))))))
- (wait-until-ready 6787)
-
- (let* ((base "http://localhost:6787/")
- (item (add-text-to-store %store "random" (random-text)))
- (part (store-path-hash-part item))
- (narinfo (string-append base part ".narinfo"))
- (nar (string-append base "nar/gzip/" (basename item)))
- (cached (string-append cache "/gzip/" (basename item)
- ".narinfo")))
- ;; The first response used to be 500 and to terminate the daemon
- ;; connection as a side effect.
- (and (= (response-code
- (http-get (string-append base "nar/gzip/"
- (make-string 32 #\e)
- "-does-not-exist")))
- 404)
- (= 200 (response-code (http-get nar)))
- (= 200 (response-code (http-get narinfo)))
- (begin
- (wait-for-file cached)
- (response-code (http-get nar)))))))))
-
(test-equal "/log/NAME"
`(200 #t application/x-bzip2)
(let ((drv (run-with-store %store
--
2.32.0
M
M
Mathieu Othacehe wrote on 22 Aug 2021 09:33
Re: bug#50040: [PATCH 0/2] publish: Always render nar/narinfo during backing.
(address . 50040@debbugs.gnu.org)
87sfz2j5ls.fsf@gnu.org
Hey,

I increased the bypass cache value on the Berlin publish server from 50
to 500 MiB to confirm that this patch will help. Turns out, there are
still some Cuirass errors.

Running this command:

Toggle snippet (6 lines)
sudo strace -f -s 10000 -e write -p 124867 -p 124868 -p 124869 -p
124870 -p 124871 -p 124872 -p 124873 -p 124874 -p 124875 -p 124876 -p
124877 -p 124878 -p 124879 -p 124880 -p 124881 -p 124882 -p 124883 2>&1
|grep narinfo

(The processes are all the Nginx workers)

Shows those errors:

Toggle snippet (20 lines)
[pid 124867] write(4, "2021/08/21 22:17:40 [error] 124867#0: *16550606 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.167, server: ci.guix.gnu.org, request: \"GET /s03ydmlv5p7v206hvj5z1d4hn9nzsv2s.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/s03ydmlv5p7v206hvj5z1d4hn9nzsv2s.narinfo\", host: \"141.80.167.131\"\n", 355) = 355
[pid 124869] write(4, "2021/08/21 22:17:40 [error] 124869#0: *16550621 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.175, server: ci.guix.gnu.org, request: \"GET /kfzipc676in0ff2im8zprbsgv9idrc9j.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/kfzipc676in0ff2im8zprbsgv9idrc9j.narinfo\", host: \"141.80.167.131\"\n", 355 <unfinished ...>
[pid 124867] write(4, "2021/08/21 22:17:41 [error] 124867#0: *16550771 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.158, server: ci.guix.gnu.org, request: \"GET /0mpspvh9h208vhrd3wxhw1165rz8mslb.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/0mpspvh9h208vhrd3wxhw1165rz8mslb.narinfo\", host: \"141.80.167.131\"\n", 355) = 355
[pid 124867] write(4, "2021/08/21 22:17:41 [error] 124867#0: *16550774 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.163, server: ci.guix.gnu.org, request: \"GET /4hy6nms1lzgbdrwslw5dlilvvjlrz95g.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/4hy6nms1lzgbdrwslw5dlilvvjlrz95g.narinfo\", host: \"141.80.167.131\"\n", 355) = 355
[pid 124867] write(4, "2021/08/21 22:17:42 [error] 124867#0: *16550628 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.173, server: ci.guix.gnu.org, request: \"GET /y9adqfkz8h3if17q3w801wfr0r91xdds.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/y9adqfkz8h3if17q3w801wfr0r91xdds.narinfo\", host: \"141.80.167.131\"\n", 355) = 355
[pid 124867] write(4, "2021/08/21 22:17:42 [error] 124867#0: *16550471 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.178, server: ci.guix.gnu.org, request: \"GET /d6vvg1rsqya8cnhi2mzqqvin55qax3xy.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/d6vvg1rsqya8cnhi2mzqqvin55qax3xy.narinfo\", host: \"141.80.167.131\"\n", 355) = 355
[pid 124867] write(4, "2021/08/21 22:17:42 [error] 124867#0: *16550606 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.167, server: ci.guix.gnu.org, request: \"GET /h3bd6psd2ijj5d9a8q46w2ldqsi135j0.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/h3bd6psd2ijj5d9a8q46w2ldqsi135j0.narinfo\", host: \"141.80.167.131\"\n", 355) = 355
[pid 124867] write(4, "2021/08/21 22:17:42 [error] 124867#0: *16550780 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.184, server: ci.guix.gnu.org, request: \"GET /8jvzhavq2zkychyjhs0ywnz2mpgp73qz.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/8jvzhavq2zkychyjhs0ywnz2mpgp73qz.narinfo\", host: \"141.80.167.131\"\n", 355) = 355
[pid 124868] write(4, "2021/08/21 22:17:43 [error] 124868#0: *16550782 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 127.0.0.1, server: ci.guix.gnu.org, request: \"GET /3wl6j74i18wgcyqwrvfjjcyag4w3ip3d.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/3wl6j74i18wgcyqwrvfjjcyag4w3ip3d.narinfo\", host: \"127.0.0.1\"\n", 345) = 345
[pid 124868] write(4, "2021/08/21 22:17:43 [error] 124868#0: *16550784 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.161, server: ci.guix.gnu.org, request: \"GET /lf8lczjd7k55x2byql0rn5qpgibvl6kh.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/lf8lczjd7k55x2byql0rn5qpgibvl6kh.narinfo\", host: \"141.80.167.131\"\n", 355) = 355
[pid 124867] write(4, "2021/08/21 22:17:43 [error] 124867#0: *16550786 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.177, server: ci.guix.gnu.org, request: \"GET /fvddmn01gx8amxlh6kvv8j7pcm6dxddn.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/fvddmn01gx8amxlh6kvv8j7pcm6dxddn.narinfo\", host: \"141.80.167.131\"\n", 355) = 355
[pid 124867] write(4, "2021/08/21 22:17:44 [error] 124867#0: *16550788 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.183, server: ci.guix.gnu.org, request: \"GET /wcl4mlxpnn20q2f2mgc2hj7w21jf03iz.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/wcl4mlxpnn20q2f2mgc2hj7w21jf03iz.narinfo\", host: \"141.80.167.131\"\n", 355) = 355
[pid 124867] write(4, "2021/08/21 22:17:44 [error] 124867#0: *16550471 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.178, server: ci.guix.gnu.org, request: \"GET /hakhzv36jl50mas5am4pwk89p428wnrl.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/hakhzv36jl50mas5am4pwk89p428wnrl.narinfo\", host: \"141.80.167.131\"\n", 355) = 355
[pid 124867] write(4, "2021/08/21 22:17:44 [error] 124867#0: *16550606 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.167, server: ci.guix.gnu.org, request: \"GET /64fsv1vjf1mxkv6wv2wh3vkffzrgalyq.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/64fsv1vjf1mxkv6wv2wh3vkffzrgalyq.narinfo\", host: \"141.80.167.131\"\n", 355) = 355
[pid 124867] write(4, "2021/08/21 22:17:44 [error] 124867#0: *16550796 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.159, server: ci.guix.gnu.org, request: \"GET /skbvrqdl64lph6k1ba5gay31spdbkm04.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/skbvrqdl64lph6k1ba5gay31spdbkm04.narinfo\", host: \"141.80.167.131\"\n", 355) = 355
[pid 124867] write(4, "2021/08/21 22:17:45 [error] 124867#0: *16550799 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.164, server: ci.guix.gnu.org, request: \"GET /b9lrrh5zj7wjhgs93m1aaz0chbp7awzk.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/b9lrrh5zj7wjhgs93m1aaz0chbp7awzk.narinfo\", host: \"141.80.167.131\"\n", 355) = 355
[pid 124867] write(4, "2021/08/21 22:17:45 [error] 124867#0: *16550802 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 127.0.0.1, server: ci.guix.gnu.org, request: \"GET /3sqsphfshnkpxk91zi38hh11ifq144n3.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/3sqsphfshnkpxk91zi38hh11ifq144n3.narinfo\", host: \"127.0.0.1\"\n", 345) = 345
[pid 124867] write(4, "2021/08/21 22:17:45 [error] 124867#0: *16550804 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.185, server: ci.guix.gnu.org, request: \"GET /f8ld4yvmsp5z1850gppy6sv0g59yamh8.narinfo HTTP/1.1\", upstream: \"http://127.0.0.1:3000/f8ld4yvmsp5z1850gppy6sv0g59yamh8.narinfo\", host: \"141.80.167.131\"\n", 355) = 355

I think this is caused by the following lines in the Berlin Nginx
configuration:

Toggle snippet (7 lines)
;; Do not tolerate slowness of hydra.gnu.org when fetching
;; narinfos: better return 504 quickly than wait forever.
"proxy_connect_timeout 2s;"
"proxy_read_timeout 2s;"
"proxy_send_timeout 2s;"

The narinfos that are not in cache yet are computed in the main thread,
slowing down the publish process. Under pressure, computing those
narinfo can exceed the 2s timeout resulting in the failures above.

Our best course of action here could be to merge the proposed patches,
extend the Nginx timeout to 10s and see if all the requests can be
handled quickly enough.

Thanks,

Mathieu
L
L
Ludovic Courtès wrote on 31 Aug 2021 00:31
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 50040@debbugs.gnu.org)
87sfyqa7je.fsf_-_@gnu.org
Mathieu Othacehe <othacehe@gnu.org> skribis:

Toggle quote (4 lines)
> The "narinfo-string" procedure is expensive in term of IO operations and can
> take a while under IO pressure, such a GC collecting. Defer its call to a new
> thread created in the http-write procedure.

I don’t fully understand where the performance problem comes from.

‘narinfo-string’ makes RPCs, formats a string, and computes a signature
over a few hundred bytes at most. It does almost no I/O per se (it gets
the nar hash and size via an RPC), and does little computation.

I get this:

Toggle snippet (9 lines)
scheme@(guix scripts publish)> (%public-key (read-file-sexp "tests/signing-key.pub"))
$8 = #f
scheme@(guix scripts publish)> (%private-key (read-file-sexp "tests/signing-key.sec"))
$9 = #f
scheme@(guix scripts publish)> ,t (narinfo-string s "/gnu/store/0ff19dxsjvhf4hjnw9ixlqpwdmwfpbyp-libreoffice-7.1.4.2")
$11 = "StorePath: /gnu/store/0ff19dxsjvhf4hjnw9ixlqpwdmwfpbyp-libreoffice-7.1.4.2\nURL: nar/0ff19dxsjvhf4hjnw9ixlqpwdmwfpbyp-libreoffice-7.1.4.2\nCompression: none\nFileSize: 441389400\n[…]"
;; 0.006714s real time, 0.006460s run time. 0.000000s spent in GC.

Off the top of my head, scenarios that can make ‘narinfo-string’ slow:

0. The store’s big GC lock is taken so RPCs don’t complete until it’s
over.

GC runs at most twice a day on berlin. Do you have logs showing at
what time those timeouts occur? We could see if there’s a
correlation.

1. Contention on the daemon database (“database is locked”) that makes
RPCs take seconds to complete. Right now,

sudo guix processes|grep ^Session|wc -l

returns 6, which is not a lot, and I suspect that’s typical. There
could still be contention, but that doesn’t sound very likely.

2. It does I/O when it calls ‘read-derivation-from-file’. Under high
I/O load, that could be relatively expensive, though I’d expect it
to be measured in tenths of a second at worst?

But look, ‘read-derivation-from-file’ is called just to fill in the
“System” field, which is not used anywhere (not a single caller of
‘narinfo-system’), so we could just as well remove it and see how
it behaves.

Did you manage to come up with a way to synthetically reproduce the
problem?

Anyway, that the main thread is blocking while this happens is certainly
a problem, so this patch looks like an improvement. That we have to use
the ‘http-write’ hack isn’t great, but I think it’s OK, unless we want
to switch to Fibers.

Ludo’.
L
L
Ludovic Courtès wrote on 31 Aug 2021 00:33
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 50040@debbugs.gnu.org)
87r1eaa7gu.fsf_-_@gnu.org
Hi!

Mathieu Othacehe <othacehe@gnu.org> skribis:

Toggle quote (17 lines)
> I think this is caused by the following lines in the Berlin Nginx
> configuration:
>
> ;; Do not tolerate slowness of hydra.gnu.org when fetching
> ;; narinfos: better return 504 quickly than wait forever.
> "proxy_connect_timeout 2s;"
> "proxy_read_timeout 2s;"
> "proxy_send_timeout 2s;"
>
> The narinfos that are not in cache yet are computed in the main thread,
> slowing down the publish process. Under pressure, computing those
> narinfo can exceed the 2s timeout resulting in the failures above.
>
> Our best course of action here could be to merge the proposed patches,
> extend the Nginx timeout to 10s and see if all the requests can be
> handled quickly enough.

Yeah, increasing these timeouts a little bit (to 5–7s? making it too
long could lead to annoyances) sounds reasonable.

Ludo’.
M
M
Mathieu Othacehe wrote on 31 Aug 2021 11:08
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 50040@debbugs.gnu.org)
875yvm6kxf.fsf@gnu.org
Hey Ludo!

Toggle quote (3 lines)
> Did you manage to come up with a way to synthetically reproduce the
> problem?

Yes but it is rather complex. I managed to restart a lot of builds at
the same time by running SQL queries on the Cuirass database.

Toggle quote (4 lines)
> 2. It does I/O when it calls ‘read-derivation-from-file’. Under high
> I/O load, that could be relatively expensive, though I’d expect it
> to be measured in tenths of a second at worst?

It looks like read-derivation-from-file is indeed quite
expensive. There's an attached strace log that shows a bunch of
derivation files reading, caused by read-derivation-from-file. I can
count 336 derivation file reads in 30 seconds, which is not much but
could get worst I think.

Toggle quote (5 lines)
> But look, ‘read-derivation-from-file’ is called just to fill in the
> “System” field, which is not used anywhere (not a single caller of
> ‘narinfo-system’), so we could just as well remove it and see how
> it behaves.

Yes, I'll propose a patch to remove it.

Toggle quote (5 lines)
> Anyway, that the main thread is blocking while this happens is certainly
> a problem, so this patch looks like an improvement. That we have to use
> the ‘http-write’ hack isn’t great, but I think it’s OK, unless we want
> to switch to Fibers.

I think that applying this patchset + removing read-derivation-from-file
call + increasing Nginx timeouts could be a good start. However, I will
be mostly afk for about 3 weeks so unable to monitor the publish server
on Berlin and fix potential regressions. Maybe we should wait until
then.

Thanks,

Mathieu
Attachment: strace_publish.txt
L
L
Ludovic Courtès wrote on 1 Sep 2021 22:48
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 50040@debbugs.gnu.org)
87k0k06mz9.fsf@gnu.org
Hello!

Mathieu Othacehe <othacehe@gnu.org> skribis:

Toggle quote (6 lines)
>> Did you manage to come up with a way to synthetically reproduce the
>> problem?
>
> Yes but it is rather complex. I managed to restart a lot of builds at
> the same time by running SQL queries on the Cuirass database.

OK.

[...]

Toggle quote (7 lines)
>> But look, ‘read-derivation-from-file’ is called just to fill in the
>> “System” field, which is not used anywhere (not a single caller of
>> ‘narinfo-system’), so we could just as well remove it and see how
>> it behaves.
>
> Yes, I'll propose a patch to remove it.

Neat.

Toggle quote (11 lines)
>> Anyway, that the main thread is blocking while this happens is certainly
>> a problem, so this patch looks like an improvement. That we have to use
>> the ‘http-write’ hack isn’t great, but I think it’s OK, unless we want
>> to switch to Fibers.
>
> I think that applying this patchset + removing read-derivation-from-file
> call + increasing Nginx timeouts could be a good start. However, I will
> be mostly afk for about 3 weeks so unable to monitor the publish server
> on Berlin and fix potential regressions. Maybe we should wait until
> then.

Yes, sounds safer. Removing the “System” bit from narinfos produced by
‘guix publish’ should be fine, though.

Thanks!

Ludo’.
L
L
Ludovic Courtès wrote on 6 Sep 2021 15:54
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 50040@debbugs.gnu.org)
87tuix7qsw.fsf_-_@gnu.org
Hi Mathieu,

An additional data point: while stracing ‘guix publish’ on berlin just
now, I saw read(2) calls taking several seconds (!) to complete:

Toggle snippet (8 lines)
14:22:07 openat(AT_FDCWD, "/var/cache/guix/publish/hashes/wcd9bci97i2zi52ay4vddvr1jx9w50kz", O_RDONLY) = 33 <0.000032>
14:22:07 fstat(33, {st_mode=S_IFREG|0666, st_size=60, ...}) = 0 <0.000007>
14:22:07 lseek(33, 0, SEEK_CUR) = 0 <0.000006>
14:22:07 read(33, "/gnu/store/wcd9bci97i2zi52ay4vdd"..., 4096) = 60 <2.540400>
14:22:10 read(33, "", 4096) = 0 <0.000014>
14:22:10 close(33) = 0 <0.000008>

The load is indeed high right now:

Toggle snippet (4 lines)
ludo@berlin ~$ uptime
14:24:18 up 265 days 21:09, 1 user, load average: 10.61, 8.42, 8.00

There are a few ongoing builds, which could partly explain the high load
(‘guix-daemon’ shows up at the top in iotop):

Toggle snippet (30 lines)
ludo@berlin ~$ sudo guix processes|grep -C3 LockHeld
SessionPID: 43868
ClientPID: 43515
ClientCommand: /gnu/store/6l9rix46ydxyldf74dvpgr60rf5ily0c-guile-3.0.7/bin/guile --no-auto-compile /gnu/store/7rqj61dy660k0nw97w7657zbxnbdvg1g-update-guix.gnu.org
LockHeld: /gnu/store/srim97jxlxgyiday0j3hbq2z0w03iy3k-guix-web-site.lock
ChildPID: 49648
ChildCommand: : /gnu/store/p9di786py35m4kxszf466wcla6jvq96h-guix-1.3.0-3.50dfbbf/libexec/guix/guile \ /gnu/store/p9di786py35m4kxszf466wcla6jvq96h-guix-1.3.0-3.50dfbbf/bin/guix substitute --query

--
SessionPID: 61372
ClientPID: 58620
ClientCommand: /gnu/store/hkyxill7pdyyrrqfxl4sydq5db0ljw17-guile-3.0.7/bin/guile --no-auto-compile -e main -s /gnu/store/cm7hy2mpvk3av2gfg0gpfli11fgrpvrs-cuirass-1.1.0-3.91e8b2e/bin/.cuirass-real evaluate dbname=cuirass host=/tmp 14259
LockHeld: /gnu/store/xdvzddg84by15s1c17k543b39c03jp8i-isl-0.23.drv.lock

SessionPID: 61373
ClientPID: 58620
ClientCommand: /gnu/store/hkyxill7pdyyrrqfxl4sydq5db0ljw17-guile-3.0.7/bin/guile --no-auto-compile -e main -s /gnu/store/cm7hy2mpvk3av2gfg0gpfli11fgrpvrs-cuirass-1.1.0-3.91e8b2e/bin/.cuirass-real evaluate dbname=cuirass host=/tmp 14259
LockHeld: /gnu/store/2wm7b9g07nxxhrnfcggw6ibdry65fpwb-xfce4-panel-4.16.3-builder.lock

SessionPID: 61374
ClientPID: 58620
ClientCommand: /gnu/store/hkyxill7pdyyrrqfxl4sydq5db0ljw17-guile-3.0.7/bin/guile --no-auto-compile -e main -s /gnu/store/cm7hy2mpvk3av2gfg0gpfli11fgrpvrs-cuirass-1.1.0-3.91e8b2e/bin/.cuirass-real evaluate dbname=cuirass host=/tmp 14259
LockHeld: /gnu/store/1jmidz4i70w3lczn99wgn19dksb0pzhv-pkg-config-0.29.2-builder.lock

SessionPID: 61375
ClientPID: 58620
ClientCommand: /gnu/store/hkyxill7pdyyrrqfxl4sydq5db0ljw17-guile-3.0.7/bin/guile --no-auto-compile -e main -s /gnu/store/cm7hy2mpvk3av2gfg0gpfli11fgrpvrs-cuirass-1.1.0-3.91e8b2e/bin/.cuirass-real evaluate dbname=cuirass host=/tmp 14259
LockHeld: /gnu/store/4j7s86xp7w5jgm84lr18szln1zpw36zd-static-binaries-tarball-0.drv.lock

The ‘mumi worker’ process also seems to be doing a lot of I/O and using
quite a bit of CPU (presumably allocating a lot). Its ‘perf’ profile is
like this:

Toggle snippet (11 lines)
2.62% .mumi-real libguile-3.0.so.1.1.1 [.] scm_ilength
1.20% .mumi-real libguile-3.0.so.1.1.1 [.] scm_getc
1.12% .mumi-real libgc.so.1.4.3 [.] GC_add_to_black_list_normal.part.4
1.06% .mumi-real libgc.so.1.4.3 [.] GC_build_fl
0.98% .mumi-real libgc.so.1.4.3 [.] GC_reclaim_clear
0.97% .mumi-real libguile-3.0.so.1.1.1 [.] get_callee_vcode
0.91% .mumi-real libgc.so.1.4.3 [.] GC_generic_malloc_many
0.90% .mumi-real libguile-3.0.so.1.1.1 [.] peek_byte_or_eof
0.78% .mumi-real [JIT] tid 62601 [.] 0x00007f886964804d

Food for thought…

Ludo’.
M
M
Mathieu Othacehe wrote on 17 Sep 2021 17:27
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 50040@debbugs.gnu.org)
87zgsbuss7.fsf@gnu.org
Hey Ludo,

Toggle quote (16 lines)
> The ‘mumi worker’ process also seems to be doing a lot of I/O and using
> quite a bit of CPU (presumably allocating a lot). Its ‘perf’ profile is
> like this:
>
> 2.62% .mumi-real libguile-3.0.so.1.1.1 [.] scm_ilength
> 1.20% .mumi-real libguile-3.0.so.1.1.1 [.] scm_getc
> 1.12% .mumi-real libgc.so.1.4.3 [.] GC_add_to_black_list_normal.part.4
> 1.06% .mumi-real libgc.so.1.4.3 [.] GC_build_fl
> 0.98% .mumi-real libgc.so.1.4.3 [.] GC_reclaim_clear
> 0.97% .mumi-real libguile-3.0.so.1.1.1 [.] get_callee_vcode
> 0.91% .mumi-real libgc.so.1.4.3 [.] GC_generic_malloc_many
> 0.90% .mumi-real libguile-3.0.so.1.1.1 [.] peek_byte_or_eof
> 0.78% .mumi-real [JIT] tid 62601 [.] 0x00007f886964804d
>
> Food for thought…

Yep, interesting findings, thanks for sharing. I just discovered that
the Nginx server is logging in /var/log/nginx/error.log the errors we
are interested in:

Toggle snippet (10 lines)
2021/09/17 14:13:37 [error] 129925#0: *3727055 upstream timed out (110: Connection timed out) while reading response header from upstream, client: XXX, server: ci.guix.gnu.org, request: "GET /3m1j4ddx11
prp6azw3rjdhljg5vchf1s.narinfo HTTP/1.1", upstream: "http://127.0.0.1:3000/3m1j4ddx11prp6azw3rjdhljg5vchf1s.narinfo", host: "ci.guix.gnu.org"
2021/09/17 14:13:37 [error] 129925#0: *3727020 upstream timed out (110: Connection timed out) while reading response header from upstream, client: XXX, server: ci.guix.gnu.org, request: "GET /pdhr7fgql3
qm1x24yxkj4c2f6s7vffi5.narinfo HTTP/1.1", upstream: "http://127.0.0.1:3000/pdhr7fgql3qm1x24yxkj4c2f6s7vffi5.narinfo", host: "ci.guix.gnu.org"
2021/09/17 14:13:37 [error] 129925#0: *3726717 upstream timed out (110: Connection timed out) while reading response header from upstream, client: XXX, server: ci.guix.gnu.org, request: "GET /19gv6sq
qcmn8q020bpm9jyl6n9bjpg05.narinfo HTTP/1.1", upstream: "http://127.0.0.1:3000/19gv6sqqcmn8q020bpm9jyl6n9bjpg05.narinfo", host: "ci.guix.gnu.org"
2021/09/17 14:13:37 [error] 129925#0: *3725361 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.169, server: ci.guix.gnu.org, request: "GET /y8dp69gk
y4dn5vnn7wnx04j20q0yrdja.narinfo HTTP/1.1", upstream: "http://127.0.0.1:3000/y8dp69gky4dn5vnn7wnx04j20q0yrdja.narinfo", host: "141.80.167.131"

Despite the following diff, extending the timeout delays to 10 seconds,
we still have a lot of those errors. Some happening on the build farm
(141.80.167.X), some directly on user machines.

Toggle snippet (16 lines)
diff --git a/hydra/nginx/berlin.scm b/hydra/nginx/berlin.scm
index 44ff28e..7a085e5 100644
--- a/hydra/nginx/berlin.scm
+++ b/hydra/nginx/berlin.scm
@@ -97,9 +97,9 @@
;; Do not tolerate slowness of hydra.gnu.org when fetching
;; narinfos: better return 504 quickly than wait forever.
- "proxy_connect_timeout 2s;"
- "proxy_read_timeout 2s;"
- "proxy_send_timeout 2s;"
+ "proxy_connect_timeout 10s;"
+ "proxy_read_timeout 10s;"
+ "proxy_send_timeout 10s;"

I'll see if I can reproduce the I/O pressure on demand, to be able to
have more accurate strace/perf investigations.

Thanks,

Mathieu
M
M
Mathieu Othacehe wrote on 5 Oct 2021 12:07
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 50040@debbugs.gnu.org)
87h7dveqeu.fsf_-_@gnu.org
Hey,

The guix-publish daemon has been updated on berlin to include this fix:
222f4661ed11b225f458cbe495a296f233129bec. Sadly, it looks like it
doesn't improve the situation.

I observed that the builds of the following evaluation failed because of
the typical: "cannot build missing derivation
?/gnu/store/4p6hgfkj11q5mqk39lb3kxczwqswzh97-iso-image-installer.drv?".


When looking at the Nginx logs, we have those concomitant errors:

Toggle snippet (24 lines)
021/10/05 10:52:44 [info] 129925#0: *19790054 client 94.237.57.97 closed keepalive connection
2021/10/05 10:52:44 [error] 129925#0: *19787848 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.181, server: ci.guix.gnu.org, request: "GET /bhjk59akv02qibag0pxx6cjlrlgs1gjm.narinfo HTTP/1.1", upstream: "http://127.0.0.
1:3000/bhjk59akv02qibag0pxx6cjlrlgs1gjm.narinfo", host: "141.80.167.131"
2021/10/05 10:52:44 [error] 129925#0: *19788215 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.183, server: ci.guix.gnu.org, request: "GET /g243k85y7z7j6dq7nx74gp3d28hhad71.narinfo HTTP/1.1", upstream: "http://127.0.0.
1:3000/g243k85y7z7j6dq7nx74gp3d28hhad71.narinfo", host: "141.80.167.131"
2021/10/05 10:52:44 [error] 129925#0: *19787890 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.160, server: ci.guix.gnu.org, request: "GET /gxp6fhj7mszs90b24k630wprrkwsm5vb.narinfo HTTP/1.1", upstream: "http://127.0.0.
1:3000/gxp6fhj7mszs90b24k630wprrkwsm5vb.narinfo", host: "141.80.167.131"
2021/10/05 10:52:44 [error] 129925#0: *19788116 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.163, server: ci.guix.gnu.org, request: "GET /hyjiw3cnvsn2k5mpfafmczy68jmpb71x.narinfo HTTP/1.1", upstream: "http://127.0.0.
1:3000/hyjiw3cnvsn2k5mpfafmczy68jmpb71x.narinfo", host: "141.80.167.131"
2021/10/05 10:52:44 [error] 129925#0: *19788032 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.167, server: ci.guix.gnu.org, request: "GET /ngnnppl9ncixxjid0crfhwjn4h0xyg8c.narinfo HTTP/1.1", upstream: "http://127.0.0.
1:3000/ngnnppl9ncixxjid0crfhwjn4h0xyg8c.narinfo", host: "141.80.167.131"
2021/10/05 10:52:44 [error] 129925#0: *19787887 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.181, server: ci.guix.gnu.org, request: "GET /lvpmiq4p8wqyxi34q99qm7d0xfwq484a.narinfo HTTP/1.1", upstream: "http://127.0.0.
1:3000/lvpmiq4p8wqyxi34q99qm7d0xfwq484a.narinfo", host: "141.80.167.131"
2021/10/05 10:52:44 [error] 129925#0: *19788074 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.186, server: ci.guix.gnu.org, request: "GET /wxv8a2845d5zxq0yaxd57arjvdvhppib.narinfo HTTP/1.1", upstream: "http://127.0.0.
1:3000/wxv8a2845d5zxq0yaxd57arjvdvhppib.narinfo", host: "141.80.167.131"
2021/10/05 10:52:44 [error] 129925#0: *19787681 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.165, server: ci.guix.gnu.org, request: "GET /8kmc1fznb29rmbjiyvkbkixhzm2287al.narinfo HTTP/1.1", upstream: "http://127.0.0.
1:3000/8kmc1fznb29rmbjiyvkbkixhzm2287al.narinfo", host: "141.80.167.131"
2021/10/05 10:52:44 [error] 129925#0: *19787714 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.180, server: ci.guix.gnu.org, request: "GET /795mhps0sdli2ndvcv1aq54b7fg59d97.narinfo HTTP/1.1", upstream: "http://127.0.0.
1:3000/795mhps0sdli2ndvcv1aq54b7fg59d97.narinfo", host: "141.80.167.131"
2021/10/05 10:52:44 [error] 129925#0: *19788327 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.165, server: ci.guix.gnu.org, request: "GET /zsv7zg4xxr2hlirsbifka5y62394zzl8.narinfo HTTP/1.1", upstream: "http://127.0.0.
1:3000/zsv7zg4xxr2hlirsbifka5y62394zzl8.narinfo", host: "141.80.167.131"
2021/10/05 10:52:44 [error] 129925#0: *19787809 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.169, server: ci.guix.gnu.org, request: "GET /a8npwin71gbn9fprn18i5j26cl91l3rn.narinfo HTTP/1.1", upstream: "http://127.0.0.

Thanks,

Mathieu
M
M
Mathieu Othacehe wrote on 6 Oct 2021 10:58
(address . 50040@debbugs.gnu.org)
87o882edhd.fsf_-_@gnu.org
Hello,

Toggle quote (9 lines)
> * guix/scripts/publish.scm (render-narinfo): Defer the narinfo string creation
> to the http-write procedure.
> (compression->sexp, sexp->compression): New procedures.
> ("X-Nar-Compression"): Use them.
> ("X-Narinfo-Compressions"): New custom header.
> (strip-headers): Add the x-nar-path header.
> (http-write): Add narinfo on-the-fly creation support. It happens in a
> separated thread to prevent blocking the main thread.

I applied this one and updated the Guix package revision. Let's see if
it brings some improvements.

Thanks,

Mathieu
M
M
Mathieu Othacehe wrote on 8 Oct 2021 09:04
(address . 50040@debbugs.gnu.org)
871r4wyp2t.fsf_-_@gnu.org
Hello,

Toggle quote (3 lines)
> I applied this one and updated the Guix package revision. Let's see if
> it brings some improvements.

Sadly, there are still some failures:

Toggle snippet (5 lines)
2021/10/08 00:32:38 [error] 129925#0: *22076376 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.168, server: ci.guix.gnu.org, request: "GET /z6mk9m115h39cy8krryca7snimh9z5zq.narinfo HTTP/1.1", upstream: "http://127.0.0.1:3000/z6mk9m115h39cy8krryca7snimh9z5zq.narinfo", host: "141.80.167.131"
2021/10/08 00:32:38 [error] 129925#0: *22076361 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.164, server: ci.guix.gnu.org, request: "GET /2l4cxz4dlr39hr7s52i9khxbj89d83qz.narinfo HTTP/1.1", upstream: "http://127.0.0.1:3000/2l4cxz4dlr39hr7s52i9khxbj89d83qz.narinfo", host: "141.80.167.131"
2021/10/08 00:32:38 [error] 129925#0: *22076367 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.178, server: ci.guix.gnu.org, request: "GET /i4nwwda7mgk8n7qir59wmwppjjf216yz.narinfo HTTP/1.1", upstream: "http://127.0.0.1:3000/i4nwwda7mgk8n7qir59wmwppjjf216yz.narinfo", host: "141.80.167.131"

I guess we need better profiling tools to understand what's going on
here.

Thanks,

Mathieu
?