Sporadic guix-offload crashes due to EOF errors

OpenSubmitted by Marius Bakke.
Details
3 participants
  • Ludovic Courtès
  • Marius Bakke
  • Maxim Cournoyer
Owner
unassigned
Severity
normal
M
M
Marius Bakke wrote on 31 May 2020 11:51
(address . bug-guix@gnu.org)
87mu5owhxh.fsf@gnu.org
Hello,
During 'guix build -s aarch64-linux dolphin' on Berlin, I got this crash:
Toggle snippet (36 lines)building /gnu/store/87655bh9rqcr29qasl1c4yj3skmxkyiz-kfilemetadata-5.70.0.drv...process 12989 acquired build slot '/var/guix/offload/overdrive1.guixsd.org:52522/1'process 12989 acquired build slot '/var/guix/offload/dover.guix.info:9023/1'process 12989 acquired build slot '/var/guix/offload/141.80.167.167:22/0'process 12989 acquired build slot '/var/guix/offload/141.80.167.163:22/0'process 12989 acquired build slot '/var/guix/offload/localhost:2223/1'process 12989 acquired build slot '/var/guix/offload/141.80.167.168:22/0'process 12989 acquired build slot '/var/guix/offload/141.80.167.173:22/0'process 12989 acquired build slot '/var/guix/offload/141.80.167.176:22/0'process 12989 acquired build slot '/var/guix/offload/localhost:2222/0'process 12989 acquired build slot '/var/guix/offload/141.80.167.165:22/0'process 12989 acquired build slot '/var/guix/offload/141.80.167.169:22/0'process 12989 acquired build slot '/var/guix/offload/141.80.167.181:22/0'process 12989 acquired build slot '/var/guix/offload/141.80.167.170:22/0'process 12989 acquired build slot '/var/guix/offload/141.80.167.174:22/0'process 12989 acquired build slot '/var/guix/offload/141.80.167.180:22/0'process 12989 acquired build slot '/var/guix/offload/141.80.167.161:22/0'Backtrace:In ice-9/boot-9.scm: 1736:10 5 (with-exception-handler _ _ #:unwind? _ # _)In unknown file: 4 (apply-smob/0 #<thunk 7f3344d296c0>)In ice-9/boot-9.scm: 718:2 3 (call-with-prompt _ _ #<procedure default-prompt-handle…>)In ice-9/eval.scm: 619:8 2 (_ #(#(#<directory (guile-user) 7f3344933f00>)))In guix/ui.scm: 1936:12 1 (run-guix-command _ . _)In guix/scripts/offload.scm: 742:22 0 (guix-offload . _)
guix/scripts/offload.scm:742:22: In procedure guix-offload:Throw to key `match-error' with args `("match" "no matching pattern" #<eof>)'.guix build: error: unexpected EOF reading a line
Which is strange because guix/scripts/offload.scm:742 is wrapped in a(unless (eof-object? ...)) block.
When this happens, the build command terminates, along with any otherbuilds that it had started concurrently. Builds from other clientswere unaffected, of course.
I have also seen this occur on my personal offloading setup once everyblue moon, but don't know what could have caused it.
-----BEGIN PGP SIGNATURE-----
iQEzBAEBCgAdFiEEu7At3yzq9qgNHeZDoqBt8qM6VPoFAl7TfjoACgkQoqBt8qM6VPokIQf+Lt6//6tAzD0+oGyAhc15p4LAqqkW5/lxQb/6nluNcifbjN/SSCgeFTzuk/Lti/6HEbjOqrUORtw9Yl50B3/5Is/qzuqFtmf7i++47r6QN2+dlwwK4vfePylcqy8caALlHDvC7rqrpVFq3MxStuW7QXBe/+jSjSORAh1KVh8TKHnhkdkRaazyZOP8oIl879u7txA9a1wqtkmgjX49QqkXrTZV8QK4PBCW+hiqobgIcxqDr9l21GMEWRYwssIxH0RphQlMrvhg2TPQhY2guNeLVy2eKwtr3VYe51RKWZTkoQ+ntQfOylltoypPQRJFs59QUW85DzbybcV+9I5Monr6Og===iT+K-----END PGP SIGNATURE-----
M
M
Marius Bakke wrote on 31 May 2020 12:12
(address . 41625@debbugs.gnu.org)
87k10swgyy.fsf@gnu.org
Marius Bakke <marius@gnu.org> writes:
Toggle quote (2 lines)> During 'guix build -s aarch64-linux dolphin' on Berlin, I got this crash:
Funny, I just got it _again_ building the same derivation. It seems tohappen while checking load on the selected offload machine, which is'overdrive1.guixsd.org' in this and the previous case:
Toggle snippet (36 lines)successfully built /gnu/store/qwscp6h2zsxr0knaizn1fn0saw1pfimi-kidletime-5.70.0.drvprocess 51308 acquired build slot '/var/guix/offload/141.80.167.174:22/0'process 51308 acquired build slot '/var/guix/offload/141.80.167.181:22/0'process 51308 acquired build slot '/var/guix/offload/overdrive1.guixsd.org:52522/1'process 51308 acquired build slot '/var/guix/offload/localhost:2223/1'process 51308 acquired build slot '/var/guix/offload/141.80.167.162:22/0'process 51308 acquired build slot '/var/guix/offload/141.80.167.180:22/0'process 51308 acquired build slot '/var/guix/offload/141.80.167.161:22/0'process 51308 acquired build slot '/var/guix/offload/141.80.167.170:22/0'process 51308 acquired build slot '/var/guix/offload/141.80.167.163:22/0'process 51308 acquired build slot '/var/guix/offload/141.80.167.166:22/0'process 51308 acquired build slot '/var/guix/offload/141.80.167.177:22/0'process 51308 acquired build slot '/var/guix/offload/localhost:2222/0'process 51308 acquired build slot '/var/guix/offload/141.80.167.173:22/0'process 51308 acquired build slot '/var/guix/offload/141.80.167.168:22/0'process 51308 acquired build slot '/var/guix/offload/141.80.167.169:22/0'process 51308 acquired build slot '/var/guix/offload/141.80.167.176:22/0'Backtrace:In ice-9/boot-9.scm: 1736:10 5 (with-exception-handler _ _ #:unwind? _ # _)In unknown file: 4 (apply-smob/0 #<thunk 7f3e4487b1c0>)In ice-9/boot-9.scm: 718:2 3 (call-with-prompt _ _ #<procedure default-prompt-handle…>)In ice-9/eval.scm: 619:8 2 (_ #(#(#<directory (guile-user) 7f3e4445ff00>)))In guix/ui.scm: 1936:12 1 (run-guix-command _ . _)In guix/scripts/offload.scm: 742:22 0 (guix-offload . _)
guix/scripts/offload.scm:742:22: In procedure guix-offload:Throw to key `match-error' with args `("match" "no matching pattern" #<eof>)'.guix build: error: unexpected EOF reading a line
'guix offload test' passes without problems.
-----BEGIN PGP SIGNATURE-----
iQEzBAEBCgAdFiEEu7At3yzq9qgNHeZDoqBt8qM6VPoFAl7TgxUACgkQoqBt8qM6VPrt3ggAt+nNax+K3pWTlItX1uy7Dv3Gehhb6BNpJA1PiDA1TTmlh4GG+TmXsp4vaPGheo0jnk2thwJ/AuaqYRu+Ec3mpMR1F/XaO/9ULWC0eeerlRhhnZJtnxopKnoB2BQmzS5loGWL8jmNL7bQ0+uESLZDsl5gbbWgTh0vAjneE97oaewTgYcBLF6dD7Mg9lDsuWTY9B5D+uoUGrpH0TRjydoNNCu0hjXisQ2mDKSh0KnohKuPHAkJ+0zMAJaGWrv8a9VqXbODDJrr/m3hdZt6PUA1iOh0HnYz7U8o6in84taHdhU3mlrlap7UlSmR5VcnbwolJfU9pzEYl2BfEBsagnE2AQ===W9oA-----END PGP SIGNATURE-----
M
M
Marius Bakke wrote on 31 May 2020 13:21
(address . 41625@debbugs.gnu.org)
87h7vwwdrs.fsf@gnu.org
Marius Bakke <marius@gnu.org> writes:
Toggle quote (2 lines)> 'guix offload test' passes without problems.
Not so fast, running it in a loop reveals the crash.
There is a trace file in /root/offloadtest.trace on Berlin with such anoccurence. It looks like a timeout is reached shortly before the EOFerror:
10139 poll([{fd=14, events=POLLIN|POLLOUT}], 1, 0) = 1 ([{fd=14, revents=POLLOUT}])10139 poll([{fd=14, events=POLLIN}], 1, 15000) = 0 (Timeout)10139 write(2, "Backtrace:\n", 11) = 11
This seems to be from a different node than the one reported previously,as the preceding connect() was to this machine:
10139 connect(44, {sa_family=AF_INET, sin_port=htons(22), sin_addr=inet_addr("141.80.167.186")}, 16) = -1 EINPROGRESS (Operation now in progress)
-----BEGIN PGP SIGNATURE-----
iQEzBAEBCgAdFiEEu7At3yzq9qgNHeZDoqBt8qM6VPoFAl7Tk0cACgkQoqBt8qM6VPpkSwgAuUZtpdzkttyqECe0BebW8+V/xeKAa06wXCk8rdVZhbAfmJ57MM4p6L/QBVn3Qy3LBrZJ3IskgIUZXIByOoIPK15L5hyMUMVV5GDQrU+UDIBJNKH8H0o9XsJY2DequT4SsQ7aF8njhA21xZ8XLvMk6pu7SPNZPLszvapDEM9kgA39DTx6lfxBLw0VD9v+pVQtuZqg37ge6jYBZrH1uoZm7vFbO1QXESVBUT4xmY5hjnsBpLUFcUOnuW1mC39lpgtBPwC1umLVgVoyCAB5XCb6i1LMyFQ1ysLLbkBXyCfDs0ZhHSZeZm/3hbAxn2POIAqxUseebhbqDjKHnYm57HckXA===z0CC-----END PGP SIGNATURE-----
L
L
Ludovic Courtès wrote on 4 Jun 2020 14:05
(name . Marius Bakke)(address . marius@gnu.org)(address . 41625@debbugs.gnu.org)
87a71jc9yi.fsf@gnu.org
Hi,
Marius Bakke <marius@gnu.org> skribis:
Toggle quote (19 lines)> Marius Bakke <marius@gnu.org> writes:>>> 'guix offload test' passes without problems.>> Not so fast, running it in a loop reveals the crash.>> There is a trace file in /root/offloadtest.trace on Berlin with such an> occurence. It looks like a timeout is reached shortly before the EOF> error:>> 10139 poll([{fd=14, events=POLLIN|POLLOUT}], 1, 0) = 1 ([{fd=14, revents=POLLOUT}])> 10139 poll([{fd=14, events=POLLIN}], 1, 15000) = 0 (Timeout)> 10139 write(2, "Backtrace:\n", 11) = 11>> This seems to be from a different node than the one reported previously,> as the preceding connect() was to this machine:>> 10139 connect(44, {sa_family=AF_INET, sin_port=htons(22), sin_addr=inet_addr("141.80.167.186")}, 16) = -1 EINPROGRESS (Operation now in progress)
So it looks like ‘connect’ fails and eventually we get an EOF object.However, I don’t see where that EOF comes from because the return valueof ‘connect!’ (the Guile-SSH procedure) is properly checked.
Ludo’.
M
M
Maxim Cournoyer wrote on 24 May 07:33 +0200
(name . Ludovic Courtès)(address . ludo@gnu.org)
87mtsky9um.fsf@gmail.com
Hi,
Ludovic Courtès <ludo@gnu.org> writes:
Toggle quote (31 lines)> Hi,>> Marius Bakke <marius@gnu.org> skribis:>>> Marius Bakke <marius@gnu.org> writes:>>>>> 'guix offload test' passes without problems.>>>> Not so fast, running it in a loop reveals the crash.>>>> There is a trace file in /root/offloadtest.trace on Berlin with such an>> occurence. It looks like a timeout is reached shortly before the EOF>> error:>>>> 10139 poll([{fd=14, events=POLLIN|POLLOUT}], 1, 0) = 1 ([{fd=14, revents=POLLOUT}])>> 10139 poll([{fd=14, events=POLLIN}], 1, 15000) = 0 (Timeout)>> 10139 write(2, "Backtrace:\n", 11) = 11>>>> This seems to be from a different node than the one reported previously,>> as the preceding connect() was to this machine:>>>> 10139 connect(44, {sa_family=AF_INET, sin_port=htons(22),>> sin_addr=inet_addr("141.80.167.186")}, 16) = -1 EINPROGRESS>> (Operation now in progress)>> So it looks like ‘connect’ fails and eventually we get an EOF object.> However, I don’t see where that EOF comes from because the return value> of ‘connect!’ (the Guile-SSH procedure) is properly checked.>> Ludo’.
I got a slightly different backtrace that suggests making the connectionis not at fault, rather it occurs during the read-repl-response call:
Toggle snippet (22 lines)guix offload: testing 1 build machines defined in '/etc/guix/machines.scm'...Backtrace: 8 (primitive-load "/home/maxim/.config/guix/current/bin/guix")In guix/ui.scm: 2165:12 7 (run-guix-command _ . _)In ice-9/boot-9.scm: 1752:10 6 (with-exception-handler _ _ #:unwind? _ #:unwind-for-type _) 1747:15 5 (with-exception-handler #<procedure 7f2caf885780 at ice-9/boot-9.scm:1831:7 (exn)> _ # _ # …)In guix/scripts/offload.scm: 704:21 4 (check-machine-availability _ _)In srfi/srfi-1.scm: 586:17 3 (map1 (#<session maxim@overdrive1.guix.gnu.org:52522 (connected) 7f2cae396fc0>))In guix/inferior.scm: 258:2 2 (port->inferior _ _) 240:2 1 (read-repl-response _ _)In ice-9/boot-9.scm: 1685:16 0 (raise-exception _ #:continuable? _)
ice-9/boot-9.scm:1685:16: In procedure raise-exception:Throw to key `match-error' with args `("match" "no matching pattern" #<eof>)'.
I seem to get this more often than not with the overdrive1 offloadmachine.
Maxim
M
M
Maxim Cournoyer wrote on 25 May 17:50 +0200
[PATCH] offload: Handle a possible EOF response from read-repl-response.
(address . 41625@debbugs.gnu.org)
20210525155003.27590-1-maxim.cournoyer@gmail.com
Fixes https://issues.guix.gnu.org/41625.
* guix/scripts/offload.scm (check-machine-availability): Refactor so that ittakes a single machine object, to allow for retrying a single machine. Handlethe case where the checks raised an exception due to the connection to thebuild machine having been lost, and retry up to 3 times. Ensure the cleanupcode is run in all situations.(check-machines-availability): New procedure. CallCHECK-MACHINES-AVAILABILITY in parallel, which improves performance (abouttwice as fast with 4 build machines, from ~30 s to ~15 s).* guix/inferior.scm (&inferior-connection-lost): New condition type.(read-repl-response): Raise a condition of the above type when reading EOFfrom the build machine's port.--- guix/inferior.scm | 9 ++++++++ guix/scripts/offload.scm | 50 +++++++++++++++++++++++++++++----------- 2 files changed, 45 insertions(+), 14 deletions(-)
Toggle diff (132 lines)diff --git a/guix/inferior.scm b/guix/inferior.scmindex 7c8e478f2a..4ac1ea3484 100644--- a/guix/inferior.scm+++ b/guix/inferior.scm@@ -1,5 +1,6 @@ ;;; GNU Guix --- Functional package management for GNU ;;; Copyright © 2018, 2019, 2020, 2021 Ludovic Courtès <ludo@gnu.org>+;;; Copyright © 2021 Maxim Cournoyer <maxim.cournoyer@gmail.com> ;;; ;;; This file is part of GNU Guix. ;;;@@ -70,6 +71,7 @@ inferior-exception-arguments inferior-exception-inferior inferior-exception-stack+ inferior-connection-lost? read-repl-response inferior-packages@@ -228,6 +230,9 @@ equivalent. Return #f if the inferior could not be launched." (inferior inferior-exception-inferior) ;<inferior> | #f (stack inferior-exception-stack)) ;list of (FILE COLUMN LINE) +(define-condition-type &inferior-connection-lost &error+ inferior-connection-lost?)+ (define* (read-repl-response port #:optional inferior) "Read a (guix repl) response from PORT and return it as a Scheme object. Raise '&inferior-exception' when an exception is read from PORT."@@ -241,6 +246,10 @@ Raise '&inferior-exception' when an exception is read from PORT." (match (read port) (('values objects ...) (apply values (map sexp->object objects)))+ ;; Unexpectedly read EOF from the port. This can happen for example when+ ;; the underlying connection for PORT was lost with Guile-SSH.+ (? eof-object?+ (raise (condition (&inferior-connection-lost)))) (('exception ('arguments key objects ...) ('stack frames ...)) ;; Protocol (0 1 1) and later.diff --git a/guix/scripts/offload.scm b/guix/scripts/offload.scmindex 835078cb97..7c6d38b218 100644--- a/guix/scripts/offload.scm+++ b/guix/scripts/offload.scm@@ -1,7 +1,7 @@ ;;; GNU Guix --- Functional package management for GNU ;;; Copyright © 2014, 2015, 2016, 2017, 2018, 2019, 2020 Ludovic Courtès <ludo@gnu.org> ;;; Copyright © 2017 Ricardo Wurmus <rekado@elephly.net>-;;; Copyright © 2020 Maxim Cournoyer <maxim.cournoyer@gmail.com>+;;; Copyright © 2020, 2021 Maxim Cournoyer <maxim.cournoyer@gmail.com> ;;; Copyright © 2020 Julien Lepiller <julien@lepiller.eu> ;;; ;;; This file is part of GNU Guix.@@ -53,6 +53,7 @@ #:use-module (ice-9 regex) #:use-module (ice-9 format) #:use-module (ice-9 binary-ports)+ #:use-module (ice-9 threads) #:export (build-machine build-machine? build-machine-name@@ -684,7 +685,7 @@ daemon is not running." (leave (G_ "failed to import '~a' from '~a'~%") item name))))) -(define (check-machine-availability machine-file pred)+(define (check-machines-availability machine-file pred) "Check that each machine matching PRED in MACHINE-FILE is usable as a build machine." (define (build-machine=? m1 m2)@@ -696,18 +697,39 @@ machine." (let ((machines (filter pred (delete-duplicates (build-machines machine-file) build-machine=?))))- (info (G_ "testing ~a build machines defined in '~a'...~%")+ (info (G_ "Testing ~a build machines defined in '~a'...~%") (length machines) machine-file)- (let* ((names (map build-machine-name machines))- (sockets (map build-machine-daemon-socket machines))- (sessions (map (cut open-ssh-session <> %short-timeout) machines))- (nodes (map remote-inferior sessions)))- (for-each assert-node-has-guix nodes names)- (for-each assert-node-repl nodes names)- (for-each assert-node-can-import sessions nodes names sockets)- (for-each assert-node-can-export sessions nodes names sockets)- (for-each close-inferior nodes)- (for-each disconnect! sessions))))+ (par-for-each check-machine-availability machines)))++(define (check-machine-availability machine)+ "Check whether MACHINE is available. Exit with an error upon failure."+ ;; Sometimes, the machine remote port may return EOF, presumably because the+ ;; connection was lost. Retry up to 3 times.+ (let loop ((retries 3))+ (guard (c ((inferior-connection-lost? c)+ (let ((retries-left (1- retries)))+ (if (> retries-left 0)+ (begin+ (format (current-error-port)+ (G_ "connection to machine ~s lost; retrying~%")+ (build-machine-name machine))+ (loop (retries-left)))+ (leave (G_ "connection repeatedly lost with machine '~a'~%")+ (build-machine-name machine))))))+ (let* ((name (build-machine-name machine))+ (socket (build-machine-daemon-socket machine))+ (session (open-ssh-session machine %short-timeout))+ (node (remote-inferior session)))+ (dynamic-wind+ (lambda () #t)+ (lambda ()+ (assert-node-has-guix node name)+ (assert-node-repl node name)+ (assert-node-can-import session node name socket)+ (assert-node-can-export session node name socket))+ (lambda ()+ (close-inferior node)+ (disconnect! session))))))) (define (check-machine-status machine-file pred) "Print the load of each machine matching PRED in MACHINE-FILE."@@ -824,7 +846,7 @@ machine." ((file) (values file (const #t))) (() (values %machine-file (const #t))) (x (leave (G_ "wrong number of arguments~%"))))))- (check-machine-availability (or file %machine-file) pred))))+ (check-machines-availability (or file %machine-file) pred)))) (("status" rest ...) (with-error-handling (let-values (((file pred)-- 2.31.1
L
L
Ludovic Courtès wrote on 25 May 22:27 +0200
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)
875yz61rvt.fsf@gnu.org
Hi!
Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:
Toggle quote (14 lines)> Fixes https://issues.guix.gnu.org/41625.>> * guix/scripts/offload.scm (check-machine-availability): Refactor so that it> takes a single machine object, to allow for retrying a single machine. Handle> the case where the checks raised an exception due to the connection to the> build machine having been lost, and retry up to 3 times. Ensure the cleanup> code is run in all situations.> (check-machines-availability): New procedure. Call> CHECK-MACHINES-AVAILABILITY in parallel, which improves performance (about> twice as fast with 4 build machines, from ~30 s to ~15 s).> * guix/inferior.scm (&inferior-connection-lost): New condition type.> (read-repl-response): Raise a condition of the above type when reading EOF> from the build machine's port.
[...]
Toggle quote (15 lines)> +(define-condition-type &inferior-connection-lost &error> + inferior-connection-lost?)> +> (define* (read-repl-response port #:optional inferior)> "Read a (guix repl) response from PORT and return it as a Scheme object.> Raise '&inferior-exception' when an exception is read from PORT."> @@ -241,6 +246,10 @@ Raise '&inferior-exception' when an exception is read from PORT."> (match (read port)> (('values objects ...)> (apply values (map sexp->object objects)))> + ;; Unexpectedly read EOF from the port. This can happen for example when> + ;; the underlying connection for PORT was lost with Guile-SSH.> + (? eof-object?> + (raise (condition (&inferior-connection-lost))))
The match clause syntax is incorrect; should be:
((? eof-object?) (raise …))
Toggle quote (14 lines)> + (info (G_ "Testing ~a build machines defined in '~a'...~%")> (length machines) machine-file)> - (let* ((names (map build-machine-name machines))> - (sockets (map build-machine-daemon-socket machines))> - (sessions (map (cut open-ssh-session <> %short-timeout) machines))> - (nodes (map remote-inferior sessions)))> - (for-each assert-node-has-guix nodes names)> - (for-each assert-node-repl nodes names)> - (for-each assert-node-can-import sessions nodes names sockets)> - (for-each assert-node-can-export sessions nodes names sockets)> - (for-each close-inferior nodes)> - (for-each disconnect! sessions))))> + (par-for-each check-machine-availability machines)))
Why not! IMO this should go in a separate patch, though, since it’s notrelated.
Toggle quote (16 lines)> +(define (check-machine-availability machine)> + "Check whether MACHINE is available. Exit with an error upon failure."> + ;; Sometimes, the machine remote port may return EOF, presumably because the> + ;; connection was lost. Retry up to 3 times.> + (let loop ((retries 3))> + (guard (c ((inferior-connection-lost? c)> + (let ((retries-left (1- retries)))> + (if (> retries-left 0)> + (begin> + (format (current-error-port)> + (G_ "connection to machine ~s lost; retrying~%")> + (build-machine-name machine))> + (loop (retries-left)))> + (leave (G_ "connection repeatedly lost with machine '~a'~%")> + (build-machine-name machine))))))
I’m afraid we’re papering over problems here.
Is running ‘guix offload test /etc/guix/machines.scm overdrive1’ onberlin enough to reproduce the issue? If so, we could monitor/stracesshd on overdrive1 to get a better understanding of what’s going on.
WDYT?
Thanks,Ludo’.
M
M
Maxim Cournoyer wrote on 26 May 05:18 +0200
Re: [PATCH v2] offload: Handle a possible EOF response from read-repl-response.
(name . Ludovic Courtès)(address . ludo@gnu.org)
87mtsikwsm.fsf_-_@gmail.com
Hi Ludovic,
Ludovic Courtès <ludo@gnu.org> writes:
[...]
Toggle quote (17 lines)>> (define* (read-repl-response port #:optional inferior)>> "Read a (guix repl) response from PORT and return it as a Scheme object.>> Raise '&inferior-exception' when an exception is read from PORT.">> @@ -241,6 +246,10 @@ Raise '&inferior-exception' when an exception is read from PORT.">> (match (read port)>> (('values objects ...)>> (apply values (map sexp->object objects)))>> + ;; Unexpectedly read EOF from the port. This can happen for example when>> + ;; the underlying connection for PORT was lost with Guile-SSH.>> + (? eof-object?>> + (raise (condition (&inferior-connection-lost))))>> The match clause syntax is incorrect; should be:>> ((? eof-object?)> (raise …))
Good catch, fixed.
Toggle quote (17 lines)>> + (info (G_ "Testing ~a build machines defined in '~a'...~%")>> (length machines) machine-file)>> - (let* ((names (map build-machine-name machines))>> - (sockets (map build-machine-daemon-socket machines))>> - (sessions (map (cut open-ssh-session <> %short-timeout) machines))>> - (nodes (map remote-inferior sessions)))>> - (for-each assert-node-has-guix nodes names)>> - (for-each assert-node-repl nodes names)>> - (for-each assert-node-can-import sessions nodes names sockets)>> - (for-each assert-node-can-export sessions nodes names sockets)>> - (for-each close-inferior nodes)>> - (for-each disconnect! sessions))))>> + (par-for-each check-machine-availability machines)))>> Why not! IMO this should go in a separate patch, though, since it’s not> related.
For me, it is related in that retrying all the checks of *every* buildoffload machine would be too expensive; it already takes 32 s for my 4offload machines; retrying this for up to 3 times would mean waiting fora minute and half, which I don't find reasonable (imagine on berlin!).
Toggle quote (18 lines)>> +(define (check-machine-availability machine)>> + "Check whether MACHINE is available. Exit with an error upon failure.">> + ;; Sometimes, the machine remote port may return EOF, presumably because the>> + ;; connection was lost. Retry up to 3 times.>> + (let loop ((retries 3))>> + (guard (c ((inferior-connection-lost? c)>> + (let ((retries-left (1- retries)))>> + (if (> retries-left 0)>> + (begin>> + (format (current-error-port)>> + (G_ "connection to machine ~s lost; retrying~%")>> + (build-machine-name machine))>> + (loop (retries-left)))>> + (leave (G_ "connection repeatedly lost with machine '~a'~%")>> + (build-machine-name machine))))))>> I’m afraid we’re papering over problems here.
I had that thought too, but then also realized that even if this waspapering over a problem, it'd be a good one to paper over as thisproblem can legitimately happen in practice, due to the network'sinherently shaky nature. It seems better to be ready for it. Also, myhopes in being able to troubleshoot such a difficult to reproducenetworking issue are rather low.
Toggle quote (4 lines)> Is running ‘guix offload test /etc/guix/machines.scm overdrive1’ on> berlin enough to reproduce the issue? If so, we could monitor/strace> sshd on overdrive1 to get a better understanding of what’s going on.
It's actually difficult to trigger it; it seems to happen mostly on thefirst try after a long time without connecting to the machine; on the2nd and later tries, everything is smooth. Waiting a few minutes is notenough to re-trigger the problem.
I've managed to see the problem a few lucky times with:
Toggle snippet (3 lines)while true; do guix offload test /etc/guix/machines.scm overdrive1; done
I don't have a password set for my user on overdrive1, so can't attachstrace to sshd, but yeah, we could try to capture it and see if we canunderstand what's going on.
Attached is v2 of the patch, with the match clause fixed.
From c52172502749a4d194dc51db9d2c394cb15e8d07 Mon Sep 17 00:00:00 2001From: Maxim Cournoyer <maxim.cournoyer@gmail.com>Date: Tue, 25 May 2021 08:42:06 -0400Subject: [PATCH] offload: Handle a possible EOF response from read-repl-response.
Fixes https://issues.guix.gnu.org/41625.
* guix/scripts/offload.scm (check-machine-availability): Refactor so that ittakes a single machine object, to allow for retrying a single machine. Handlethe case where the checks raised an exception due to the connection to thebuild machine having been lost, and retry up to 3 times. Ensure the cleanupcode is run in all situations.(check-machines-availability): New procedure. CallCHECK-MACHINES-AVAILABILITY in parallel, which improves performance (abouttwice as fast with 4 build machines, from ~30 s to ~15 s).* guix/inferior.scm (&inferior-connection-lost): New condition type.(read-repl-response): Raise a condition of the above type when reading EOFfrom the build machine's port.--- guix/inferior.scm | 9 ++++++++ guix/scripts/offload.scm | 50 +++++++++++++++++++++++++++++----------- 2 files changed, 45 insertions(+), 14 deletions(-)
Toggle diff (132 lines)diff --git a/guix/inferior.scm b/guix/inferior.scmindex 7c8e478f2a..45d9d843e3 100644--- a/guix/inferior.scm+++ b/guix/inferior.scm@@ -1,5 +1,6 @@ ;;; GNU Guix --- Functional package management for GNU ;;; Copyright © 2018, 2019, 2020, 2021 Ludovic Courtès <ludo@gnu.org>+;;; Copyright © 2021 Maxim Cournoyer <maxim.cournoyer@gmail.com> ;;; ;;; This file is part of GNU Guix. ;;;@@ -70,6 +71,7 @@ inferior-exception-arguments inferior-exception-inferior inferior-exception-stack+ inferior-connection-lost? read-repl-response inferior-packages@@ -228,6 +230,9 @@ equivalent. Return #f if the inferior could not be launched." (inferior inferior-exception-inferior) ;<inferior> | #f (stack inferior-exception-stack)) ;list of (FILE COLUMN LINE) +(define-condition-type &inferior-connection-lost &error+ inferior-connection-lost?)+ (define* (read-repl-response port #:optional inferior) "Read a (guix repl) response from PORT and return it as a Scheme object. Raise '&inferior-exception' when an exception is read from PORT."@@ -241,6 +246,10 @@ Raise '&inferior-exception' when an exception is read from PORT." (match (read port) (('values objects ...) (apply values (map sexp->object objects)))+ ;; Unexpectedly read EOF from the port. This can happen for example when+ ;; the underlying connection for PORT was lost with Guile-SSH.+ ((? eof-object?)+ (raise (condition (&inferior-connection-lost)))) (('exception ('arguments key objects ...) ('stack frames ...)) ;; Protocol (0 1 1) and later.diff --git a/guix/scripts/offload.scm b/guix/scripts/offload.scmindex 835078cb97..0271874f6a 100644--- a/guix/scripts/offload.scm+++ b/guix/scripts/offload.scm@@ -1,7 +1,7 @@ ;;; GNU Guix --- Functional package management for GNU ;;; Copyright © 2014, 2015, 2016, 2017, 2018, 2019, 2020 Ludovic Courtès <ludo@gnu.org> ;;; Copyright © 2017 Ricardo Wurmus <rekado@elephly.net>-;;; Copyright © 2020 Maxim Cournoyer <maxim.cournoyer@gmail.com>+;;; Copyright © 2020, 2021 Maxim Cournoyer <maxim.cournoyer@gmail.com> ;;; Copyright © 2020 Julien Lepiller <julien@lepiller.eu> ;;; ;;; This file is part of GNU Guix.@@ -53,6 +53,7 @@ #:use-module (ice-9 regex) #:use-module (ice-9 format) #:use-module (ice-9 binary-ports)+ #:use-module (ice-9 threads) #:export (build-machine build-machine? build-machine-name@@ -684,7 +685,7 @@ daemon is not running." (leave (G_ "failed to import '~a' from '~a'~%") item name))))) -(define (check-machine-availability machine-file pred)+(define (check-machines-availability machine-file pred) "Check that each machine matching PRED in MACHINE-FILE is usable as a build machine." (define (build-machine=? m1 m2)@@ -696,18 +697,39 @@ machine." (let ((machines (filter pred (delete-duplicates (build-machines machine-file) build-machine=?))))- (info (G_ "testing ~a build machines defined in '~a'...~%")+ (info (G_ "Testing ~a build machines defined in '~a'...~%") (length machines) machine-file)- (let* ((names (map build-machine-name machines))- (sockets (map build-machine-daemon-socket machines))- (sessions (map (cut open-ssh-session <> %short-timeout) machines))- (nodes (map remote-inferior sessions)))- (for-each assert-node-has-guix nodes names)- (for-each assert-node-repl nodes names)- (for-each assert-node-can-import sessions nodes names sockets)- (for-each assert-node-can-export sessions nodes names sockets)- (for-each close-inferior nodes)- (for-each disconnect! sessions))))+ (par-for-each check-machine-availability machines)))++(define (check-machine-availability machine)+ "Check whether MACHINE is available. Exit with an error upon failure."+ ;; Sometimes, the machine remote port may return EOF, presumably because the+ ;; connection was lost. Retry up to 3 times.+ (let loop ((retries 3))+ (guard (c ((inferior-connection-lost? c)+ (let ((retries-left (1- retries)))+ (if (> retries-left 0)+ (begin+ (format (current-error-port)+ (G_ "connection to machine '~a' lost; retrying~%")+ (build-machine-name machine))+ (loop (retries-left)))+ (leave (G_ "connection repeatedly lost with machine '~a'~%")+ (build-machine-name machine))))))+ (let* ((name (build-machine-name machine))+ (socket (build-machine-daemon-socket machine))+ (session (open-ssh-session machine %short-timeout))+ (node (remote-inferior session)))+ (dynamic-wind+ (lambda () #t)+ (lambda ()+ (assert-node-has-guix node name)+ (assert-node-repl node name)+ (assert-node-can-import session node name socket)+ (assert-node-can-export session node name socket))+ (lambda ()+ (close-inferior node)+ (disconnect! session))))))) (define (check-machine-status machine-file pred) "Print the load of each machine matching PRED in MACHINE-FILE."@@ -824,7 +846,7 @@ machine." ((file) (values file (const #t))) (() (values %machine-file (const #t))) (x (leave (G_ "wrong number of arguments~%"))))))- (check-machine-availability (or file %machine-file) pred))))+ (check-machines-availability (or file %machine-file) pred)))) (("status" rest ...) (with-error-handling (let-values (((file pred)-- 2.31.1
Thanks!
Maxim
L
L
Ludovic Courtès wrote on 26 May 11:14 +0200
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)
87fsy9x3ev.fsf@gnu.org
Hi,
Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:
Toggle quote (22 lines)>>> + (info (G_ "Testing ~a build machines defined in '~a'...~%")>>> (length machines) machine-file)>>> - (let* ((names (map build-machine-name machines))>>> - (sockets (map build-machine-daemon-socket machines))>>> - (sessions (map (cut open-ssh-session <> %short-timeout) machines))>>> - (nodes (map remote-inferior sessions)))>>> - (for-each assert-node-has-guix nodes names)>>> - (for-each assert-node-repl nodes names)>>> - (for-each assert-node-can-import sessions nodes names sockets)>>> - (for-each assert-node-can-export sessions nodes names sockets)>>> - (for-each close-inferior nodes)>>> - (for-each disconnect! sessions))))>>> + (par-for-each check-machine-availability machines)))>>>> Why not! IMO this should go in a separate patch, though, since it’s not>> related.>> For me, it is related in that retrying all the checks of *every* build> offload machine would be too expensive; it already takes 32 s for my 4> offload machines; retrying this for up to 3 times would mean waiting for> a minute and half, which I don't find reasonable (imagine on berlin!).
I see. So I’d say it’s a prerequisite (a patch that must come before)but not entirely the same thing. I’m nitpicking!
We should make sure it doesn’t trigger thread-safety issues in libssh oranything like that (running it repeatedly on a large machines.scm shouldgive us some confidence).
Toggle quote (25 lines)>>> +(define (check-machine-availability machine)>>> + "Check whether MACHINE is available. Exit with an error upon failure.">>> + ;; Sometimes, the machine remote port may return EOF, presumably because the>>> + ;; connection was lost. Retry up to 3 times.>>> + (let loop ((retries 3))>>> + (guard (c ((inferior-connection-lost? c)>>> + (let ((retries-left (1- retries)))>>> + (if (> retries-left 0)>>> + (begin>>> + (format (current-error-port)>>> + (G_ "connection to machine ~s lost; retrying~%")>>> + (build-machine-name machine))>>> + (loop (retries-left)))>>> + (leave (G_ "connection repeatedly lost with machine '~a'~%")>>> + (build-machine-name machine))))))>>>> I’m afraid we’re papering over problems here.>> I had that thought too, but then also realized that even if this was> papering over a problem, it'd be a good one to paper over as this> problem can legitimately happen in practice, due to the network's> inherently shaky nature. It seems better to be ready for it. Also, my> hopes in being able to troubleshoot such a difficult to reproduce> networking issue are rather low.
Yes, but note that this is just for ‘guix offload test’. The actualcode run while offloading will still fail badly.
Toggle quote (17 lines)>> Is running ‘guix offload test /etc/guix/machines.scm overdrive1’ on>> berlin enough to reproduce the issue? If so, we could monitor/strace>> sshd on overdrive1 to get a better understanding of what’s going on.>> It's actually difficult to trigger it; it seems to happen mostly on the> first try after a long time without connecting to the machine; on the> 2nd and later tries, everything is smooth. Waiting a few minutes is not> enough to re-trigger the problem.>> I've managed to see the problem a few lucky times with:>> while true; do guix offload test /etc/guix/machines.scm overdrive1; done>> I don't have a password set for my user on overdrive1, so can't attach> strace to sshd, but yeah, we could try to capture it and see if we can> understand what's going on.
OK.
Toggle quote (20 lines)> From c52172502749a4d194dc51db9d2c394cb15e8d07 Mon Sep 17 00:00:00 2001> From: Maxim Cournoyer <maxim.cournoyer@gmail.com>> Date: Tue, 25 May 2021 08:42:06 -0400> Subject: [PATCH] offload: Handle a possible EOF response from> read-repl-response.>> Fixes <https://issues.guix.gnu.org/41625>.>> * guix/scripts/offload.scm (check-machine-availability): Refactor so that it> takes a single machine object, to allow for retrying a single machine. Handle> the case where the checks raised an exception due to the connection to the> build machine having been lost, and retry up to 3 times. Ensure the cleanup> code is run in all situations.> (check-machines-availability): New procedure. Call> CHECK-MACHINES-AVAILABILITY in parallel, which improves performance (about> twice as fast with 4 build machines, from ~30 s to ~15 s).> * guix/inferior.scm (&inferior-connection-lost): New condition type.> (read-repl-response): Raise a condition of the above type when reading EOF> from the build machine's port.
[...]
Toggle quote (3 lines)> +(define-condition-type &inferior-connection-lost &error> + inferior-connection-lost?)
Perhaps worth adding an ‘inferior’ and/or ‘port’ field. That wouldallow the handler to present more information as to which inferior isfailing.
Maybe ‘premature-eof’ would be more accurate than ‘connection-lost’.
Toggle quote (4 lines)> + (format (current-error-port)> + (G_ "connection to machine '~a' lost; retrying~%")> + (build-machine-name machine))
You can use ‘info’ instead of ‘format’.
Otherwise LGTM, thanks!
Ludo’.
M
M
Marius Bakke wrote on 26 May 17:48 +0200
(address . 41625@debbugs.gnu.org)
87h7ipa433.fsf@gnu.org
Maxim Cournoyer <maxim.cournoyer@gmail.com> skriver:
Toggle quote (15 lines)>> Is running ‘guix offload test /etc/guix/machines.scm overdrive1’ on>> berlin enough to reproduce the issue? If so, we could monitor/strace>> sshd on overdrive1 to get a better understanding of what’s going on.>> It's actually difficult to trigger it; it seems to happen mostly on the> first try after a long time without connecting to the machine; on the> 2nd and later tries, everything is smooth. Waiting a few minutes is not> enough to re-trigger the problem.>> I've managed to see the problem a few lucky times with:>> --8<---------------cut here---------------start------------->8---> while true; do guix offload test /etc/guix/machines.scm overdrive1; done> --8<---------------cut here---------------end--------------->8---
I used to be able to reproduce it by inducing a high load on the targetmachine and just let Guix keep trying to connect. But now I did that,and set overload threshold to 0.0 for good measure, and Guix has beenwaiting patiently for two hours without failure.
So AFAICT this bug has been fixed. Perhaps Berlin or the Overdrivesimply needs to be updated?
-----BEGIN PGP SIGNATURE-----
iIUEARYKAC0WIQRNTknu3zbaMQ2ddzTocYulkRQQdwUCYK5t0A8cbWFyaXVzQGdudS5vcmcACgkQ6HGLpZEUEHcy7QEAlX2skErbZBDILToutLkOSP6mLf3xK2tdqBcOttgbpX0BANb7F5k14C9HhyQBYoexMCS9Mydjfx43TLIeAuKw2gkL=41ZX-----END PGP SIGNATURE-----
M
M
Maxim Cournoyer wrote on 27 May 13:49 +0200
(name . Ludovic Courtès)(address . ludo@gnu.org)
877djkl7lp.fsf@gmail.com
Hi Ludovic,
Ludovic Courtès <ludo@gnu.org> writes:
[...]
Toggle quote (3 lines)> I see. So I’d say it’s a prerequisite (a patch that must come before)> but not entirely the same thing. I’m nitpicking!
Eh, it's okay :-). Splitting changes into the right unit is a problemthat is akin to naming things; it's hard! I welcome your suggestion.
Toggle quote (4 lines)> We should make sure it doesn’t trigger thread-safety issues in libssh or> anything like that (running it repeatedly on a large machines.scm should> give us some confidence).
It seems fine so far, but I've only tested in a loop with 4 buildmachines. When it nears completion I'll give it a shot on berlin.
[...]
Toggle quote (3 lines)> Yes, but note that this is just for ‘guix offload test’. The actual> code run while offloading will still fail badly.
Ah, thanks for pointing that; I somehow thought that this machine statuschecking code was a prelude to every offloaded build.
[...]
Toggle quote (6 lines)>> I don't have a password set for my user on overdrive1, so can't attach>> strace to sshd, but yeah, we could try to capture it and see if we can>> understand what's going on.>> OK.
I'd be happy to try strace when your are available. You can ping me onthe chat. It's been more than 8 hours since I tried, so I should beable to trigger the problem :-).
[...]
Toggle quote (6 lines)> Perhaps worth adding an ‘inferior’ and/or ‘port’ field. That would> allow the handler to present more information as to which inferior is> failing.>> Maybe ‘premature-eof’ would be more accurate than ‘connection-lost’.
Good suggestions. I'll implement them.
Toggle quote (6 lines)>> + (format (current-error-port)>> + (G_ "connection to machine '~a' lost; retrying~%")>> + (build-machine-name machine))>> You can use ‘info’ instead of ‘format’.
That also. Thanks!
On another note, I was able to 'exercise' the fix, and the exception israised but something fails with the following backtrace instead of beingretried:
Toggle snippet (27 lines)guix offload: Testing 1 build machines defined in '/etc/guix/machines.scm'...connection to machine 'overdrive1.guix.gnu.org' lost; retryingBacktrace:In ice-9/boot-9.scm: 1752:10 10 (with-exception-handler _ _ #:unwind? _ #:unwind-for-type _)In unknown file: 9 (apply-smob/0 #<thunk 7f915c028f60>)In ice-9/boot-9.scm: 724:2 8 (call-with-prompt _ _ #<procedure default-prompt-handler (k proc)>)In ice-9/eval.scm: 619:8 7 (_ #(#(#<directory (guile-user) 7f915c022c80>)))In guix/ui.scm: 2161:12 6 (run-guix-command _ . _)In ice-9/boot-9.scm: 1752:10 5 (with-exception-handler _ _ #:unwind? _ #:unwind-for-type _) 1747:15 4 (with-exception-handler #<procedure 7f91576bf0c0 at ice-9/boot-9.scm:1831:7 (exn)> _ # _ # …)In srfi/srfi-1.scm: 634:9 3 (for-each #<procedure check-machine-availability (a)> (#<<build-machine> name: "overdriv…>))In ice-9/eval.scm: 191:35 2 (_ #(#(#(#<directory (guix scripts offload) 7f9159852780> 3 #<<build-machine> na…> …) …) …))Exception thrown while printing backtrace:In procedure frame-local-ref: Argument 2 out of range: 1
ice-9/boot-9.scm:1685:16: In procedure raise-exception:Wrong type to apply: 2
I haven't been able to pinpoint what yet. Notice that in the above codeI've changed par-for-each by just for-each, doubting it might havesomething to do with it, but it appears unrelated.
Thanks,
Maxim
M
M
Maxim Cournoyer wrote on 27 May 13:51 +0200
(name . Marius Bakke)(address . marius@gnu.org)
8735u8l7iy.fsf@gmail.com
Hi Marius,
Marius Bakke <marius@gnu.org> writes:
Toggle quote (25 lines)> Maxim Cournoyer <maxim.cournoyer@gmail.com> skriver:>>>> Is running ‘guix offload test /etc/guix/machines.scm overdrive1’ on>>> berlin enough to reproduce the issue? If so, we could monitor/strace>>> sshd on overdrive1 to get a better understanding of what’s going on.>>>> It's actually difficult to trigger it; it seems to happen mostly on the>> first try after a long time without connecting to the machine; on the>> 2nd and later tries, everything is smooth. Waiting a few minutes is not>> enough to re-trigger the problem.>>>> I've managed to see the problem a few lucky times with:>>>> --8<---------------cut here---------------start------------->8--->> while true; do guix offload test /etc/guix/machines.scm overdrive1; done>> --8<---------------cut here---------------end--------------->8--->> I used to be able to reproduce it by inducing a high load on the target> machine and just let Guix keep trying to connect. But now I did that,> and set overload threshold to 0.0 for good measure, and Guix has been> waiting patiently for two hours without failure.>> So AFAICT this bug has been fixed. Perhaps Berlin or the Overdrive> simply needs to be updated?
Ah! Do you have root access to overdrive1? It'd be interesting toreconfigure it to update the guix-daemon and see if the problemvanishes.
Thanks for the information!
Maxim
M
M
Maxim Cournoyer wrote on 27 May 16:57 +0200
Re: [PATCH v3] offload: Handle a possible EOF response from read-repl-response.
(name . Ludovic Courtès)(address . ludo@gnu.org)
87r1hsjkbv.fsf_-_@gmail.com
Hi Ludo,
Now that I have root access to overdrive1, I could strace the sshdprocess (I just did 'strace -p340', noting the process of sshd displayedwith 'herd status sshd'):
Toggle snippet (32 lines)pselect6(87, [3 4], NULL, NULL, NULL, NULL) = 1 (in [3])accept(3, {sa_family=AF_INET, sin_port=htons(33262), sin_addr=inet_addr("66.158.152.121")}, [128->16]) = 5fcntl(5, F_GETFL) = 0x2 (flags O_RDWR)pipe2([6, 7], 0) = 0socketpair(AF_UNIX, SOCK_STREAM, 0, [8, 9]) = 0clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xffff8e0ef0e0) = 644close(7) = 0close(9) = 0write(8, "\0\0\1\245\0", 5) = 5write(8, "\0\0\1\234\nPort 22\nPermitRootLogin no\n"..., 420) = 420close(8) = 0close(5) = 0getpid() = 340getpid() = 340getpid() = 340getpid() = 340getpid() = 340getpid() = 340getpid() = 340pselect6(87, [3 4 6], NULL, NULL, NULL, NULL) = 1 (in [6])read(6, "\0", 1) = 1pselect6(87, [3 4 6], NULL, NULL, NULL, NULL) = 1 (in [6])read(6, "", 1) = 0close(6) = 0pselect6(87, [3 4], NULL, NULL, NULL, NULL) = ? ERESTARTNOHAND (To be restarted if no handler)--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=644, si_uid=0, si_status=255, si_utime=1, si_stime=0} ---wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 255}], WNOHANG, NULL) = 644wait4(-1, 0xfffffa4d90e4, WNOHANG, NULL) = 0rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call)pselect6(87, [3 4], NULL, NULL, NULL, NULL <detached ...>
With the attached v3 patch, the corresponding output (still problematic)was:
Toggle snippet (29 lines)$ ./pre-inst-env guix offload test /etc/guix/machines.scm overdrive1guix offload: Testing 1 build machines defined in '/etc/guix/machines.scm'...guix offload: got premature EOF from machine 'overdrive1.guix.gnu.org' from inferior '#<inferior pipe (0 1 1) 7f6ee10d3400>' on port '#<input-output: channel (open) 7f6ee10c5500>'; retrying connectionBacktrace:In ice-9/boot-9.scm: 1752:10 10 (with-exception-handler _ _ #:unwind? _ #:unwind-for-type _)In unknown file: 9 (apply-smob/0 #<thunk 7f6ee5ae9f60>)In ice-9/boot-9.scm: 724:2 8 (call-with-prompt _ _ #<procedure default-prompt-handler (k proc)>)In ice-9/eval.scm: 619:8 7 (_ #(#(#<directory (guile-user) 7f6ee5ae3c80>)))In guix/ui.scm: 2161:12 6 (run-guix-command _ . _)In ice-9/boot-9.scm: 1752:10 5 (with-exception-handler _ _ #:unwind? _ #:unwind-for-type _) 1747:15 4 (with-exception-handler #<procedure 7f6ee25e1480 at ice-9/boot-9.scm:1831:7 (exn)> _ # _ # …)In ice-9/threads.scm: 288:21 3 (loop _)In guix/scripts/offload.scm: 719:29 2 (_ _) 719:29 1 (_ _)In ice-9/boot-9.scm: 1685:16 0 (raise-exception _ #:continuable? _)
ice-9/boot-9.scm:1685:16: In procedure raise-exception:Wrong type to apply: 2
I haven't tried analyzing the strace output yet.
Maxim
From c7b2ec1c58adf8c795df0a6aaf075dbc331f41e8 Mon Sep 17 00:00:00 2001From: Maxim Cournoyer <maxim.cournoyer@gmail.com>Date: Thu, 27 May 2021 08:44:44 -0400Subject: [PATCH 1/2] offload: Parallelize machine check in offload test.
* guix/scripts/offload.scm (check-machine-availability): Refactor so that ittakes a single machine object. Ensure the cleanup code is always run.(check-machines-availability): New procedure. CallCHECK-MACHINES-AVAILABILITY in parallel, which improves performance (abouttwice as fast with 4 build machines, from ~30 s to ~15 s).--- guix/scripts/offload.scm | 39 +++++++++++++++++++++++++-------------- 1 file changed, 25 insertions(+), 14 deletions(-)
Toggle diff (81 lines)diff --git a/guix/scripts/offload.scm b/guix/scripts/offload.scmindex 835078cb97..b0fd20e158 100644--- a/guix/scripts/offload.scm+++ b/guix/scripts/offload.scm@@ -1,7 +1,7 @@ ;;; GNU Guix --- Functional package management for GNU ;;; Copyright © 2014, 2015, 2016, 2017, 2018, 2019, 2020 Ludovic Courtès <ludo@gnu.org> ;;; Copyright © 2017 Ricardo Wurmus <rekado@elephly.net>-;;; Copyright © 2020 Maxim Cournoyer <maxim.cournoyer@gmail.com>+;;; Copyright © 2020, 2021 Maxim Cournoyer <maxim.cournoyer@gmail.com> ;;; Copyright © 2020 Julien Lepiller <julien@lepiller.eu> ;;; ;;; This file is part of GNU Guix.@@ -53,6 +53,7 @@ #:use-module (ice-9 regex) #:use-module (ice-9 format) #:use-module (ice-9 binary-ports)+ #:use-module (ice-9 threads) #:export (build-machine build-machine? build-machine-name@@ -684,7 +685,7 @@ daemon is not running." (leave (G_ "failed to import '~a' from '~a'~%") item name))))) -(define (check-machine-availability machine-file pred)+(define (check-machines-availability machine-file pred) "Check that each machine matching PRED in MACHINE-FILE is usable as a build machine." (define (build-machine=? m1 m2)@@ -696,18 +697,28 @@ machine." (let ((machines (filter pred (delete-duplicates (build-machines machine-file) build-machine=?))))- (info (G_ "testing ~a build machines defined in '~a'...~%")+ (info (G_ "Testing ~a build machines defined in '~a'...~%") (length machines) machine-file)- (let* ((names (map build-machine-name machines))- (sockets (map build-machine-daemon-socket machines))- (sessions (map (cut open-ssh-session <> %short-timeout) machines))- (nodes (map remote-inferior sessions)))- (for-each assert-node-has-guix nodes names)- (for-each assert-node-repl nodes names)- (for-each assert-node-can-import sessions nodes names sockets)- (for-each assert-node-can-export sessions nodes names sockets)- (for-each close-inferior nodes)- (for-each disconnect! sessions))))+ (par-for-each check-machine-availability machines)))++(define (check-machine-availability machine)+ "Check whether MACHINE is available. Exit with an error upon failure."+ ;; Sometimes, the machine remote port may return EOF, presumably because the+ ;; connection was lost. Retry up to 3 times.+ (let* ((name (build-machine-name machine))+ (socket (build-machine-daemon-socket machine))+ (session (open-ssh-session machine %short-timeout))+ (node (remote-inferior session)))+ (dynamic-wind+ (lambda () #t)+ (lambda ()+ (assert-node-has-guix node name)+ (assert-node-repl node name)+ (assert-node-can-import session node name socket)+ (assert-node-can-export session node name socket))+ (lambda ()+ (close-inferior node)+ (disconnect! session))))) (define (check-machine-status machine-file pred) "Print the load of each machine matching PRED in MACHINE-FILE."@@ -824,7 +835,7 @@ machine." ((file) (values file (const #t))) (() (values %machine-file (const #t))) (x (leave (G_ "wrong number of arguments~%"))))))- (check-machine-availability (or file %machine-file) pred))))+ (check-machines-availability (or file %machine-file) pred)))) (("status" rest ...) (with-error-handling (let-values (((file pred)-- 2.31.1
From b5558777617e4674a150895458d57d202de56120 Mon Sep 17 00:00:00 2001From: Maxim Cournoyer <maxim.cournoyer@gmail.com>Date: Tue, 25 May 2021 08:42:06 -0400Subject: [PATCH 2/2] offload: Handle a possible EOF response from read-repl-response.
Partially fixes https://issues.guix.gnu.org/41625.
* guix/scripts/offload.scm (check-machine-availability): Handle the case wherethe checks raised an exception due to receiving EOF prematurely, and retry upto 3 times.* guix/inferior.scm (&inferior-premature-eof): New condition type.(read-repl-response): Raise a condition of the above type when reading EOFfrom the build machine's port.--- guix/inferior.scm | 15 ++++++++++++++ guix/scripts/offload.scm | 42 ++++++++++++++++++++++++++-------------- 2 files changed, 43 insertions(+), 14 deletions(-)
Toggle diff (101 lines)diff --git a/guix/inferior.scm b/guix/inferior.scmindex 7c8e478f2a..e63b37a7dd 100644--- a/guix/inferior.scm+++ b/guix/inferior.scm@@ -1,5 +1,6 @@ ;;; GNU Guix --- Functional package management for GNU ;;; Copyright © 2018, 2019, 2020, 2021 Ludovic Courtès <ludo@gnu.org>+;;; Copyright © 2021 Maxim Cournoyer <maxim.cournoyer@gmail.com> ;;; ;;; This file is part of GNU Guix. ;;;@@ -70,6 +71,9 @@ inferior-exception-arguments inferior-exception-inferior inferior-exception-stack+ inferior-premature-eof?+ inferior-premature-eof-port+ inferior-premature-eof-inferior read-repl-response inferior-packages@@ -228,6 +232,11 @@ equivalent. Return #f if the inferior could not be launched." (inferior inferior-exception-inferior) ;<inferior> | #f (stack inferior-exception-stack)) ;list of (FILE COLUMN LINE) +(define-condition-type &inferior-premature-eof &error+ inferior-premature-eof?+ (port inferior-premature-eof-port)+ (inferior inferior-premature-eof-inferior))+ (define* (read-repl-response port #:optional inferior) "Read a (guix repl) response from PORT and return it as a Scheme object. Raise '&inferior-exception' when an exception is read from PORT."@@ -241,6 +250,12 @@ Raise '&inferior-exception' when an exception is read from PORT." (match (read port) (('values objects ...) (apply values (map sexp->object objects)))+ ((? eof-object?)+ ;; Unexpectedly read EOF from the port. This can happen for example when+ ;; the underlying connection for PORT was lost with Guile-SSH.+ (raise (condition (&inferior-premature-eof+ (inferior inferior)+ (port port))))) (('exception ('arguments key objects ...) ('stack frames ...)) ;; Protocol (0 1 1) and later.diff --git a/guix/scripts/offload.scm b/guix/scripts/offload.scmindex b0fd20e158..4312eb4e22 100644--- a/guix/scripts/offload.scm+++ b/guix/scripts/offload.scm@@ -705,20 +705,34 @@ machine." "Check whether MACHINE is available. Exit with an error upon failure." ;; Sometimes, the machine remote port may return EOF, presumably because the ;; connection was lost. Retry up to 3 times.- (let* ((name (build-machine-name machine))- (socket (build-machine-daemon-socket machine))- (session (open-ssh-session machine %short-timeout))- (node (remote-inferior session)))- (dynamic-wind- (lambda () #t)- (lambda ()- (assert-node-has-guix node name)- (assert-node-repl node name)- (assert-node-can-import session node name socket)- (assert-node-can-export session node name socket))- (lambda ()- (close-inferior node)- (disconnect! session)))))+ (let loop ((retries 3))+ (guard (c ((inferior-premature-eof? c)+ (let ((retries-left (1- retries))+ (inferior (inferior-premature-eof-inferior c)))+ (if (> retries-left 0)+ (begin+ (info (G_ "got premature EOF from machine '~a' from \+inferior '~a' on port '~a'; retrying connection~%")+ (build-machine-name machine)+ inferior+ (inferior-premature-eof-port c))+ (loop (retries-left)))+ (leave (G_ "connection repeatedly lost with machine '~a'~%")+ (build-machine-name machine))))))+ (let* ((name (build-machine-name machine))+ (socket (build-machine-daemon-socket machine))+ (session (open-ssh-session machine %short-timeout))+ (node (remote-inferior session)))+ (dynamic-wind+ (lambda () #t)+ (lambda ()+ (assert-node-has-guix node name)+ (assert-node-repl node name)+ (assert-node-can-import session node name socket)+ (assert-node-can-export session node name socket))+ (lambda ()+ (close-inferior node)+ (disconnect! session))))))) (define (check-machine-status machine-file pred) "Print the load of each machine matching PRED in MACHINE-FILE."-- 2.31.1
M
M
Maxim Cournoyer wrote on 27 May 19:20 +0200
Re: [PATCH v2] offload: Handle a possible EOF response from read-repl-response.
(name . Ludovic Courtès)(address . ludo@gnu.org)
87mtsgjdoz.fsf@gmail.com
Hello again,
My previous strace output didn't follow the 'clone' syscall; the oneattached does. It's compressed with lzip.
Attachment: sshd-strace.out.lz
Thanks!
Maxim
L
L
Ludovic Courtès wrote on 29 May 21:24 +0200
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)
87tuml5onf.fsf@gnu.org
Hi Maxim,
Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:
Toggle quote (3 lines)> My previous strace output didn't follow the 'clone' syscall; the one> attached does. It's compressed with lzip.
Toggle snippet (52 lines)11457 sendto(3, "<86>May 27 19:03:45 sshd[11457]: pam_unix(sshd:session): session opened for user maxim by (uid=0)", 97, MSG_NOSIGNAL, NULL, 0) = 9711457 socketpair(AF_UNIX, SOCK_STREAM, 0, [5, 7]) = 011457 fcntl(5, F_SETFD, FD_CLOEXEC) = 011457 fcntl(7, F_SETFD, FD_CLOEXEC) = 011457 clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD <unfinished ...>11459 set_robust_list(0xffff9d2e10f0, 24 <unfinished ...>11457 <... clone resumed>, child_tidptr=0xffff9d2e10e0) = 11459
[…]
11459 clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD <unfinished ...>11460 set_robust_list(0xffff9d2e10f0, 24 <unfinished ...>11459 <... clone resumed>, child_tidptr=0xffff9d2e10e0) = 11460
[…]
11460 execve("/gnu/store/6b5gnpnbi73l0pasan0fip4w3f1afizi-bash-5.0.16/bin/bash", ["bash", "-c", "guix repl -t machine"], 0xaaaacfb78470 /* 16 vars */) = 0
[…]
11460 <... read resumed>"(use-modules (gnu))\n", 4096) = 20
[…]
11459 read(4, "", 16384) = 011459 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 011459 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 011459 close(11) = 011459 close(10) = 011459 close(13) = 011459 close(4) = 011459 brk(0xaaaacfba7000) = 0xaaaacfba700011459 write(5, "\0\0\0\0012", 5) = 511457 <... ppoll resumed>) = 1 ([{fd=7, revents=POLLIN}])11459 exit_group(0 <unfinished ...>
[…]
11457 sendto(3, "<86>May 27 19:04:08 sshd[11457]: pam_unix(sshd:session): session closed for user maxim", 86, MSG_NOSIGNAL, NULL, 0) = 86
[…]
11460 openat(AT_FDCWD, "/gnu/store/83vzpl9pl5v86ncb8jy2gbfgsac0amjv-guix-module-union/lib/guile/3.0/site-ccache/gnu/packages/mtools.go", O_RDONLY|O_CLOEXEC) = 1411460 lseek(14, 0, SEEK_END) = 7270711460 mmap(NULL, 72707, PROT_READ, MAP_PRIVATE, 14, 0) = 0xffff9019400011460 close(14) = 011460 mprotect(0xffff901a4000, 3432, PROT_READ|PROT_WRITE) = 011460 newfstatat(AT_FDCWD, "/gnu/store/83vzpl9pl5v86ncb8jy2gbfgsac0amjv-guix-module-union/share/guile/site/3.0/gnu/packages/patches/u-boot-rockchip-inno-usb.patch", {st_mode=S_IFREG|0444, st_size=10450, ...}, 0) = 011460 write(1, "(values (non-self-quoting 2052 \"#<unspecified>\"))\n", 50) = -1 EPIPE (Broken pipe)11460 --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=11460, si_uid=30013} ---
(‘guix repl’ is PID 11460.)
It looks like the SSH session gets closed while ‘guix repl’ is stillrunning, and thus the client never sees the reply from ‘guix repl’(‘guix repl’ itself eventually gets EPIPE because its parent processterminated and there’s no one listening to it.)
So the bug may be on the client side, in ‘guix offload’, which for somereason drops the session too early. Can you enable libssh debuggingthere, reproduce the problem, and share the libssh debug info from thatrun?
Do enable libssh debugging, you can drop something like this in (guixscripts offload) for instance:
(use-modules (ssh log)) (set-log-verbosity! 'protocol)
The available logging levels, from Guile-SSH’s log.c, are:
Toggle snippet (15 lines)struct symbol_mapping log_verbosity[] = { /* 0, No logging at all */ { "nolog", SSH_LOG_NOLOG }, /* 1, Only rare and noteworthy events */ { "rare", SSH_LOG_RARE }, /* 2, High level protocol information */ { "protocol", SSH_LOG_PROTOCOL }, /* 3, Lower level protocol infomations, packet level */ { "packet", SSH_LOG_PACKET }, /* 4, Every function path */ { "functions", SSH_LOG_FUNCTIONS }, { NULL, -1 }};
HTH!
Ludo’.
L
L
Ludovic Courtès wrote on 5 Jul 10:57 +0200
Re: bug#41625: Sporadic guix-offload crashes due to EOF errors
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)
878s2lw330.fsf_-_@gnu.org
Hi,
Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:
Toggle quote (28 lines)> Now that I have root access to overdrive1, I could strace the sshd> process (I just did 'strace -p340', noting the process of sshd displayed> with 'herd status sshd'):>> pselect6(87, [3 4], NULL, NULL, NULL, NULL) = 1 (in [3])> accept(3, {sa_family=AF_INET, sin_port=htons(33262), sin_addr=inet_addr("66.158.152.121")}, [128->16]) = 5> fcntl(5, F_GETFL) = 0x2 (flags O_RDWR)> pipe2([6, 7], 0) = 0> socketpair(AF_UNIX, SOCK_STREAM, 0, [8, 9]) = 0> clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xffff8e0ef0e0) = 644> close(7) = 0> close(9) = 0> write(8, "\0\0\1\245\0", 5) = 5> write(8, "\0\0\1\234\nPort 22\nPermitRootLogin no\n"..., 420) = 420> close(8) = 0> close(5) = 0> getpid() = 340> getpid() = 340> getpid() = 340> getpid() = 340> getpid() = 340> getpid() = 340> getpid() = 340> pselect6(87, [3 4 6], NULL, NULL, NULL, NULL) = 1 (in [6])> read(6, "\0", 1) = 1> pselect6(87, [3 4 6], NULL, NULL, NULL, NULL) = 1 (in [6])> read(6, "", 1) = 0
OK, so it looks as if the client disconnected right away. Hard to tellexactly what that happened. :-/ Perhaps turning libssh debugging on onthe client side could help (by uncommenting “#:log-verbosity 'protocol”in (guix ssh)).
Toggle quote (11 lines)>>From c7b2ec1c58adf8c795df0a6aaf075dbc331f41e8 Mon Sep 17 00:00:00 2001> From: Maxim Cournoyer <maxim.cournoyer@gmail.com>> Date: Thu, 27 May 2021 08:44:44 -0400> Subject: [PATCH 1/2] offload: Parallelize machine check in offload test.>> * guix/scripts/offload.scm (check-machine-availability): Refactor so that it> takes a single machine object. Ensure the cleanup code is always run.> (check-machines-availability): New procedure. Call> CHECK-MACHINES-AVAILABILITY in parallel, which improves performance (about> twice as fast with 4 build machines, from ~30 s to ~15 s).
I remain wary of this change, because that could lead to subtlenon-deterministic bugs (of the kind that keeps you busy for weeks) andbecause I personally don’t mind if ‘guix offload test’ takes 30s onberlin, and because the intermingled output may make diagnostics lessclear.
Toggle quote (15 lines)>>From b5558777617e4674a150895458d57d202de56120 Mon Sep 17 00:00:00 2001> From: Maxim Cournoyer <maxim.cournoyer@gmail.com>> Date: Tue, 25 May 2021 08:42:06 -0400> Subject: [PATCH 2/2] offload: Handle a possible EOF response from> read-repl-response.>> Partially fixes <https://issues.guix.gnu.org/41625>.>> * guix/scripts/offload.scm (check-machine-availability): Handle the case where> the checks raised an exception due to receiving EOF prematurely, and retry up> to 3 times.> * guix/inferior.scm (&inferior-premature-eof): New condition type.> (read-repl-response): Raise a condition of the above type when reading EOF> from the build machine's port.
LGTM!
Thanks,Ludo’.
?