Sporadic guix-offload crashes due to EOF errors

  • Done
  • quality assurance status badge
Details
3 participants
  • Ludovic Courtès
  • Marius Bakke
  • Maxim Cournoyer
Owner
unassigned
Submitted by
Marius Bakke
Severity
normal
M
M
Marius Bakke wrote on 31 May 2020 11:51
(address . bug-guix@gnu.org)
87mu5owhxh.fsf@gnu.org
Hello,

During 'guix build -s aarch64-linux dolphin' on Berlin, I got this crash:

Toggle snippet (36 lines)
building /gnu/store/87655bh9rqcr29qasl1c4yj3skmxkyiz-kfilemetadata-5.70.0.drv...
process 12989 acquired build slot '/var/guix/offload/overdrive1.guixsd.org:52522/1'
process 12989 acquired build slot '/var/guix/offload/dover.guix.info:9023/1'
process 12989 acquired build slot '/var/guix/offload/141.80.167.167:22/0'
process 12989 acquired build slot '/var/guix/offload/141.80.167.163:22/0'
process 12989 acquired build slot '/var/guix/offload/localhost:2223/1'
process 12989 acquired build slot '/var/guix/offload/141.80.167.168:22/0'
process 12989 acquired build slot '/var/guix/offload/141.80.167.173:22/0'
process 12989 acquired build slot '/var/guix/offload/141.80.167.176:22/0'
process 12989 acquired build slot '/var/guix/offload/localhost:2222/0'
process 12989 acquired build slot '/var/guix/offload/141.80.167.165:22/0'
process 12989 acquired build slot '/var/guix/offload/141.80.167.169:22/0'
process 12989 acquired build slot '/var/guix/offload/141.80.167.181:22/0'
process 12989 acquired build slot '/var/guix/offload/141.80.167.170:22/0'
process 12989 acquired build slot '/var/guix/offload/141.80.167.174:22/0'
process 12989 acquired build slot '/var/guix/offload/141.80.167.180:22/0'
process 12989 acquired build slot '/var/guix/offload/141.80.167.161:22/0'
Backtrace:
In ice-9/boot-9.scm:
1736:10 5 (with-exception-handler _ _ #:unwind? _ # _)
In unknown file:
4 (apply-smob/0 #<thunk 7f3344d296c0>)
In ice-9/boot-9.scm:
718:2 3 (call-with-prompt _ _ #<procedure default-prompt-handle…>)
In ice-9/eval.scm:
619:8 2 (_ #(#(#<directory (guile-user) 7f3344933f00>)))
In guix/ui.scm:
1936:12 1 (run-guix-command _ . _)
In guix/scripts/offload.scm:
742:22 0 (guix-offload . _)

guix/scripts/offload.scm:742:22: In procedure guix-offload:
Throw to key `match-error' with args `("match" "no matching pattern" #<eof>)'.
guix build: error: unexpected EOF reading a line

Which is strange because guix/scripts/offload.scm:742 is wrapped in a
(unless (eof-object? ...)) block.

When this happens, the build command terminates, along with any other
builds that it had started concurrently. Builds from other clients
were unaffected, of course.

I have also seen this occur on my personal offloading setup once every
blue moon, but don't know what could have caused it.
-----BEGIN PGP SIGNATURE-----

iQEzBAEBCgAdFiEEu7At3yzq9qgNHeZDoqBt8qM6VPoFAl7TfjoACgkQoqBt8qM6
VPokIQf+Lt6//6tAzD0+oGyAhc15p4LAqqkW5/lxQb/6nluNcifbjN/SSCgeFTzu
k/Lti/6HEbjOqrUORtw9Yl50B3/5Is/qzuqFtmf7i++47r6QN2+dlwwK4vfePylc
qy8caALlHDvC7rqrpVFq3MxStuW7QXBe/+jSjSORAh1KVh8TKHnhkdkRaazyZOP8
oIl879u7txA9a1wqtkmgjX49QqkXrTZV8QK4PBCW+hiqobgIcxqDr9l21GMEWRYw
ssIxH0RphQlMrvhg2TPQhY2guNeLVy2eKwtr3VYe51RKWZTkoQ+ntQfOylltoypP
QRJFs59QUW85DzbybcV+9I5Monr6Og==
=iT+K
-----END PGP SIGNATURE-----

M
M
Marius Bakke wrote on 31 May 2020 12:12
(address . 41625@debbugs.gnu.org)
87k10swgyy.fsf@gnu.org
Marius Bakke <marius@gnu.org> writes:

Toggle quote (2 lines)
> During 'guix build -s aarch64-linux dolphin' on Berlin, I got this crash:

Funny, I just got it _again_ building the same derivation. It seems to
happen while checking load on the selected offload machine, which is
'overdrive1.guixsd.org' in this and the previous case:

Toggle snippet (36 lines)
successfully built /gnu/store/qwscp6h2zsxr0knaizn1fn0saw1pfimi-kidletime-5.70.0.drv
process 51308 acquired build slot '/var/guix/offload/141.80.167.174:22/0'
process 51308 acquired build slot '/var/guix/offload/141.80.167.181:22/0'
process 51308 acquired build slot '/var/guix/offload/overdrive1.guixsd.org:52522/1'
process 51308 acquired build slot '/var/guix/offload/localhost:2223/1'
process 51308 acquired build slot '/var/guix/offload/141.80.167.162:22/0'
process 51308 acquired build slot '/var/guix/offload/141.80.167.180:22/0'
process 51308 acquired build slot '/var/guix/offload/141.80.167.161:22/0'
process 51308 acquired build slot '/var/guix/offload/141.80.167.170:22/0'
process 51308 acquired build slot '/var/guix/offload/141.80.167.163:22/0'
process 51308 acquired build slot '/var/guix/offload/141.80.167.166:22/0'
process 51308 acquired build slot '/var/guix/offload/141.80.167.177:22/0'
process 51308 acquired build slot '/var/guix/offload/localhost:2222/0'
process 51308 acquired build slot '/var/guix/offload/141.80.167.173:22/0'
process 51308 acquired build slot '/var/guix/offload/141.80.167.168:22/0'
process 51308 acquired build slot '/var/guix/offload/141.80.167.169:22/0'
process 51308 acquired build slot '/var/guix/offload/141.80.167.176:22/0'
Backtrace:
In ice-9/boot-9.scm:
1736:10 5 (with-exception-handler _ _ #:unwind? _ # _)
In unknown file:
4 (apply-smob/0 #<thunk 7f3e4487b1c0>)
In ice-9/boot-9.scm:
718:2 3 (call-with-prompt _ _ #<procedure default-prompt-handle…>)
In ice-9/eval.scm:
619:8 2 (_ #(#(#<directory (guile-user) 7f3e4445ff00>)))
In guix/ui.scm:
1936:12 1 (run-guix-command _ . _)
In guix/scripts/offload.scm:
742:22 0 (guix-offload . _)

guix/scripts/offload.scm:742:22: In procedure guix-offload:
Throw to key `match-error' with args `("match" "no matching pattern" #<eof>)'.
guix build: error: unexpected EOF reading a line

'guix offload test' passes without problems.
-----BEGIN PGP SIGNATURE-----

iQEzBAEBCgAdFiEEu7At3yzq9qgNHeZDoqBt8qM6VPoFAl7TgxUACgkQoqBt8qM6
VPrt3ggAt+nNax+K3pWTlItX1uy7Dv3Gehhb6BNpJA1PiDA1TTmlh4GG+TmXsp4v
aPGheo0jnk2thwJ/AuaqYRu+Ec3mpMR1F/XaO/9ULWC0eeerlRhhnZJtnxopKnoB
2BQmzS5loGWL8jmNL7bQ0+uESLZDsl5gbbWgTh0vAjneE97oaewTgYcBLF6dD7Mg
9lDsuWTY9B5D+uoUGrpH0TRjydoNNCu0hjXisQ2mDKSh0KnohKuPHAkJ+0zMAJaG
Wrv8a9VqXbODDJrr/m3hdZt6PUA1iOh0HnYz7U8o6in84taHdhU3mlrlap7UlSmR
5VcnbwolJfU9pzEYl2BfEBsagnE2AQ==
=W9oA
-----END PGP SIGNATURE-----

M
M
Marius Bakke wrote on 31 May 2020 13:21
(address . 41625@debbugs.gnu.org)
87h7vwwdrs.fsf@gnu.org
Marius Bakke <marius@gnu.org> writes:

Toggle quote (2 lines)
> 'guix offload test' passes without problems.

Not so fast, running it in a loop reveals the crash.

There is a trace file in /root/offloadtest.trace on Berlin with such an
occurence. It looks like a timeout is reached shortly before the EOF
error:

10139 poll([{fd=14, events=POLLIN|POLLOUT}], 1, 0) = 1 ([{fd=14, revents=POLLOUT}])
10139 poll([{fd=14, events=POLLIN}], 1, 15000) = 0 (Timeout)
10139 write(2, "Backtrace:\n", 11) = 11

This seems to be from a different node than the one reported previously,
as the preceding connect() was to this machine:

10139 connect(44, {sa_family=AF_INET, sin_port=htons(22), sin_addr=inet_addr("141.80.167.186")}, 16) = -1 EINPROGRESS (Operation now in progress)
-----BEGIN PGP SIGNATURE-----

iQEzBAEBCgAdFiEEu7At3yzq9qgNHeZDoqBt8qM6VPoFAl7Tk0cACgkQoqBt8qM6
VPpkSwgAuUZtpdzkttyqECe0BebW8+V/xeKAa06wXCk8rdVZhbAfmJ57MM4p6L/Q
BVn3Qy3LBrZJ3IskgIUZXIByOoIPK15L5hyMUMVV5GDQrU+UDIBJNKH8H0o9XsJY
2DequT4SsQ7aF8njhA21xZ8XLvMk6pu7SPNZPLszvapDEM9kgA39DTx6lfxBLw0V
D9v+pVQtuZqg37ge6jYBZrH1uoZm7vFbO1QXESVBUT4xmY5hjnsBpLUFcUOnuW1m
C39lpgtBPwC1umLVgVoyCAB5XCb6i1LMyFQ1ysLLbkBXyCfDs0ZhHSZeZm/3hbAx
n2POIAqxUseebhbqDjKHnYm57HckXA==
=z0CC
-----END PGP SIGNATURE-----

L
L
Ludovic Courtès wrote on 4 Jun 2020 14:05
(name . Marius Bakke)(address . marius@gnu.org)(address . 41625@debbugs.gnu.org)
87a71jc9yi.fsf@gnu.org
Hi,

Marius Bakke <marius@gnu.org> skribis:

Toggle quote (19 lines)
> Marius Bakke <marius@gnu.org> writes:
>
>> 'guix offload test' passes without problems.
>
> Not so fast, running it in a loop reveals the crash.
>
> There is a trace file in /root/offloadtest.trace on Berlin with such an
> occurence. It looks like a timeout is reached shortly before the EOF
> error:
>
> 10139 poll([{fd=14, events=POLLIN|POLLOUT}], 1, 0) = 1 ([{fd=14, revents=POLLOUT}])
> 10139 poll([{fd=14, events=POLLIN}], 1, 15000) = 0 (Timeout)
> 10139 write(2, "Backtrace:\n", 11) = 11
>
> This seems to be from a different node than the one reported previously,
> as the preceding connect() was to this machine:
>
> 10139 connect(44, {sa_family=AF_INET, sin_port=htons(22), sin_addr=inet_addr("141.80.167.186")}, 16) = -1 EINPROGRESS (Operation now in progress)

So it looks like ‘connect’ fails and eventually we get an EOF object.
However, I don’t see where that EOF comes from because the return value
of ‘connect!’ (the Guile-SSH procedure) is properly checked.

Ludo’.
M
M
Maxim Cournoyer wrote on 24 May 2021 07:33
(name . Ludovic Courtès)(address . ludo@gnu.org)
87mtsky9um.fsf@gmail.com
Hi,

Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (31 lines)
> Hi,
>
> Marius Bakke <marius@gnu.org> skribis:
>
>> Marius Bakke <marius@gnu.org> writes:
>>
>>> 'guix offload test' passes without problems.
>>
>> Not so fast, running it in a loop reveals the crash.
>>
>> There is a trace file in /root/offloadtest.trace on Berlin with such an
>> occurence. It looks like a timeout is reached shortly before the EOF
>> error:
>>
>> 10139 poll([{fd=14, events=POLLIN|POLLOUT}], 1, 0) = 1 ([{fd=14, revents=POLLOUT}])
>> 10139 poll([{fd=14, events=POLLIN}], 1, 15000) = 0 (Timeout)
>> 10139 write(2, "Backtrace:\n", 11) = 11
>>
>> This seems to be from a different node than the one reported previously,
>> as the preceding connect() was to this machine:
>>
>> 10139 connect(44, {sa_family=AF_INET, sin_port=htons(22),
>> sin_addr=inet_addr("141.80.167.186")}, 16) = -1 EINPROGRESS
>> (Operation now in progress)
>
> So it looks like ‘connect’ fails and eventually we get an EOF object.
> However, I don’t see where that EOF comes from because the return value
> of ‘connect!’ (the Guile-SSH procedure) is properly checked.
>
> Ludo’.

I got a slightly different backtrace that suggests making the connection
is not at fault, rather it occurs during the read-repl-response call:

Toggle snippet (22 lines)
guix offload: testing 1 build machines defined in '/etc/guix/machines.scm'...
Backtrace:
8 (primitive-load "/home/maxim/.config/guix/current/bin/guix")
In guix/ui.scm:
2165:12 7 (run-guix-command _ . _)
In ice-9/boot-9.scm:
1752:10 6 (with-exception-handler _ _ #:unwind? _ #:unwind-for-type _)
1747:15 5 (with-exception-handler #<procedure 7f2caf885780 at ice-9/boot-9.scm:1831:7 (exn)> _ # _ # …)
In guix/scripts/offload.scm:
704:21 4 (check-machine-availability _ _)
In srfi/srfi-1.scm:
586:17 3 (map1 (#<session maxim@overdrive1.guix.gnu.org:52522 (connected) 7f2cae396fc0>))
In guix/inferior.scm:
258:2 2 (port->inferior _ _)
240:2 1 (read-repl-response _ _)
In ice-9/boot-9.scm:
1685:16 0 (raise-exception _ #:continuable? _)

ice-9/boot-9.scm:1685:16: In procedure raise-exception:
Throw to key `match-error' with args `("match" "no matching pattern" #<eof>)'.

I seem to get this more often than not with the overdrive1 offload
machine.

Maxim
M
M
Maxim Cournoyer wrote on 25 May 2021 17:50
[PATCH] offload: Handle a possible EOF response from read-repl-response.
(address . 41625@debbugs.gnu.org)
20210525155003.27590-1-maxim.cournoyer@gmail.com

* guix/scripts/offload.scm (check-machine-availability): Refactor so that it
takes a single machine object, to allow for retrying a single machine. Handle
the case where the checks raised an exception due to the connection to the
build machine having been lost, and retry up to 3 times. Ensure the cleanup
code is run in all situations.
(check-machines-availability): New procedure. Call
CHECK-MACHINES-AVAILABILITY in parallel, which improves performance (about
twice as fast with 4 build machines, from ~30 s to ~15 s).
* guix/inferior.scm (&inferior-connection-lost): New condition type.
(read-repl-response): Raise a condition of the above type when reading EOF
from the build machine's port.
---
guix/inferior.scm | 9 ++++++++
guix/scripts/offload.scm | 50 +++++++++++++++++++++++++++++-----------
2 files changed, 45 insertions(+), 14 deletions(-)

Toggle diff (132 lines)
diff --git a/guix/inferior.scm b/guix/inferior.scm
index 7c8e478f2a..4ac1ea3484 100644
--- a/guix/inferior.scm
+++ b/guix/inferior.scm
@@ -1,5 +1,6 @@
;;; GNU Guix --- Functional package management for GNU
;;; Copyright © 2018, 2019, 2020, 2021 Ludovic Courtès <ludo@gnu.org>
+;;; Copyright © 2021 Maxim Cournoyer <maxim.cournoyer@gmail.com>
;;;
;;; This file is part of GNU Guix.
;;;
@@ -70,6 +71,7 @@
inferior-exception-arguments
inferior-exception-inferior
inferior-exception-stack
+ inferior-connection-lost?
read-repl-response
inferior-packages
@@ -228,6 +230,9 @@ equivalent. Return #f if the inferior could not be launched."
(inferior inferior-exception-inferior) ;<inferior> | #f
(stack inferior-exception-stack)) ;list of (FILE COLUMN LINE)
+(define-condition-type &inferior-connection-lost &error
+ inferior-connection-lost?)
+
(define* (read-repl-response port #:optional inferior)
"Read a (guix repl) response from PORT and return it as a Scheme object.
Raise '&inferior-exception' when an exception is read from PORT."
@@ -241,6 +246,10 @@ Raise '&inferior-exception' when an exception is read from PORT."
(match (read port)
(('values objects ...)
(apply values (map sexp->object objects)))
+ ;; Unexpectedly read EOF from the port. This can happen for example when
+ ;; the underlying connection for PORT was lost with Guile-SSH.
+ (? eof-object?
+ (raise (condition (&inferior-connection-lost))))
(('exception ('arguments key objects ...)
('stack frames ...))
;; Protocol (0 1 1) and later.
diff --git a/guix/scripts/offload.scm b/guix/scripts/offload.scm
index 835078cb97..7c6d38b218 100644
--- a/guix/scripts/offload.scm
+++ b/guix/scripts/offload.scm
@@ -1,7 +1,7 @@
;;; GNU Guix --- Functional package management for GNU
;;; Copyright © 2014, 2015, 2016, 2017, 2018, 2019, 2020 Ludovic Courtès <ludo@gnu.org>
;;; Copyright © 2017 Ricardo Wurmus <rekado@elephly.net>
-;;; Copyright © 2020 Maxim Cournoyer <maxim.cournoyer@gmail.com>
+;;; Copyright © 2020, 2021 Maxim Cournoyer <maxim.cournoyer@gmail.com>
;;; Copyright © 2020 Julien Lepiller <julien@lepiller.eu>
;;;
;;; This file is part of GNU Guix.
@@ -53,6 +53,7 @@
#:use-module (ice-9 regex)
#:use-module (ice-9 format)
#:use-module (ice-9 binary-ports)
+ #:use-module (ice-9 threads)
#:export (build-machine
build-machine?
build-machine-name
@@ -684,7 +685,7 @@ daemon is not running."
(leave (G_ "failed to import '~a' from '~a'~%")
item name)))))
-(define (check-machine-availability machine-file pred)
+(define (check-machines-availability machine-file pred)
"Check that each machine matching PRED in MACHINE-FILE is usable as a build
machine."
(define (build-machine=? m1 m2)
@@ -696,18 +697,39 @@ machine."
(let ((machines (filter pred
(delete-duplicates (build-machines machine-file)
build-machine=?))))
- (info (G_ "testing ~a build machines defined in '~a'...~%")
+ (info (G_ "Testing ~a build machines defined in '~a'...~%")
(length machines) machine-file)
- (let* ((names (map build-machine-name machines))
- (sockets (map build-machine-daemon-socket machines))
- (sessions (map (cut open-ssh-session <> %short-timeout) machines))
- (nodes (map remote-inferior sessions)))
- (for-each assert-node-has-guix nodes names)
- (for-each assert-node-repl nodes names)
- (for-each assert-node-can-import sessions nodes names sockets)
- (for-each assert-node-can-export sessions nodes names sockets)
- (for-each close-inferior nodes)
- (for-each disconnect! sessions))))
+ (par-for-each check-machine-availability machines)))
+
+(define (check-machine-availability machine)
+ "Check whether MACHINE is available. Exit with an error upon failure."
+ ;; Sometimes, the machine remote port may return EOF, presumably because the
+ ;; connection was lost. Retry up to 3 times.
+ (let loop ((retries 3))
+ (guard (c ((inferior-connection-lost? c)
+ (let ((retries-left (1- retries)))
+ (if (> retries-left 0)
+ (begin
+ (format (current-error-port)
+ (G_ "connection to machine ~s lost; retrying~%")
+ (build-machine-name machine))
+ (loop (retries-left)))
+ (leave (G_ "connection repeatedly lost with machine '~a'~%")
+ (build-machine-name machine))))))
+ (let* ((name (build-machine-name machine))
+ (socket (build-machine-daemon-socket machine))
+ (session (open-ssh-session machine %short-timeout))
+ (node (remote-inferior session)))
+ (dynamic-wind
+ (lambda () #t)
+ (lambda ()
+ (assert-node-has-guix node name)
+ (assert-node-repl node name)
+ (assert-node-can-import session node name socket)
+ (assert-node-can-export session node name socket))
+ (lambda ()
+ (close-inferior node)
+ (disconnect! session)))))))
(define (check-machine-status machine-file pred)
"Print the load of each machine matching PRED in MACHINE-FILE."
@@ -824,7 +846,7 @@ machine."
((file) (values file (const #t)))
(() (values %machine-file (const #t)))
(x (leave (G_ "wrong number of arguments~%"))))))
- (check-machine-availability (or file %machine-file) pred))))
+ (check-machines-availability (or file %machine-file) pred))))
(("status" rest ...)
(with-error-handling
(let-values (((file pred)
--
2.31.1
L
L
Ludovic Courtès wrote on 25 May 2021 22:27
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)
875yz61rvt.fsf@gnu.org
Hi!

Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

Toggle quote (14 lines)
>
> * guix/scripts/offload.scm (check-machine-availability): Refactor so that it
> takes a single machine object, to allow for retrying a single machine. Handle
> the case where the checks raised an exception due to the connection to the
> build machine having been lost, and retry up to 3 times. Ensure the cleanup
> code is run in all situations.
> (check-machines-availability): New procedure. Call
> CHECK-MACHINES-AVAILABILITY in parallel, which improves performance (about
> twice as fast with 4 build machines, from ~30 s to ~15 s).
> * guix/inferior.scm (&inferior-connection-lost): New condition type.
> (read-repl-response): Raise a condition of the above type when reading EOF
> from the build machine's port.

[...]

Toggle quote (15 lines)
> +(define-condition-type &inferior-connection-lost &error
> + inferior-connection-lost?)
> +
> (define* (read-repl-response port #:optional inferior)
> "Read a (guix repl) response from PORT and return it as a Scheme object.
> Raise '&inferior-exception' when an exception is read from PORT."
> @@ -241,6 +246,10 @@ Raise '&inferior-exception' when an exception is read from PORT."
> (match (read port)
> (('values objects ...)
> (apply values (map sexp->object objects)))
> + ;; Unexpectedly read EOF from the port. This can happen for example when
> + ;; the underlying connection for PORT was lost with Guile-SSH.
> + (? eof-object?
> + (raise (condition (&inferior-connection-lost))))

The match clause syntax is incorrect; should be:

((? eof-object?)
(raise …))

Toggle quote (14 lines)
> + (info (G_ "Testing ~a build machines defined in '~a'...~%")
> (length machines) machine-file)
> - (let* ((names (map build-machine-name machines))
> - (sockets (map build-machine-daemon-socket machines))
> - (sessions (map (cut open-ssh-session <> %short-timeout) machines))
> - (nodes (map remote-inferior sessions)))
> - (for-each assert-node-has-guix nodes names)
> - (for-each assert-node-repl nodes names)
> - (for-each assert-node-can-import sessions nodes names sockets)
> - (for-each assert-node-can-export sessions nodes names sockets)
> - (for-each close-inferior nodes)
> - (for-each disconnect! sessions))))
> + (par-for-each check-machine-availability machines)))

Why not! IMO this should go in a separate patch, though, since it’s not
related.

Toggle quote (16 lines)
> +(define (check-machine-availability machine)
> + "Check whether MACHINE is available. Exit with an error upon failure."
> + ;; Sometimes, the machine remote port may return EOF, presumably because the
> + ;; connection was lost. Retry up to 3 times.
> + (let loop ((retries 3))
> + (guard (c ((inferior-connection-lost? c)
> + (let ((retries-left (1- retries)))
> + (if (> retries-left 0)
> + (begin
> + (format (current-error-port)
> + (G_ "connection to machine ~s lost; retrying~%")
> + (build-machine-name machine))
> + (loop (retries-left)))
> + (leave (G_ "connection repeatedly lost with machine '~a'~%")
> + (build-machine-name machine))))))

I’m afraid we’re papering over problems here.

Is running ‘guix offload test /etc/guix/machines.scm overdrive1’ on
berlin enough to reproduce the issue? If so, we could monitor/strace
sshd on overdrive1 to get a better understanding of what’s going on.

WDYT?

Thanks,
Ludo’.
M
M
Maxim Cournoyer wrote on 26 May 2021 05:18
Re: [PATCH v2] offload: Handle a possible EOF response from read-repl-response.
(name . Ludovic Courtès)(address . ludo@gnu.org)
87mtsikwsm.fsf_-_@gmail.com
Hi Ludovic,

Ludovic Courtès <ludo@gnu.org> writes:

[...]

Toggle quote (17 lines)
>> (define* (read-repl-response port #:optional inferior)
>> "Read a (guix repl) response from PORT and return it as a Scheme object.
>> Raise '&inferior-exception' when an exception is read from PORT."
>> @@ -241,6 +246,10 @@ Raise '&inferior-exception' when an exception is read from PORT."
>> (match (read port)
>> (('values objects ...)
>> (apply values (map sexp->object objects)))
>> + ;; Unexpectedly read EOF from the port. This can happen for example when
>> + ;; the underlying connection for PORT was lost with Guile-SSH.
>> + (? eof-object?
>> + (raise (condition (&inferior-connection-lost))))
>
> The match clause syntax is incorrect; should be:
>
> ((? eof-object?)
> (raise …))

Good catch, fixed.

Toggle quote (17 lines)
>> + (info (G_ "Testing ~a build machines defined in '~a'...~%")
>> (length machines) machine-file)
>> - (let* ((names (map build-machine-name machines))
>> - (sockets (map build-machine-daemon-socket machines))
>> - (sessions (map (cut open-ssh-session <> %short-timeout) machines))
>> - (nodes (map remote-inferior sessions)))
>> - (for-each assert-node-has-guix nodes names)
>> - (for-each assert-node-repl nodes names)
>> - (for-each assert-node-can-import sessions nodes names sockets)
>> - (for-each assert-node-can-export sessions nodes names sockets)
>> - (for-each close-inferior nodes)
>> - (for-each disconnect! sessions))))
>> + (par-for-each check-machine-availability machines)))
>
> Why not! IMO this should go in a separate patch, though, since it’s not
> related.

For me, it is related in that retrying all the checks of *every* build
offload machine would be too expensive; it already takes 32 s for my 4
offload machines; retrying this for up to 3 times would mean waiting for
a minute and half, which I don't find reasonable (imagine on berlin!).

Toggle quote (18 lines)
>> +(define (check-machine-availability machine)
>> + "Check whether MACHINE is available. Exit with an error upon failure."
>> + ;; Sometimes, the machine remote port may return EOF, presumably because the
>> + ;; connection was lost. Retry up to 3 times.
>> + (let loop ((retries 3))
>> + (guard (c ((inferior-connection-lost? c)
>> + (let ((retries-left (1- retries)))
>> + (if (> retries-left 0)
>> + (begin
>> + (format (current-error-port)
>> + (G_ "connection to machine ~s lost; retrying~%")
>> + (build-machine-name machine))
>> + (loop (retries-left)))
>> + (leave (G_ "connection repeatedly lost with machine '~a'~%")
>> + (build-machine-name machine))))))
>
> I’m afraid we’re papering over problems here.

I had that thought too, but then also realized that even if this was
papering over a problem, it'd be a good one to paper over as this
problem can legitimately happen in practice, due to the network's
inherently shaky nature. It seems better to be ready for it. Also, my
hopes in being able to troubleshoot such a difficult to reproduce
networking issue are rather low.

Toggle quote (4 lines)
> Is running ‘guix offload test /etc/guix/machines.scm overdrive1’ on
> berlin enough to reproduce the issue? If so, we could monitor/strace
> sshd on overdrive1 to get a better understanding of what’s going on.

It's actually difficult to trigger it; it seems to happen mostly on the
first try after a long time without connecting to the machine; on the
2nd and later tries, everything is smooth. Waiting a few minutes is not
enough to re-trigger the problem.

I've managed to see the problem a few lucky times with:

Toggle snippet (3 lines)
while true; do guix offload test /etc/guix/machines.scm overdrive1; done

I don't have a password set for my user on overdrive1, so can't attach
strace to sshd, but yeah, we could try to capture it and see if we can
understand what's going on.

Attached is v2 of the patch, with the match clause fixed.
Thanks!

Maxim
L
L
Ludovic Courtès wrote on 26 May 2021 11:14
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)
87fsy9x3ev.fsf@gnu.org
Hi,

Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

Toggle quote (22 lines)
>>> + (info (G_ "Testing ~a build machines defined in '~a'...~%")
>>> (length machines) machine-file)
>>> - (let* ((names (map build-machine-name machines))
>>> - (sockets (map build-machine-daemon-socket machines))
>>> - (sessions (map (cut open-ssh-session <> %short-timeout) machines))
>>> - (nodes (map remote-inferior sessions)))
>>> - (for-each assert-node-has-guix nodes names)
>>> - (for-each assert-node-repl nodes names)
>>> - (for-each assert-node-can-import sessions nodes names sockets)
>>> - (for-each assert-node-can-export sessions nodes names sockets)
>>> - (for-each close-inferior nodes)
>>> - (for-each disconnect! sessions))))
>>> + (par-for-each check-machine-availability machines)))
>>
>> Why not! IMO this should go in a separate patch, though, since it’s not
>> related.
>
> For me, it is related in that retrying all the checks of *every* build
> offload machine would be too expensive; it already takes 32 s for my 4
> offload machines; retrying this for up to 3 times would mean waiting for
> a minute and half, which I don't find reasonable (imagine on berlin!).

I see. So I’d say it’s a prerequisite (a patch that must come before)
but not entirely the same thing. I’m nitpicking!

We should make sure it doesn’t trigger thread-safety issues in libssh or
anything like that (running it repeatedly on a large machines.scm should
give us some confidence).

Toggle quote (25 lines)
>>> +(define (check-machine-availability machine)
>>> + "Check whether MACHINE is available. Exit with an error upon failure."
>>> + ;; Sometimes, the machine remote port may return EOF, presumably because the
>>> + ;; connection was lost. Retry up to 3 times.
>>> + (let loop ((retries 3))
>>> + (guard (c ((inferior-connection-lost? c)
>>> + (let ((retries-left (1- retries)))
>>> + (if (> retries-left 0)
>>> + (begin
>>> + (format (current-error-port)
>>> + (G_ "connection to machine ~s lost; retrying~%")
>>> + (build-machine-name machine))
>>> + (loop (retries-left)))
>>> + (leave (G_ "connection repeatedly lost with machine '~a'~%")
>>> + (build-machine-name machine))))))
>>
>> I’m afraid we’re papering over problems here.
>
> I had that thought too, but then also realized that even if this was
> papering over a problem, it'd be a good one to paper over as this
> problem can legitimately happen in practice, due to the network's
> inherently shaky nature. It seems better to be ready for it. Also, my
> hopes in being able to troubleshoot such a difficult to reproduce
> networking issue are rather low.

Yes, but note that this is just for ‘guix offload test’. The actual
code run while offloading will still fail badly.

Toggle quote (17 lines)
>> Is running ‘guix offload test /etc/guix/machines.scm overdrive1’ on
>> berlin enough to reproduce the issue? If so, we could monitor/strace
>> sshd on overdrive1 to get a better understanding of what’s going on.
>
> It's actually difficult to trigger it; it seems to happen mostly on the
> first try after a long time without connecting to the machine; on the
> 2nd and later tries, everything is smooth. Waiting a few minutes is not
> enough to re-trigger the problem.
>
> I've managed to see the problem a few lucky times with:
>
> while true; do guix offload test /etc/guix/machines.scm overdrive1; done
>
> I don't have a password set for my user on overdrive1, so can't attach
> strace to sshd, but yeah, we could try to capture it and see if we can
> understand what's going on.

OK.

Toggle quote (20 lines)
> From c52172502749a4d194dc51db9d2c394cb15e8d07 Mon Sep 17 00:00:00 2001
> From: Maxim Cournoyer <maxim.cournoyer@gmail.com>
> Date: Tue, 25 May 2021 08:42:06 -0400
> Subject: [PATCH] offload: Handle a possible EOF response from
> read-repl-response.
>
> Fixes <https://issues.guix.gnu.org/41625>.
>
> * guix/scripts/offload.scm (check-machine-availability): Refactor so that it
> takes a single machine object, to allow for retrying a single machine. Handle
> the case where the checks raised an exception due to the connection to the
> build machine having been lost, and retry up to 3 times. Ensure the cleanup
> code is run in all situations.
> (check-machines-availability): New procedure. Call
> CHECK-MACHINES-AVAILABILITY in parallel, which improves performance (about
> twice as fast with 4 build machines, from ~30 s to ~15 s).
> * guix/inferior.scm (&inferior-connection-lost): New condition type.
> (read-repl-response): Raise a condition of the above type when reading EOF
> from the build machine's port.

[...]

Toggle quote (3 lines)
> +(define-condition-type &inferior-connection-lost &error
> + inferior-connection-lost?)

Perhaps worth adding an ‘inferior’ and/or ‘port’ field. That would
allow the handler to present more information as to which inferior is
failing.

Maybe ‘premature-eof’ would be more accurate than ‘connection-lost’.

Toggle quote (4 lines)
> + (format (current-error-port)
> + (G_ "connection to machine '~a' lost; retrying~%")
> + (build-machine-name machine))

You can use ‘info’ instead of ‘format’.

Otherwise LGTM, thanks!

Ludo’.
M
M
Marius Bakke wrote on 26 May 2021 17:48
(address . 41625@debbugs.gnu.org)
87h7ipa433.fsf@gnu.org
Maxim Cournoyer <maxim.cournoyer@gmail.com> skriver:

Toggle quote (15 lines)
>> Is running ‘guix offload test /etc/guix/machines.scm overdrive1’ on
>> berlin enough to reproduce the issue? If so, we could monitor/strace
>> sshd on overdrive1 to get a better understanding of what’s going on.
>
> It's actually difficult to trigger it; it seems to happen mostly on the
> first try after a long time without connecting to the machine; on the
> 2nd and later tries, everything is smooth. Waiting a few minutes is not
> enough to re-trigger the problem.
>
> I've managed to see the problem a few lucky times with:
>
> --8<---------------cut here---------------start------------->8---
> while true; do guix offload test /etc/guix/machines.scm overdrive1; done
> --8<---------------cut here---------------end--------------->8---

I used to be able to reproduce it by inducing a high load on the target
machine and just let Guix keep trying to connect. But now I did that,
and set overload threshold to 0.0 for good measure, and Guix has been
waiting patiently for two hours without failure.

So AFAICT this bug has been fixed. Perhaps Berlin or the Overdrive
simply needs to be updated?
-----BEGIN PGP SIGNATURE-----

iIUEARYKAC0WIQRNTknu3zbaMQ2ddzTocYulkRQQdwUCYK5t0A8cbWFyaXVzQGdu
dS5vcmcACgkQ6HGLpZEUEHcy7QEAlX2skErbZBDILToutLkOSP6mLf3xK2tdqBcO
ttgbpX0BANb7F5k14C9HhyQBYoexMCS9Mydjfx43TLIeAuKw2gkL
=41ZX
-----END PGP SIGNATURE-----

M
M
Maxim Cournoyer wrote on 27 May 2021 13:49
(name . Ludovic Courtès)(address . ludo@gnu.org)
877djkl7lp.fsf@gmail.com
Hi Ludovic,

Ludovic Courtès <ludo@gnu.org> writes:

[...]

Toggle quote (3 lines)
> I see. So I’d say it’s a prerequisite (a patch that must come before)
> but not entirely the same thing. I’m nitpicking!

Eh, it's okay :-). Splitting changes into the right unit is a problem
that is akin to naming things; it's hard! I welcome your suggestion.

Toggle quote (4 lines)
> We should make sure it doesn’t trigger thread-safety issues in libssh or
> anything like that (running it repeatedly on a large machines.scm should
> give us some confidence).

It seems fine so far, but I've only tested in a loop with 4 build
machines. When it nears completion I'll give it a shot on berlin.

[...]

Toggle quote (3 lines)
> Yes, but note that this is just for ‘guix offload test’. The actual
> code run while offloading will still fail badly.

Ah, thanks for pointing that; I somehow thought that this machine status
checking code was a prelude to every offloaded build.

[...]

Toggle quote (6 lines)
>> I don't have a password set for my user on overdrive1, so can't attach
>> strace to sshd, but yeah, we could try to capture it and see if we can
>> understand what's going on.
>
> OK.

I'd be happy to try strace when your are available. You can ping me on
the chat. It's been more than 8 hours since I tried, so I should be
able to trigger the problem :-).

[...]

Toggle quote (6 lines)
> Perhaps worth adding an ‘inferior’ and/or ‘port’ field. That would
> allow the handler to present more information as to which inferior is
> failing.
>
> Maybe ‘premature-eof’ would be more accurate than ‘connection-lost’.

Good suggestions. I'll implement them.

Toggle quote (6 lines)
>> + (format (current-error-port)
>> + (G_ "connection to machine '~a' lost; retrying~%")
>> + (build-machine-name machine))
>
> You can use ‘info’ instead of ‘format’.

That also. Thanks!

On another note, I was able to 'exercise' the fix, and the exception is
raised but something fails with the following backtrace instead of being
retried:

Toggle snippet (27 lines)
guix offload: Testing 1 build machines defined in '/etc/guix/machines.scm'...
connection to machine 'overdrive1.guix.gnu.org' lost; retrying
Backtrace:
In ice-9/boot-9.scm:
1752:10 10 (with-exception-handler _ _ #:unwind? _ #:unwind-for-type _)
In unknown file:
9 (apply-smob/0 #<thunk 7f915c028f60>)
In ice-9/boot-9.scm:
724:2 8 (call-with-prompt _ _ #<procedure default-prompt-handler (k proc)>)
In ice-9/eval.scm:
619:8 7 (_ #(#(#<directory (guile-user) 7f915c022c80>)))
In guix/ui.scm:
2161:12 6 (run-guix-command _ . _)
In ice-9/boot-9.scm:
1752:10 5 (with-exception-handler _ _ #:unwind? _ #:unwind-for-type _)
1747:15 4 (with-exception-handler #<procedure 7f91576bf0c0 at ice-9/boot-9.scm:1831:7 (exn)> _ # _ # …)
In srfi/srfi-1.scm:
634:9 3 (for-each #<procedure check-machine-availability (a)> (#<<build-machine> name: "overdriv…>))
In ice-9/eval.scm:
191:35 2 (_ #(#(#(#<directory (guix scripts offload) 7f9159852780> 3 #<<build-machine> na…> …) …) …))
Exception thrown while printing backtrace:
In procedure frame-local-ref: Argument 2 out of range: 1

ice-9/boot-9.scm:1685:16: In procedure raise-exception:
Wrong type to apply: 2

I haven't been able to pinpoint what yet. Notice that in the above code
I've changed par-for-each by just for-each, doubting it might have
something to do with it, but it appears unrelated.

Thanks,

Maxim
M
M
Maxim Cournoyer wrote on 27 May 2021 13:51
(name . Marius Bakke)(address . marius@gnu.org)
8735u8l7iy.fsf@gmail.com
Hi Marius,

Marius Bakke <marius@gnu.org> writes:

Toggle quote (25 lines)
> Maxim Cournoyer <maxim.cournoyer@gmail.com> skriver:
>
>>> Is running ‘guix offload test /etc/guix/machines.scm overdrive1’ on
>>> berlin enough to reproduce the issue? If so, we could monitor/strace
>>> sshd on overdrive1 to get a better understanding of what’s going on.
>>
>> It's actually difficult to trigger it; it seems to happen mostly on the
>> first try after a long time without connecting to the machine; on the
>> 2nd and later tries, everything is smooth. Waiting a few minutes is not
>> enough to re-trigger the problem.
>>
>> I've managed to see the problem a few lucky times with:
>>
>> --8<---------------cut here---------------start------------->8---
>> while true; do guix offload test /etc/guix/machines.scm overdrive1; done
>> --8<---------------cut here---------------end--------------->8---
>
> I used to be able to reproduce it by inducing a high load on the target
> machine and just let Guix keep trying to connect. But now I did that,
> and set overload threshold to 0.0 for good measure, and Guix has been
> waiting patiently for two hours without failure.
>
> So AFAICT this bug has been fixed. Perhaps Berlin or the Overdrive
> simply needs to be updated?

Ah! Do you have root access to overdrive1? It'd be interesting to
reconfigure it to update the guix-daemon and see if the problem
vanishes.

Thanks for the information!

Maxim
M
M
Maxim Cournoyer wrote on 27 May 2021 16:57
Re: [PATCH v3] offload: Handle a possible EOF response from read-repl-response.
(name . Ludovic Courtès)(address . ludo@gnu.org)
87r1hsjkbv.fsf_-_@gmail.com
Hi Ludo,

Now that I have root access to overdrive1, I could strace the sshd
process (I just did 'strace -p340', noting the process of sshd displayed
with 'herd status sshd'):

Toggle snippet (32 lines)
pselect6(87, [3 4], NULL, NULL, NULL, NULL) = 1 (in [3])
accept(3, {sa_family=AF_INET, sin_port=htons(33262), sin_addr=inet_addr("66.158.152.121")}, [128->16]) = 5
fcntl(5, F_GETFL) = 0x2 (flags O_RDWR)
pipe2([6, 7], 0) = 0
socketpair(AF_UNIX, SOCK_STREAM, 0, [8, 9]) = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xffff8e0ef0e0) = 644
close(7) = 0
close(9) = 0
write(8, "\0\0\1\245\0", 5) = 5
write(8, "\0\0\1\234\nPort 22\nPermitRootLogin no\n"..., 420) = 420
close(8) = 0
close(5) = 0
getpid() = 340
getpid() = 340
getpid() = 340
getpid() = 340
getpid() = 340
getpid() = 340
getpid() = 340
pselect6(87, [3 4 6], NULL, NULL, NULL, NULL) = 1 (in [6])
read(6, "\0", 1) = 1
pselect6(87, [3 4 6], NULL, NULL, NULL, NULL) = 1 (in [6])
read(6, "", 1) = 0
close(6) = 0
pselect6(87, [3 4], NULL, NULL, NULL, NULL) = ? ERESTARTNOHAND (To be restarted if no handler)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=644, si_uid=0, si_status=255, si_utime=1, si_stime=0} ---
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 255}], WNOHANG, NULL) = 644
wait4(-1, 0xfffffa4d90e4, WNOHANG, NULL) = 0
rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call)
pselect6(87, [3 4], NULL, NULL, NULL, NULL <detached ...>

With the attached v3 patch, the corresponding output (still problematic)
was:

Toggle snippet (29 lines)
$ ./pre-inst-env guix offload test /etc/guix/machines.scm overdrive1
guix offload: Testing 1 build machines defined in '/etc/guix/machines.scm'...
guix offload: got premature EOF from machine 'overdrive1.guix.gnu.org' from inferior '#<inferior pipe (0 1 1) 7f6ee10d3400>' on port '#<input-output: channel (open) 7f6ee10c5500>'; retrying connection
Backtrace:
In ice-9/boot-9.scm:
1752:10 10 (with-exception-handler _ _ #:unwind? _ #:unwind-for-type _)
In unknown file:
9 (apply-smob/0 #<thunk 7f6ee5ae9f60>)
In ice-9/boot-9.scm:
724:2 8 (call-with-prompt _ _ #<procedure default-prompt-handler (k proc)>)
In ice-9/eval.scm:
619:8 7 (_ #(#(#<directory (guile-user) 7f6ee5ae3c80>)))
In guix/ui.scm:
2161:12 6 (run-guix-command _ . _)
In ice-9/boot-9.scm:
1752:10 5 (with-exception-handler _ _ #:unwind? _ #:unwind-for-type _)
1747:15 4 (with-exception-handler #<procedure 7f6ee25e1480 at ice-9/boot-9.scm:1831:7 (exn)> _ # _ # …)
In ice-9/threads.scm:
288:21 3 (loop _)
In guix/scripts/offload.scm:
719:29 2 (_ _)
719:29 1 (_ _)
In ice-9/boot-9.scm:
1685:16 0 (raise-exception _ #:continuable? _)

ice-9/boot-9.scm:1685:16: In procedure raise-exception:
Wrong type to apply: 2

I haven't tried analyzing the strace output yet.

Maxim
From c7b2ec1c58adf8c795df0a6aaf075dbc331f41e8 Mon Sep 17 00:00:00 2001
From: Maxim Cournoyer <maxim.cournoyer@gmail.com>
Date: Thu, 27 May 2021 08:44:44 -0400
Subject: [PATCH 1/2] offload: Parallelize machine check in offload test.

* guix/scripts/offload.scm (check-machine-availability): Refactor so that it
takes a single machine object. Ensure the cleanup code is always run.
(check-machines-availability): New procedure. Call
CHECK-MACHINES-AVAILABILITY in parallel, which improves performance (about
twice as fast with 4 build machines, from ~30 s to ~15 s).
---
guix/scripts/offload.scm | 39 +++++++++++++++++++++++++--------------
1 file changed, 25 insertions(+), 14 deletions(-)

Toggle diff (81 lines)
diff --git a/guix/scripts/offload.scm b/guix/scripts/offload.scm
index 835078cb97..b0fd20e158 100644
--- a/guix/scripts/offload.scm
+++ b/guix/scripts/offload.scm
@@ -1,7 +1,7 @@
;;; GNU Guix --- Functional package management for GNU
;;; Copyright © 2014, 2015, 2016, 2017, 2018, 2019, 2020 Ludovic Courtès <ludo@gnu.org>
;;; Copyright © 2017 Ricardo Wurmus <rekado@elephly.net>
-;;; Copyright © 2020 Maxim Cournoyer <maxim.cournoyer@gmail.com>
+;;; Copyright © 2020, 2021 Maxim Cournoyer <maxim.cournoyer@gmail.com>
;;; Copyright © 2020 Julien Lepiller <julien@lepiller.eu>
;;;
;;; This file is part of GNU Guix.
@@ -53,6 +53,7 @@
#:use-module (ice-9 regex)
#:use-module (ice-9 format)
#:use-module (ice-9 binary-ports)
+ #:use-module (ice-9 threads)
#:export (build-machine
build-machine?
build-machine-name
@@ -684,7 +685,7 @@ daemon is not running."
(leave (G_ "failed to import '~a' from '~a'~%")
item name)))))
-(define (check-machine-availability machine-file pred)
+(define (check-machines-availability machine-file pred)
"Check that each machine matching PRED in MACHINE-FILE is usable as a build
machine."
(define (build-machine=? m1 m2)
@@ -696,18 +697,28 @@ machine."
(let ((machines (filter pred
(delete-duplicates (build-machines machine-file)
build-machine=?))))
- (info (G_ "testing ~a build machines defined in '~a'...~%")
+ (info (G_ "Testing ~a build machines defined in '~a'...~%")
(length machines) machine-file)
- (let* ((names (map build-machine-name machines))
- (sockets (map build-machine-daemon-socket machines))
- (sessions (map (cut open-ssh-session <> %short-timeout) machines))
- (nodes (map remote-inferior sessions)))
- (for-each assert-node-has-guix nodes names)
- (for-each assert-node-repl nodes names)
- (for-each assert-node-can-import sessions nodes names sockets)
- (for-each assert-node-can-export sessions nodes names sockets)
- (for-each close-inferior nodes)
- (for-each disconnect! sessions))))
+ (par-for-each check-machine-availability machines)))
+
+(define (check-machine-availability machine)
+ "Check whether MACHINE is available. Exit with an error upon failure."
+ ;; Sometimes, the machine remote port may return EOF, presumably because the
+ ;; connection was lost. Retry up to 3 times.
+ (let* ((name (build-machine-name machine))
+ (socket (build-machine-daemon-socket machine))
+ (session (open-ssh-session machine %short-timeout))
+ (node (remote-inferior session)))
+ (dynamic-wind
+ (lambda () #t)
+ (lambda ()
+ (assert-node-has-guix node name)
+ (assert-node-repl node name)
+ (assert-node-can-import session node name socket)
+ (assert-node-can-export session node name socket))
+ (lambda ()
+ (close-inferior node)
+ (disconnect! session)))))
(define (check-machine-status machine-file pred)
"Print the load of each machine matching PRED in MACHINE-FILE."
@@ -824,7 +835,7 @@ machine."
((file) (values file (const #t)))
(() (values %machine-file (const #t)))
(x (leave (G_ "wrong number of arguments~%"))))))
- (check-machine-availability (or file %machine-file) pred))))
+ (check-machines-availability (or file %machine-file) pred))))
(("status" rest ...)
(with-error-handling
(let-values (((file pred)
--
2.31.1
M
M
Maxim Cournoyer wrote on 27 May 2021 19:20
Re: [PATCH v2] offload: Handle a possible EOF response from read-repl-response.
(name . Ludovic Courtès)(address . ludo@gnu.org)
87mtsgjdoz.fsf@gmail.com
Hello again,

My previous strace output didn't follow the 'clone' syscall; the one
attached does. It's compressed with lzip.
Attachment: sshd-strace.out.lz
Thanks!

Maxim
L
L
Ludovic Courtès wrote on 29 May 2021 21:24
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)
87tuml5onf.fsf@gnu.org
Hi Maxim,

Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

Toggle quote (3 lines)
> My previous strace output didn't follow the 'clone' syscall; the one
> attached does. It's compressed with lzip.

Toggle snippet (52 lines)
11457 sendto(3, "<86>May 27 19:03:45 sshd[11457]: pam_unix(sshd:session): session opened for user maxim by (uid=0)", 97, MSG_NOSIGNAL, NULL, 0) = 97
11457 socketpair(AF_UNIX, SOCK_STREAM, 0, [5, 7]) = 0
11457 fcntl(5, F_SETFD, FD_CLOEXEC) = 0
11457 fcntl(7, F_SETFD, FD_CLOEXEC) = 0
11457 clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD <unfinished ...>
11459 set_robust_list(0xffff9d2e10f0, 24 <unfinished ...>
11457 <... clone resumed>, child_tidptr=0xffff9d2e10e0) = 11459

[…]

11459 clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD <unfinished ...>
11460 set_robust_list(0xffff9d2e10f0, 24 <unfinished ...>
11459 <... clone resumed>, child_tidptr=0xffff9d2e10e0) = 11460

[…]

11460 execve("/gnu/store/6b5gnpnbi73l0pasan0fip4w3f1afizi-bash-5.0.16/bin/bash", ["bash", "-c", "guix repl -t machine"], 0xaaaacfb78470 /* 16 vars */) = 0

[…]

11460 <... read resumed>"(use-modules (gnu))\n", 4096) = 20

[…]

11459 read(4, "", 16384) = 0
11459 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
11459 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
11459 close(11) = 0
11459 close(10) = 0
11459 close(13) = 0
11459 close(4) = 0
11459 brk(0xaaaacfba7000) = 0xaaaacfba7000
11459 write(5, "\0\0\0\0012", 5) = 5
11457 <... ppoll resumed>) = 1 ([{fd=7, revents=POLLIN}])
11459 exit_group(0 <unfinished ...>

[…]

11457 sendto(3, "<86>May 27 19:04:08 sshd[11457]: pam_unix(sshd:session): session closed for user maxim", 86, MSG_NOSIGNAL, NULL, 0) = 86

[…]

11460 openat(AT_FDCWD, "/gnu/store/83vzpl9pl5v86ncb8jy2gbfgsac0amjv-guix-module-union/lib/guile/3.0/site-ccache/gnu/packages/mtools.go", O_RDONLY|O_CLOEXEC) = 14
11460 lseek(14, 0, SEEK_END) = 72707
11460 mmap(NULL, 72707, PROT_READ, MAP_PRIVATE, 14, 0) = 0xffff90194000
11460 close(14) = 0
11460 mprotect(0xffff901a4000, 3432, PROT_READ|PROT_WRITE) = 0
11460 newfstatat(AT_FDCWD, "/gnu/store/83vzpl9pl5v86ncb8jy2gbfgsac0amjv-guix-module-union/share/guile/site/3.0/gnu/packages/patches/u-boot-rockchip-inno-usb.patch", {st_mode=S_IFREG|0444, st_size=10450, ...}, 0) = 0
11460 write(1, "(values (non-self-quoting 2052 \"#<unspecified>\"))\n", 50) = -1 EPIPE (Broken pipe)
11460 --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=11460, si_uid=30013} ---

(‘guix repl’ is PID 11460.)

It looks like the SSH session gets closed while ‘guix repl’ is still
running, and thus the client never sees the reply from ‘guix repl’
(‘guix repl’ itself eventually gets EPIPE because its parent process
terminated and there’s no one listening to it.)

So the bug may be on the client side, in ‘guix offload’, which for some
reason drops the session too early. Can you enable libssh debugging
there, reproduce the problem, and share the libssh debug info from that
run?

Do enable libssh debugging, you can drop something like this in (guix
scripts offload) for instance:

(use-modules (ssh log))
(set-log-verbosity! 'protocol)

The available logging levels, from Guile-SSH’s log.c, are:

Toggle snippet (15 lines)
struct symbol_mapping log_verbosity[] = {
/* 0, No logging at all */
{ "nolog", SSH_LOG_NOLOG },
/* 1, Only rare and noteworthy events */
{ "rare", SSH_LOG_RARE },
/* 2, High level protocol information */
{ "protocol", SSH_LOG_PROTOCOL },
/* 3, Lower level protocol infomations, packet level */
{ "packet", SSH_LOG_PACKET },
/* 4, Every function path */
{ "functions", SSH_LOG_FUNCTIONS },
{ NULL, -1 }
};

HTH!

Ludo’.
L
L
Ludovic Courtès wrote on 5 Jul 2021 10:57
Re: bug#41625: Sporadic guix-offload crashes due to EOF errors
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)
878s2lw330.fsf_-_@gnu.org
Hi,

Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

Toggle quote (28 lines)
> Now that I have root access to overdrive1, I could strace the sshd
> process (I just did 'strace -p340', noting the process of sshd displayed
> with 'herd status sshd'):
>
> pselect6(87, [3 4], NULL, NULL, NULL, NULL) = 1 (in [3])
> accept(3, {sa_family=AF_INET, sin_port=htons(33262), sin_addr=inet_addr("66.158.152.121")}, [128->16]) = 5
> fcntl(5, F_GETFL) = 0x2 (flags O_RDWR)
> pipe2([6, 7], 0) = 0
> socketpair(AF_UNIX, SOCK_STREAM, 0, [8, 9]) = 0
> clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xffff8e0ef0e0) = 644
> close(7) = 0
> close(9) = 0
> write(8, "\0\0\1\245\0", 5) = 5
> write(8, "\0\0\1\234\nPort 22\nPermitRootLogin no\n"..., 420) = 420
> close(8) = 0
> close(5) = 0
> getpid() = 340
> getpid() = 340
> getpid() = 340
> getpid() = 340
> getpid() = 340
> getpid() = 340
> getpid() = 340
> pselect6(87, [3 4 6], NULL, NULL, NULL, NULL) = 1 (in [6])
> read(6, "\0", 1) = 1
> pselect6(87, [3 4 6], NULL, NULL, NULL, NULL) = 1 (in [6])
> read(6, "", 1) = 0

OK, so it looks as if the client disconnected right away. Hard to tell
exactly what that happened. :-/ Perhaps turning libssh debugging on on
the client side could help (by uncommenting “#:log-verbosity 'protocol”
in (guix ssh)).

Toggle quote (11 lines)
>>From c7b2ec1c58adf8c795df0a6aaf075dbc331f41e8 Mon Sep 17 00:00:00 2001
> From: Maxim Cournoyer <maxim.cournoyer@gmail.com>
> Date: Thu, 27 May 2021 08:44:44 -0400
> Subject: [PATCH 1/2] offload: Parallelize machine check in offload test.
>
> * guix/scripts/offload.scm (check-machine-availability): Refactor so that it
> takes a single machine object. Ensure the cleanup code is always run.
> (check-machines-availability): New procedure. Call
> CHECK-MACHINES-AVAILABILITY in parallel, which improves performance (about
> twice as fast with 4 build machines, from ~30 s to ~15 s).

I remain wary of this change, because that could lead to subtle
non-deterministic bugs (of the kind that keeps you busy for weeks) and
because I personally don’t mind if ‘guix offload test’ takes 30s on
berlin, and because the intermingled output may make diagnostics less
clear.

Toggle quote (15 lines)
>>From b5558777617e4674a150895458d57d202de56120 Mon Sep 17 00:00:00 2001
> From: Maxim Cournoyer <maxim.cournoyer@gmail.com>
> Date: Tue, 25 May 2021 08:42:06 -0400
> Subject: [PATCH 2/2] offload: Handle a possible EOF response from
> read-repl-response.
>
> Partially fixes <https://issues.guix.gnu.org/41625>.
>
> * guix/scripts/offload.scm (check-machine-availability): Handle the case where
> the checks raised an exception due to receiving EOF prematurely, and retry up
> to 3 times.
> * guix/inferior.scm (&inferior-premature-eof): New condition type.
> (read-repl-response): Raise a condition of the above type when reading EOF
> from the build machine's port.

LGTM!

Thanks,
Ludo’.
M
M
Maxim Cournoyer wrote on 24 Sep 2021 06:53
(name . Ludovic Courtès)(address . ludo@gnu.org)
87czoyk20t.fsf@gmail.com
Hello!

Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (37 lines)
> Hi,
>
> Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:
>
>> Now that I have root access to overdrive1, I could strace the sshd
>> process (I just did 'strace -p340', noting the process of sshd displayed
>> with 'herd status sshd'):
>>
>> pselect6(87, [3 4], NULL, NULL, NULL, NULL) = 1 (in [3])
>> accept(3, {sa_family=AF_INET, sin_port=htons(33262), sin_addr=inet_addr("66.158.152.121")}, [128->16]) = 5
>> fcntl(5, F_GETFL) = 0x2 (flags O_RDWR)
>> pipe2([6, 7], 0) = 0
>> socketpair(AF_UNIX, SOCK_STREAM, 0, [8, 9]) = 0
>> clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xffff8e0ef0e0) = 644
>> close(7) = 0
>> close(9) = 0
>> write(8, "\0\0\1\245\0", 5) = 5
>> write(8, "\0\0\1\234\nPort 22\nPermitRootLogin no\n"..., 420) = 420
>> close(8) = 0
>> close(5) = 0
>> getpid() = 340
>> getpid() = 340
>> getpid() = 340
>> getpid() = 340
>> getpid() = 340
>> getpid() = 340
>> getpid() = 340
>> pselect6(87, [3 4 6], NULL, NULL, NULL, NULL) = 1 (in [6])
>> read(6, "\0", 1) = 1
>> pselect6(87, [3 4 6], NULL, NULL, NULL, NULL) = 1 (in [6])
>> read(6, "", 1) = 0
>
> OK, so it looks as if the client disconnected right away. Hard to tell
> exactly what that happened. :-/ Perhaps turning libssh debugging on on
> the client side could help (by uncommenting “#:log-verbosity 'protocol”
> in (guix ssh)).

I was able to better understand the problem after encountering it on
another low power ARM board. It's about the guile-ssh/libssh timeout
causing a channel read to return EOF.

I have one example here where it hangs at the (inferior-eval
'(use-modules (gnu)) result)' step; Guix runs for about 1m30s,
apparently loading all the package modules. Perhaps my
GUILE_COMPILED_PATH is not set correctly and things are slower than
expected. Not sure.

But what happens is that there's no output in the 15 s timeout that we
set for the SSH session elapses, and libssh's ssh_channel_read returns
0, which is the same value it returns when it encounters EOF. Guile's
peek_byte_or_eof turn that zero into an EOF. I've shared my analysis on
the guile-ssh bug tracker [0]

So information is lost at libssh's level, which is not so nice. Knowing
exactly how that EOF come into the picture, we can handle it and produce
better diagnostic though. I'll try reworking my original patch in that
direction.

Thanks,

Maxim

M
M
Maxim Cournoyer wrote on 24 Sep 2021 06:55
(name . Ludovic Courtès)(address . ludo@gnu.org)
878rzmk1x9.fsf@gmail.com
Oh, and for completion, here's a reproducer I used here to debug with my
low power offload machine:

Toggle snippet (65 lines)
x(use-modules (guix scripts offload)
(guix inferior)
((guix ssh) #:hide (open-ssh-session))

(ice-9 match)
(ssh popen))

(define machine (build-machine
(name "tm")
(port 22)
(user "root")
(systems (list "armhf-linux"))
;; ecdsa-sha2-nistp256
(host-key "ssh-ecdsa-p256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBIx32IqtlUr2VMypLItGSnmKzd0eJkgnzPkVbOZbn4HFQD5KvWB98t4IkT0PESjs+DDQfI/PwRENo3yqd5GjukA=")
(parallel-builds 2)
(compression "none")
(private-key "/home/maxim/.ssh/id_rsa.1")))

(define open-ssh-session (@@ (guix scripts offload) open-ssh-session))

(define session (open-ssh-session machine 15))

(define repl-command '("guix" "repl" "-t" "machine"))

(define pipe (apply open-remote-pipe* session OPEN_BOTH repl-command))

;;; Hangs
;;;(port->inferior pipe)

(define rest (match (read pipe)
(('repl-version 0 rest ...)
rest)))

(define %inferior-packages (@@ (guix inferior) %inferior-packages))

(define %inferior-package-table (@@ (guix inferior) %inferior-package-table))

(define inferior (@@ (guix inferior) inferior))
(define inferior-socket (@@ (guix inferior) inferior-socket))

(define result (inferior 'pipe pipe close (cons 0 rest)
(delay (%inferior-packages result))
(delay (%inferior-package-table result))))

(define send-inferior-request (@@ (guix inferior) send-inferior-request))

(send-inferior-request '(() repl-version 0 1 1) result)

(inferior-eval '(use-modules (guix)) result)

;;; Hangs!
;;t(inferior-eval '(use-modules (gnu)) result)

(send-inferior-request '(use-modules (gnu)) result)

(define socket (inferior-socket result))

;;Throw to key `match-error' with args `("match" "no matching pattern" #<eof>)'
;;(read-repl-response socket result)
;; -> guix using 100% cpu, reading package modules for ~ 1m30s

;;; guile-ssh returns <eof> on timeout? (15 s) -> yes
(read socket)

I wish the Guile debugger was more capable; it'd have spared me the
above manual labor :-)

Maxim
M
M
Maxim Cournoyer wrote on 26 Mar 2022 06:03
(name . Marius Bakke)(address . marius@gnu.org)
87v8w1wbiu.fsf_-_@gmail.com
Hello,

Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:

Toggle quote (33 lines)
> Hi Marius,
>
> Marius Bakke <marius@gnu.org> writes:
>
>> Maxim Cournoyer <maxim.cournoyer@gmail.com> skriver:
>>
>>>> Is running ‘guix offload test /etc/guix/machines.scm overdrive1’ on
>>>> berlin enough to reproduce the issue? If so, we could monitor/strace
>>>> sshd on overdrive1 to get a better understanding of what’s going on.
>>>
>>> It's actually difficult to trigger it; it seems to happen mostly on the
>>> first try after a long time without connecting to the machine; on the
>>> 2nd and later tries, everything is smooth. Waiting a few minutes is not
>>> enough to re-trigger the problem.
>>>
>>> I've managed to see the problem a few lucky times with:
>>>
>>> --8<---------------cut here---------------start------------->8---
>>> while true; do guix offload test /etc/guix/machines.scm overdrive1; done
>>> --8<---------------cut here---------------end--------------->8---
>>
>> I used to be able to reproduce it by inducing a high load on the target
>> machine and just let Guix keep trying to connect. But now I did that,
>> and set overload threshold to 0.0 for good measure, and Guix has been
>> waiting patiently for two hours without failure.
>>
>> So AFAICT this bug has been fixed. Perhaps Berlin or the Overdrive
>> simply needs to be updated?
>
> Ah! Do you have root access to overdrive1? It'd be interesting to
> reconfigure it to update the guix-daemon and see if the problem
> vanishes.

Good news, this seems resolved with the newer Guile-SSH 0.15.1, where
long delays to return some output no longer triggers an EOF response
(instead now the client waits still). I believe it was fixed by this
commit [0].

Many thanks to Artyom Poptsov for fixing it!

Closing.

Maxim

Closed
?