(address . bug-guix@gnu.org)
Hello,
So there’s another situation where offloading regularly hangs on
berlin. The ‘guix offload’ process looks like this:
Toggle snippet (53 lines)
(gdb) bt
#0 0x00007f1f715686a1 in __GI___poll (fds=0x14e9b30, nfds=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1 0x00007f1f673b94e7 in ssh_poll (timeout=<optimized out>, nfds=<optimized out>, fds=<optimized out>)
at /tmp/guix-build-libssh-0.7.7.drv-0/libssh-0.7.7-checkout/src/poll.c:98
#2 ssh_poll_ctx_dopoll (ctx=ctx@entry=0x14ee2e0, timeout=timeout@entry=-1)
at /tmp/guix-build-libssh-0.7.7.drv-0/libssh-0.7.7-checkout/src/poll.c:612
#3 0x00007f1f673ba449 in ssh_handle_packets (session=session@entry=0x2249360, timeout=timeout@entry=-1)
at /tmp/guix-build-libssh-0.7.7.drv-0/libssh-0.7.7-checkout/src/session.c:634
#4 0x00007f1f673ba51d in ssh_handle_packets_termination (session=session@entry=0x2249360, timeout=<optimized out>,
timeout@entry=-3, fct=fct@entry=0x7f1f673a4430 <ssh_channel_read_termination>, user=user@entry=0x7ffce23953f0)
at /tmp/guix-build-libssh-0.7.7.drv-0/libssh-0.7.7-checkout/src/session.c:696
#5 0x00007f1f673a6aaf in ssh_channel_read_timeout (channel=0x224e360, dest=dest@entry=0x18ef020,
count=count@entry=8, is_stderr=<optimized out>, timeout=-3, timeout@entry=-1)
at /tmp/guix-build-libssh-0.7.7.drv-0/libssh-0.7.7-checkout/src/channels.c:2705
#6 0x00007f1f673a6bbb in ssh_channel_read (channel=<optimized out>, dest=dest@entry=0x18ef020, count=count@entry=8,
is_stderr=<optimized out>) at /tmp/guix-build-libssh-0.7.7.drv-0/libssh-0.7.7-checkout/src/channels.c:2621
#7 0x00007f1f67413a23 in read_from_channel_port (
channel=<error reading variable: ERROR: In procedure gdbscm_memory_port_fill_input: error reading memory>0x22f01a0, dst=<optimized out>, start=0, count=8) at channel-type.c:161
#8 0x00007f1f71b65287 in scm_i_read_bytes (
port=port@entry=<error reading variable: ERROR: In procedure gdbscm_memory_port_fill_input: error reading memory>0x22f01a0, dst=dst@entry="#<vu8vector>" = {...}, start=start@entry=0, count=count@entry=8) at ports.c:1559
#9 0x00007f1f71b6996c in scm_c_read_bytes (
port=port@entry=<error reading variable: ERROR: In procedure gdbscm_memory_port_fill_input: error reading memory>0x22f01a0, dst=dst@entry="#<vu8vector>" = {...}, start=start@entry=0, count=count@entry=8) at ports.c:1639
#10 0x00007f1f71b6fd80 in scm_get_bytevector_n (
port=<error reading variable: ERROR: In procedure gdbscm_memory_port_fill_input: error reading memory>0x22f01a0,
count=<optimized out>) at r6rs-ports.c:421
#11 0x00007f1f71ba4715 in vm_regular_engine (thread=0x14e9b30, vp=0xc31f30, registers=0xffffffff, resume=1901495969)
at vm-engine.c:786
[...]
(gdb) p *fds
$1 = {fd = 15, events = 1, revents = 0}
(gdb) shell ls -l /proc/12185/fd
total 0
lr-x------ 1 root root 64 Jan 10 16:56 0 -> 'pipe:[76778016]'
l-wx------ 1 root root 64 Jan 10 16:56 1 -> 'pipe:[76778015]'
lr-x------ 1 root root 64 Jan 10 16:56 10 -> 'pipe:[76838317]'
l-wx------ 1 root root 64 Jan 10 16:56 11 -> 'pipe:[76838317]'
lr-x------ 1 root root 64 Jan 10 16:56 12 -> 'pipe:[76851360]'
l-wx------ 1 root root 64 Jan 10 16:56 13 -> 'pipe:[76851360]'
l-wx------ 1 root root 64 Jan 10 16:56 14 -> /var/guix/offload/overdrive1.guixsd.org/1
lrwx------ 1 root root 64 Jan 10 16:56 15 -> 'socket:[76860702]'
lr-x------ 1 root root 64 Jan 10 16:56 16 -> /dev/urandom
l-wx------ 1 root root 64 Jan 10 16:56 2 -> 'pipe:[76778015]'
lr-x------ 1 root root 64 Jan 10 16:56 3 -> 'pipe:[76838313]'
l-wx------ 1 root root 64 Jan 10 16:56 4 -> 'pipe:[76778017]'
l-wx------ 1 root root 64 Jan 10 16:56 5 -> 'pipe:[76838313]'
lr-x------ 1 root root 64 Jan 10 16:56 6 -> 'pipe:[76838316]'
l-wx------ 1 root root 64 Jan 10 16:56 7 -> 'pipe:[76838316]'
lr-x------ 1 root root 64 Jan 10 16:56 8 -> 'pipe:[76841414]'
l-wx------ 1 root root 64 Jan 10 16:56 9 -> 'pipe:[76841414]'
It’s a ‘get-bytevector-n’ for 8 bytes, so it looks like the daemon
protocol. At that point the socket is actually dead: if I connect on
the remote machine (overdrive1.guixsd.org) I can see that there are no
other open SSH sessions.
A simple thing would be to somehow get libssh to pass POLLIN | POLLRDHUP
instead of just POLLIN.
Additionally, we could change Guile-SSH so that we can specify a timeout
when reading from a channel.
Ludo’.