Offloading sometimes hangs

  • Open
  • quality assurance status badge
Details
3 participants
  • Ludovic Courtès
  • Maxim Cournoyer
  • Mathieu Othacehe
Owner
unassigned
Submitted by
Ludovic Courtès
Severity
normal
L
L
Ludovic Courtès wrote on 10 Jan 2019 17:09
(address . bug-guix@gnu.org)
87o98obikk.fsf@gnu.org
Hello,

So there’s another situation where offloading regularly hangs on
berlin. The ‘guix offload’ process looks like this:

Toggle snippet (53 lines)
(gdb) bt
#0 0x00007f1f715686a1 in __GI___poll (fds=0x14e9b30, nfds=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1 0x00007f1f673b94e7 in ssh_poll (timeout=<optimized out>, nfds=<optimized out>, fds=<optimized out>)
at /tmp/guix-build-libssh-0.7.7.drv-0/libssh-0.7.7-checkout/src/poll.c:98
#2 ssh_poll_ctx_dopoll (ctx=ctx@entry=0x14ee2e0, timeout=timeout@entry=-1)
at /tmp/guix-build-libssh-0.7.7.drv-0/libssh-0.7.7-checkout/src/poll.c:612
#3 0x00007f1f673ba449 in ssh_handle_packets (session=session@entry=0x2249360, timeout=timeout@entry=-1)
at /tmp/guix-build-libssh-0.7.7.drv-0/libssh-0.7.7-checkout/src/session.c:634
#4 0x00007f1f673ba51d in ssh_handle_packets_termination (session=session@entry=0x2249360, timeout=<optimized out>,
timeout@entry=-3, fct=fct@entry=0x7f1f673a4430 <ssh_channel_read_termination>, user=user@entry=0x7ffce23953f0)
at /tmp/guix-build-libssh-0.7.7.drv-0/libssh-0.7.7-checkout/src/session.c:696
#5 0x00007f1f673a6aaf in ssh_channel_read_timeout (channel=0x224e360, dest=dest@entry=0x18ef020,
count=count@entry=8, is_stderr=<optimized out>, timeout=-3, timeout@entry=-1)
at /tmp/guix-build-libssh-0.7.7.drv-0/libssh-0.7.7-checkout/src/channels.c:2705
#6 0x00007f1f673a6bbb in ssh_channel_read (channel=<optimized out>, dest=dest@entry=0x18ef020, count=count@entry=8,
is_stderr=<optimized out>) at /tmp/guix-build-libssh-0.7.7.drv-0/libssh-0.7.7-checkout/src/channels.c:2621
#7 0x00007f1f67413a23 in read_from_channel_port (
channel=<error reading variable: ERROR: In procedure gdbscm_memory_port_fill_input: error reading memory>0x22f01a0, dst=<optimized out>, start=0, count=8) at channel-type.c:161
#8 0x00007f1f71b65287 in scm_i_read_bytes (
port=port@entry=<error reading variable: ERROR: In procedure gdbscm_memory_port_fill_input: error reading memory>0x22f01a0, dst=dst@entry="#<vu8vector>" = {...}, start=start@entry=0, count=count@entry=8) at ports.c:1559
#9 0x00007f1f71b6996c in scm_c_read_bytes (
port=port@entry=<error reading variable: ERROR: In procedure gdbscm_memory_port_fill_input: error reading memory>0x22f01a0, dst=dst@entry="#<vu8vector>" = {...}, start=start@entry=0, count=count@entry=8) at ports.c:1639
#10 0x00007f1f71b6fd80 in scm_get_bytevector_n (
port=<error reading variable: ERROR: In procedure gdbscm_memory_port_fill_input: error reading memory>0x22f01a0,
count=<optimized out>) at r6rs-ports.c:421
#11 0x00007f1f71ba4715 in vm_regular_engine (thread=0x14e9b30, vp=0xc31f30, registers=0xffffffff, resume=1901495969)
at vm-engine.c:786

[...]

(gdb) p *fds
$1 = {fd = 15, events = 1, revents = 0}
(gdb) shell ls -l /proc/12185/fd
total 0
lr-x------ 1 root root 64 Jan 10 16:56 0 -> 'pipe:[76778016]'
l-wx------ 1 root root 64 Jan 10 16:56 1 -> 'pipe:[76778015]'
lr-x------ 1 root root 64 Jan 10 16:56 10 -> 'pipe:[76838317]'
l-wx------ 1 root root 64 Jan 10 16:56 11 -> 'pipe:[76838317]'
lr-x------ 1 root root 64 Jan 10 16:56 12 -> 'pipe:[76851360]'
l-wx------ 1 root root 64 Jan 10 16:56 13 -> 'pipe:[76851360]'
l-wx------ 1 root root 64 Jan 10 16:56 14 -> /var/guix/offload/overdrive1.guixsd.org/1
lrwx------ 1 root root 64 Jan 10 16:56 15 -> 'socket:[76860702]'
lr-x------ 1 root root 64 Jan 10 16:56 16 -> /dev/urandom
l-wx------ 1 root root 64 Jan 10 16:56 2 -> 'pipe:[76778015]'
lr-x------ 1 root root 64 Jan 10 16:56 3 -> 'pipe:[76838313]'
l-wx------ 1 root root 64 Jan 10 16:56 4 -> 'pipe:[76778017]'
l-wx------ 1 root root 64 Jan 10 16:56 5 -> 'pipe:[76838313]'
lr-x------ 1 root root 64 Jan 10 16:56 6 -> 'pipe:[76838316]'
l-wx------ 1 root root 64 Jan 10 16:56 7 -> 'pipe:[76838316]'
lr-x------ 1 root root 64 Jan 10 16:56 8 -> 'pipe:[76841414]'
l-wx------ 1 root root 64 Jan 10 16:56 9 -> 'pipe:[76841414]'

It’s a ‘get-bytevector-n’ for 8 bytes, so it looks like the daemon
protocol. At that point the socket is actually dead: if I connect on
the remote machine (overdrive1.guixsd.org) I can see that there are no
other open SSH sessions.

A simple thing would be to somehow get libssh to pass POLLIN | POLLRDHUP
instead of just POLLIN.

Additionally, we could change Guile-SSH so that we can specify a timeout
when reading from a channel.

Ludo’.
L
L
Ludovic Courtès wrote on 14 Jan 2019 23:45
(address . 34033@debbugs.gnu.org)
87fttuq2mz.fsf@gnu.org
Hello,

Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (3 lines)
> A simple thing would be to somehow get libssh to pass POLLIN | POLLRDHUP
> instead of just POLLIN.

Reported here:


A fix has been proposed by upstream and should be committed shortly.

Toggle quote (3 lines)
> Additionally, we could change Guile-SSH so that we can specify a timeout
> when reading from a channel.

Turns out we can set a per-session timeout, which we already do (see
#:timeout in ‘open-ssh-session’ in (guix scripts offload)) but
‘ssh_channel_read’ would ignore it and instead pass an infinite timeout
to poll(2):


This issue happens to be fixed in libssh 0.8.x, so I upgraded our libssh
package in commit a8b0556ea1e439c89dc1ba33c8864e8b9b811f08.

(That still doesn’t tell us why our ‘guix offload’ processes would
occasionally be stuck but at least it ensures the build farm keeps
making progress even when that happens.)

Ludo’.
M
M
Maxim Cournoyer wrote on 22 Feb 2020 05:37
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 34033@debbugs.gnu.org)
87wo8fqlu5.fsf@apteryx.i-did-not-set--mail-host-address--so-tickle-me
Hello Ludovic,

Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (32 lines)
> Hello,
>
> Ludovic Courtès <ludo@gnu.org> skribis:
>
>> A simple thing would be to somehow get libssh to pass POLLIN | POLLRDHUP
>> instead of just POLLIN.
>
> Reported here:
>
> https://www.libssh.org/archive/libssh/2019-01/0000000.html
>
> A fix has been proposed by upstream and should be committed shortly.
>
>> Additionally, we could change Guile-SSH so that we can specify a timeout
>> when reading from a channel.
>
> Turns out we can set a per-session timeout, which we already do (see
> #:timeout in ‘open-ssh-session’ in (guix scripts offload)) but
> ‘ssh_channel_read’ would ignore it and instead pass an infinite timeout
> to poll(2):
>
> https://www.libssh.org/archive/libssh/2019-01/0000001.html
>
> This issue happens to be fixed in libssh 0.8.x, so I upgraded our libssh
> package in commit a8b0556ea1e439c89dc1ba33c8864e8b9b811f08.
>
> (That still doesn’t tell us why our ‘guix offload’ processes would
> occasionally be stuck but at least it ensures the build farm keeps
> making progress even when that happens.)
>
> Ludo’.

Seems the patch in the response at the URL you linked is awaiting some
feedback/review. Is this the reason 'guix substitute' hangs for so long
when the substitute server is down? (like 1 minute or so).

Maxim
L
L
Ludovic Courtès wrote on 22 Feb 2020 21:35
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)(address . 34033@debbugs.gnu.org)
87v9nyuzq1.fsf@gnu.org
Hi Maxim,

Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

Toggle quote (38 lines)
> Ludovic Courtès <ludo@gnu.org> writes:
>
>> Hello,
>>
>> Ludovic Courtès <ludo@gnu.org> skribis:
>>
>>> A simple thing would be to somehow get libssh to pass POLLIN | POLLRDHUP
>>> instead of just POLLIN.
>>
>> Reported here:
>>
>> https://www.libssh.org/archive/libssh/2019-01/0000000.html
>>
>> A fix has been proposed by upstream and should be committed shortly.
>>
>>> Additionally, we could change Guile-SSH so that we can specify a timeout
>>> when reading from a channel.
>>
>> Turns out we can set a per-session timeout, which we already do (see
>> #:timeout in ‘open-ssh-session’ in (guix scripts offload)) but
>> ‘ssh_channel_read’ would ignore it and instead pass an infinite timeout
>> to poll(2):
>>
>> https://www.libssh.org/archive/libssh/2019-01/0000001.html
>>
>> This issue happens to be fixed in libssh 0.8.x, so I upgraded our libssh
>> package in commit a8b0556ea1e439c89dc1ba33c8864e8b9b811f08.
>>
>> (That still doesn’t tell us why our ‘guix offload’ processes would
>> occasionally be stuck but at least it ensures the build farm keeps
>> making progress even when that happens.)
>>
>> Ludo’.
>
> Seems the patch in the response at the URL you linked is awaiting some
> feedback/review. Is this the reason 'guix substitute' hangs for so long
> when the substitute server is down? (like 1 minute or so).

The issues above are in libssh and were fixed a while ago. ‘guix
substitute’ doesn’t use Guile-SSH/libssh, so the problem you’re seeing
must be something different.

What do you mean by “the substitute server is down”? You mean ‘guix
publish’ is not running, or the machine is unavailable altogether?

Thanks,
Ludo’.
M
M
Maxim Cournoyer wrote on 24 Feb 2020 14:59
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 34033@debbugs.gnu.org)
87blpof5mq.fsf@gmail.com
Hello Ludovic,

Ludovic Courtès <ludo@gnu.org> writes:


[...]

Toggle quote (4 lines)
> The issues above are in libssh and were fixed a while ago. ‘guix
> substitute’ doesn’t use Guile-SSH/libssh, so the problem you’re seeing
> must be something different.

OK, good to know!

Toggle quote (3 lines)
> What do you mean by “the substitute server is down”? You mean ‘guix
> publish’ is not running, or the machine is unavailable altogether?

The machine is turned off (i.e., the machine is unavailable altogether
:-). It doesn't hang forever, but the timeout is a rather long. I'm
using that machine as both a substitute and an offload server.

Maxim
L
L
Ludovic Courtès wrote on 24 Feb 2020 15:59
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)(address . 34033@debbugs.gnu.org)
87lfosm3p0.fsf@gnu.org
Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

Toggle quote (2 lines)
> Ludovic Courtès <ludo@gnu.org> writes:

[...]

Toggle quote (7 lines)
>> What do you mean by “the substitute server is down”? You mean ‘guix
>> publish’ is not running, or the machine is unavailable altogether?
>
> The machine is turned off (i.e., the machine is unavailable altogether
> :-). It doesn't hang forever, but the timeout is a rather long. I'm
> using that machine as both a substitute and an offload server.

I don’t think it’s Guix-specific though. Presumably, if you do:

telnet that-server https

you get the same timeout, right?

Thanks,
Ludo’.
M
M
Mathieu Othacehe wrote on 2 Jul 2020 16:20
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 34033@debbugs.gnu.org)
87pn9ec82g.fsf@gnu.org
Hello,

Toggle quote (4 lines)
> (That still doesn’t tell us why our ‘guix offload’ processes would
> occasionally be stuck but at least it ensures the build farm keeps
> making progress even when that happens.)

I'm still not sure it's directly related to this bug but I observed
several offloading hangs on Berlin today.

For instance, in Cuirass logs:

Toggle snippet (4 lines)
2020-07-02T09:59:45 '/gnu/store/rm8ndiichxhwybaizis5pgck77952ilp-halt.drv' offloaded to '141.80.167.164'
2020-07-02T09:54:30 '/gnu/store/dxczkbf5wa6qr37gm7wr995hcxs8s0ya-motion-4.2.2.drv' offloaded to '141.80.167.170'

those two builds were offloaded around 10:00 today and there's still no
report from them at 16:00.

On 141.80.167.164 there's a matching build log:

Toggle snippet (3 lines)
-rw-r--r-- 1 root root 1735 Jul 2 10:00 /var/log/guix/drvs/rm/8ndiichxhwybaizis5pgck77952ilp-halt.drv.bz2

same on 141.80.167.170,

Toggle snippet (3 lines)
-rw-r--r-- 1 root root 6344 Jul 2 09:56 /var/log/guix/drvs/dx/czkbf5wa6qr37gm7wr995hcxs8s0ya-motion-4.2.2.drv.bz2

Having those builds "unfinished" keeps the rest of the evaluation
hanging. Running this SQL command in Cuirass database:

Toggle snippet (13 lines)
sqlite> select derivation, datetime(starttime, 'unixepoch', 'localtime'),stoptime from Builds where status=-1 and evaluation=14771;
/gnu/store/ncp59nyidli4lm3ff2lkfjym25yb18j5-guix-1.1.0-14.5bd8033.drv|2020-07-02 09:33:04|0
/gnu/store/rm8ndiichxhwybaizis5pgck77952ilp-halt.drv|2020-07-02 09:59:28|0
/gnu/store/71wnjgm2waqgw3fqmxmc4r3f1ifd1l92-cups-test.drv|2020-07-02 10:00:26|0
/gnu/store/9qsqd7jfwnaw9sm323y45cwymn98kyjl-exim-test.drv|2020-07-02 10:00:51|0
/gnu/store/vhcww4fw4qxw0hl1009npd26b22gfj3c-bitlbee-test.drv|2020-07-02 10:00:24|0
/gnu/store/92jrd6dfzgdifr107hwi64s8hf4mls47-iptables.drv|2020-07-02 09:59:49|0
/gnu/store/380nq6sjphd0agrvl43sr6ypli1yraz4-gnunet-0.12.2.drv|2020-07-02 09:51:32|0
/gnu/store/lqs22nbc6vy2z2524rmkcsmbh5mllm62-cuirass-0.0.1-37.882393d.drv|2020-07-02 10:34:37|0
/gnu/store/dxczkbf5wa6qr37gm7wr995hcxs8s0ya-motion-4.2.2.drv|2020-07-02 09:54:02|0
/gnu/store/5ln3r997ycr7rd6fqahd2d426mjw0rxb-gzochi-0.12.drv|2020-07-02 09:53:51|0

shows that the evaluation is pretty much pending since 10:00.

According to Cuirass logs again, all those builds were offloaded,
"/gnu/store/380nq6sjphd0agrvl43sr6ypli1yraz4-gnunet-0.12.2.drv",
"/gnu/store/lqs22nbc6vy2z2524rmkcsmbh5mllm62-cuirass-0.0.1-37.882393d.drv"
and /gnu/store/5ln3r997ycr7rd6fqahd2d426mjw0rxb-gzochi-0.12.drv
are reported as failed, and all other are still hanging.

Something is going wrong here! I'll keep investigating.

Thanks,

Mathieu
M
M
Mathieu Othacehe wrote on 3 Jul 2020 09:05
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 34033@debbugs.gnu.org)
877dvlkriv.fsf@gnu.org
Hey,

Toggle quote (2 lines)
> Something is going wrong here! I'll keep investigating.

To help us investigate those issues I added a "/status" page, which is
also accessible from a new drop-down menu in the Cuirass navigation bar.


Hydra has the same interface, but also a "Machine status" page, that
breaks down the running builds machine per machine. I plan to implement
that one next. Reading Hydra code, I also discovered that some part of
the offloading is directly done from Hydra, which talks with the
nix-daemon of the connected build machines, interesting!

While I'm writing, we have 5 running builds for ~1 hour, and 76040 queued
builds. Given the computing power of Berlin, there must be a bottleneck
somewhere.

Thanks,

Mathieu
L
L
Ludovic Courtès wrote on 3 Jul 2020 15:58
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 34033@debbugs.gnu.org)
871rlsog2o.fsf@gnu.org
Hi!

Mathieu Othacehe <othacehe@gnu.org> skribis:

Toggle quote (7 lines)
>> Something is going wrong here! I'll keep investigating.
>
> To help us investigate those issues I added a "/status" page, which is
> also accessible from a new drop-down menu in the Cuirass navigation bar.
>
> See, https://ci.guix.gnu.org/status.

Nice! So it’s roughly like the info at /api/queue, but filtered to
running builds, right?

Toggle quote (6 lines)
> Hydra has the same interface, but also a "Machine status" page, that
> breaks down the running builds machine per machine. I plan to implement
> that one next. Reading Hydra code, I also discovered that some part of
> the offloading is directly done from Hydra, which talks with the
> nix-daemon of the connected build machines, interesting!

Yes, Hydra does most of the scheduling by itself. Since this is
redundant with what the daemon + offload do, I thought Cuirass shouldn’t
do any scheduling at all and instead let the daemon take care of it
all.

This has advantages (the daemon has a global view and can achieve better
scheduling), and drawbacks (the protocol requires us to wait for
‘build-things’ completion before we can queue more builds, and
scheduling decisions are almost invisible to Cuirass).

Toggle quote (4 lines)
> While I'm writing, we have 5 running builds for ~1 hour, and 76040 queued
> builds. Given the computing power of Berlin, there must be a bottleneck
> somewhere.

Yes! I’ve often run “guix processes” on berlin, then stracing the
‘SessionPID’ process. It’s insightful because you sometimes see the
daemon is stuck waiting for a machine to offload to, sometimes it’s
stuck waiting for a build that will perhaps just eventually timeout…

Ludo’.
?