Offloading sometimes hangs

OpenSubmitted by Ludovic Courtès.
Details
3 participants
  • Ludovic Courtès
  • Maxim Cournoyer
  • Mathieu Othacehe
Owner
unassigned
Severity
normal
L
L
Ludovic Courtès wrote on 10 Jan 2019 17:09
(address . bug-guix@gnu.org)
87o98obikk.fsf@gnu.org
Hello,
So there’s another situation where offloading regularly hangs onberlin. The ‘guix offload’ process looks like this:
Toggle snippet (53 lines)(gdb) bt#0 0x00007f1f715686a1 in __GI___poll (fds=0x14e9b30, nfds=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29#1 0x00007f1f673b94e7 in ssh_poll (timeout=<optimized out>, nfds=<optimized out>, fds=<optimized out>) at /tmp/guix-build-libssh-0.7.7.drv-0/libssh-0.7.7-checkout/src/poll.c:98#2 ssh_poll_ctx_dopoll (ctx=ctx@entry=0x14ee2e0, timeout=timeout@entry=-1) at /tmp/guix-build-libssh-0.7.7.drv-0/libssh-0.7.7-checkout/src/poll.c:612#3 0x00007f1f673ba449 in ssh_handle_packets (session=session@entry=0x2249360, timeout=timeout@entry=-1) at /tmp/guix-build-libssh-0.7.7.drv-0/libssh-0.7.7-checkout/src/session.c:634#4 0x00007f1f673ba51d in ssh_handle_packets_termination (session=session@entry=0x2249360, timeout=<optimized out>, timeout@entry=-3, fct=fct@entry=0x7f1f673a4430 <ssh_channel_read_termination>, user=user@entry=0x7ffce23953f0) at /tmp/guix-build-libssh-0.7.7.drv-0/libssh-0.7.7-checkout/src/session.c:696#5 0x00007f1f673a6aaf in ssh_channel_read_timeout (channel=0x224e360, dest=dest@entry=0x18ef020, count=count@entry=8, is_stderr=<optimized out>, timeout=-3, timeout@entry=-1) at /tmp/guix-build-libssh-0.7.7.drv-0/libssh-0.7.7-checkout/src/channels.c:2705#6 0x00007f1f673a6bbb in ssh_channel_read (channel=<optimized out>, dest=dest@entry=0x18ef020, count=count@entry=8, is_stderr=<optimized out>) at /tmp/guix-build-libssh-0.7.7.drv-0/libssh-0.7.7-checkout/src/channels.c:2621#7 0x00007f1f67413a23 in read_from_channel_port ( channel=<error reading variable: ERROR: In procedure gdbscm_memory_port_fill_input: error reading memory>0x22f01a0, dst=<optimized out>, start=0, count=8) at channel-type.c:161#8 0x00007f1f71b65287 in scm_i_read_bytes ( port=port@entry=<error reading variable: ERROR: In procedure gdbscm_memory_port_fill_input: error reading memory>0x22f01a0, dst=dst@entry="#<vu8vector>" = {...}, start=start@entry=0, count=count@entry=8) at ports.c:1559#9 0x00007f1f71b6996c in scm_c_read_bytes ( port=port@entry=<error reading variable: ERROR: In procedure gdbscm_memory_port_fill_input: error reading memory>0x22f01a0, dst=dst@entry="#<vu8vector>" = {...}, start=start@entry=0, count=count@entry=8) at ports.c:1639#10 0x00007f1f71b6fd80 in scm_get_bytevector_n ( port=<error reading variable: ERROR: In procedure gdbscm_memory_port_fill_input: error reading memory>0x22f01a0, count=<optimized out>) at r6rs-ports.c:421#11 0x00007f1f71ba4715 in vm_regular_engine (thread=0x14e9b30, vp=0xc31f30, registers=0xffffffff, resume=1901495969) at vm-engine.c:786
[...]
(gdb) p *fds$1 = {fd = 15, events = 1, revents = 0}(gdb) shell ls -l /proc/12185/fdtotal 0lr-x------ 1 root root 64 Jan 10 16:56 0 -> 'pipe:[76778016]'l-wx------ 1 root root 64 Jan 10 16:56 1 -> 'pipe:[76778015]'lr-x------ 1 root root 64 Jan 10 16:56 10 -> 'pipe:[76838317]'l-wx------ 1 root root 64 Jan 10 16:56 11 -> 'pipe:[76838317]'lr-x------ 1 root root 64 Jan 10 16:56 12 -> 'pipe:[76851360]'l-wx------ 1 root root 64 Jan 10 16:56 13 -> 'pipe:[76851360]'l-wx------ 1 root root 64 Jan 10 16:56 14 -> /var/guix/offload/overdrive1.guixsd.org/1lrwx------ 1 root root 64 Jan 10 16:56 15 -> 'socket:[76860702]'lr-x------ 1 root root 64 Jan 10 16:56 16 -> /dev/urandoml-wx------ 1 root root 64 Jan 10 16:56 2 -> 'pipe:[76778015]'lr-x------ 1 root root 64 Jan 10 16:56 3 -> 'pipe:[76838313]'l-wx------ 1 root root 64 Jan 10 16:56 4 -> 'pipe:[76778017]'l-wx------ 1 root root 64 Jan 10 16:56 5 -> 'pipe:[76838313]'lr-x------ 1 root root 64 Jan 10 16:56 6 -> 'pipe:[76838316]'l-wx------ 1 root root 64 Jan 10 16:56 7 -> 'pipe:[76838316]'lr-x------ 1 root root 64 Jan 10 16:56 8 -> 'pipe:[76841414]'l-wx------ 1 root root 64 Jan 10 16:56 9 -> 'pipe:[76841414]'
It’s a ‘get-bytevector-n’ for 8 bytes, so it looks like the daemonprotocol. At that point the socket is actually dead: if I connect onthe remote machine (overdrive1.guixsd.org) I can see that there are noother open SSH sessions.
A simple thing would be to somehow get libssh to pass POLLIN | POLLRDHUPinstead of just POLLIN.
Additionally, we could change Guile-SSH so that we can specify a timeoutwhen reading from a channel.
Ludo’.
L
L
Ludovic Courtès wrote on 14 Jan 2019 23:45
(address . 34033@debbugs.gnu.org)
87fttuq2mz.fsf@gnu.org
Hello,
Ludovic Courtès <ludo@gnu.org> skribis:
Toggle quote (3 lines)> A simple thing would be to somehow get libssh to pass POLLIN | POLLRDHUP> instead of just POLLIN.
Reported here:
https://www.libssh.org/archive/libssh/2019-01/0000000.html
A fix has been proposed by upstream and should be committed shortly.
Toggle quote (3 lines)> Additionally, we could change Guile-SSH so that we can specify a timeout> when reading from a channel.
Turns out we can set a per-session timeout, which we already do (see#:timeout in ‘open-ssh-session’ in (guix scripts offload)) but‘ssh_channel_read’ would ignore it and instead pass an infinite timeoutto poll(2):
https://www.libssh.org/archive/libssh/2019-01/0000001.html
This issue happens to be fixed in libssh 0.8.x, so I upgraded our libsshpackage in commit a8b0556ea1e439c89dc1ba33c8864e8b9b811f08.
(That still doesn’t tell us why our ‘guix offload’ processes wouldoccasionally be stuck but at least it ensures the build farm keepsmaking progress even when that happens.)
Ludo’.
M
M
Maxim Cournoyer wrote on 22 Feb 2020 05:37
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 34033@debbugs.gnu.org)
87wo8fqlu5.fsf@apteryx.i-did-not-set--mail-host-address--so-tickle-me
Hello Ludovic,
Ludovic Courtès <ludo@gnu.org> writes:
Toggle quote (32 lines)> Hello,>> Ludovic Courtès <ludo@gnu.org> skribis:>>> A simple thing would be to somehow get libssh to pass POLLIN | POLLRDHUP>> instead of just POLLIN.>> Reported here:>> https://www.libssh.org/archive/libssh/2019-01/0000000.html>> A fix has been proposed by upstream and should be committed shortly.>>> Additionally, we could change Guile-SSH so that we can specify a timeout>> when reading from a channel.>> Turns out we can set a per-session timeout, which we already do (see> #:timeout in ‘open-ssh-session’ in (guix scripts offload)) but> ‘ssh_channel_read’ would ignore it and instead pass an infinite timeout> to poll(2):>> https://www.libssh.org/archive/libssh/2019-01/0000001.html>> This issue happens to be fixed in libssh 0.8.x, so I upgraded our libssh> package in commit a8b0556ea1e439c89dc1ba33c8864e8b9b811f08.>> (That still doesn’t tell us why our ‘guix offload’ processes would> occasionally be stuck but at least it ensures the build farm keeps> making progress even when that happens.)>> Ludo’.
Seems the patch in the response at the URL you linked is awaiting somefeedback/review. Is this the reason 'guix substitute' hangs for so longwhen the substitute server is down? (like 1 minute or so).
Maxim
L
L
Ludovic Courtès wrote on 22 Feb 2020 21:35
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)(address . 34033@debbugs.gnu.org)
87v9nyuzq1.fsf@gnu.org
Hi Maxim,
Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:
Toggle quote (38 lines)> Ludovic Courtès <ludo@gnu.org> writes:>>> Hello,>>>> Ludovic Courtès <ludo@gnu.org> skribis:>>>>> A simple thing would be to somehow get libssh to pass POLLIN | POLLRDHUP>>> instead of just POLLIN.>>>> Reported here:>>>> https://www.libssh.org/archive/libssh/2019-01/0000000.html>>>> A fix has been proposed by upstream and should be committed shortly.>>>>> Additionally, we could change Guile-SSH so that we can specify a timeout>>> when reading from a channel.>>>> Turns out we can set a per-session timeout, which we already do (see>> #:timeout in ‘open-ssh-session’ in (guix scripts offload)) but>> ‘ssh_channel_read’ would ignore it and instead pass an infinite timeout>> to poll(2):>>>> https://www.libssh.org/archive/libssh/2019-01/0000001.html>>>> This issue happens to be fixed in libssh 0.8.x, so I upgraded our libssh>> package in commit a8b0556ea1e439c89dc1ba33c8864e8b9b811f08.>>>> (That still doesn’t tell us why our ‘guix offload’ processes would>> occasionally be stuck but at least it ensures the build farm keeps>> making progress even when that happens.)>>>> Ludo’.>> Seems the patch in the response at the URL you linked is awaiting some> feedback/review. Is this the reason 'guix substitute' hangs for so long> when the substitute server is down? (like 1 minute or so).
The issues above are in libssh and were fixed a while ago. ‘guixsubstitute’ doesn’t use Guile-SSH/libssh, so the problem you’re seeingmust be something different.
What do you mean by “the substitute server is down”? You mean ‘guixpublish’ is not running, or the machine is unavailable altogether?
Thanks,Ludo’.
M
M
Maxim Cournoyer wrote on 24 Feb 2020 14:59
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 34033@debbugs.gnu.org)
87blpof5mq.fsf@gmail.com
Hello Ludovic,
Ludovic Courtès <ludo@gnu.org> writes:

[...]
Toggle quote (4 lines)> The issues above are in libssh and were fixed a while ago. ‘guix> substitute’ doesn’t use Guile-SSH/libssh, so the problem you’re seeing> must be something different.
OK, good to know!
Toggle quote (3 lines)> What do you mean by “the substitute server is down”? You mean ‘guix> publish’ is not running, or the machine is unavailable altogether?
The machine is turned off (i.e., the machine is unavailable altogether:-). It doesn't hang forever, but the timeout is a rather long. I'musing that machine as both a substitute and an offload server.
Maxim
L
L
Ludovic Courtès wrote on 24 Feb 2020 15:59
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)(address . 34033@debbugs.gnu.org)
87lfosm3p0.fsf@gnu.org
Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:
Toggle quote (2 lines)> Ludovic Courtès <ludo@gnu.org> writes:
[...]
Toggle quote (7 lines)>> What do you mean by “the substitute server is down”? You mean ‘guix>> publish’ is not running, or the machine is unavailable altogether?>> The machine is turned off (i.e., the machine is unavailable altogether> :-). It doesn't hang forever, but the timeout is a rather long. I'm> using that machine as both a substitute and an offload server.
I don’t think it’s Guix-specific though. Presumably, if you do:
telnet that-server https
you get the same timeout, right?
Thanks,Ludo’.
M
M
Mathieu Othacehe wrote on 2 Jul 2020 16:20
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 34033@debbugs.gnu.org)
87pn9ec82g.fsf@gnu.org
Hello,
Toggle quote (4 lines)> (That still doesn’t tell us why our ‘guix offload’ processes would> occasionally be stuck but at least it ensures the build farm keeps> making progress even when that happens.)
I'm still not sure it's directly related to this bug but I observedseveral offloading hangs on Berlin today.
For instance, in Cuirass logs:
Toggle snippet (4 lines)2020-07-02T09:59:45 '/gnu/store/rm8ndiichxhwybaizis5pgck77952ilp-halt.drv' offloaded to '141.80.167.164'2020-07-02T09:54:30 '/gnu/store/dxczkbf5wa6qr37gm7wr995hcxs8s0ya-motion-4.2.2.drv' offloaded to '141.80.167.170'
those two builds were offloaded around 10:00 today and there's still noreport from them at 16:00.
On 141.80.167.164 there's a matching build log:
Toggle snippet (3 lines)-rw-r--r-- 1 root root 1735 Jul 2 10:00 /var/log/guix/drvs/rm/8ndiichxhwybaizis5pgck77952ilp-halt.drv.bz2
same on 141.80.167.170,
Toggle snippet (3 lines)-rw-r--r-- 1 root root 6344 Jul 2 09:56 /var/log/guix/drvs/dx/czkbf5wa6qr37gm7wr995hcxs8s0ya-motion-4.2.2.drv.bz2
Having those builds "unfinished" keeps the rest of the evaluationhanging. Running this SQL command in Cuirass database:
Toggle snippet (13 lines)sqlite> select derivation, datetime(starttime, 'unixepoch', 'localtime'),stoptime from Builds where status=-1 and evaluation=14771;/gnu/store/ncp59nyidli4lm3ff2lkfjym25yb18j5-guix-1.1.0-14.5bd8033.drv|2020-07-02 09:33:04|0/gnu/store/rm8ndiichxhwybaizis5pgck77952ilp-halt.drv|2020-07-02 09:59:28|0/gnu/store/71wnjgm2waqgw3fqmxmc4r3f1ifd1l92-cups-test.drv|2020-07-02 10:00:26|0/gnu/store/9qsqd7jfwnaw9sm323y45cwymn98kyjl-exim-test.drv|2020-07-02 10:00:51|0/gnu/store/vhcww4fw4qxw0hl1009npd26b22gfj3c-bitlbee-test.drv|2020-07-02 10:00:24|0/gnu/store/92jrd6dfzgdifr107hwi64s8hf4mls47-iptables.drv|2020-07-02 09:59:49|0/gnu/store/380nq6sjphd0agrvl43sr6ypli1yraz4-gnunet-0.12.2.drv|2020-07-02 09:51:32|0/gnu/store/lqs22nbc6vy2z2524rmkcsmbh5mllm62-cuirass-0.0.1-37.882393d.drv|2020-07-02 10:34:37|0/gnu/store/dxczkbf5wa6qr37gm7wr995hcxs8s0ya-motion-4.2.2.drv|2020-07-02 09:54:02|0/gnu/store/5ln3r997ycr7rd6fqahd2d426mjw0rxb-gzochi-0.12.drv|2020-07-02 09:53:51|0
shows that the evaluation is pretty much pending since 10:00.
According to Cuirass logs again, all those builds were offloaded,"/gnu/store/380nq6sjphd0agrvl43sr6ypli1yraz4-gnunet-0.12.2.drv","/gnu/store/lqs22nbc6vy2z2524rmkcsmbh5mllm62-cuirass-0.0.1-37.882393d.drv"and /gnu/store/5ln3r997ycr7rd6fqahd2d426mjw0rxb-gzochi-0.12.drvare reported as failed, and all other are still hanging.
Something is going wrong here! I'll keep investigating.
Thanks,
Mathieu
M
M
Mathieu Othacehe wrote on 3 Jul 2020 09:05
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 34033@debbugs.gnu.org)
877dvlkriv.fsf@gnu.org
Hey,
Toggle quote (2 lines)> Something is going wrong here! I'll keep investigating.
To help us investigate those issues I added a "/status" page, which isalso accessible from a new drop-down menu in the Cuirass navigation bar.
See, https://ci.guix.gnu.org/status.
Hydra has the same interface, but also a "Machine status" page, thatbreaks down the running builds machine per machine. I plan to implementthat one next. Reading Hydra code, I also discovered that some part ofthe offloading is directly done from Hydra, which talks with thenix-daemon of the connected build machines, interesting!
While I'm writing, we have 5 running builds for ~1 hour, and 76040 queuedbuilds. Given the computing power of Berlin, there must be a bottlenecksomewhere.
Thanks,
Mathieu
L
L
Ludovic Courtès wrote on 3 Jul 2020 15:58
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 34033@debbugs.gnu.org)
871rlsog2o.fsf@gnu.org
Hi!
Mathieu Othacehe <othacehe@gnu.org> skribis:
Toggle quote (7 lines)>> Something is going wrong here! I'll keep investigating.>> To help us investigate those issues I added a "/status" page, which is> also accessible from a new drop-down menu in the Cuirass navigation bar.>> See, https://ci.guix.gnu.org/status.
Nice! So it’s roughly like the info at /api/queue, but filtered torunning builds, right?
Toggle quote (6 lines)> Hydra has the same interface, but also a "Machine status" page, that> breaks down the running builds machine per machine. I plan to implement> that one next. Reading Hydra code, I also discovered that some part of> the offloading is directly done from Hydra, which talks with the> nix-daemon of the connected build machines, interesting!
Yes, Hydra does most of the scheduling by itself. Since this isredundant with what the daemon + offload do, I thought Cuirass shouldn’tdo any scheduling at all and instead let the daemon take care of itall.
This has advantages (the daemon has a global view and can achieve betterscheduling), and drawbacks (the protocol requires us to wait for‘build-things’ completion before we can queue more builds, andscheduling decisions are almost invisible to Cuirass).
Toggle quote (4 lines)> While I'm writing, we have 5 running builds for ~1 hour, and 76040 queued> builds. Given the computing power of Berlin, there must be a bottleneck> somewhere.
Yes! I’ve often run “guix processes” on berlin, then stracing the‘SessionPID’ process. It’s insightful because you sometimes see thedaemon is stuck waiting for a machine to offload to, sometimes it’sstuck waiting for a build that will perhaps just eventually timeout…
Ludo’.
?
Your comment

Commenting via the web interface is currently disabled.

To comment on this conversation send email to 34033@debbugs.gnu.org