Offloading sometimes hangs

Open

Details

3 participants

Ludovic Courtès
Maxim Cournoyer
Mathieu Othacehe

Owner: unassigned

Submitted by: Ludovic Courtès

Severity: normal

Ludovic Courtès wrote on 10 Jan 2019 17:09

Recipients:(address . bug-guix@gnu.org)

Message-ID:87o98obikk.fsf@gnu.org

Hello,

So there’s another situation where offloading regularly hangs on

berlin. The ‘guix offload’ process looks like this:

Toggle snippet (53 lines)(gdb) bt
#0  0x00007f1f715686a1 in __GI___poll (fds=0x14e9b30, nfds=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f1f673b94e7 in ssh_poll (timeout=<optimized out>, nfds=<optimized out>, fds=<optimized out>)
    at /tmp/guix-build-libssh-0.7.7.drv-0/libssh-0.7.7-checkout/src/poll.c:98
#2  ssh_poll_ctx_dopoll (ctx=ctx@entry=0x14ee2e0, timeout=timeout@entry=-1)
    at /tmp/guix-build-libssh-0.7.7.drv-0/libssh-0.7.7-checkout/src/poll.c:612
#3  0x00007f1f673ba449 in ssh_handle_packets (session=session@entry=0x2249360, timeout=timeout@entry=-1)
    at /tmp/guix-build-libssh-0.7.7.drv-0/libssh-0.7.7-checkout/src/session.c:634
#4  0x00007f1f673ba51d in ssh_handle_packets_termination (session=session@entry=0x2249360, timeout=<optimized out>,
    timeout@entry=-3, fct=fct@entry=0x7f1f673a4430 <ssh_channel_read_termination>, user=user@entry=0x7ffce23953f0)
    at /tmp/guix-build-libssh-0.7.7.drv-0/libssh-0.7.7-checkout/src/session.c:696
#5  0x00007f1f673a6aaf in ssh_channel_read_timeout (channel=0x224e360, dest=dest@entry=0x18ef020,
    count=count@entry=8, is_stderr=<optimized out>, timeout=-3, timeout@entry=-1)
    at /tmp/guix-build-libssh-0.7.7.drv-0/libssh-0.7.7-checkout/src/channels.c:2705
#6  0x00007f1f673a6bbb in ssh_channel_read (channel=<optimized out>, dest=dest@entry=0x18ef020, count=count@entry=8,
    is_stderr=<optimized out>) at /tmp/guix-build-libssh-0.7.7.drv-0/libssh-0.7.7-checkout/src/channels.c:2621
#7  0x00007f1f67413a23 in read_from_channel_port (
    channel=<error reading variable: ERROR: In procedure gdbscm_memory_port_fill_input: error reading memory>0x22f01a0, dst=<optimized out>, start=0, count=8) at channel-type.c:161
#8  0x00007f1f71b65287 in scm_i_read_bytes (
    port=port@entry=<error reading variable: ERROR: In procedure gdbscm_memory_port_fill_input: error reading memory>0x22f01a0, dst=dst@entry="#<vu8vector>" = {...}, start=start@entry=0, count=count@entry=8) at ports.c:1559
#9  0x00007f1f71b6996c in scm_c_read_bytes (
    port=port@entry=<error reading variable: ERROR: In procedure gdbscm_memory_port_fill_input: error reading memory>0x22f01a0, dst=dst@entry="#<vu8vector>" = {...}, start=start@entry=0, count=count@entry=8) at ports.c:1639
#10 0x00007f1f71b6fd80 in scm_get_bytevector_n (
    port=<error reading variable: ERROR: In procedure gdbscm_memory_port_fill_input: error reading memory>0x22f01a0,
    count=<optimized out>) at r6rs-ports.c:421
#11 0x00007f1f71ba4715 in vm_regular_engine (thread=0x14e9b30, vp=0xc31f30, registers=0xffffffff, resume=1901495969)
    at vm-engine.c:786

[...]

(gdb) p *fds
$1 = {fd = 15, events = 1, revents = 0}
(gdb) shell ls -l /proc/12185/fd
total 0
lr-x------ 1 root root 64 Jan 10 16:56 0 -> 'pipe:[76778016]'
l-wx------ 1 root root 64 Jan 10 16:56 1 -> 'pipe:[76778015]'
lr-x------ 1 root root 64 Jan 10 16:56 10 -> 'pipe:[76838317]'
l-wx------ 1 root root 64 Jan 10 16:56 11 -> 'pipe:[76838317]'
lr-x------ 1 root root 64 Jan 10 16:56 12 -> 'pipe:[76851360]'
l-wx------ 1 root root 64 Jan 10 16:56 13 -> 'pipe:[76851360]'
l-wx------ 1 root root 64 Jan 10 16:56 14 -> /var/guix/offload/overdrive1.guixsd.org/1
lrwx------ 1 root root 64 Jan 10 16:56 15 -> 'socket:[76860702]'
lr-x------ 1 root root 64 Jan 10 16:56 16 -> /dev/urandom
l-wx------ 1 root root 64 Jan 10 16:56 2 -> 'pipe:[76778015]'
lr-x------ 1 root root 64 Jan 10 16:56 3 -> 'pipe:[76838313]'
l-wx------ 1 root root 64 Jan 10 16:56 4 -> 'pipe:[76778017]'
l-wx------ 1 root root 64 Jan 10 16:56 5 -> 'pipe:[76838313]'
lr-x------ 1 root root 64 Jan 10 16:56 6 -> 'pipe:[76838316]'
l-wx------ 1 root root 64 Jan 10 16:56 7 -> 'pipe:[76838316]'
lr-x------ 1 root root 64 Jan 10 16:56 8 -> 'pipe:[76841414]'
l-wx------ 1 root root 64 Jan 10 16:56 9 -> 'pipe:[76841414]'

It’s a ‘get-bytevector-n’ for 8 bytes, so it looks like the daemon

protocol. At that point the socket is actually dead: if I connect on

the remote machine (overdrive1.guixsd.org) I can see that there are no

other open SSH sessions.

A simple thing would be to somehow get libssh to pass POLLIN | POLLRDHUP

instead of just POLLIN.

Additionally, we could change Guile-SSH so that we can specify a timeout

when reading from a channel.

Ludo’.

Ludovic Courtès wrote on 14 Jan 2019 23:45

Recipients:(address . 34033@debbugs.gnu.org)

Message-ID:87fttuq2mz.fsf@gnu.org

Hello,

Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (3 lines)

> A simple thing would be to somehow get libssh to pass POLLIN | POLLRDHUP

> instead of just POLLIN.

Reported here:

https://www.libssh.org/archive/libssh/2019-01/0000000.html

A fix has been proposed by upstream and should be committed shortly.

Toggle quote (3 lines)

> Additionally, we could change Guile-SSH so that we can specify a timeout

> when reading from a channel.

Turns out we can set a per-session timeout, which we already do (see

#:timeout in ‘open-ssh-session’ in (guix scripts offload)) but

‘ssh_channel_read’ would ignore it and instead pass an infinite timeout

to poll(2):

https://www.libssh.org/archive/libssh/2019-01/0000001.html

This issue happens to be fixed in libssh 0.8.x, so I upgraded our libssh

package in commit a8b0556ea1e439c89dc1ba33c8864e8b9b811f08.

(That still doesn’t tell us why our ‘guix offload’ processes would

occasionally be stuck but at least it ensures the build farm keeps

making progress even when that happens.)

Ludo’.

Maxim Cournoyer wrote on 22 Feb 2020 05:37

Recipients:(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 34033@debbugs.gnu.org)

Message-ID:87wo8fqlu5.fsf@apteryx.i-did-not-set--mail-host-address--so-tickle-me

Hello Ludovic,

Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (32 lines)> Hello,
>
> Ludovic Courtès <ludo@gnu.org> skribis:
>
>> A simple thing would be to somehow get libssh to pass POLLIN | POLLRDHUP
>> instead of just POLLIN.
>
> Reported here:
>
>   https://www.libssh.org/archive/libssh/2019-01/0000000.html
>
> A fix has been proposed by upstream and should be committed shortly.
>
>> Additionally, we could change Guile-SSH so that we can specify a timeout
>> when reading from a channel.
>
> Turns out we can set a per-session timeout, which we already do (see
> #:timeout in ‘open-ssh-session’ in (guix scripts offload)) but
> ‘ssh_channel_read’ would ignore it and instead pass an infinite timeout
> to poll(2):
>
>   https://www.libssh.org/archive/libssh/2019-01/0000001.html
>
> This issue happens to be fixed in libssh 0.8.x, so I upgraded our libssh
> package in commit a8b0556ea1e439c89dc1ba33c8864e8b9b811f08.
>
> (That still doesn’t tell us why our ‘guix offload’ processes would
> occasionally be stuck but at least it ensures the build farm keeps
> making progress even when that happens.)
>
> Ludo’.

Seems the patch in the response at the URL you linked is awaiting some

feedback/review. Is this the reason 'guix substitute' hangs for so long

when the substitute server is down? (like 1 minute or so).

Maxim

Ludovic Courtès wrote on 22 Feb 2020 21:35

Recipients:(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)(address . 34033@debbugs.gnu.org)

Message-ID:87v9nyuzq1.fsf@gnu.org

Hi Maxim,

Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

Toggle quote (38 lines)> Ludovic Courtès <ludo@gnu.org> writes:
>
>> Hello,
>>
>> Ludovic Courtès <ludo@gnu.org> skribis:
>>
>>> A simple thing would be to somehow get libssh to pass POLLIN | POLLRDHUP
>>> instead of just POLLIN.
>>
>> Reported here:
>>
>>   https://www.libssh.org/archive/libssh/2019-01/0000000.html
>>
>> A fix has been proposed by upstream and should be committed shortly.
>>
>>> Additionally, we could change Guile-SSH so that we can specify a timeout
>>> when reading from a channel.
>>
>> Turns out we can set a per-session timeout, which we already do (see
>> #:timeout in ‘open-ssh-session’ in (guix scripts offload)) but
>> ‘ssh_channel_read’ would ignore it and instead pass an infinite timeout
>> to poll(2):
>>
>>   https://www.libssh.org/archive/libssh/2019-01/0000001.html
>>
>> This issue happens to be fixed in libssh 0.8.x, so I upgraded our libssh
>> package in commit a8b0556ea1e439c89dc1ba33c8864e8b9b811f08.
>>
>> (That still doesn’t tell us why our ‘guix offload’ processes would
>> occasionally be stuck but at least it ensures the build farm keeps
>> making progress even when that happens.)
>>
>> Ludo’.
>
> Seems the patch in the response at the URL you linked is awaiting some
> feedback/review.  Is this the reason 'guix substitute' hangs for so long
> when the substitute server is down? (like 1 minute or so).

The issues above are in libssh and were fixed a while ago. ‘guix

substitute’ doesn’t use Guile-SSH/libssh, so the problem you’re seeing

must be something different.

What do you mean by “the substitute server is down”? You mean ‘guix

publish’ is not running, or the machine is unavailable altogether?

Thanks,

Ludo’.

Maxim Cournoyer wrote on 24 Feb 2020 14:59

Recipients:(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 34033@debbugs.gnu.org)

Message-ID:87blpof5mq.fsf@gmail.com

Hello Ludovic,

Ludovic Courtès <ludo@gnu.org> writes:

[...]

Toggle quote (4 lines)

> The issues above are in libssh and were fixed a while ago. ‘guix

> substitute’ doesn’t use Guile-SSH/libssh, so the problem you’re seeing

> must be something different.

OK, good to know!

Toggle quote (3 lines)

> What do you mean by “the substitute server is down”? You mean ‘guix

> publish’ is not running, or the machine is unavailable altogether?

The machine is turned off (i.e., the machine is unavailable altogether

:-). It doesn't hang forever, but the timeout is a rather long. I'm

using that machine as both a substitute and an offload server.

Maxim

Ludovic Courtès wrote on 24 Feb 2020 15:59

Recipients:(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)(address . 34033@debbugs.gnu.org)

Message-ID:87lfosm3p0.fsf@gnu.org

Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

Toggle quote (2 lines)

> Ludovic Courtès <ludo@gnu.org> writes:

[...]

Toggle quote (7 lines)

>> What do you mean by “the substitute server is down”? You mean ‘guix

>> publish’ is not running, or the machine is unavailable altogether?

> The machine is turned off (i.e., the machine is unavailable altogether

> :-). It doesn't hang forever, but the timeout is a rather long. I'm

> using that machine as both a substitute and an offload server.

I don’t think it’s Guix-specific though. Presumably, if you do:

telnet that-server https

you get the same timeout, right?

Thanks,

Ludo’.

Mathieu Othacehe wrote on 2 Jul 2020 16:20

Recipients:(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 34033@debbugs.gnu.org)

Message-ID:87pn9ec82g.fsf@gnu.org

Hello,

Toggle quote (4 lines)

> (That still doesn’t tell us why our ‘guix offload’ processes would

> occasionally be stuck but at least it ensures the build farm keeps

> making progress even when that happens.)

I'm still not sure it's directly related to this bug but I observed

several offloading hangs on Berlin today.

For instance, in Cuirass logs:

Toggle snippet (4 lines)

2020-07-02T09:59:45 '/gnu/store/rm8ndiichxhwybaizis5pgck77952ilp-halt.drv' offloaded to '141.80.167.164'

2020-07-02T09:54:30 '/gnu/store/dxczkbf5wa6qr37gm7wr995hcxs8s0ya-motion-4.2.2.drv' offloaded to '141.80.167.170'

those two builds were offloaded around 10:00 today and there's still no

report from them at 16:00.

On 141.80.167.164 there's a matching build log:

Toggle snippet (3 lines)

-rw-r--r-- 1 root root 1735 Jul 2 10:00 /var/log/guix/drvs/rm/8ndiichxhwybaizis5pgck77952ilp-halt.drv.bz2

same on 141.80.167.170,

Toggle snippet (3 lines)

-rw-r--r-- 1 root root 6344 Jul 2 09:56 /var/log/guix/drvs/dx/czkbf5wa6qr37gm7wr995hcxs8s0ya-motion-4.2.2.drv.bz2

Having those builds "unfinished" keeps the rest of the evaluation

hanging. Running this SQL command in Cuirass database:

Toggle snippet (13 lines)sqlite> select derivation, datetime(starttime, 'unixepoch', 'localtime'),stoptime from Builds where status=-1 and evaluation=14771;
/gnu/store/ncp59nyidli4lm3ff2lkfjym25yb18j5-guix-1.1.0-14.5bd8033.drv|2020-07-02 09:33:04|0
/gnu/store/rm8ndiichxhwybaizis5pgck77952ilp-halt.drv|2020-07-02 09:59:28|0
/gnu/store/71wnjgm2waqgw3fqmxmc4r3f1ifd1l92-cups-test.drv|2020-07-02 10:00:26|0
/gnu/store/9qsqd7jfwnaw9sm323y45cwymn98kyjl-exim-test.drv|2020-07-02 10:00:51|0
/gnu/store/vhcww4fw4qxw0hl1009npd26b22gfj3c-bitlbee-test.drv|2020-07-02 10:00:24|0
/gnu/store/92jrd6dfzgdifr107hwi64s8hf4mls47-iptables.drv|2020-07-02 09:59:49|0
/gnu/store/380nq6sjphd0agrvl43sr6ypli1yraz4-gnunet-0.12.2.drv|2020-07-02 09:51:32|0
/gnu/store/lqs22nbc6vy2z2524rmkcsmbh5mllm62-cuirass-0.0.1-37.882393d.drv|2020-07-02 10:34:37|0
/gnu/store/dxczkbf5wa6qr37gm7wr995hcxs8s0ya-motion-4.2.2.drv|2020-07-02 09:54:02|0
/gnu/store/5ln3r997ycr7rd6fqahd2d426mjw0rxb-gzochi-0.12.drv|2020-07-02 09:53:51|0

shows that the evaluation is pretty much pending since 10:00.

According to Cuirass logs again, all those builds were offloaded,

"/gnu/store/380nq6sjphd0agrvl43sr6ypli1yraz4-gnunet-0.12.2.drv",

"/gnu/store/lqs22nbc6vy2z2524rmkcsmbh5mllm62-cuirass-0.0.1-37.882393d.drv"

and /gnu/store/5ln3r997ycr7rd6fqahd2d426mjw0rxb-gzochi-0.12.drv

are reported as failed, and all other are still hanging.

Something is going wrong here! I'll keep investigating.

Thanks,

Mathieu

Mathieu Othacehe wrote on 3 Jul 2020 09:05

Recipients:(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 34033@debbugs.gnu.org)

Message-ID:877dvlkriv.fsf@gnu.org

Hey,

Toggle quote (2 lines)

> Something is going wrong here! I'll keep investigating.

To help us investigate those issues I added a "/status" page, which is

also accessible from a new drop-down menu in the Cuirass navigation bar.

See, https://ci.guix.gnu.org/status.

Hydra has the same interface, but also a "Machine status" page, that

breaks down the running builds machine per machine. I plan to implement

that one next. Reading Hydra code, I also discovered that some part of

the offloading is directly done from Hydra, which talks with the

nix-daemon of the connected build machines, interesting!

While I'm writing, we have 5 running builds for ~1 hour, and 76040 queued

builds. Given the computing power of Berlin, there must be a bottleneck

somewhere.

Thanks,

Mathieu

Ludovic Courtès wrote on 3 Jul 2020 15:58

Recipients:(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 34033@debbugs.gnu.org)

Message-ID:871rlsog2o.fsf@gnu.org

Hi!

Mathieu Othacehe <othacehe@gnu.org> skribis:

Toggle quote (7 lines)

>> Something is going wrong here! I'll keep investigating.

> To help us investigate those issues I added a "/status" page, which is

> also accessible from a new drop-down menu in the Cuirass navigation bar.

> See, https://ci.guix.gnu.org/status.

Nice! So it’s roughly like the info at /api/queue, but filtered to

running builds, right?

Toggle quote (6 lines)

> Hydra has the same interface, but also a "Machine status" page, that

> breaks down the running builds machine per machine. I plan to implement

> that one next. Reading Hydra code, I also discovered that some part of

> the offloading is directly done from Hydra, which talks with the

> nix-daemon of the connected build machines, interesting!

Yes, Hydra does most of the scheduling by itself. Since this is

redundant with what the daemon + offload do, I thought Cuirass shouldn’t

do any scheduling at all and instead let the daemon take care of it

all.

This has advantages (the daemon has a global view and can achieve better

scheduling), and drawbacks (the protocol requires us to wait for

‘build-things’ completion before we can queue more builds, and

scheduling decisions are almost invisible to Cuirass).

Toggle quote (4 lines)

> While I'm writing, we have 5 running builds for ~1 hour, and 76040 queued

> builds. Given the computing power of Berlin, there must be a bottleneck

> somewhere.

Yes!  I’ve often run “guix processes” on berlin, then stracing the
‘SessionPID’ process.  It’s insightful because you sometimes see the
daemon is stuck waiting for a machine to offload to, sometimes it’s
stuck waiting for a build that will perhaps just eventually timeout…

Ludo’.

Your comment

Commenting via the web interface is currently disabled.

To comment on this conversation send an email to 34033@debbugs.gnu.org

To respond to this issue using the mumi CLI, first switch to it

mumi current 34033

Then, you may apply the latest patchset in this issue (with sign off)

mumi am -- -s

Or, compose a reply to this issue

mumi compose

Or, send patches to this issue

mumi send-email *.patch

is:open	open issues
is:done	closed issues
submitter:<who>	search issue submitter
author:<who>	search by message author
date:yesterday..now	search by issue date
mdate:3m..2d	search by message date