'guix offload' occasionally hangs on berlin, hitting max-silent-time

  • Open
  • quality assurance status badge
Details
2 participants
  • Andreas Enge
  • Ludovic Courtès
Owner
unassigned
Submitted by
Ludovic Courtès
Severity
important
L
L
Ludovic Courtès wrote on 27 Feb 2023 11:37
[Cuirass] Build timeouts during evaluations
(address . bug-guix@gnu.org)
87a60zo0le.fsf@inria.fr
Hello Guix!

We often get evaluation errors for ‘core-updates’, where the build log
reveals that this is due to a build timeout while building the Guix
instance that will be used for the evaluation. Example:

Toggle snippet (18 lines)
building of `/gnu/store/x9dyzywwmsp8vmpvpx54yhcnz7rqin0l-asciidoc-9.1.0.drv' timed out after 3600 seconds of silence
cannot build derivation `/gnu/store/bl7aq8999lmp6flbyw3k656m522jmcl3-git-2.39.1.drv': 1 dependencies couldn't be built
cannot build derivation `/gnu/store/ldajjyvb4np1zq9l9s09j0dpsbpscpan-gnulib-2022-12-06-1.440b528.drv': 1 dependencies couldn't be built
cannot build derivation `/gnu/store/fdks470j7xjw3dc6fiyw15hdsxj6cdbc-guile-gnutls-3.7.11.drv': 1 dependencies couldn't be built
cannot build derivation `/gnu/store/71pl6a38bd0b81qgi45wz2ymvlv0x7xn-guix-cli-core.drv': 1 dependencies couldn't be built

[...]

cannot build derivation `/gnu/store/m7r73wypxhhmv8cqza1x8g2zyq9k1xpg-profile.drv': 1 dependencies couldn't be built

[...]

ERROR:
1. &store-protocol-error:
message: "build of `/gnu/store/m7r73wypxhhmv8cqza1x8g2zyq9k1xpg-profile.drv' failed"
status: 101


Likewise for the ‘guile’ jobset:

Toggle snippet (40 lines)
exporting path `/gnu/store/m2hyx7qk2j1bsvpj1qv89701klv8kqkk-guix-command'
exporting path `/gnu/store/vlm7401d5jvna6j7qdxhcr581cd8g8vh-guix-daemon-1.4.0-4.01fd830'
exporting path `/gnu/store/zv3c0anfj085d2lyxh4g4aahhczmcaj8-guix-daemon'
exporting path `/gnu/store/f26rw4haizf2h126k9ja9wk31cm49mj0-guix-17bd02433'
exporting path `/gnu/store/wc736va72zl2zz36817ahf3kc5fkfr69-inferior-script.scm'
building of `/gnu/store/yq9r44lic8cn4nphksbi5s3ppwfcsh1n-guile.drv' timed out after 3600 seconds of silence
cannot build derivation `/gnu/store/nrcsvww8j24rjzyin90w76qgbkgphhvp-profile.drv': 1 dependencies couldn't be built
Backtrace:
In ice-9/boot-9.scm:
1752:10 12 (with-exception-handler _ _ #:unwind? _ # _)
In unknown file:
11 (apply-smob/0 #<thunk 7fd8cbe13340>)
In ice-9/boot-9.scm:
724:2 10 (call-with-prompt _ _ #<procedure default-prompt-handle?>)
In ice-9/eval.scm:
619:8 9 (_ #(#(#<directory (guile-user) 7fd8cbe16c80>)))
In cuirass/ui.scm:
104:10 8 (run-cuirass-command _ . _)
In cuirass/scripts/evaluate.scm:
105:27 7 (cuirass-evaluate _)
In ice-9/boot-9.scm:
1752:10 6 (with-exception-handler _ _ #:unwind? _ # _)
In guix/store.scm:
658:37 5 (thunk)
2123:25 4 (run-with-store #<store-connection 256.99 7fd8cad09dc0> ?)
In cuirass/scripts/evaluate.scm:
87:8 3 (_ _)
In guix/store.scm:
1995:38 2 (_ #<store-connection 256.99 7fd8a33e6d70>)
1421:15 1 (_ #<store-connection 256.99 7fd8a33e6d70> _ _)
In ice-9/boot-9.scm:
1685:16 0 (raise-exception _ #:continuable? _)

ice-9/boot-9.scm:1685:16: In procedure raise-exception:
ERROR:
1. &store-protocol-error:
message: "build of `/gnu/store/nrcsvww8j24rjzyin90w76qgbkgphhvp-profile.drv' failed"
status: 101

Building this particular derivation is quick though:

Toggle snippet (15 lines)
$ time guix build /gnu/store/yq9r44lic8cn4nphksbi5s3ppwfcsh1n-guile.drv --check
The following derivation will be built:
/gnu/store/yq9r44lic8cn4nphksbi5s3ppwfcsh1n-guile.drv
building /gnu/store/yq9r44lic8cn4nphksbi5s3ppwfcsh1n-guile.drv...
(repl-version 0 1 1)
(values (value "/gnu/store/rgk1sc2zvllf7cgbyj5g28b9v2x6nvn3-guile/share/guile/site/3.0"))
successfully built /gnu/store/yq9r44lic8cn4nphksbi5s3ppwfcsh1n-guile.drv
successfully built /gnu/store/yq9r44lic8cn4nphksbi5s3ppwfcsh1n-guile.drv
/gnu/store/rgk1sc2zvllf7cgbyj5g28b9v2x6nvn3-guile

real 0m1.253s
user 0m0.751s
sys 0m0.034s

So the problem must be elsewhere.

One thing is that this particular .drv is quite big:

Toggle snippet (17 lines)
$ guix size /gnu/store/yq9r44lic8cn4nphksbi5s3ppwfcsh1n-guile.drv
store item total self
/gnu/store/wjr1sqxci75jfkvwzmmvlgypkyp3fvqi-guix-17bd024 105.0 105.0 42.9%
/gnu/store/6b3fp2kwqxw7cvnh6jz9g49n9k2n5avy-guix-packages-base-source 36.2 36.2 14.8%
/gnu/store/h3hjni2bd8f5kbn8w0g2xx2jnagfynnr-guile-e2ed33e 29.0 29.0 11.8%
/gnu/store/gfb26xn5f2qyrxzv847cd6iql6h8igh0-packages 23.1 23.1 9.4%
/gnu/store/dj8rq4jd8qgk4k1hzg8djcvi6hlbdb7m-doc 22.3 22.3 9.1%
/gnu/store/a5z192qxni40ly92cknwxkpnxc8zxfgf-guix 7.8 7.8 3.2%
/gnu/store/f8wg2y3rrh4dgi6znjrv8a1l4ymkc7rb-guix-packages-source 2.8 2.8 1.2%
/gnu/store/8iwm8x6472xnf7g6hfxzw029gxm1qbkq-guix-system-source 2.5 2.5 1.0%
/gnu/store/2dxbf11shgvs1cgvzz6wpnfra97f0sfw-guix-extra-source 2.5 2.5 1.0%
/gnu/store/q57az26zdgafic5b8xsbxlh4iwddvr86-doc 2.1 2.1 0.9%
[…]
/gnu/store/0mxnx8l4fgigvd7gakwdk6hc6im4wnai-disarchive-mirrors 0.0 0.0 0.0%
total: 244.8 MiB

Merely transferring it for offloading takes time; depending on machine
load, perhaps it can hit the 1h max-silent timeout? That’d mean the
throughput is below 70 KiB/s. Plausible?

The closure of
/gnu/store/x9dyzywwmsp8vmpvpx54yhcnz7rqin0l-asciidoc-9.1.0.drv is
3.0 MiB and it builds in 4 seconds, so there must be something else
going on, perhaps a bug in libssh/Guile-SSH leading to stale
connections?

Thoughts?

Ludo’.
L
L
Ludovic Courtès wrote on 1 Mar 2023 16:30
control message for bug #61839
(address . control@debbugs.gnu.org)
877cw0cwv8.fsf@gnu.org
severity 61839 important
quit
L
L
Ludovic Courtès wrote on 3 Mar 2023 15:24
Re: bug#61839: [Cuirass] Build timeouts during evaluations
(address . 61839@debbugs.gnu.org)(name . Mathieu Othacehe)(address . othacehe@gnu.org)
87sfelx69b.fsf@gnu.org
Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (4 lines)
> exporting path `/gnu/store/f26rw4haizf2h126k9ja9wk31cm49mj0-guix-17bd02433'
> exporting path `/gnu/store/wc736va72zl2zz36817ahf3kc5fkfr69-inferior-script.scm'
> building of `/gnu/store/yq9r44lic8cn4nphksbi5s3ppwfcsh1n-guile.drv' timed out after 3600 seconds of silence

I experienced something similar while building from the command line on
berlin, so it looks like an offloading problem.

Ludo’.
A
A
Andreas Enge wrote on 17 Mar 2023 14:21
Re: Offloading problems on berlin
ZBRpP/cDAYMQ5220@jurong
Am Thu, Mar 16, 2023 at 02:50:53PM +0100 schrieb Ludovic Courtès:
Toggle quote (3 lines)
> I occasionally offload from my laptop to machines at work but didn’t hit
> this bug, so we’ll have to see if it’s specific to berlin or not.

My impression is that in my case, the problem is not with ssh, but somehow
with the guix daemon. I tend to observe this:

guix offload: sending 1 store item (6,490 MiB) to '141.80.167.169'...
exporting path `/gnu/store/y4ipvkapf1gninaabwdl6pcz46c1frak-texlive-texmf-20210325'

Then jnettop shows data transfer of about 100 to 120MB/s during a bit over
a minute, which makes me think that the actual data transfer succeeds, but
that somehow the data is not registered in the store.
Then nothing tangible happens and offloading times out. Since I am
apparently not able to connect to the machines via ssh, I cannot have a
closer look at the receiving end.

Andreas
L
L
Ludovic Courtès wrote on 7 Apr 2023 17:49
control message for bug #61839
(address . control@debbugs.gnu.org)
871qkvd75p.fsf@gnu.org
retitle 61839 'guix offload' occasionally hangs on berlin, hitting max-silent-time
quit
L
L
Ludovic Courtès wrote on 17 Apr 2023 18:37
Re: bug#61839: 'guix offload' occasionally hangs on berlin, hitting max-silent-time
(address . 61839@debbugs.gnu.org)
87edoicvn3.fsf@gnu.org
Hi!

Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (4 lines)
> We often get evaluation errors for ‘core-updates’, where the build log
> reveals that this is due to a build timeout while building the Guix
> instance that will be used for the evaluation. Example:

[...]

Toggle quote (7 lines)
> exporting path `/gnu/store/m2hyx7qk2j1bsvpj1qv89701klv8kqkk-guix-command'
> exporting path `/gnu/store/vlm7401d5jvna6j7qdxhcr581cd8g8vh-guix-daemon-1.4.0-4.01fd830'
> exporting path `/gnu/store/zv3c0anfj085d2lyxh4g4aahhczmcaj8-guix-daemon'
> exporting path `/gnu/store/f26rw4haizf2h126k9ja9wk31cm49mj0-guix-17bd02433'
> exporting path `/gnu/store/wc736va72zl2zz36817ahf3kc5fkfr69-inferior-script.scm'
> building of `/gnu/store/yq9r44lic8cn4nphksbi5s3ppwfcsh1n-guile.drv' timed out after 3600 seconds of silence

With help from Andreas, I was able to reproduce it on berlin by running:

guix build /gnu/store/b22580wj63wk4a3i09n22wa60y36kaa5-openjdk-16.0.2.drv

Timeout happens while ‘guix offload’ is sending build requisites to the
target machine. At that point, the client ‘guix offload’ process is
stuck like so:

Toggle snippet (50 lines)
(gdb) bt
#0 0x00007f1fd7ad7d6f in __GI___poll (fds=0x766b00, nfds=1, timeout=3594784) at ../sysdeps/unix/sysv/linux/poll.c:29
#1 0x00007f1fcc220de7 in ssh_poll_ctx_dopoll () from target:/gnu/store/f3hn75p2jjc94jfxfwsgf96s2wz8kaql-libssh-0.9.6/lib/libssh.so.4
#2 0x00007f1fcc221f0c in ssh_handle_packets () from target:/gnu/store/f3hn75p2jjc94jfxfwsgf96s2wz8kaql-libssh-0.9.6/lib/libssh.so.4
#3 0x00007f1fcc221fdb in ssh_handle_packets_termination ()
from target:/gnu/store/f3hn75p2jjc94jfxfwsgf96s2wz8kaql-libssh-0.9.6/lib/libssh.so.4
#4 0x00007f1fcc2034bb in channel_write_common () from target:/gnu/store/f3hn75p2jjc94jfxfwsgf96s2wz8kaql-libssh-0.9.6/lib/libssh.so.4
#5 0x00007f1fcc28b15b in ?? () from target:/gnu/store/rjhq6cz7xbck4l7zar6ga3ygph2chwzr-guile-ssh-0.16.0/lib/libguile-ssh.so
#6 0x00007f1fd7cf4f34 in scm_i_write_bytes (port=0x7f1fc556ba00, src=0x7f1fc3c27000, start=0, count=65536) at ports.c:2858
#7 0x00007f1fd7cff546 in scm_put_bytevector (port=0x7f1fc556ba00, bv=0x7f1fc3c27000, start=<optimized out>, count=<optimized out>)
at r6rs-ports.c:676
#8 0x00007f1fc5ac5197 in ?? ()
#9 0x000000000069731c in ?? ()
#10 0x00007f1fc5acdf71 in ?? ()
#11 0x0000000000697318 in ?? ()
#12 0x00007f1fd7cd2ccc in scm_jit_enter_mcode (thread=0x7f1fd73b1d80, mcode=0x69bf14 "\034;\004") at jit.c:6038
#13 0x00007f1fd7d2874f in vm_regular_engine (thread=0x7f1fd73b1d80) at vm-engine.c:546
#14 0x00007f1fd7d355d9 in scm_call_n (proc=<optimized out>, argv=<optimized out>, nargs=1) at vm.c:1610
#15 0x00007f1fd7c9d457 in scm_primitive_eval (exp=<optimized out>, exp@entry=0x7f1fcfab8cf0) at eval.c:671
#16 0x00007f1fd7ca34b6 in scm_eval (exp=0x7f1fcfab8cf0, module_or_state=0x7f1fcfaacc80) at eval.c:705
#17 0x00007f1fd7d073b6 in scm_shell (argc=11, argv=0x6a9fc0) at script.c:357
#18 0x00007f1fd7cb249c in invoke_main_func (body_data=0x7fffd2c0d370) at init.c:312
#19 0x00007f1fd7c9b85a in c_body (d=0x7fffd2c0d290) at continuations.c:430
#20 0x00007f1fd7d28326 in vm_regular_engine (thread=0x7f1fd73b1d80) at vm-engine.c:972
#21 0x00007f1fd7d355d9 in scm_call_n (proc=<optimized out>, argv=<optimized out>, nargs=2) at vm.c:1610
#22 0x00007f1fd7c9d09a in scm_call_2 (proc=<optimized out>, arg1=<optimized out>, arg2=<optimized out>) at eval.c:503
#23 0x00007f1fd7d53742 in scm_c_with_exception_handler.constprop.0 (type=0x404, handler_data=handler_data@entry=0x7fffd2c0d220,
thunk_data=thunk_data@entry=0x7fffd2c0d220, thunk=<optimized out>, handler=<optimized out>) at exceptions.c:170
#24 0x00007f1fd7d2588f in scm_c_catch (tag=<optimized out>, body=<optimized out>, body_data=<optimized out>, handler=<optimized out>,
handler_data=<optimized out>, pre_unwind_handler=<optimized out>, pre_unwind_handler_data=0x7f1fcfa10040) at throw.c:168
#25 0x00007f1fd7c9de66 in scm_i_with_continuation_barrier (pre_unwind_handler=0x7f1fd7c9db80 <pre_unwind_handler>,
pre_unwind_handler_data=0x7f1fcfa10040, handler_data=0x7fffd2c0d290, handler=0x7f1fd7ca48b0 <c_handler>, body_data=0x7fffd2c0d290,
body=0x7f1fd7c9b850 <c_body>) at continuations.c:368
#26 scm_c_with_continuation_barrier (func=<optimized out>, data=<optimized out>) at continuations.c:464
#27 0x00007f1fd7d24b39 in with_guile (base=0x7fffd2c0d318, data=0x7fffd2c0d340) at threads.c:645
#28 0x00007f1fd7bfb0ba in GC_call_with_stack_base () from target:/gnu/store/2lczkxbdbzh4gk7wh91bzrqrk7h5g1dl-libgc-8.0.4/lib/libgc.so.1
#29 0x00007f1fd7d1d8b8 in scm_i_with_guile (dynamic_state=<optimized out>, data=<optimized out>, func=<optimized out>) at threads.c:688
#30 scm_with_guile (func=<optimized out>, data=<optimized out>) at threads.c:694
#31 0x00007f1fd7cbb025 in scm_boot_guile (argc=<optimized out>, argv=<optimized out>, main_func=<optimized out>, closure=<optimized out>)
at init.c:295
#32 0x0000000000401191 in ?? ()
#33 0x00007f1fd7a117dd in __libc_start_main (main=0x401110, argc=8, argv=0x7fffd2c0d4a8, init=<optimized out>, fini=<optimized out>,
rtld_fini=<optimized out>, stack_end=0x7fffd2c0d498) at ../csu/libc-start.c:332
#34 0x00000000004011da in ?? ()
(gdb) p *fds
$1 = {fd = 15, events = 1, revents = 0}
(gdb) shell ls -l /proc/62345/fd/15
lrwx------ 1 root root 64 Apr 17 17:42 /proc/62345/fd/15 -> 'socket:[3238975699]'

The SSH connection is still live, and on the server side I see the ‘guix
repl -t machine’ process stuck in:

Toggle snippet (3 lines)
write(2, "linking \342\200\230/gnu/store/0gbj38wswwxq50i68ci973fgi2l531b4-openjdk-16.0.2-checkout/test/hotspot/jtreg/compiler/c2/cr6340864/TestLongVect.java\342\200\231 to \342\200\230/gnu/store/.links/03zscrls9431zipkpx1sn450llvf5hkhvbkpjk57j22f7d2nb3vs\342\200\231\n", 221

(This message comes from optimise-store.cc:192.)

This suggests that ‘guix repl -t machine’ has filled its output buffer;
its output isn’t being read so it gets stuck in that write(2) call. Its
parent ‘sshd’ process is stuck in:

Toggle snippet (3 lines)
ppoll([{fd=3, events=POLLIN}, {fd=4, events=0}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=11, events=POLLOUT}], 5, NULL, [], 8^C

One hypothesis is that when transferring store items with many files,
such as openjdk-checkout here, guix-daemon on the remote machine outputs
many such “linking … to …” messages; if the client doesn’t read them,
then we end up with a deadlock like this, with the server-side process
stuck writing more messages.

However, that “linking …” message is supposed to be written only at
non-default debugging levels:

printMsg(lvlTalkative, format("linking ‘%1%’ to ‘%2%’") % path % linkPath);

Turns out that guix-daemon on build machines is running with ‘--debug’
starting with maintenance.git commit
72d8a62466b71eda657b52ca17e99f7189ea18ad (“sysadmin: build-machines:
Enable guix-daemon debug output.”)

The immediate fix is to undo that—I’ll do that later.

The other one is to swallow ‘current-error-port’ in
‘store-import-channel’.

To be continued…

Ludo’.
L
L
Ludovic Courtès wrote on 18 Apr 2023 00:55
(address . 61839@debbugs.gnu.org)
871qkiazl1.fsf@gnu.org
Hi,

Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (7 lines)
> Turns out that guix-daemon on build machines is running with ‘--debug’
> starting with maintenance.git commit
> 72d8a62466b71eda657b52ca17e99f7189ea18ad (“sysadmin: build-machines:
> Enable guix-daemon debug output.”)
>
> The immediate fix is to undo that—I’ll do that later.

Done:


I redeployed the build machines (on the same Guix commit as before).
The 4 machines that host a childhurd are still being redeployed though.

Toggle quote (3 lines)
> The other one is to swallow ‘current-error-port’ in
> ‘store-import-channel’.

?