Hi! Ludovic Courtès skribis: > We often get evaluation errors for ‘core-updates’, where the build log > reveals that this is due to a build timeout while building the Guix > instance that will be used for the evaluation. Example: [...] > exporting path `/gnu/store/m2hyx7qk2j1bsvpj1qv89701klv8kqkk-guix-command' > exporting path `/gnu/store/vlm7401d5jvna6j7qdxhcr581cd8g8vh-guix-daemon-1.4.0-4.01fd830' > exporting path `/gnu/store/zv3c0anfj085d2lyxh4g4aahhczmcaj8-guix-daemon' > exporting path `/gnu/store/f26rw4haizf2h126k9ja9wk31cm49mj0-guix-17bd02433' > exporting path `/gnu/store/wc736va72zl2zz36817ahf3kc5fkfr69-inferior-script.scm' > building of `/gnu/store/yq9r44lic8cn4nphksbi5s3ppwfcsh1n-guile.drv' timed out after 3600 seconds of silence With help from Andreas, I was able to reproduce it on berlin by running: guix build /gnu/store/b22580wj63wk4a3i09n22wa60y36kaa5-openjdk-16.0.2.drv Timeout happens while ‘guix offload’ is sending build requisites to the target machine. At that point, the client ‘guix offload’ process is stuck like so: --8<---------------cut here---------------start------------->8--- (gdb) bt #0 0x00007f1fd7ad7d6f in __GI___poll (fds=0x766b00, nfds=1, timeout=3594784) at ../sysdeps/unix/sysv/linux/poll.c:29 #1 0x00007f1fcc220de7 in ssh_poll_ctx_dopoll () from target:/gnu/store/f3hn75p2jjc94jfxfwsgf96s2wz8kaql-libssh-0.9.6/lib/libssh.so.4 #2 0x00007f1fcc221f0c in ssh_handle_packets () from target:/gnu/store/f3hn75p2jjc94jfxfwsgf96s2wz8kaql-libssh-0.9.6/lib/libssh.so.4 #3 0x00007f1fcc221fdb in ssh_handle_packets_termination () from target:/gnu/store/f3hn75p2jjc94jfxfwsgf96s2wz8kaql-libssh-0.9.6/lib/libssh.so.4 #4 0x00007f1fcc2034bb in channel_write_common () from target:/gnu/store/f3hn75p2jjc94jfxfwsgf96s2wz8kaql-libssh-0.9.6/lib/libssh.so.4 #5 0x00007f1fcc28b15b in ?? () from target:/gnu/store/rjhq6cz7xbck4l7zar6ga3ygph2chwzr-guile-ssh-0.16.0/lib/libguile-ssh.so #6 0x00007f1fd7cf4f34 in scm_i_write_bytes (port=0x7f1fc556ba00, src=0x7f1fc3c27000, start=0, count=65536) at ports.c:2858 #7 0x00007f1fd7cff546 in scm_put_bytevector (port=0x7f1fc556ba00, bv=0x7f1fc3c27000, start=, count=) at r6rs-ports.c:676 #8 0x00007f1fc5ac5197 in ?? () #9 0x000000000069731c in ?? () #10 0x00007f1fc5acdf71 in ?? () #11 0x0000000000697318 in ?? () #12 0x00007f1fd7cd2ccc in scm_jit_enter_mcode (thread=0x7f1fd73b1d80, mcode=0x69bf14 "\034;\004") at jit.c:6038 #13 0x00007f1fd7d2874f in vm_regular_engine (thread=0x7f1fd73b1d80) at vm-engine.c:546 #14 0x00007f1fd7d355d9 in scm_call_n (proc=, argv=, nargs=1) at vm.c:1610 #15 0x00007f1fd7c9d457 in scm_primitive_eval (exp=, exp@entry=0x7f1fcfab8cf0) at eval.c:671 #16 0x00007f1fd7ca34b6 in scm_eval (exp=0x7f1fcfab8cf0, module_or_state=0x7f1fcfaacc80) at eval.c:705 #17 0x00007f1fd7d073b6 in scm_shell (argc=11, argv=0x6a9fc0) at script.c:357 #18 0x00007f1fd7cb249c in invoke_main_func (body_data=0x7fffd2c0d370) at init.c:312 #19 0x00007f1fd7c9b85a in c_body (d=0x7fffd2c0d290) at continuations.c:430 #20 0x00007f1fd7d28326 in vm_regular_engine (thread=0x7f1fd73b1d80) at vm-engine.c:972 #21 0x00007f1fd7d355d9 in scm_call_n (proc=, argv=, nargs=2) at vm.c:1610 #22 0x00007f1fd7c9d09a in scm_call_2 (proc=, arg1=, arg2=) at eval.c:503 #23 0x00007f1fd7d53742 in scm_c_with_exception_handler.constprop.0 (type=0x404, handler_data=handler_data@entry=0x7fffd2c0d220, thunk_data=thunk_data@entry=0x7fffd2c0d220, thunk=, handler=) at exceptions.c:170 #24 0x00007f1fd7d2588f in scm_c_catch (tag=, body=, body_data=, handler=, handler_data=, pre_unwind_handler=, pre_unwind_handler_data=0x7f1fcfa10040) at throw.c:168 #25 0x00007f1fd7c9de66 in scm_i_with_continuation_barrier (pre_unwind_handler=0x7f1fd7c9db80 , pre_unwind_handler_data=0x7f1fcfa10040, handler_data=0x7fffd2c0d290, handler=0x7f1fd7ca48b0 , body_data=0x7fffd2c0d290, body=0x7f1fd7c9b850 ) at continuations.c:368 #26 scm_c_with_continuation_barrier (func=, data=) at continuations.c:464 #27 0x00007f1fd7d24b39 in with_guile (base=0x7fffd2c0d318, data=0x7fffd2c0d340) at threads.c:645 #28 0x00007f1fd7bfb0ba in GC_call_with_stack_base () from target:/gnu/store/2lczkxbdbzh4gk7wh91bzrqrk7h5g1dl-libgc-8.0.4/lib/libgc.so.1 #29 0x00007f1fd7d1d8b8 in scm_i_with_guile (dynamic_state=, data=, func=) at threads.c:688 #30 scm_with_guile (func=, data=) at threads.c:694 #31 0x00007f1fd7cbb025 in scm_boot_guile (argc=, argv=, main_func=, closure=) at init.c:295 #32 0x0000000000401191 in ?? () #33 0x00007f1fd7a117dd in __libc_start_main (main=0x401110, argc=8, argv=0x7fffd2c0d4a8, init=, fini=, rtld_fini=, stack_end=0x7fffd2c0d498) at ../csu/libc-start.c:332 #34 0x00000000004011da in ?? () (gdb) p *fds $1 = {fd = 15, events = 1, revents = 0} (gdb) shell ls -l /proc/62345/fd/15 lrwx------ 1 root root 64 Apr 17 17:42 /proc/62345/fd/15 -> 'socket:[3238975699]' --8<---------------cut here---------------end--------------->8--- The SSH connection is still live, and on the server side I see the ‘guix repl -t machine’ process stuck in: --8<---------------cut here---------------start------------->8--- write(2, "linking \342\200\230/gnu/store/0gbj38wswwxq50i68ci973fgi2l531b4-openjdk-16.0.2-checkout/test/hotspot/jtreg/compiler/c2/cr6340864/TestLongVect.java\342\200\231 to \342\200\230/gnu/store/.links/03zscrls9431zipkpx1sn450llvf5hkhvbkpjk57j22f7d2nb3vs\342\200\231\n", 221 --8<---------------cut here---------------end--------------->8--- (This message comes from optimise-store.cc:192.) This suggests that ‘guix repl -t machine’ has filled its output buffer; its output isn’t being read so it gets stuck in that write(2) call. Its parent ‘sshd’ process is stuck in: --8<---------------cut here---------------start------------->8--- ppoll([{fd=3, events=POLLIN}, {fd=4, events=0}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=11, events=POLLOUT}], 5, NULL, [], 8^C --8<---------------cut here---------------end--------------->8--- One hypothesis is that when transferring store items with many files, such as openjdk-checkout here, guix-daemon on the remote machine outputs many such “linking … to …” messages; if the client doesn’t read them, then we end up with a deadlock like this, with the server-side process stuck writing more messages. However, that “linking …” message is supposed to be written only at non-default debugging levels: printMsg(lvlTalkative, format("linking ‘%1%’ to ‘%2%’") % path % linkPath); Turns out that guix-daemon on build machines is running with ‘--debug’ starting with maintenance.git commit 72d8a62466b71eda657b52ca17e99f7189ea18ad (“sysadmin: build-machines: Enable guix-daemon debug output.”) The immediate fix is to undo that—I’ll do that later. The other one is to swallow ‘current-error-port’ in ‘store-import-channel’. To be continued… Ludo’.