Hi!
Mathieu Othacehe <othacehe@gnu.org> skribis:
Toggle quote (12 lines)
>
> The following lines can be caused by having a multi-threaded program:
> (cuirass scripts evaluate) because of its par-for-each procedure, call
> "primitive-fork" in open-bidirectional-pipe of (guix inferior).
>
> warning: call to primitive-fork while multiple threads are running;
> further behavior unspecified. See "Processes" in the
> manual, for more information.
>
> Not sure that it could cause the observed hang, Ludo what do you think?
As you write, the warning comes from ‘open-bidirectional-pipe’, added in
commit bd86bbd300474204878e927f6cd3f0defa1662a5. I noticed it back then
but as far as I could see, it was harmless (but definitely suboptimal):
Indeed, “make cuirass-jobs” works for me.
I decided to go ahead and run ‘cuirass evaluate’ by hand to gather more
info:
Toggle snippet (18 lines)
ludo@berlin ~$ sudo su - cuirass -s /bin/sh -c "/gnu/store/qamrzl942khb7w0cb4csyflnxfkwvvxq-cuirass-1.1.0-11.9f08035/bin/cuirass evaluate dbname=cuirass 323183"
Computing Guix derivation for 'x86_64-linux'... |
warning: call to primitive-fork while multiple threads are running;
further behavior unspecified. See "Processes" in the
manual, for more information.
warning: call to primitive-fork while multiple threads are running;
further behavior unspecified. See "Processes" in the
manual, for more information.
warning: call to primitive-fork while multiple threads are running;
further behavior unspecified. See "Processes" in the
manual, for more information.
warning: call to primitive-fork while multiple threads are running;
further behavior unspecified. See "Processes" in the
manual, for more information.
2022-05-16T09:56:43 Registering builds for evaluation 323183.
2022-05-16T10:08:47 Registering builds for evaluation 323183.
I straced it in another terminal. It did its work for a long while, and
eventually activity stopped:
Toggle snippet (38 lines)
read(228, "nu/store/7gi4qsakz775z5wcmddg8xan7q806sj1-findutils-4.8.0\")) (#:nix-name . \"findutils-4.8.0\") (#:sys"..., 4096) = 4096
read(228, "nrd-gcc-cross-x86_64-w64-mingw32-10.3.0.drv\" \"/gnu/store/ykrksw421rd22zh4w3pc751r90wc881n-patch-2.7."..., 4096) = 4096
read(228, "wm-binutils-cross-x86_64-w64-mingw32-2.37.drv\" \"/gnu/store/xfh4965w2amrsl7x0h3dslrhndzn5bbz-linux-li"..., 4096) = 4096
read(228, "1ir-gawk-5.1.0.drv\" \"/gnu/store/p1d2w4fg1ljbyfs0ayv854snahcimamk-libunistring-0.9.10.drv\" \"/gnu/stor"..., 4096) = 4096
read(228, "3ysc-guile-2.2.7.drv\" \"/gnu/store/b0ma6l61ad4dfb3030sl1ngvdlicc523-glibc-utf8-locales-2.33.drv\" \"/gn"..., 4096) = 4096
read(228, "j6a5mc7w13xiw8136y4brv7w-xz-5.2.5.drv\" \"/gnu/store/w1bxsf02c4y0hj0nd39awv7i55ych641-libffi-3.3.drv\" "..., 4096) = 4096
read(228, "tore/qbqgamcpwgs2sw8lcr6pxl5y38fpyz0d-sed-4.8.drv\" \"/gnu/store/qmsbxbgb758h6f7018snxlwn69w9w9i4-diff"..., 4096) = 4096
read(228, "oconf-2.69.drv\" \"/gnu/store/6n5mn2vkl4x4fab37sx1rmqyqxqcwd4y-libidn2-2.3.1.drv\" \"/gnu/store/6q5cp0m2"..., 4096) = 4096
mmap(0x7fddd26af000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fddd268f000
read(228, "-binutils-cross-x86_64-w64-mingw32-2.37.drv\" \"/gnu/store/wdmk9llhdzx5y10mziss1i5z6q037a7z-guile-3.0."..., 4096) = 954
futex(0x7fdf015381f0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7fdf01546b00, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x7fdf015381f4, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7fdf01546b00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7fdf01538f74, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGPWR {si_signo=SIGPWR, si_code=SI_TKILL, si_pid=64599, si_uid=997} ---
rt_sigsuspend(~[INT QUIT ABRT BUS SEGV TERM XCPU RTMIN RT_1], 8) = ? ERESTARTNOHAND (To be restarted if no handler)
--- SIGXCPU {si_signo=SIGXCPU, si_code=SI_TKILL, si_pid=64599, si_uid=997} ---
rt_sigreturn({mask=~[INT QUIT ABRT BUS KILL SEGV TERM STOP RTMIN RT_1]}) = -1 EINTR (Interrupted system call)
rt_sigreturn({mask=[]}) = 202
futex(0x7fdf01538f74, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGPWR {si_signo=SIGPWR, si_code=SI_TKILL, si_pid=64599, si_uid=997} ---
rt_sigsuspend(~[INT QUIT ABRT BUS SEGV TERM XCPU RTMIN RT_1], 8) = ? ERESTARTNOHAND (To be restarted if no handler)
--- SIGXCPU {si_signo=SIGXCPU, si_code=SI_TKILL, si_pid=64599, si_uid=997} ---
rt_sigreturn({mask=~[INT QUIT ABRT BUS KILL SEGV TERM STOP RTMIN RT_1]}) = -1 EINTR (Interrupted system call)
rt_sigreturn({mask=[]}) = 202
futex(0x7fdf01538f74, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
[…]
--- SIGXCPU {si_signo=SIGXCPU, si_code=SI_TKILL, si_pid=64599, si_uid=997} ---
rt_sigreturn({mask=~[INT QUIT ABRT BUS KILL SEGV TERM STOP RTMIN RT_1]}) = -1 EINTR (Interrupted system call)
rt_sigreturn({mask=[]}) = 202
futex(0x7fdf01538f74, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x7fdf01546c00, FUTEX_WAKE_PRIVATE, 1) = 0
close(224) = 0
futex(0x7fdf01538f70, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY
[stops here]
The ‘read’ calls are about getting the jobset alists; FD 228 is a socket
(to a ‘guix repl’ subprocess I presume). SIGPWR/SIGXCPU suggests
attempts to make a stop-the-world GC.
GDB shows that the process has 120 threads (it’s a 96-core machine, but
still…) and most threads are stuck in pthread_cond_wait:
Toggle snippet (58 lines)
(gdb) bt
#0 0x00007fdf01d62afa in __futex_abstimed_wait_common64 (futex_word=futex_word@entry=0x7fded20cd8b4,
expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0,
cancel=cancel@entry=true) at ../sysdeps/nptl/futex-internal.c:74
#1 0x00007fdf01d62b5b in __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7fded20cd8b4,
expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0)
at ../sysdeps/nptl/futex-internal.c:123
#2 0x00007fdf01d5cab0 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7fdef9c15fc0,
cond=cond@entry=0x7fded20cd888) at pthread_cond_wait.c:504
#3 __pthread_cond_wait (cond=cond@entry=0x7fded20cd888, mutex=mutex@entry=0x7fdef9c15fc0) at pthread_cond_wait.c:619
#4 0x00007fdf01ea2a55 in scm_pthread_cond_wait (cond=cond@entry=0x7fded20cd888, mutex=mutex@entry=0x7fdef9c15fc0)
at threads.c:1605
#5 0x00007fdf01ea7b6b in block_self (
queue=((#<smob thread 7fdedfd05eb0>) #<smob thread 7fdedfd05af0> #<smob thread 7fdedfd05950> #<smob thread 7fdedfd05870> #<smob thread 7fdedfd05790> #<smob thread 7fdedfd056b0> #<smob thread 7fdedfd05570> #<smob thread 7fdedfd05490> #<smob thread 7fdedfd05350> #<smob thread 7fdedfd05150> #<smob thread 7fdedfd05070> #<smob thread 7fdedfd18f50> #<smob thread 7fdedfd18d70> #<smob thread 7fdedfd189a0> #<smob thread 7fdedfd18730> #<smob thread 7fdedfd185e0> #<smob thread 7fdedfd182b0> #<smob thread 7fdedfd18160> #<smob thread 7fdedfd21f50> #<smob thread 7fdedfd21e00> #<smob thread 7fdedfd21cb0> #<smob thread 7fdedfd217a0> #<smob thread 7fdedfd21650> #<smob thread 7fdedfd21500> #<smob thread 7fdedfd213b0> #<smob thread 7fdedfd21260> #<smob thread 7fdedfd2ff90> #<smob thread 7fdedfd2fe60> #<smob thread 7fdedfd2fcf0> #<smob thread 7fdedfd2fba0> #<smob thread 7fdedfd2fa50> #<smob thread 7fdedfd2f900> #<smob thread 7fdedfd2f650> #<smob thread 7fdedfd2f570> #<smob thread 7fdedfd2f430> #<smob thread 7fdedfd2f310> #<smob thread 7fdedfd2f1a0> #<smob thread 7fdedfd2f0b0> #<smob thread 7fdedfd4efd0> #<smob thread 7fdedfd4ee90> #<smob thread 7fdedfd4ebd0> #<smob thread 7fdedfd4e940> #<smob thread 7fdedfd4e760> #<smob thread 7fdedfd4e660> #<smob thread 7fdedfd4e270> #<smob thread 7fdedfdbefc0> #<smob thread 7fdedfdbeec0> #<smob thread 7fdedfdbede0> #<smob thread 7fdedfdbed00> #<smob thread 7fdedfdbeb00> #<smob thread 7fdedfdbe8f0> #<smob thread 7fdedfdbe810> #<smob thread 7fdedfdbe730> #<smob thread 7fdedfdbe650> #<smob thread 7fdedfdbe3f0> #<smob thread 7fdedfdbe220> #<smob thread 7fdedfdbe120> #<smob thread 7fdedfdbe020> #<smob thread 7fdedfa34610> #<smob thread 7fdedfa344d0> #<smob thread 7fdedfa34110> #<smob thread 7fdedfa52e90> #<smob thread 7fdedfa52d30> #<smob thread 7fdedfa52c40> #<smob thread 7fdedfa737e0> #<smob thread 7fdedfa73400> #<smob thread 7fdedfa73000> #<smob thread 7fdedfa88ef0> #<smob thread 7fdedfa88cc0> #<smob thread 7fdedfa888e0> #<smob thread 7fdedfa88800> #<smob thread 7fdedfa88720> #<smob thread 7fdedfa88620> #<smob thread 7fdedfa884b0> #<smob thread 7fdedfa88370> #<smob thread 7fdedfa880b0> #<smob thread 7fdedfb06fa0> #<smob thread 7fdedfb06b90> #<smob thread 7fdedfb066a0> #<smob thread 7fdedfb06470> #<smob thread 7fdedfb062e0> #<smob thread 7fdedfb06140> #<smob thread 7fdedfb58f80> #<smob thread 7fdedfb58af0> #<smob thread 7fdedfb58830> #<smob thread 7fdedfb93cd0> #<smob thread 7fdedfb93ae0> #<smob thread 7fdedfb936c0> #<smob thread 7fdedfb935c0> #<smob thread 7fdedfb93400> #<smob thread 7fdedfb932c0> #<smob thread 7fdedfd05ff0> #<smob thread 7fdedfd05c20> #<smob thread 7fdedfd05eb0>), mutex=mutex@entry=0x7fdef9c15fc0, waittime=waittime@entry=0x0) at threads.c:312
#6 0x00007fdf01eaa1cd in timed_wait (waittime=0x0, current_thread=0x7fded20cd6c0, c=0x7fdef9c10d20,
m=0x7fdef9c15fc0, kind=SCM_MUTEX_STANDARD) at threads.c:1331
#7 scm_timed_wait_condition_variable (cond=#<smob condition-variable 7fdef9c122e0>,
mutex=#<smob mutex 7fdef9c12320>, timeout=#<undefined 904>) at threads.c:1424
#8 0x00007fded169a7f3 in ?? ()
#9 0x0000000000b83688 in ?? ()
#10 0x00007fded169a750 in ?? ()
#11 0x00007fdeec009e98 in ?? ()
#12 0x00007fdf01e57ccc in scm_jit_enter_mcode (thread=0x7fded20cd6c0, mcode=0xb8369c "\034\330\003") at jit.c:6038
#13 0x00007fdf01eacf3c in vm_regular_engine (thread=0x7fded20cd6c0) at vm-engine.c:360
#14 0x00007fdf01eba5e9 in scm_call_n (proc=<optimized out>, argv=<optimized out>, nargs=0) at vm.c:1608
#15 0x00007fdf01e1ea0e in scm_call_with_unblocked_asyncs (proc=#<program 7fdef732e880>) at async.c:406
#16 0x00007fdf01ead336 in vm_regular_engine (thread=0x7fded20cd6c0) at vm-engine.c:972
#17 0x00007fdf01eba5e9 in scm_call_n (proc=<optimized out>, argv=<optimized out>, nargs=0) at vm.c:1608
#18 0x00007fdf01ea9be6 in really_launch (d=0x7fdef730c860) at threads.c:778
#19 0x00007fdf01e2085a in c_body (d=0x7fdec8c44d80) at continuations.c:430
#20 0x00007fded2874532 in ?? ()
#21 0x00007fdef9a227e0 in ?? ()
#22 0x00007fdef7d8c370 in ?? ()
#23 0x0000000000000048 in ?? ()
#24 0x00007fdf01e57ccc in scm_jit_enter_mcode (thread=0x7fded20cd6c0, mcode=0xb81874 "\034<\003") at jit.c:6038
#25 0x00007fdf01eacf3c in vm_regular_engine (thread=0x7fded20cd6c0) at vm-engine.c:360
#26 0x00007fdf01eba5e9 in scm_call_n (proc=<optimized out>, argv=<optimized out>, nargs=2) at vm.c:1608
#27 0x00007fdf01e2209a in scm_call_2 (proc=<optimized out>, arg1=<optimized out>, arg2=<optimized out>) at eval.c:503
#28 0x00007fdf01ed8752 in scm_c_with_exception_handler.constprop.0 (type=#t,
handler_data=handler_data@entry=0x7fdec8c44d10, thunk_data=thunk_data@entry=0x7fdec8c44d10,
thunk=<optimized out>, handler=<optimized out>) at exceptions.c:170
#29 0x00007fdf01eaa88f in scm_c_catch (tag=<optimized out>, body=<optimized out>, body_data=<optimized out>,
handler=<optimized out>, handler_data=<optimized out>, pre_unwind_handler=<optimized out>,
pre_unwind_handler_data=0x7fdef9b97040) at throw.c:168
#30 0x00007fdf01e22e66 in scm_i_with_continuation_barrier (pre_unwind_handler=0x7fdf01e22b80 <pre_unwind_handler>,
pre_unwind_handler_data=0x7fdef9b97040, handler_data=0x7fdec8c44d80, handler=0x7fdf01e298b0 <c_handler>,
body_data=0x7fdec8c44d80, body=0x7fdf01e20850 <c_body>) at continuations.c:368
#31 scm_c_with_continuation_barrier (func=<optimized out>, data=<optimized out>) at continuations.c:464
#32 0x00007fdf01ea9b39 in with_guile (base=0x7fdec8c44e08, data=0x7fdec8c44e30) at threads.c:645
#33 0x00007fdf01d800ba in GC_call_with_stack_base ()
from /gnu/store/2lczkxbdbzh4gk7wh91bzrqrk7h5g1dl-libgc-8.0.4/lib/libgc.so.1
#34 0x00007fdf01ea216d in scm_i_with_guile (dynamic_state=<optimized out>, data=0x7fdef730c860,
func=0x7fdf01ea9b70 <really_launch>) at threads.c:688
#35 launch_thread (d=0x7fdef730c860) at threads.c:787
#36 0x00007fdf01d56d7e in start_thread (arg=0x7fdec8c45640) at pthread_create.c:473
#37 0x00007fdf01954eff in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
There’s also one thread stuck in read(2):
Toggle snippet (38 lines)
(gdb) bt
#0 __libc_read (nbytes=1, buf=0x7fdedf97bdd0, fd=7) at ../sysdeps/unix/sysv/linux/read.c:26
#1 __libc_read (fd=7, buf=buf@entry=0x7fdedf97bdd0, nbytes=nbytes@entry=1) at ../sysdeps/unix/sysv/linux/read.c:24
#2 0x00007fdf01e382e8 in fport_read (port=<optimized out>, dst=<optimized out>, start=<optimized out>, count=1)
at fports.c:597
#3 0x00007fdf01e78d22 in scm_i_read_bytes (port=port@entry=#<port #<port-type file 7fdef9b5fb40> 7fdef59b7180>,
dst="#<vu8vector>" = {...}, start=start@entry=0, count=1) at ports.c:1566
#4 0x00007fdf01e7b1c7 in scm_fill_input (port=port@entry=#<port #<port-type file 7fdef9b5fb40> 7fdef59b7180>,
minimum_size=1, minimum_size@entry=0, cur_out=cur_out@entry=0x7fde86fcc3e8,
avail_out=avail_out@entry=0x7fde86fcc3e0) at ports.c:2693
#5 0x00007fdf01e7b434 in peek_byte_or_eof (port=port@entry=#<port #<port-type file 7fdef9b5fb40> 7fdef59b7180>,
buf_out=buf_out@entry=0x7fde86fcc448, cur_out=cur_out@entry=0x7fde86fcc440) at ports.c:1526
#6 0x00007fdf01e7c1b5 in maybe_consume_bom (port=port@entry=#<port #<port-type file 7fdef9b5fb40> 7fdef59b7180>,
bom=bom@entry=0x7fdf01ee56ad <scm_utf8_bom> "\357\273\277scm_i_read_bytes", bom_len=bom_len@entry=3)
at ports.c:2468
#7 0x00007fdf01e7b324 in port_clear_stream_start_for_bom_read (io_mode=BOM_IO_TEXT,
port=#<port #<port-type file 7fdef9b5fb40> 7fdef59b7180>) at ports.c:2504
#8 scm_fill_input (port=port@entry=#<port #<port-type file 7fdef9b5fb40> 7fdef59b7180>, minimum_size=1,
minimum_size@entry=0, cur_out=cur_out@entry=0x7fde86fcc4e8, avail_out=avail_out@entry=0x7fde86fcc4e0)
at ports.c:2648
#9 0x00007fdf01e7b434 in peek_byte_or_eof (port=port@entry=#<port #<port-type file 7fdef9b5fb40> 7fdef59b7180>,
buf_out=buf_out@entry=0x7fde86fcc548, cur_out=cur_out@entry=0x7fde86fcc540) at ports.c:1526
#10 0x00007fdf01e7b97c in peek_utf8_codepoint (port=#<port #<port-type file 7fdef9b5fb40> 7fdef59b7180>,
buf_out=buf_out@entry=0x7fde86fcc5a8, cur_out=cur_out@entry=0x7fde86fcc5a0, len_out=len_out@entry=0x7fde86fcc598)
at ports.c:1790
#11 0x00007fdf01e7bee4 in peek_codepoint (len=0x7fde86fcc598, cur=0x7fde86fcc5a0, buf=0x7fde86fcc5a8,
port=<optimized out>) at ports.c:1984
#12 scm_peek_char (port=<optimized out>) at ports.c:2202
#13 0x00007fdef997b1ab in ?? ()
#14 0x0000000000b82ae0 in ?? ()
#15 0x00007fdef997b120 in ?? ()
#16 0x00007fdec4292c48 in ?? ()
#17 0x00007fdf01e57ccc in scm_jit_enter_mcode (thread=0x7fded1366000, mcode=0xb82af0 "\034\234\003") at jit.c:6038
#18 0x00007fdf01eacf3c in vm_regular_engine (thread=0x7fded1366000) at vm-engine.c:360
#19 0x00007fdf01eba5e9 in scm_call_n (proc=<optimized out>, argv=<optimized out>, nargs=0) at vm.c:1608
#20 0x00007fdf01e1ea0e in scm_call_with_unblocked_asyncs (proc=#<program 7fdef59a0f60>) at async.c:406
A plausible scenario is a synchronization issue: ‘cuirass evaluate’ got
all the data it asked for from ‘guix repl’, but it’s still waiting on
read(2) even though there’s nothing left. (There are two child ‘guix
repl’ processes and both are waiting on read(0, …).)
Thoughts?
Ludo’.