Hi, Mathieu Othacehe skribis: > When running "gui-installed-desktop-os-encrypted" test, Shepherd seems > to deadlock when restarting "guix-daemon". This can happen at different > stages: > > * In "umount-cow-store" procedure, just before finishing the install. > > * During "set-http-proxy" tests inside the marionette. > > This is not always reproducible. In order to gather some information, I > created a Shepherd "strace" service that logs what's happening in > Shepherd itself (patch attached). We should be able to reproduce it with much simpler tests then, right? Like maybe “while : ; do herd restart guix-daemon ; done” or similar? > It seems that, just after blocking signals, in "fork+exec-command", I > guess, Shepherd is taking a lock: > > 183553:1 chdir("/") = 0 > 183554:1 clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=1, tv_nsec=4387222}) = 0 > 183555:1 rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0 > 183560:1 madvise(0x7f179782d000, 12288, MADV_DONTNEED) = 0 > 183561:1 clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=1, tv_nsec=13592357}) = 0 > 183562:1 clone(child_stack=0x7f17981e6fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[271], tls=0x7f17981e7700, child_tidptr=0x7f17981e79d0) = 271 > 183563:1 stat("/etc/localtime", {st_mode=S_IFREG|0444, st_size=1920, ...}) = 0 > 183564:1 write(15, "shepherd[1]: changing HTTP/HTTPS"..., 86 > 183566:1 <... write resumed>) = 86 > 183575:1 getpgid(169) = 169 > 183576:1 getpgid(0) = 0 > 183577:1 kill(-169, SIGTERM > 183579:1 <... kill resumed>) = 0 > 183582:1 stat("/etc/localtime", {st_mode=S_IFREG|0444, st_size=1920, ...}) = 0 > 183583:1 write(15, "shepherd[1]: Service guix-daemon"..., 51 > 183585:1 <... write resumed>) = 51 > 183594:1 getuid() = 0 > 183595:1 rt_sigprocmask(SIG_BLOCK, [HUP INT TERM CHLD], [HUP INT TERM CHLD], 8) = 0 > 183596:1 write(6, "\1", 1 > 183598:1 <... write resumed>) = 1 > 183605:1 futex(0x7f17981e79d0, FUTEX_WAIT, 271, NULL > > and is then blocking forever. When that happens, we should check how many threads exist in PID 1. There should be the finalization thread and the main thread, plus the signal thread (because there are still ‘sigaction’ calls in the ‘main’ procedure), plus the GC marker threads. In , Andy suggests that the signal thread is not properly handled; indeed it takes locks and we don’t try to shut it down upon fork. However, when using signalfd, the signal thread must be stuck in its ‘read’ call in ‘read_signal_pipe_data’, so I don’t see how it could cause problems. The GC threads are presumably taken care of by the atfork handler in libgc. Thoughts? Ludo’.