Hello, When running "gui-installed-desktop-os-encrypted" test, Shepherd seemsto deadlock when restarting "guix-daemon". This can happen at differentstages: * In "umount-cow-store" procedure, just before finishing the install. * During "set-http-proxy" tests inside the marionette.This is not always reproducible. In order to gather some information, Icreated a Shepherd "strace" service that logs what's happening inShepherd itself (patch attached). It seems that, just after blocking signals, in "fork+exec-command", Iguess, Shepherd is taking a lock:
I think this is caused by a "pthread_join", most probably the one in"stop_finalization_thread" that is called right before forking a newprocess. The fact that we hang here probably means that the finalizerthread itself is hanging, not sure why. It looks like what was reported by Ludo here:https://issues.guix.info/31925. Thanks, Mathieu
Toggle quote (12 lines)> When running "gui-installed-desktop-os-encrypted" test, Shepherd seems> to deadlock when restarting "guix-daemon". This can happen at different> stages:>> * In "umount-cow-store" procedure, just before finishing the install.>> * During "set-http-proxy" tests inside the marionette.> > This is not always reproducible. In order to gather some information, I> created a Shepherd "strace" service that logs what's happening in> Shepherd itself (patch attached).
We should be able to reproduce it with much simpler tests then, right?Like maybe “while : ; do herd restart guix-daemon ; done” or similar?
When that happens, we should check how many threads exist in PID 1.There should be the finalization thread and the main thread, plus thesignal thread (because there are still ‘sigaction’ calls in the ‘main’procedure), plus the GC marker threads. In https://issues.guix.gnu.org/31925#6, Andy suggests that the signalthread is not properly handled; indeed it takes locks and we don’t tryto shut it down upon fork. However, when using signalfd, the signalthread must be stuck in its ‘read’ call in ‘read_signal_pipe_data’, so Idon’t see how it could cause problems. The GC threads are presumably taken care of by the atfork handler inlibgc. Thoughts? Ludo’.
The first one is spawned from Shepherd directly. The other one isspawned from the forked process in "marionette-shepherd-service". Those two finalizer threads share the same pipe. When we try tostop the finalizer thread in Shepherd, right before forking a newprocess, we send a '\1' byte to the finalizer pipe.
the marionette finalizer thread. Then, we pthread_join the Shepherdfinalizer thread, which never stops! Quite unfortunate. Here's a small reproducer attached. So unless I'm wrong this is a Guileissue, that will cause any program that uses at least two primitive-forkcalls to possibly hang. I'm quite convinced that those two bugs are directly related: * https://issues.guix.info/31925* https://issues.guix.gnu.org/42353 Now regarding the fix of this issue, I guess that a process forked with"primitive-fork" in Guile should close it's parent finalizer pipe andopen a new one. WDYT? Thanks, Mathieu