Shepherd deadlocks

DoneSubmitted by Mathieu Othacehe.
Details
4 participants
  • Ludovic Courtès
  • Michael Rohleder
  • Mathieu Othacehe
  • Andrew Whatson
Owner
unassigned
Severity
important
M
M
Mathieu Othacehe wrote on 19 Jun 2020 10:41
(address . bug-guix@gnu.org)
87h7v75txx.fsf@gnu.org
Hello,
When running "gui-installed-desktop-os-encrypted" test, Shepherd seemsto deadlock when restarting "guix-daemon". This can happen at differentstages:
* In "umount-cow-store" procedure, just before finishing the install.
* During "set-http-proxy" tests inside the marionette. This is not always reproducible. In order to gather some information, Icreated a Shepherd "strace" service that logs what's happening inShepherd itself (patch attached).
It seems that, just after blocking signals, in "fork+exec-command", Iguess, Shepherd is taking a lock:
Toggle snippet (23 lines)183553:1 chdir("/") = 0183554:1 clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=1, tv_nsec=4387222}) = 0183555:1 rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0183560:1 madvise(0x7f179782d000, 12288, MADV_DONTNEED) = 0183561:1 clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=1, tv_nsec=13592357}) = 0183562:1 clone(child_stack=0x7f17981e6fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[271], tls=0x7f17981e7700, child_tidptr=0x7f17981e79d0) = 271183563:1 stat("/etc/localtime", {st_mode=S_IFREG|0444, st_size=1920, ...}) = 0183564:1 write(15, "shepherd[1]: changing HTTP/HTTPS"..., 86 <unfinished ...>183566:1 <... write resumed>) = 86183575:1 getpgid(169) = 169183576:1 getpgid(0) = 0183577:1 kill(-169, SIGTERM <unfinished ...>183579:1 <... kill resumed>) = 0183582:1 stat("/etc/localtime", {st_mode=S_IFREG|0444, st_size=1920, ...}) = 0183583:1 write(15, "shepherd[1]: Service guix-daemon"..., 51 <unfinished ...>183585:1 <... write resumed>) = 51183594:1 getuid() = 0183595:1 rt_sigprocmask(SIG_BLOCK, [HUP INT TERM CHLD], [HUP INT TERM CHLD], 8) = 0183596:1 write(6, "\1", 1 <unfinished ...>183598:1 <... write resumed>) = 1183605:1 futex(0x7f17981e79d0, FUTEX_WAIT, 271, NULL <unfinished ...>
and is then blocking forever.
I cannot find any trace of locking in Shepherd, except for"without-automatic-finalization" procedure.
Thanks,
Mathieu
From 13ac475ad953509cbbf88547e4f72efc08d31055 Mon Sep 17 00:00:00 2001From: Mathieu Othacehe <othacehe@gnu.org>Date: Thu, 18 Jun 2020 14:13:15 +0200Subject: [PATCH] strace service
--- gnu/services/base.scm | 3 +++ gnu/services/linux.scm | 25 ++++++++++++++++++++++++- gnu/system/install.scm | 3 +++ 3 files changed, 30 insertions(+), 1 deletion(-)
Toggle diff (84 lines)diff --git a/gnu/services/base.scm b/gnu/services/base.scmindex 6ea7ef8e7e..d80c6cca6d 100644--- a/gnu/services/base.scm+++ b/gnu/services/base.scm@@ -34,6 +34,7 @@ #:use-module (guix deprecation) #:use-module (gnu services) #:use-module (gnu services admin)+ #:use-module (gnu services linux) #:use-module (gnu services shepherd) #:use-module (gnu system pam) #:use-module (gnu system shadow) ; 'user-account', etc.@@ -2386,6 +2387,8 @@ to handle." (service guix-service-type) (service nscd-service-type) + (service strace-service-type 1)+ (service rottlog-service-type) ;; The LVM2 rules are needed as soon as LVM2 or the device-mapper isdiff --git a/gnu/services/linux.scm b/gnu/services/linux.scmindex 12934c2084..96889dfb7f 100644--- a/gnu/services/linux.scm+++ b/gnu/services/linux.scm@@ -42,7 +42,9 @@ earlyoom-configuration-send-notification-command earlyoom-service-type - kernel-module-loader-service-type))+ kernel-module-loader-service-type++ strace-service-type)) ;;;@@ -177,3 +179,24 @@ representation." (compose concatenate) (extend append) (default-value '())))++ +;;;+;;; Strace service.+;;;++(define strace-service-type+ (shepherd-service-type+ 'strace+ (lambda (pid)+ (shepherd-service+ (requirement '(root-file-system user-processes))+ (provision (list (string->symbol+ (string-append "strace-" (number->string pid)))))+ (documentation "Strace a running process.")+ (start #~(make-forkexec-constructor+ (list #$(file-append strace "/bin/strace")+ "-f" "-p" (number->string #$pid)+ "-o" (string-append "/var/log/strace-"+ (number->string #$pid) ".log"))))+ (stop #~(make-kill-destructor))))))diff --git a/gnu/system/install.scm b/gnu/system/install.scmindex d0ff2e7c52..f2f1e9414c 100644--- a/gnu/system/install.scm+++ b/gnu/system/install.scm@@ -34,6 +34,7 @@ #:use-module (gnu installer) #:use-module (gnu system locale) #:use-module (gnu services dbus)+ #:use-module (gnu services linux) #:use-module (gnu services networking) #:use-module (gnu services shepherd) #:use-module (gnu services ssh)@@ -378,6 +379,8 @@ Access documentation at any time by pressing Alt-F2.\x1b[0m ;; regulations-compliant WiFi access. (udev-service #:rules (list lvm2 crda)) + (service strace-service-type 1)+ ;; Add the 'cow-store' service, which users have to start manually ;; since it takes the installation directory as an argument. (cow-store-service)-- 2.26.2
Attachment: strace-1.log.xz
M
M
Mathieu Othacehe wrote on 19 Jun 2020 14:10
(address . 41948@debbugs.gnu.org)(address . ludo@gnu.org)
87d05v5k8n.fsf@gnu.org
Toggle quote (2 lines)> 183605:1 futex(0x7f17981e79d0, FUTEX_WAIT, 271, NULL <unfinished ...>
I think this is caused by a "pthread_join", most probably the one in"stop_finalization_thread" that is called right before forking a newprocess. The fact that we hang here probably means that the finalizerthread itself is hanging, not sure why.
It looks like what was reported by Ludo here:https://issues.guix.info/31925.
Thanks,
Mathieu
M
M
Michael Rohleder wrote on 20 Jun 2020 02:16
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 41948@debbugs.gnu.org)
87y2oilhgm.fsf@rohleder.de
Hi Mathieu,
Mathieu Othacehe <othacehe@gnu.org> writes:
Toggle quote (20 lines)> This is not always reproducible. In order to gather some information, I> created a Shepherd "strace" service that logs what's happening in> Shepherd itself (patch attached).
> +(define strace-service-type> + (shepherd-service-type> + 'strace> + (lambda (pid)> + (shepherd-service> + (requirement '(root-file-system user-processes))> + (provision (list (string->symbol> + (string-append "strace-" (number->string pid)))))> + (documentation "Strace a running process.")> + (start #~(make-forkexec-constructor> + (list #$(file-append strace "/bin/strace")> + "-f" "-p" (number->string #$pid)> + "-o" (string-append "/var/log/strace-"> + (number->string #$pid) ".log"))))> + (stop #~(make-kill-destructor))))))
Thank you very much for this nice strace service!
I'am debugging a "shepherd hangs" problem in the context of lvm andlvmcache...This service helps me a lot, ty!

--
Toggle quote (1 lines)>Ever heard of .cshrc?
That's a city in Bosnia. Right?(Discussion in comp.os.linux.misc on the intuitiveness of commands.)
-----BEGIN PGP SIGNATURE-----
iQEzBAEBCAAdFiEEdV4t5dDVhcUueCgwfHr/vv7yyyUFAl7tVVkACgkQfHr/vv7yyyWfjQf+JiRafW3ojpVRpu3tojSjO969qSreYbHyL9aqCG9RWmlDAyVTHnIqsk8H+D1uZ11aGBLoYpbW+24JLrCyH0MPa019BXKzweuz15do/vkkqcHyCzsJT8sfGv5e1KTcaJyaEctDQlD3Rbh4rZr58xiKOxfGG9x6AFCg44ugbbEMDXKv08bQO+bxi3MxqOx0RlB40uj4jMvdLtjkEWEJYXyu0BXxCBr8Nw2DSeAL4c4ptTZdXZe+lS0IevhpHTzlrNd3euAGGqy+w6oi9dxJZealzl+F1OMyPlJRVUnnN+mSeyWhMhHsCnEWD71QxxoNH2CzC7dgvTMD63PplpKv46V0fQ===he4k-----END PGP SIGNATURE-----
L
L
Ludovic Courtès wrote on 20 Jun 2020 12:31
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 41948@debbugs.gnu.org)
87a70yc9kj.fsf@gnu.org
Hi,
Mathieu Othacehe <othacehe@gnu.org> skribis:
Toggle quote (12 lines)> When running "gui-installed-desktop-os-encrypted" test, Shepherd seems> to deadlock when restarting "guix-daemon". This can happen at different> stages:>> * In "umount-cow-store" procedure, just before finishing the install.>> * During "set-http-proxy" tests inside the marionette.> > This is not always reproducible. In order to gather some information, I> created a Shepherd "strace" service that logs what's happening in> Shepherd itself (patch attached).
We should be able to reproduce it with much simpler tests then, right?Like maybe “while : ; do herd restart guix-daemon ; done” or similar?
Toggle quote (27 lines)> It seems that, just after blocking signals, in "fork+exec-command", I> guess, Shepherd is taking a lock:>> 183553:1 chdir("/") = 0> 183554:1 clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=1, tv_nsec=4387222}) = 0> 183555:1 rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0> 183560:1 madvise(0x7f179782d000, 12288, MADV_DONTNEED) = 0> 183561:1 clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=1, tv_nsec=13592357}) = 0> 183562:1 clone(child_stack=0x7f17981e6fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[271], tls=0x7f17981e7700, child_tidptr=0x7f17981e79d0) = 271> 183563:1 stat("/etc/localtime", {st_mode=S_IFREG|0444, st_size=1920, ...}) = 0> 183564:1 write(15, "shepherd[1]: changing HTTP/HTTPS"..., 86 <unfinished ...>> 183566:1 <... write resumed>) = 86> 183575:1 getpgid(169) = 169> 183576:1 getpgid(0) = 0> 183577:1 kill(-169, SIGTERM <unfinished ...>> 183579:1 <... kill resumed>) = 0> 183582:1 stat("/etc/localtime", {st_mode=S_IFREG|0444, st_size=1920, ...}) = 0> 183583:1 write(15, "shepherd[1]: Service guix-daemon"..., 51 <unfinished ...>> 183585:1 <... write resumed>) = 51> 183594:1 getuid() = 0> 183595:1 rt_sigprocmask(SIG_BLOCK, [HUP INT TERM CHLD], [HUP INT TERM CHLD], 8) = 0> 183596:1 write(6, "\1", 1 <unfinished ...>> 183598:1 <... write resumed>) = 1> 183605:1 futex(0x7f17981e79d0, FUTEX_WAIT, 271, NULL <unfinished ...>>> and is then blocking forever.
When that happens, we should check how many threads exist in PID 1.There should be the finalization thread and the main thread, plus thesignal thread (because there are still ‘sigaction’ calls in the ‘main’procedure), plus the GC marker threads.
In https://issues.guix.gnu.org/31925#6, Andy suggests that the signalthread is not properly handled; indeed it takes locks and we don’t tryto shut it down upon fork. However, when using signalfd, the signalthread must be stuck in its ‘read’ call in ‘read_signal_pipe_data’, so Idon’t see how it could cause problems.
The GC threads are presumably taken care of by the atfork handler inlibgc.
Thoughts?
Ludo’.
L
L
Ludovic Courtès wrote on 23 Jun 2020 11:28
control message for bug #41948
(address . control@debbugs.gnu.org)
871rm63zd7.fsf@gnu.org
severity 41948 importantquit
M
M
Mathieu Othacehe wrote on 16 Aug 2020 11:56
Re: bug#41948: Shepherd deadlocks
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 41948@debbugs.gnu.org)
87k0xyhq22.fsf@gnu.org
Hey Ludo,
Toggle quote (3 lines)> We should be able to reproduce it with much simpler tests then, right?> Like maybe “while : ; do herd restart guix-daemon ; done” or similar?
Well I tried that without success. Then I had a closer look to thestrace log.
Turns out there are two concurrent "finalizer" threads:
Toggle snippet (3 lines)1 clone(child_stack=0x7f17981e6fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[271], tls=0x7f17981e7700, child_tidptr=0x7f17981e79d0) = 271
and this one,
Toggle snippet (3 lines)217 <... clone resumed>, parent_tid=[253], tls=0x7f1799309700, child_tidptr=0x7f17993099d0) = 253
The first one is spawned from Shepherd directly. The other one isspawned from the forked process in "marionette-shepherd-service".
Those two finalizer threads share the same pipe. When we try tostop the finalizer thread in Shepherd, right before forking a newprocess, we send a '\1' byte to the finalizer pipe.
Toggle snippet (3 lines)1 write(6, "\1", 1 <unfinished ...>
which is received by (line 183597):
Toggle snippet (3 lines)253 <... read resumed>"\1", 1) = 1
the marionette finalizer thread. Then, we pthread_join the Shepherdfinalizer thread, which never stops! Quite unfortunate.
Here's a small reproducer attached. So unless I'm wrong this is a Guileissue, that will cause any program that uses at least two primitive-forkcalls to possibly hang.
I'm quite convinced that those two bugs are directly related:
* https://issues.guix.info/31925* https://issues.guix.gnu.org/42353
Now regarding the fix of this issue, I guess that a process forked with"primitive-fork" in Guile should close it's parent finalizer pipe andopen a new one.
WDYT?
Thanks,
Mathieu
Attachment: t.scm
L
L
Ludovic Courtès wrote on 7 May 23:49 +0200
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 41948@debbugs.gnu.org)
87v97u9pu1.fsf@gnu.org
Hi!
Mathieu Othacehe <othacehe@gnu.org> skribis:
Toggle quote (14 lines)> Those two finalizer threads share the same pipe. When we try to> stop the finalizer thread in Shepherd, right before forking a new> process, we send a '\1' byte to the finalizer pipe.>> 1 write(6, "\1", 1 <unfinished ...>>>> which is received by (line 183597): >> 253 <... read resumed>"\1", 1) = 1>> the marionette finalizer thread. Then, we pthread_join the Shepherd> finalizer thread, which never stops! Quite unfortunate.
While working on a fix for this issue (finalizer pipe shared betweenparent and child process), I found the ‘sleep_pipe’ of the main threadis also shared between the parent and its child.
Attached is a reproducer. It prints something like this before hanging:
Toggle snippet (14 lines)$ guile ~/src/guile-debugging/signal-pipe.scmparent: 25947child: 25953alarm in parent!alarm in parent!alarm in parent!
[...]
alarm in parent!alarm in parent!child woken up!
“child woken up” means that it’s the child process that won the racereading on the sleep pipe (from ‘scm_std_select’).
The parent process then hangs because, in ‘scm_std_select’, it did:
1. select(1), which returned due to available data on ‘wakeup_fd’;
2. ‘full_read (wakeup_fd, &dummy, 1)’ gets stuck forever in read(2) because the child process read that byte in the meantime so there’s nothing left to read.
Here’s the sequence:
Toggle snippet (23 lines)25947 select(4, [3], NULL, NULL, {tv_sec=0, tv_usec=100000}) = 0 (Timeout)25947 getpid() = 2594725947 kill(25947, SIGALRM) = 025947 --- SIGALRM {si_signo=SIGALRM, si_code=SI_USER, si_pid=25947, si_uid=1000} ---25947 write(8, "\16", 1) = 125947 rt_sigreturn({mask=[]}) = 025952 <... read resumed>"\16", 1) = 125947 rt_sigprocmask(SIG_BLOCK, NULL, <unfinished ...>25952 write(4, "\0", 1 <unfinished ...>25947 <... rt_sigprocmask resumed>[], 8) = 025953 <... select resumed>) = 1 (in [3], left {tv_sec=0, tv_usec=346370})25952 <... write resumed>) = 125947 select(4, [3], NULL, NULL, {tv_sec=0, tv_usec=100000} <unfinished ...>25953 read(3, <unfinished ...>25952 rt_sigprocmask(SIG_BLOCK, NULL, <unfinished ...>25947 <... select resumed>) = 1 (in [3], left {tv_sec=0, tv_usec=99999})25953 <... read resumed>"\0", 1) = 125947 read(3, <unfinished ...>25952 <... rt_sigprocmask resumed>~[KILL STOP PWR RTMIN RT_1], 8) = 025953 write(1, "child woken up!\n", 16 <unfinished ...>25952 read(7, <unfinished ...>
Notice how “\16” (= SIGALRM) is written by the parent’s main thread andread by the child’s main thread.
Thoughts?
Ludo’.
;; https://issues.guix.gnu.org/41948 (use-modules (ice-9 match)) (setvbuf (current-output-port) 'line) (sigaction SIGCHLD pk) ;start signal thread (match (primitive-fork) (0 (format #t "child: ~a~%" (getpid)) (let loop () (unless (zero? (usleep 500000)) ;; If this happens, it means the select(2) call in 'scm_std_select' ;; returned because one of our file descriptors had input data ;; available (which shouldn't happen). (format #t "child woken up!~%")) (loop))) (pid (format #t "parent: ~a~%" (getpid)) (sigaction SIGALRM (lambda _ (format #t "alarm in parent!~%"))) (let loop () (kill (getpid) SIGALRM) (usleep 100000) (loop))))
L
L
Ludovic Courtès wrote on 8 May 00:07 +0200
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 41948@debbugs.gnu.org)
87o8dm9p0n.fsf@gnu.org
Ludovic Courtès <ludo@gnu.org> skribis:
Toggle quote (4 lines)> While working on a fix for this issue (finalizer pipe shared between> parent and child process), I found the ‘sleep_pipe’ of the main thread> is also shared between the parent and its child.
Here’s a patch that fixes the problem as exposed by the reproducer.
Thoughts?
(We’d need these pipes to be “O_CLOFORK” rather than O_CLOEXEC.)
Ludo’.
Toggle diff (45 lines)diff --git a/libguile/posix.c b/libguile/posix.cindex eaf12de32..e0c157c20 100644--- a/libguile/posix.c+++ b/libguile/posix.c@@ -1217,6 +1217,29 @@ SCM_DEFINE (scm_execle, "execle", 2, 0, 1, #undef FUNC_NAME #ifdef HAVE_FORK+static void *+do_fork (void *ret)+{+ pid_t pid = fork ();++ if (pid == 0)+ {+ /* The child process must not share its sleep pipe with the+ parent. */+ int err;+ scm_thread *t = SCM_I_CURRENT_THREAD;++ close (t->sleep_pipe[0]);+ close (t->sleep_pipe[1]);+ err = pipe2 (t->sleep_pipe, O_CLOEXEC);+ if (err != 0)+ abort ();+ }++ * (pid_t *) ret = pid;+ return NULL;+}+ SCM_DEFINE (scm_fork, "primitive-fork", 0, 0, 0, (), "Creates a new \"child\" process by duplicating the current \"parent\" process.\n"@@ -1244,7 +1267,9 @@ SCM_DEFINE (scm_fork, "primitive-fork", 0, 0, 0, " further behavior unspecified. See \"Processes\" in the\n" " manual, for more information.\n"), scm_current_warning_port ());- pid = fork ();++ scm_without_guile (do_fork, &pid);+ if (pid == -1) SCM_SYSERROR; return scm_from_int (pid);
L
L
Ludovic Courtès wrote on 8 May 11:43 +0200
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 41948@debbugs.gnu.org)
87im3ta7dg.fsf@gnu.org
Hi,
Mathieu Othacehe <othacehe@gnu.org> skribis:
Toggle quote (14 lines)> Those two finalizer threads share the same pipe. When we try to> stop the finalizer thread in Shepherd, right before forking a new> process, we send a '\1' byte to the finalizer pipe.>> 1 write(6, "\1", 1 <unfinished ...>>>> which is received by (line 183597): >> 253 <... read resumed>"\1", 1) = 1>> the marionette finalizer thread. Then, we pthread_join the Shepherd> finalizer thread, which never stops! Quite unfortunate.
And here’s the patch for this pipe: it closes the finalizer pipepre-fork, and it gets reopened lazily.
Together with the ‘sleep_pipe’ patch, it appears to fix the problemsdescribed here.
Ludo’.
Toggle diff (87 lines)diff --git a/libguile/finalizers.c b/libguile/finalizers.cindex 0ae165fd1..5ddc7dbef 100644--- a/libguile/finalizers.c+++ b/libguile/finalizers.c@@ -24,6 +24,7 @@ # include <config.h> #endif +#include <assert.h> #include <errno.h> #include <fcntl.h> #include <full-write.h>@@ -170,7 +171,7 @@ queue_finalizer_async (void) #if SCM_USE_PTHREAD_THREADS -static int finalization_pipe[2];+static int finalization_pipe[2] = { -1, -1 }; static scm_i_pthread_mutex_t finalization_thread_lock = SCM_I_PTHREAD_MUTEX_INITIALIZER; static pthread_t finalization_thread;@@ -254,6 +255,13 @@ start_finalization_thread (void) scm_i_pthread_mutex_lock (&finalization_thread_lock); if (!finalization_thread_is_running) {+ assert (finalization_pipe[0] == -1);++ if (pipe2 (finalization_pipe, O_CLOEXEC) != 0)+ scm_syserror (NULL);++ GC_set_finalizer_notifier (notify_finalizers_to_run);+ /* Use the raw pthread API and scm_with_guile, because we don't want to block on any lock that scm_spawn_thread might want to take, and we don't want to inherit the dynamic state (fluids) of the@@ -276,6 +284,12 @@ stop_finalization_thread (void) notify_about_to_fork (); if (pthread_join (finalization_thread, NULL)) perror ("joining finalization thread");++ close (finalization_pipe[0]);+ close (finalization_pipe[1]);+ finalization_pipe[0] = -1;+ finalization_pipe[1] = -1;+ finalization_thread_is_running = 0; } scm_i_pthread_mutex_unlock (&finalization_thread_lock);@@ -284,7 +298,6 @@ stop_finalization_thread (void) static void spawn_finalizer_thread (void) {- GC_set_finalizer_notifier (notify_finalizers_to_run); start_finalization_thread (); } @@ -368,8 +381,6 @@ scm_set_automatic_finalization_enabled (int enabled_p) if (enabled_p) { #if SCM_USE_PTHREAD_THREADS- if (pipe2 (finalization_pipe, O_CLOEXEC) != 0)- scm_syserror (NULL); GC_set_finalizer_notifier (spawn_finalizer_thread); #else GC_set_finalizer_notifier (queue_finalizer_async);@@ -381,10 +392,6 @@ scm_set_automatic_finalization_enabled (int enabled_p) #if SCM_USE_PTHREAD_THREADS stop_finalization_thread ();- close (finalization_pipe[0]);- close (finalization_pipe[1]);- finalization_pipe[0] = -1;- finalization_pipe[1] = -1; #endif } @@ -423,10 +430,6 @@ scm_init_finalizer_thread (void) { #if SCM_USE_PTHREAD_THREADS if (automatic_finalization_p)- {- if (pipe2 (finalization_pipe, O_CLOEXEC) != 0)- scm_syserror (NULL); GC_set_finalizer_notifier (spawn_finalizer_thread);- } #endif }
A
A
Andrew Whatson wrote on 8 May 15:49 +0200
(address . 41948@debbugs.gnu.org)
20210508134908.3133397-1-whatson@gmail.com
Hi,
I've reviewed the finalizer patch and made some changes to ensure thatit works correctly if pipe creation or thread creation fail.
Thread creation fails in an out-of-memory scenario, so this part can beverified by running Guile's test-out-of-memory test case. You'll need alibgc built with --disable-munmap for the test to survive long enough toretry launching the finalizer thread.
Cheers!
A
A
Andrew Whatson wrote on 8 May 15:49 +0200
[PATCH] Fix some finalizer thread race conditions
(address . 41948@debbugs.gnu.org)(name . Andrew Whatson)(address . whatson@gmail.com)
20210508134908.3133397-2-whatson@gmail.com
* libguile/finalizers.c (finalization_pipe): Initialize.(reset_finalization_pipe): Factored out.(start_finalization_thread): Create the pipe immediately beforelaunching the thread. Ensure the pipe is cleaned up if thread creationfails. Update the finalizer callback if thread creation succeeds.(stop_finalization_thread): Clean up the pipe after stopping the thread.(spawn_finalizer_thread): Remove finalizer callback logic.(scm_set_automatic_finalization_enabled): Remove pipe management.(scm_init_finalizer_thread): Remove pipe management.--- libguile/finalizers.c | 45 +++++++++++++++++++++++++++---------------- 1 file changed, 28 insertions(+), 17 deletions(-)
Toggle diff (118 lines)diff --git a/libguile/finalizers.c b/libguile/finalizers.cindex 0ae165fd1..5122e5fe3 100644--- a/libguile/finalizers.c+++ b/libguile/finalizers.c@@ -24,6 +24,7 @@ # include <config.h> #endif +#include <assert.h> #include <errno.h> #include <fcntl.h> #include <full-write.h>@@ -170,7 +171,7 @@ queue_finalizer_async (void) #if SCM_USE_PTHREAD_THREADS -static int finalization_pipe[2];+static int finalization_pipe[2] = { -1, -1 }; static scm_i_pthread_mutex_t finalization_thread_lock = SCM_I_PTHREAD_MUTEX_INITIALIZER; static pthread_t finalization_thread;@@ -190,6 +191,15 @@ notify_about_to_fork (void) full_write (finalization_pipe[1], &byte, 1); } +static void+reset_finalization_pipe (void)+{+ close (finalization_pipe[0]);+ close (finalization_pipe[1]);+ finalization_pipe[0] = -1;+ finalization_pipe[1] = -1;+}+ struct finalization_pipe_data { char byte;@@ -254,15 +264,25 @@ start_finalization_thread (void) scm_i_pthread_mutex_lock (&finalization_thread_lock); if (!finalization_thread_is_running) {+ assert (finalization_pipe[0] == -1);+ /* Use the raw pthread API and scm_with_guile, because we don't want to block on any lock that scm_spawn_thread might want to take, and we don't want to inherit the dynamic state (fluids) of the caller. */- if (pthread_create (&finalization_thread, NULL,- run_finalization_thread, NULL))- perror ("error creating finalization thread");+ if (pipe2 (finalization_pipe, O_CLOEXEC) != 0)+ perror ("error creating finalization pipe");+ else if (pthread_create (&finalization_thread, NULL,+ run_finalization_thread, NULL))+ {+ reset_finalization_pipe ();+ perror ("error creating finalization thread");+ } else- finalization_thread_is_running = 1;+ {+ GC_set_finalizer_notifier (notify_finalizers_to_run);+ finalization_thread_is_running = 1;+ } } scm_i_pthread_mutex_unlock (&finalization_thread_lock); }@@ -276,6 +296,8 @@ stop_finalization_thread (void) notify_about_to_fork (); if (pthread_join (finalization_thread, NULL)) perror ("joining finalization thread");++ reset_finalization_pipe (); finalization_thread_is_running = 0; } scm_i_pthread_mutex_unlock (&finalization_thread_lock);@@ -284,7 +306,6 @@ stop_finalization_thread (void) static void spawn_finalizer_thread (void) {- GC_set_finalizer_notifier (notify_finalizers_to_run); start_finalization_thread (); } @@ -368,8 +389,6 @@ scm_set_automatic_finalization_enabled (int enabled_p) if (enabled_p) { #if SCM_USE_PTHREAD_THREADS- if (pipe2 (finalization_pipe, O_CLOEXEC) != 0)- scm_syserror (NULL); GC_set_finalizer_notifier (spawn_finalizer_thread); #else GC_set_finalizer_notifier (queue_finalizer_async);@@ -381,10 +400,6 @@ scm_set_automatic_finalization_enabled (int enabled_p) #if SCM_USE_PTHREAD_THREADS stop_finalization_thread ();- close (finalization_pipe[0]);- close (finalization_pipe[1]);- finalization_pipe[0] = -1;- finalization_pipe[1] = -1; #endif } @@ -423,10 +438,6 @@ scm_init_finalizer_thread (void) { #if SCM_USE_PTHREAD_THREADS if (automatic_finalization_p)- {- if (pipe2 (finalization_pipe, O_CLOEXEC) != 0)- scm_syserror (NULL);- GC_set_finalizer_notifier (spawn_finalizer_thread);- }+ GC_set_finalizer_notifier (spawn_finalizer_thread); #endif }-- 2.31.1
L
L
Ludovic Courtès wrote on 8 May 22:50 +0200
Re: bug#41948: Shepherd deadlocks
(name . Andrew Whatson)(address . whatson@gmail.com)(address . 41948@debbugs.gnu.org)
871rah7xw6.fsf_-_@gnu.org
Hi Andrew,
Andrew Whatson <whatson@gmail.com> skribis:
Toggle quote (10 lines)> * libguile/finalizers.c (finalization_pipe): Initialize.> (reset_finalization_pipe): Factored out.> (start_finalization_thread): Create the pipe immediately before> launching the thread. Ensure the pipe is cleaned up if thread creation> fails. Update the finalizer callback if thread creation succeeds.> (stop_finalization_thread): Clean up the pipe after stopping the thread.> (spawn_finalizer_thread): Remove finalizer callback logic.> (scm_set_automatic_finalization_enabled): Remove pipe management.> (scm_init_finalizer_thread): Remove pipe management.
I tweaked the commit log and pushed as5a281e35f4a5ae78fbcf10591d9358bec8f0bee0.
Thanks!
Ludo’.
L
L
Ludovic Courtès wrote on 8 May 22:52 +0200
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 41948-done@debbugs.gnu.org)
87wns96j93.fsf@gnu.org
Ludovic Courtès <ludo@gnu.org> skribis:
Toggle quote (8 lines)> Ludovic Courtès <ludo@gnu.org> skribis:>>> While working on a fix for this issue (finalizer pipe shared between>> parent and child process), I found the ‘sleep_pipe’ of the main thread>> is also shared between the parent and its child.>> Here’s a patch that fixes the problem as exposed by the reproducer.
I went ahead and pushed as 381291f5ff4480afbb197bf5e5a2272cfe54a386,together with a test derived from the one I posted earlier.
With this and Andrew’s finalizer pipe patch, we should be good!
This will be in the upcoming Guile 3.0.7 release.
Ludo’.
Closed
?