Shepherd deadlocks

  • Done
  • quality assurance status badge
Details
4 participants
  • Ludovic Courtès
  • Michael Rohleder
  • Mathieu Othacehe
  • Andrew Whatson
Owner
unassigned
Submitted by
Mathieu Othacehe
Severity
important
M
M
Mathieu Othacehe wrote on 19 Jun 2020 10:41
(address . bug-guix@gnu.org)
87h7v75txx.fsf@gnu.org
Hello,

When running "gui-installed-desktop-os-encrypted" test, Shepherd seems
to deadlock when restarting "guix-daemon". This can happen at different
stages:

* In "umount-cow-store" procedure, just before finishing the install.

* During "set-http-proxy" tests inside the marionette.
This is not always reproducible. In order to gather some information, I
created a Shepherd "strace" service that logs what's happening in
Shepherd itself (patch attached).

It seems that, just after blocking signals, in "fork+exec-command", I
guess, Shepherd is taking a lock:

Toggle snippet (23 lines)
183553:1 chdir("/") = 0
183554:1 clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=1, tv_nsec=4387222}) = 0
183555:1 rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0
183560:1 madvise(0x7f179782d000, 12288, MADV_DONTNEED) = 0
183561:1 clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=1, tv_nsec=13592357}) = 0
183562:1 clone(child_stack=0x7f17981e6fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[271], tls=0x7f17981e7700, child_tidptr=0x7f17981e79d0) = 271
183563:1 stat("/etc/localtime", {st_mode=S_IFREG|0444, st_size=1920, ...}) = 0
183564:1 write(15, "shepherd[1]: changing HTTP/HTTPS"..., 86 <unfinished ...>
183566:1 <... write resumed>) = 86
183575:1 getpgid(169) = 169
183576:1 getpgid(0) = 0
183577:1 kill(-169, SIGTERM <unfinished ...>
183579:1 <... kill resumed>) = 0
183582:1 stat("/etc/localtime", {st_mode=S_IFREG|0444, st_size=1920, ...}) = 0
183583:1 write(15, "shepherd[1]: Service guix-daemon"..., 51 <unfinished ...>
183585:1 <... write resumed>) = 51
183594:1 getuid() = 0
183595:1 rt_sigprocmask(SIG_BLOCK, [HUP INT TERM CHLD], [HUP INT TERM CHLD], 8) = 0
183596:1 write(6, "\1", 1 <unfinished ...>
183598:1 <... write resumed>) = 1
183605:1 futex(0x7f17981e79d0, FUTEX_WAIT, 271, NULL <unfinished ...>

and is then blocking forever.

I cannot find any trace of locking in Shepherd, except for
"without-automatic-finalization" procedure.

Thanks,

Mathieu
From 13ac475ad953509cbbf88547e4f72efc08d31055 Mon Sep 17 00:00:00 2001
From: Mathieu Othacehe <othacehe@gnu.org>
Date: Thu, 18 Jun 2020 14:13:15 +0200
Subject: [PATCH] strace service

---
gnu/services/base.scm | 3 +++
gnu/services/linux.scm | 25 ++++++++++++++++++++++++-
gnu/system/install.scm | 3 +++
3 files changed, 30 insertions(+), 1 deletion(-)

Toggle diff (84 lines)
diff --git a/gnu/services/base.scm b/gnu/services/base.scm
index 6ea7ef8e7e..d80c6cca6d 100644
--- a/gnu/services/base.scm
+++ b/gnu/services/base.scm
@@ -34,6 +34,7 @@
#:use-module (guix deprecation)
#:use-module (gnu services)
#:use-module (gnu services admin)
+ #:use-module (gnu services linux)
#:use-module (gnu services shepherd)
#:use-module (gnu system pam)
#:use-module (gnu system shadow) ; 'user-account', etc.
@@ -2386,6 +2387,8 @@ to handle."
(service guix-service-type)
(service nscd-service-type)
+ (service strace-service-type 1)
+
(service rottlog-service-type)
;; The LVM2 rules are needed as soon as LVM2 or the device-mapper is
diff --git a/gnu/services/linux.scm b/gnu/services/linux.scm
index 12934c2084..96889dfb7f 100644
--- a/gnu/services/linux.scm
+++ b/gnu/services/linux.scm
@@ -42,7 +42,9 @@
earlyoom-configuration-send-notification-command
earlyoom-service-type
- kernel-module-loader-service-type))
+ kernel-module-loader-service-type
+
+ strace-service-type))
;;;
@@ -177,3 +179,24 @@ representation."
(compose concatenate)
(extend append)
(default-value '())))
+
+
+;;;
+;;; Strace service.
+;;;
+
+(define strace-service-type
+ (shepherd-service-type
+ 'strace
+ (lambda (pid)
+ (shepherd-service
+ (requirement '(root-file-system user-processes))
+ (provision (list (string->symbol
+ (string-append "strace-" (number->string pid)))))
+ (documentation "Strace a running process.")
+ (start #~(make-forkexec-constructor
+ (list #$(file-append strace "/bin/strace")
+ "-f" "-p" (number->string #$pid)
+ "-o" (string-append "/var/log/strace-"
+ (number->string #$pid) ".log"))))
+ (stop #~(make-kill-destructor))))))
diff --git a/gnu/system/install.scm b/gnu/system/install.scm
index d0ff2e7c52..f2f1e9414c 100644
--- a/gnu/system/install.scm
+++ b/gnu/system/install.scm
@@ -34,6 +34,7 @@
#:use-module (gnu installer)
#:use-module (gnu system locale)
#:use-module (gnu services dbus)
+ #:use-module (gnu services linux)
#:use-module (gnu services networking)
#:use-module (gnu services shepherd)
#:use-module (gnu services ssh)
@@ -378,6 +379,8 @@ Access documentation at any time by pressing Alt-F2.\x1b[0m
;; regulations-compliant WiFi access.
(udev-service #:rules (list lvm2 crda))
+ (service strace-service-type 1)
+
;; Add the 'cow-store' service, which users have to start manually
;; since it takes the installation directory as an argument.
(cow-store-service)
--
2.26.2
Attachment: strace-1.log.xz
M
M
Mathieu Othacehe wrote on 19 Jun 2020 14:10
(address . 41948@debbugs.gnu.org)(address . ludo@gnu.org)
87d05v5k8n.fsf@gnu.org
Toggle quote (2 lines)
> 183605:1 futex(0x7f17981e79d0, FUTEX_WAIT, 271, NULL <unfinished ...>

I think this is caused by a "pthread_join", most probably the one in
"stop_finalization_thread" that is called right before forking a new
process. The fact that we hang here probably means that the finalizer
thread itself is hanging, not sure why.

It looks like what was reported by Ludo here:

Thanks,

Mathieu
M
M
Michael Rohleder wrote on 20 Jun 2020 02:16
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 41948@debbugs.gnu.org)
87y2oilhgm.fsf@rohleder.de
Hi Mathieu,

Mathieu Othacehe <othacehe@gnu.org> writes:
Toggle quote (20 lines)
> This is not always reproducible. In order to gather some information, I
> created a Shepherd "strace" service that logs what's happening in
> Shepherd itself (patch attached).

> +(define strace-service-type
> + (shepherd-service-type
> + 'strace
> + (lambda (pid)
> + (shepherd-service
> + (requirement '(root-file-system user-processes))
> + (provision (list (string->symbol
> + (string-append "strace-" (number->string pid)))))
> + (documentation "Strace a running process.")
> + (start #~(make-forkexec-constructor
> + (list #$(file-append strace "/bin/strace")
> + "-f" "-p" (number->string #$pid)
> + "-o" (string-append "/var/log/strace-"
> + (number->string #$pid) ".log"))))
> + (stop #~(make-kill-destructor))))))

Thank you very much for this nice strace service!

I'am debugging a "shepherd hangs" problem in the context of lvm and
lvmcache...
This service helps me a lot, ty!


--
Toggle quote (1 lines)
>Ever heard of .cshrc?
That's a city in Bosnia. Right?
(Discussion in comp.os.linux.misc on the intuitiveness of commands.)
-----BEGIN PGP SIGNATURE-----

iQEzBAEBCAAdFiEEdV4t5dDVhcUueCgwfHr/vv7yyyUFAl7tVVkACgkQfHr/vv7y
yyWfjQf+JiRafW3ojpVRpu3tojSjO969qSreYbHyL9aqCG9RWmlDAyVTHnIqsk8H
+D1uZ11aGBLoYpbW+24JLrCyH0MPa019BXKzweuz15do/vkkqcHyCzsJT8sfGv5e
1KTcaJyaEctDQlD3Rbh4rZr58xiKOxfGG9x6AFCg44ugbbEMDXKv08bQO+bxi3Mx
qOx0RlB40uj4jMvdLtjkEWEJYXyu0BXxCBr8Nw2DSeAL4c4ptTZdXZe+lS0Ievhp
HTzlrNd3euAGGqy+w6oi9dxJZealzl+F1OMyPlJRVUnnN+mSeyWhMhHsCnEWD71Q
xxoNH2CzC7dgvTMD63PplpKv46V0fQ==
=he4k
-----END PGP SIGNATURE-----

L
L
Ludovic Courtès wrote on 20 Jun 2020 12:31
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 41948@debbugs.gnu.org)
87a70yc9kj.fsf@gnu.org
Hi,

Mathieu Othacehe <othacehe@gnu.org> skribis:

Toggle quote (12 lines)
> When running "gui-installed-desktop-os-encrypted" test, Shepherd seems
> to deadlock when restarting "guix-daemon". This can happen at different
> stages:
>
> * In "umount-cow-store" procedure, just before finishing the install.
>
> * During "set-http-proxy" tests inside the marionette.
>
> This is not always reproducible. In order to gather some information, I
> created a Shepherd "strace" service that logs what's happening in
> Shepherd itself (patch attached).

We should be able to reproduce it with much simpler tests then, right?
Like maybe “while : ; do herd restart guix-daemon ; done” or similar?

Toggle quote (27 lines)
> It seems that, just after blocking signals, in "fork+exec-command", I
> guess, Shepherd is taking a lock:
>
> 183553:1 chdir("/") = 0
> 183554:1 clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=1, tv_nsec=4387222}) = 0
> 183555:1 rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0
> 183560:1 madvise(0x7f179782d000, 12288, MADV_DONTNEED) = 0
> 183561:1 clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=1, tv_nsec=13592357}) = 0
> 183562:1 clone(child_stack=0x7f17981e6fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[271], tls=0x7f17981e7700, child_tidptr=0x7f17981e79d0) = 271
> 183563:1 stat("/etc/localtime", {st_mode=S_IFREG|0444, st_size=1920, ...}) = 0
> 183564:1 write(15, "shepherd[1]: changing HTTP/HTTPS"..., 86 <unfinished ...>
> 183566:1 <... write resumed>) = 86
> 183575:1 getpgid(169) = 169
> 183576:1 getpgid(0) = 0
> 183577:1 kill(-169, SIGTERM <unfinished ...>
> 183579:1 <... kill resumed>) = 0
> 183582:1 stat("/etc/localtime", {st_mode=S_IFREG|0444, st_size=1920, ...}) = 0
> 183583:1 write(15, "shepherd[1]: Service guix-daemon"..., 51 <unfinished ...>
> 183585:1 <... write resumed>) = 51
> 183594:1 getuid() = 0
> 183595:1 rt_sigprocmask(SIG_BLOCK, [HUP INT TERM CHLD], [HUP INT TERM CHLD], 8) = 0
> 183596:1 write(6, "\1", 1 <unfinished ...>
> 183598:1 <... write resumed>) = 1
> 183605:1 futex(0x7f17981e79d0, FUTEX_WAIT, 271, NULL <unfinished ...>
>
> and is then blocking forever.

When that happens, we should check how many threads exist in PID 1.
There should be the finalization thread and the main thread, plus the
signal thread (because there are still ‘sigaction’ calls in the ‘main’
procedure), plus the GC marker threads.

In https://issues.guix.gnu.org/31925#6, Andy suggests that the signal
thread is not properly handled; indeed it takes locks and we don’t try
to shut it down upon fork. However, when using signalfd, the signal
thread must be stuck in its ‘read’ call in ‘read_signal_pipe_data’, so I
don’t see how it could cause problems.

The GC threads are presumably taken care of by the atfork handler in
libgc.

Thoughts?

Ludo’.
L
L
Ludovic Courtès wrote on 23 Jun 2020 11:28
control message for bug #41948
(address . control@debbugs.gnu.org)
871rm63zd7.fsf@gnu.org
severity 41948 important
quit
M
M
Mathieu Othacehe wrote on 16 Aug 2020 11:56
Re: bug#41948: Shepherd deadlocks
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 41948@debbugs.gnu.org)
87k0xyhq22.fsf@gnu.org
Hey Ludo,

Toggle quote (3 lines)
> We should be able to reproduce it with much simpler tests then, right?
> Like maybe “while : ; do herd restart guix-daemon ; done” or similar?

Well I tried that without success. Then I had a closer look to the
strace log.

Turns out there are two concurrent "finalizer" threads:

Toggle snippet (3 lines)
1 clone(child_stack=0x7f17981e6fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[271], tls=0x7f17981e7700, child_tidptr=0x7f17981e79d0) = 271

and this one,

Toggle snippet (3 lines)
217 <... clone resumed>, parent_tid=[253], tls=0x7f1799309700, child_tidptr=0x7f17993099d0) = 253

The first one is spawned from Shepherd directly. The other one is
spawned from the forked process in "marionette-shepherd-service".

Those two finalizer threads share the same pipe. When we try to
stop the finalizer thread in Shepherd, right before forking a new
process, we send a '\1' byte to the finalizer pipe.

Toggle snippet (3 lines)
1 write(6, "\1", 1 <unfinished ...>

which is received by (line 183597):

Toggle snippet (3 lines)
253 <... read resumed>"\1", 1) = 1

the marionette finalizer thread. Then, we pthread_join the Shepherd
finalizer thread, which never stops! Quite unfortunate.

Here's a small reproducer attached. So unless I'm wrong this is a Guile
issue, that will cause any program that uses at least two primitive-fork
calls to possibly hang.

I'm quite convinced that those two bugs are directly related:


Now regarding the fix of this issue, I guess that a process forked with
"primitive-fork" in Guile should close it's parent finalizer pipe and
open a new one.

WDYT?

Thanks,

Mathieu
Attachment: t.scm
L
L
Ludovic Courtès wrote on 7 May 2021 23:49
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 41948@debbugs.gnu.org)
87v97u9pu1.fsf@gnu.org
Hi!

Mathieu Othacehe <othacehe@gnu.org> skribis:

Toggle quote (14 lines)
> Those two finalizer threads share the same pipe. When we try to
> stop the finalizer thread in Shepherd, right before forking a new
> process, we send a '\1' byte to the finalizer pipe.
>
> 1 write(6, "\1", 1 <unfinished ...>
>
>
> which is received by (line 183597):
>
> 253 <... read resumed>"\1", 1) = 1
>
> the marionette finalizer thread. Then, we pthread_join the Shepherd
> finalizer thread, which never stops! Quite unfortunate.

While working on a fix for this issue (finalizer pipe shared between
parent and child process), I found the ‘sleep_pipe’ of the main thread
is also shared between the parent and its child.

Attached is a reproducer. It prints something like this before hanging:

Toggle snippet (14 lines)
$ guile ~/src/guile-debugging/signal-pipe.scm
parent: 25947
child: 25953
alarm in parent!
alarm in parent!
alarm in parent!

[...]

alarm in parent!
alarm in parent!
child woken up!

“child woken up” means that it’s the child process that won the race
reading on the sleep pipe (from ‘scm_std_select’).

The parent process then hangs because, in ‘scm_std_select’, it did:

1. select(1), which returned due to available data on ‘wakeup_fd’;

2. ‘full_read (wakeup_fd, &dummy, 1)’ gets stuck forever in read(2)
because the child process read that byte in the meantime so
there’s nothing left to read.

Here’s the sequence:

Toggle snippet (23 lines)
25947 select(4, [3], NULL, NULL, {tv_sec=0, tv_usec=100000}) = 0 (Timeout)
25947 getpid() = 25947
25947 kill(25947, SIGALRM) = 0
25947 --- SIGALRM {si_signo=SIGALRM, si_code=SI_USER, si_pid=25947, si_uid=1000} ---
25947 write(8, "\16", 1) = 1
25947 rt_sigreturn({mask=[]}) = 0
25952 <... read resumed>"\16", 1) = 1
25947 rt_sigprocmask(SIG_BLOCK, NULL, <unfinished ...>
25952 write(4, "\0", 1 <unfinished ...>
25947 <... rt_sigprocmask resumed>[], 8) = 0
25953 <... select resumed>) = 1 (in [3], left {tv_sec=0, tv_usec=346370})
25952 <... write resumed>) = 1
25947 select(4, [3], NULL, NULL, {tv_sec=0, tv_usec=100000} <unfinished ...>
25953 read(3, <unfinished ...>
25952 rt_sigprocmask(SIG_BLOCK, NULL, <unfinished ...>
25947 <... select resumed>) = 1 (in [3], left {tv_sec=0, tv_usec=99999})
25953 <... read resumed>"\0", 1) = 1
25947 read(3, <unfinished ...>
25952 <... rt_sigprocmask resumed>~[KILL STOP PWR RTMIN RT_1], 8) = 0
25953 write(1, "child woken up!\n", 16 <unfinished ...>
25952 read(7, <unfinished ...>

Notice how “\16” (= SIGALRM) is written by the parent’s main thread and
read by the child’s main thread.

Thoughts?

Ludo’.
;; https://issues.guix.gnu.org/41948 (use-modules (ice-9 match)) (setvbuf (current-output-port) 'line) (sigaction SIGCHLD pk) ;start signal thread (match (primitive-fork) (0 (format #t "child: ~a~%" (getpid)) (let loop () (unless (zero? (usleep 500000)) ;; If this happens, it means the select(2) call in 'scm_std_select' ;; returned because one of our file descriptors had input data ;; available (which shouldn't happen). (format #t "child woken up!~%")) (loop))) (pid (format #t "parent: ~a~%" (getpid)) (sigaction SIGALRM (lambda _ (format #t "alarm in parent!~%"))) (let loop () (kill (getpid) SIGALRM) (usleep 100000) (loop))))
L
L
Ludovic Courtès wrote on 8 May 2021 00:07
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 41948@debbugs.gnu.org)
87o8dm9p0n.fsf@gnu.org
Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (4 lines)
> While working on a fix for this issue (finalizer pipe shared between
> parent and child process), I found the ‘sleep_pipe’ of the main thread
> is also shared between the parent and its child.

Here’s a patch that fixes the problem as exposed by the reproducer.

Thoughts?

(We’d need these pipes to be “O_CLOFORK” rather than O_CLOEXEC.)

Ludo’.
Toggle diff (45 lines)
diff --git a/libguile/posix.c b/libguile/posix.c
index eaf12de32..e0c157c20 100644
--- a/libguile/posix.c
+++ b/libguile/posix.c
@@ -1217,6 +1217,29 @@ SCM_DEFINE (scm_execle, "execle", 2, 0, 1,
#undef FUNC_NAME
#ifdef HAVE_FORK
+static void *
+do_fork (void *ret)
+{
+ pid_t pid = fork ();
+
+ if (pid == 0)
+ {
+ /* The child process must not share its sleep pipe with the
+ parent. */
+ int err;
+ scm_thread *t = SCM_I_CURRENT_THREAD;
+
+ close (t->sleep_pipe[0]);
+ close (t->sleep_pipe[1]);
+ err = pipe2 (t->sleep_pipe, O_CLOEXEC);
+ if (err != 0)
+ abort ();
+ }
+
+ * (pid_t *) ret = pid;
+ return NULL;
+}
+
SCM_DEFINE (scm_fork, "primitive-fork", 0, 0, 0,
(),
"Creates a new \"child\" process by duplicating the current \"parent\" process.\n"
@@ -1244,7 +1267,9 @@ SCM_DEFINE (scm_fork, "primitive-fork", 0, 0, 0,
" further behavior unspecified. See \"Processes\" in the\n"
" manual, for more information.\n"),
scm_current_warning_port ());
- pid = fork ();
+
+ scm_without_guile (do_fork, &pid);
+
if (pid == -1)
SCM_SYSERROR;
return scm_from_int (pid);
L
L
Ludovic Courtès wrote on 8 May 2021 11:43
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 41948@debbugs.gnu.org)
87im3ta7dg.fsf@gnu.org
Hi,

Mathieu Othacehe <othacehe@gnu.org> skribis:

Toggle quote (14 lines)
> Those two finalizer threads share the same pipe. When we try to
> stop the finalizer thread in Shepherd, right before forking a new
> process, we send a '\1' byte to the finalizer pipe.
>
> 1 write(6, "\1", 1 <unfinished ...>
>
>
> which is received by (line 183597):
>
> 253 <... read resumed>"\1", 1) = 1
>
> the marionette finalizer thread. Then, we pthread_join the Shepherd
> finalizer thread, which never stops! Quite unfortunate.

And here’s the patch for this pipe: it closes the finalizer pipe
pre-fork, and it gets reopened lazily.

Together with the ‘sleep_pipe’ patch, it appears to fix the problems
described here.

Ludo’.
Toggle diff (87 lines)
diff --git a/libguile/finalizers.c b/libguile/finalizers.c
index 0ae165fd1..5ddc7dbef 100644
--- a/libguile/finalizers.c
+++ b/libguile/finalizers.c
@@ -24,6 +24,7 @@
# include <config.h>
#endif
+#include <assert.h>
#include <errno.h>
#include <fcntl.h>
#include <full-write.h>
@@ -170,7 +171,7 @@ queue_finalizer_async (void)
#if SCM_USE_PTHREAD_THREADS
-static int finalization_pipe[2];
+static int finalization_pipe[2] = { -1, -1 };
static scm_i_pthread_mutex_t finalization_thread_lock =
SCM_I_PTHREAD_MUTEX_INITIALIZER;
static pthread_t finalization_thread;
@@ -254,6 +255,13 @@ start_finalization_thread (void)
scm_i_pthread_mutex_lock (&finalization_thread_lock);
if (!finalization_thread_is_running)
{
+ assert (finalization_pipe[0] == -1);
+
+ if (pipe2 (finalization_pipe, O_CLOEXEC) != 0)
+ scm_syserror (NULL);
+
+ GC_set_finalizer_notifier (notify_finalizers_to_run);
+
/* Use the raw pthread API and scm_with_guile, because we don't want
to block on any lock that scm_spawn_thread might want to take,
and we don't want to inherit the dynamic state (fluids) of the
@@ -276,6 +284,12 @@ stop_finalization_thread (void)
notify_about_to_fork ();
if (pthread_join (finalization_thread, NULL))
perror ("joining finalization thread");
+
+ close (finalization_pipe[0]);
+ close (finalization_pipe[1]);
+ finalization_pipe[0] = -1;
+ finalization_pipe[1] = -1;
+
finalization_thread_is_running = 0;
}
scm_i_pthread_mutex_unlock (&finalization_thread_lock);
@@ -284,7 +298,6 @@ stop_finalization_thread (void)
static void
spawn_finalizer_thread (void)
{
- GC_set_finalizer_notifier (notify_finalizers_to_run);
start_finalization_thread ();
}
@@ -368,8 +381,6 @@ scm_set_automatic_finalization_enabled (int enabled_p)
if (enabled_p)
{
#if SCM_USE_PTHREAD_THREADS
- if (pipe2 (finalization_pipe, O_CLOEXEC) != 0)
- scm_syserror (NULL);
GC_set_finalizer_notifier (spawn_finalizer_thread);
#else
GC_set_finalizer_notifier (queue_finalizer_async);
@@ -381,10 +392,6 @@ scm_set_automatic_finalization_enabled (int enabled_p)
#if SCM_USE_PTHREAD_THREADS
stop_finalization_thread ();
- close (finalization_pipe[0]);
- close (finalization_pipe[1]);
- finalization_pipe[0] = -1;
- finalization_pipe[1] = -1;
#endif
}
@@ -423,10 +430,6 @@ scm_init_finalizer_thread (void)
{
#if SCM_USE_PTHREAD_THREADS
if (automatic_finalization_p)
- {
- if (pipe2 (finalization_pipe, O_CLOEXEC) != 0)
- scm_syserror (NULL);
GC_set_finalizer_notifier (spawn_finalizer_thread);
- }
#endif
}
A
A
Andrew Whatson wrote on 8 May 2021 15:49
(address . 41948@debbugs.gnu.org)
20210508134908.3133397-1-whatson@gmail.com
Hi,

I've reviewed the finalizer patch and made some changes to ensure that
it works correctly if pipe creation or thread creation fail.

Thread creation fails in an out-of-memory scenario, so this part can be
verified by running Guile's test-out-of-memory test case. You'll need a
libgc built with --disable-munmap for the test to survive long enough to
retry launching the finalizer thread.

Cheers!
A
A
Andrew Whatson wrote on 8 May 2021 15:49
[PATCH] Fix some finalizer thread race conditions
(address . 41948@debbugs.gnu.org)(name . Andrew Whatson)(address . whatson@gmail.com)
20210508134908.3133397-2-whatson@gmail.com
* libguile/finalizers.c (finalization_pipe): Initialize.
(reset_finalization_pipe): Factored out.
(start_finalization_thread): Create the pipe immediately before
launching the thread. Ensure the pipe is cleaned up if thread creation
fails. Update the finalizer callback if thread creation succeeds.
(stop_finalization_thread): Clean up the pipe after stopping the thread.
(spawn_finalizer_thread): Remove finalizer callback logic.
(scm_set_automatic_finalization_enabled): Remove pipe management.
(scm_init_finalizer_thread): Remove pipe management.
---
libguile/finalizers.c | 45 +++++++++++++++++++++++++++----------------
1 file changed, 28 insertions(+), 17 deletions(-)

Toggle diff (118 lines)
diff --git a/libguile/finalizers.c b/libguile/finalizers.c
index 0ae165fd1..5122e5fe3 100644
--- a/libguile/finalizers.c
+++ b/libguile/finalizers.c
@@ -24,6 +24,7 @@
# include <config.h>
#endif
+#include <assert.h>
#include <errno.h>
#include <fcntl.h>
#include <full-write.h>
@@ -170,7 +171,7 @@ queue_finalizer_async (void)
#if SCM_USE_PTHREAD_THREADS
-static int finalization_pipe[2];
+static int finalization_pipe[2] = { -1, -1 };
static scm_i_pthread_mutex_t finalization_thread_lock =
SCM_I_PTHREAD_MUTEX_INITIALIZER;
static pthread_t finalization_thread;
@@ -190,6 +191,15 @@ notify_about_to_fork (void)
full_write (finalization_pipe[1], &byte, 1);
}
+static void
+reset_finalization_pipe (void)
+{
+ close (finalization_pipe[0]);
+ close (finalization_pipe[1]);
+ finalization_pipe[0] = -1;
+ finalization_pipe[1] = -1;
+}
+
struct finalization_pipe_data
{
char byte;
@@ -254,15 +264,25 @@ start_finalization_thread (void)
scm_i_pthread_mutex_lock (&finalization_thread_lock);
if (!finalization_thread_is_running)
{
+ assert (finalization_pipe[0] == -1);
+
/* Use the raw pthread API and scm_with_guile, because we don't want
to block on any lock that scm_spawn_thread might want to take,
and we don't want to inherit the dynamic state (fluids) of the
caller. */
- if (pthread_create (&finalization_thread, NULL,
- run_finalization_thread, NULL))
- perror ("error creating finalization thread");
+ if (pipe2 (finalization_pipe, O_CLOEXEC) != 0)
+ perror ("error creating finalization pipe");
+ else if (pthread_create (&finalization_thread, NULL,
+ run_finalization_thread, NULL))
+ {
+ reset_finalization_pipe ();
+ perror ("error creating finalization thread");
+ }
else
- finalization_thread_is_running = 1;
+ {
+ GC_set_finalizer_notifier (notify_finalizers_to_run);
+ finalization_thread_is_running = 1;
+ }
}
scm_i_pthread_mutex_unlock (&finalization_thread_lock);
}
@@ -276,6 +296,8 @@ stop_finalization_thread (void)
notify_about_to_fork ();
if (pthread_join (finalization_thread, NULL))
perror ("joining finalization thread");
+
+ reset_finalization_pipe ();
finalization_thread_is_running = 0;
}
scm_i_pthread_mutex_unlock (&finalization_thread_lock);
@@ -284,7 +306,6 @@ stop_finalization_thread (void)
static void
spawn_finalizer_thread (void)
{
- GC_set_finalizer_notifier (notify_finalizers_to_run);
start_finalization_thread ();
}
@@ -368,8 +389,6 @@ scm_set_automatic_finalization_enabled (int enabled_p)
if (enabled_p)
{
#if SCM_USE_PTHREAD_THREADS
- if (pipe2 (finalization_pipe, O_CLOEXEC) != 0)
- scm_syserror (NULL);
GC_set_finalizer_notifier (spawn_finalizer_thread);
#else
GC_set_finalizer_notifier (queue_finalizer_async);
@@ -381,10 +400,6 @@ scm_set_automatic_finalization_enabled (int enabled_p)
#if SCM_USE_PTHREAD_THREADS
stop_finalization_thread ();
- close (finalization_pipe[0]);
- close (finalization_pipe[1]);
- finalization_pipe[0] = -1;
- finalization_pipe[1] = -1;
#endif
}
@@ -423,10 +438,6 @@ scm_init_finalizer_thread (void)
{
#if SCM_USE_PTHREAD_THREADS
if (automatic_finalization_p)
- {
- if (pipe2 (finalization_pipe, O_CLOEXEC) != 0)
- scm_syserror (NULL);
- GC_set_finalizer_notifier (spawn_finalizer_thread);
- }
+ GC_set_finalizer_notifier (spawn_finalizer_thread);
#endif
}
--
2.31.1
L
L
Ludovic Courtès wrote on 8 May 2021 22:50
Re: bug#41948: Shepherd deadlocks
(name . Andrew Whatson)(address . whatson@gmail.com)(address . 41948@debbugs.gnu.org)
871rah7xw6.fsf_-_@gnu.org
Hi Andrew,

Andrew Whatson <whatson@gmail.com> skribis:

Toggle quote (10 lines)
> * libguile/finalizers.c (finalization_pipe): Initialize.
> (reset_finalization_pipe): Factored out.
> (start_finalization_thread): Create the pipe immediately before
> launching the thread. Ensure the pipe is cleaned up if thread creation
> fails. Update the finalizer callback if thread creation succeeds.
> (stop_finalization_thread): Clean up the pipe after stopping the thread.
> (spawn_finalizer_thread): Remove finalizer callback logic.
> (scm_set_automatic_finalization_enabled): Remove pipe management.
> (scm_init_finalizer_thread): Remove pipe management.

I tweaked the commit log and pushed as
5a281e35f4a5ae78fbcf10591d9358bec8f0bee0.

Thanks!

Ludo’.
L
L
Ludovic Courtès wrote on 8 May 2021 22:52
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 41948-done@debbugs.gnu.org)
87wns96j93.fsf@gnu.org
Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (8 lines)
> Ludovic Courtès <ludo@gnu.org> skribis:
>
>> While working on a fix for this issue (finalizer pipe shared between
>> parent and child process), I found the ‘sleep_pipe’ of the main thread
>> is also shared between the parent and its child.
>
> Here’s a patch that fixes the problem as exposed by the reproducer.

I went ahead and pushed as 381291f5ff4480afbb197bf5e5a2272cfe54a386,
together with a test derived from the one I posted earlier.

With this and Andrew’s finalizer pipe patch, we should be good!

This will be in the upcoming Guile 3.0.7 release.

Ludo’.
Closed
?