shepherd 0.8.0 race condition can lead to stopping itself

  • Done
  • quality assurance status badge
Details
4 participants
  • Ludovic Courtès
  • Mathieu Othacehe
  • Mathieu Othacehe
  • Mathieu Othacehe
Owner
unassigned
Submitted by
Mathieu Othacehe
Severity
important
Merged with
M
M
Mathieu Othacehe wrote on 30 Apr 2020 13:51
Graphical installer tests sometimes hang.
(address . bug-guix@gnu.org)
87o8r9w5re.fsf@gmail.com
Hello,

Graphical installer tests sometimes hang, before starting marionette
tests. See for instance:

Restarting tests just after a hang (on the same installed image),
sometimes work. I removed the "quiet" kernel argument to see what's
going on.

Toggle snippet (22 lines)
[ 0.862608] Freeing unused kernel image memory: 1964K
[ 0.863177] Run /init as init process
GC Warning: pthread_getattr_np or pthread_attr_getstack failed for main thread
GC Warning: Couldn't read /proc/stat
Welcome, this is GNU's early boot Guile.
Use '--repl' for an initrd REPL.

loading kernel modules...
[ 0.915640] usbcore: registered new interface driver usb-storage
[ 0.917800] usbcore: registered new interface driver uas
[ 0.919569] hidraw: raw HID events driver (C) Jiri Kosina
[ 0.920519] usbcore: registered new interface driver usbhid
[ 0.921177] usbhid: USB HID core driver
[ 0.933506] isci: Intel(R) C600 SAS Controller Driver - version 1.2.0
[ 0.951303] PCI Interrupt Link [LNKD] enabled at IRQ 11
[ 0.970144] PCI Interrupt Link [LNKA] enabled at IRQ 10
[ 0.974033] virtio_blk virtio1: [vda] 4505600 512-byte logical blocks (2.31 GB/2.15 GiB)
[ 0.976186] vda: vda1 vda2

;; hanging here.

It seems that the boot freezes, soon after the initrd is started, and
before loading the boot script.

Mathieu
M
M
Mathieu Othacehe wrote on 4 May 2020 13:42
(address . bug-guix@gnu.org)
87a72ox6ws.fsf@gmail.com
Hello,

Toggle quote (3 lines)
> It seems that the boot freezes, soon after the initrd is started, and
> before loading the boot script.

I made some progress on this one. Turns out, the system was started and
the tests are in progress when it hangs.

The problem occurs during the reset of the HTTP proxy ("guix-daemon
set-http-proxy action, clear" test). It seems that clearing the proxy
kills Shepherd.

Here's a screenshot of the console when it happens. It can also be
reproduced by spawning a VM and running this script:

Toggle snippet (11 lines)
(use-modules (gnu services herd))
(with-shepherd-action 'guix-daemon
('set-http-proxy "http://localhost:8118")
result
result)
(with-shepherd-action 'guix-daemon
('set-http-proxy)
result
result)

I'll keep looking!

Thanks,

Mathieu
Attachment: crash.ppm
M
M
Mathieu Othacehe wrote on 4 May 2020 14:50
(address . 40981@debbugs.gnu.org)
87zhanx3rw.fsf@gmail.com
Toggle quote (2 lines)
> I'll keep looking!

Ok, getting closer. Here's a suspect part of Shepherd strace log:

Toggle snippet (6 lines)
[pid 1] stat("/etc/localtime", {st_mode=S_IFREG|0444, st_size=2298, ...}) = 0
[pid 1] write(9, "shepherd[1]: changing HTTP/HTTPS"..., 86) = 86
[pid 1] getpgid(194) = 194
[pid 1] kill(-194, SIGTERM) = 0

I think the problem is introduced by commit
1e7a91d21f1cc5d02697680e19e3878ff8565710 in Shepherd.

"(getpgid <guix-daemon-pid>") returns 0, and calling "(kill 0 SIGTERM)"
kills all processes.

Now, I really don't get how guix-daemon pgid could be zero. Man page of
setpgid(2) says:

Toggle snippet (4 lines)
A child created via fork(2) inherits its parent's process group ID. The PGID is
preserved across an execve(2).

WDYT?

Thanks,

Mathieu
L
L
Ludovic Courtès wrote on 5 May 2020 12:00
Re: bug#40981: Graphical installer tests sometimes hang.
(name . Mathieu Othacehe)(address . m.othacehe@gmail.com)(address . 40981@debbugs.gnu.org)
87a72mn1l2.fsf@gnu.org
Hi!

Mathieu Othacehe <m.othacehe@gmail.com> skribis:

Toggle quote (13 lines)
>> I'll keep looking!
>
> Ok, getting closer. Here's a suspect part of Shepherd strace log:
>
> [pid 1] stat("/etc/localtime", {st_mode=S_IFREG|0444, st_size=2298, ...}) = 0
> [pid 1] write(9, "shepherd[1]: changing HTTP/HTTPS"..., 86) = 86
> [pid 1] getpgid(194) = 194
> [pid 1] kill(-194, SIGTERM) = 0
>
>
> I think the problem is introduced by commit
> 1e7a91d21f1cc5d02697680e19e3878ff8565710 in Shepherd.

OK, but the trace above is “as expected”, isn’t it?

Toggle quote (3 lines)
> "(getpgid <guix-daemon-pid>") returns 0, and calling "(kill 0 SIGTERM)"
> kills all processes.

What made you think of this scenario?

I don’t think getpgid(2) can return 0. Or am I missing something?
Since guix-dameon doesn’t actually daemonize, getpgid(pid) = pid.

Running this (in a VM) works fine:

while herd set-http-proxy guix-daemon foo ; do : ; done

Thanks for debugging!

Ludo’.
M
M
Mathieu Othacehe wrote on 5 May 2020 12:22
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 40981@debbugs.gnu.org)
875zdael5s.fsf@gmail.com
Hey!

Toggle quote (2 lines)
> OK, but the trace above is “as expected”, isn’t it?

Oh sorry, I pasted the wrong part! Here's the part where getpgid returns
zero and the full log in attachment:

Toggle snippet (5 lines)
[pid 1] stat("/etc/localtime", {st_mode=S_IFREG|0444, st_size=2298, ...}) = 0
[pid 1] write(9, "shepherd[1]: clearing HTTP/HTTPS"..., 59) = 59
[pid 1] getpgid(266) = 0

Thanks,

Mathieu
Attachment: strace.log
M
M
Mathieu Othacehe wrote on 6 May 2020 12:02
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 40981@debbugs.gnu.org)
87v9l9bcu0.fsf@gmail.com
Hello,

I found what happened here. Turns out, when a process is forked and
before "setsid" or "setpgid" is called, it shares the parent PGID. In
that case the parent is Shepherd, with the PGID 0.

When doing the following actions:

* stop guix-daemon
* start guix-daemon

* stop guix-daemon
* start guix-daemon

If the second stop occurs after "fork" has been done, but before
"setsid", then "(getpgid)" returns 0. The naive patch attached could fix
the situation.

WDYT?

Mathieu
From 0e4167251a56d6baa4f51fe72250a6e3bffae8c3 Mon Sep 17 00:00:00 2001
From: Mathieu Othacehe <m.othacehe@gmail.com>
Date: Wed, 6 May 2020 11:48:26 +0200
Subject: [PATCH] service: Fix 'make-kill-destructor' when PGID is zero.

When a process is forked, and before its GID is changed in "exec-command",
it will share the parent GID, which is 0 for Shepherd. In that case, use
the PID instead of the PGID.

* modules/shepherd/service.scm (make-kill-destructor): Handle the case when
PGID is zero, between the process fork and exec.
---
modules/shepherd/service.scm | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)

Toggle diff (35 lines)
diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm
index 8604d2f..258992c 100644
--- a/modules/shepherd/service.scm
+++ b/modules/shepherd/service.scm
@@ -5,6 +5,7 @@
;; Copyright (C) 2016 Alex Kost <alezost@gmail.com>
;; Copyright (C) 2018 Carlo Zancanaro <carlo@zancanaro.id.au>
;; Copyright (C) 2019 Ricardo Wurmus <rekado@elephly.net>
+;; Copyright (C) 2020 Mathieu Othacehe <m.othacehe@gmail.com>
;;
;; This file is part of the GNU Shepherd.
;;
@@ -957,11 +958,15 @@ start."
"Return a procedure that sends SIGNAL to the process group of the PID given
as argument, where SIGNAL defaults to `SIGTERM'."
(lambda (pid . args)
- ;; Kill the whole process group PID belongs to. Don't assume that PID
- ;; is a process group ID: that's not the case when using #:pid-file,
- ;; where the process group ID is the PID of the process that
- ;; "daemonized".
- (kill (- (getpgid pid)) signal)
+ ;; Kill the whole process group PID belongs to. Don't assume that PID is
+ ;; a process group ID: that's not the case when using #:pid-file, where
+ ;; the process group ID is the PID of the process that "daemonized". If
+ ;; this procedure is called, between the process fork and exec, the PGID
+ ;; will still be zero (the Shepherd PGID). In that case, use the PID.
+ (let ((pgid (getpgid pid)))
+ (if pgid
+ (kill (- pgid) signal)
+ (kill pid signal)))
#f))
;; Produce a constructor that executes a command.
--
2.26.0
L
L
Ludovic Courtès wrote on 7 May 2020 12:51
control message for bug #40981
(address . control@debbugs.gnu.org)
878si4dnlh.fsf@gnu.org
severity 40981 important
quit
M
M
Mathieu Othacehe wrote on 7 May 2020 18:48
Re: bug#40981: Graphical installer tests sometimes hang.
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 40981@debbugs.gnu.org)
87mu6jelmx.fsf@gmail.com
Hello,

The previous patch was not working. The reason is that, when a process
is forked and before execv is called, it shares the parent signal
handler.

So when sending a SIGTERM to the child process, we are stopping
root-service, if the signal is received before the child has forked.

The work-around here is to save the installed SIGTERM handler and reset
it. Then, after forking, the parent can restore the SIGTERM handler. The
child will use the default SIGTERM handler that terminates the process.

WDYT?

Thanks,

Mathieu
From aa6f67068f1fdd622673ec0223f05fd8f8a96baa Mon Sep 17 00:00:00 2001
From: Mathieu Othacehe <m.othacehe@gmail.com>
Date: Thu, 7 May 2020 18:39:41 +0200
Subject: [PATCH] service: Fix 'make-kill-destructor' when PGID is zero.

When a process is forked, and before its GID is changed in "exec-command",
it will share the parent GID, which is 0 for Shepherd. In that case, use
the PID instead of the PGID.

Also make sure to reset the SIGTERM handler before forking a process. Failing
to do so, will result in stopping Shepherd if a SIGTERM is received between
fork and execv calls. Restore the SIGTERM handler once the process has been
forked.

* modules/shepherd/service.scm (fork+exec-command): Save the current SIGTERM
handler and reset it before forking. Then, restore it on the parent after
forking.
(make-kill-destructor): Handle the case when PGID is zero, between the process
fork and exec.
---
modules/shepherd/service.scm | 36 +++++++++++++++++++++++++++++-------
tests/forking-service.sh | 18 ++++++++++++++++++
2 files changed, 47 insertions(+), 7 deletions(-)

Toggle diff (103 lines)
diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm
index 8604d2f..c68d7e0 100644
--- a/modules/shepherd/service.scm
+++ b/modules/shepherd/service.scm
@@ -5,6 +5,7 @@
;; Copyright (C) 2016 Alex Kost <alezost@gmail.com>
;; Copyright (C) 2018 Carlo Zancanaro <carlo@zancanaro.id.au>
;; Copyright (C) 2019 Ricardo Wurmus <rekado@elephly.net>
+;; Copyright (C) 2020 Mathieu Othacehe <m.othacehe@gmail.com>
;;
;; This file is part of the GNU Shepherd.
;;
@@ -884,7 +885,18 @@ its PID."
(unless %sigchld-handler-installed?
(sigaction SIGCHLD handle-SIGCHLD SA_NOCLDSTOP)
(set! %sigchld-handler-installed? #t))
- (let ((pid (primitive-fork)))
+
+ ;; When forking a process, the signal handlers are inherited, until it
+ ;; forks. If SIGTERM is received by the forked process, before it calls
+ ;; execv, the installed SIGTERM handler, stopping Shepherd will be called.
+ ;; To avoid this, save the SIGTERM handler, disable it, and restore it once,
+ ;; the process has been forked. This way, the forked process will use the
+ ;; default SIGTERM handler stopping the process.
+ (let ((term-handler (match (sigaction SIGTERM)
+ ((proc . _)
+ proc)))
+ (pid (and (sigaction SIGTERM SIG_DFL)
+ (primitive-fork))))
(if (zero? pid)
(exec-command command
#:user user
@@ -893,7 +905,10 @@ its PID."
#:directory directory
#:file-creation-mask file-creation-mask
#:environment-variables environment-variables)
- pid)))
+ (begin
+ ;; Restore the initial SIGTERM handler.
+ (sigaction SIGTERM term-handler)
+ pid))))
(define* (make-forkexec-constructor command
#:key
@@ -957,11 +972,18 @@ start."
"Return a procedure that sends SIGNAL to the process group of the PID given
as argument, where SIGNAL defaults to `SIGTERM'."
(lambda (pid . args)
- ;; Kill the whole process group PID belongs to. Don't assume that PID
- ;; is a process group ID: that's not the case when using #:pid-file,
- ;; where the process group ID is the PID of the process that
- ;; "daemonized".
- (kill (- (getpgid pid)) signal)
+ ;; Kill the whole process group PID belongs to. Don't assume that PID is
+ ;; a process group ID: that's not the case when using #:pid-file, where
+ ;; the process group ID is the PID of the process that "daemonized". If
+ ;; this procedure is called, between the process fork and exec, the PGID
+ ;; will still be zero (the Shepherd PGID). In that case, use the PID.
+ (let ((current-pgid (getpgid 0))
+ (pgid (getpgid pid)))
+ (if (eq? pgid current-pgid)
+ (begin
+ (kill pid signal))
+ (begin
+ (kill (- pgid) signal))))
#f))
;; Produce a constructor that executes a command.
diff --git a/tests/forking-service.sh b/tests/forking-service.sh
index 7126c3d..47b09a2 100644
--- a/tests/forking-service.sh
+++ b/tests/forking-service.sh
@@ -71,6 +71,17 @@ cat > "$conf"<<EOF
#:pid-file "$PWD/$service2_pid")
#:stop (make-kill-destructor)
#:respawn? #t))
+
+(define %command3
+ '("$SHELL" "-c" "sleep 600"))
+
+(register-services
+ (make <service>
+ ;; A service that forks into a different process.
+ #:provides '(test3)
+ #:start (make-forkexec-constructor %command3)
+ #:stop (make-kill-destructor)
+ #:respawn? #t))
EOF
cat $conf
@@ -113,3 +124,10 @@ sleep 1;
$herd status test2 | grep started
test "`cat $PWD/$service2_started`" = "started
started"
+
+# Try to trigger eventual race conditions, when killing a process between fork
+# and execv calls.
+for i in {1..50}
+do
+ $herd restart test3
+done
--
2.26.0
L
L
Ludovic Courtès wrote on 10 May 2020 12:32
(name . Mathieu Othacehe)(address . m.othacehe@gmail.com)(address . 40981@debbugs.gnu.org)
87tv0o9j33.fsf@gnu.org
Hi!

Mathieu Othacehe <m.othacehe@gmail.com> skribis:

Toggle quote (7 lines)
> The previous patch was not working. The reason is that, when a process
> is forked and before execv is called, it shares the parent signal
> handler.
>
> So when sending a SIGTERM to the child process, we are stopping
> root-service, if the signal is received before the child has forked.

Woow, good catch!

Toggle quote (4 lines)
> The work-around here is to save the installed SIGTERM handler and reset
> it. Then, after forking, the parent can restore the SIGTERM handler. The
> child will use the default SIGTERM handler that terminates the process.

OK, makes sense. (Another occasion to see how something like
‘posix_spawn’ would be more appropriate than fork + exec…)

Toggle quote (20 lines)
> From aa6f67068f1fdd622673ec0223f05fd8f8a96baa Mon Sep 17 00:00:00 2001
> From: Mathieu Othacehe <m.othacehe@gmail.com>
> Date: Thu, 7 May 2020 18:39:41 +0200
> Subject: [PATCH] service: Fix 'make-kill-destructor' when PGID is zero.
>
> When a process is forked, and before its GID is changed in "exec-command",
> it will share the parent GID, which is 0 for Shepherd. In that case, use
> the PID instead of the PGID.
>
> Also make sure to reset the SIGTERM handler before forking a process. Failing
> to do so, will result in stopping Shepherd if a SIGTERM is received between
> fork and execv calls. Restore the SIGTERM handler once the process has been
> forked.
>
> * modules/shepherd/service.scm (fork+exec-command): Save the current SIGTERM
> handler and reset it before forking. Then, restore it on the parent after
> forking.
> (make-kill-destructor): Handle the case when PGID is zero, between the process
> fork and exec.

[...]

Toggle quote (13 lines)
> + ;; Kill the whole process group PID belongs to. Don't assume that PID is
> + ;; a process group ID: that's not the case when using #:pid-file, where
> + ;; the process group ID is the PID of the process that "daemonized". If
> + ;; this procedure is called, between the process fork and exec, the PGID
> + ;; will still be zero (the Shepherd PGID). In that case, use the PID.
> + (let ((current-pgid (getpgid 0))
> + (pgid (getpgid pid)))
> + (if (eq? pgid current-pgid)
> + (begin
> + (kill pid signal))
> + (begin
> + (kill (- pgid) signal))))

Shouldn’t it be:

(let ((pgid (getpgid pid)))
(if (= (getpgid 0) pgid)
(kill pid signal) ;don't kill ourself
(kill (-p pgid) signal)))

?

Note: Use the most specific comparison procedure, ‘=’ in this case,
because we know we’re dealing with numbers (it enables proper type
checking, better compiler optimizations, etc.). More importantly, ‘eq?’
performs pointer comparison, so it shouldn’t be used with numbers (in
practice it works with “fixnums” but not with bignums).

Toggle quote (7 lines)
> +# Try to trigger eventual race conditions, when killing a process between fork
> +# and execv calls.
> +for i in {1..50}
> +do
> + $herd restart test3
> +done

Would it reproduce the problem well enough?

Use `seq 1 50` to avoid relying on a Bash-specific construct (which I
think it is, right?).

Could you send an updated patch?

Thanks for the bug hunting and for the patch!

Ludo’.
L
L
Ludovic Courtès wrote on 10 May 2020 12:36
control message for bug #40981
(address . control@debbugs.gnu.org)
87r1vs9iw5.fsf@gnu.org
retitle 40981 shepherd 0.8.0 race condition can lead to stopping itself
quit
M
M
Mathieu Othacehe wrote on 10 May 2020 13:19
Re: bug#40981: Graphical installer tests sometimes hang.
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 40981@debbugs.gnu.org)
87eersm409.fsf@gnu.org
Hey Ludo,

Toggle quote (2 lines)
> Woow, good catch!

Thanks :)

Toggle quote (7 lines)
>> The work-around here is to save the installed SIGTERM handler and reset
>> it. Then, after forking, the parent can restore the SIGTERM handler. The
>> child will use the default SIGTERM handler that terminates the process.
>
> OK, makes sense. (Another occasion to see how something like
> ‘posix_spawn’ would be more appropriate than fork + exec…)

Didn't know about that function, but it seems way easier to manipulate
and less error prone indeed!

Toggle quote (7 lines)
> Shouldn’t it be:
>
> (let ((pgid (getpgid pid)))
> (if (= (getpgid 0) pgid)
> (kill pid signal) ;don't kill ourself
> (kill (-p pgid) signal)))

Yes, I changed it.

Toggle quote (6 lines)
> Note: Use the most specific comparison procedure, ‘=’ in this case,
> because we know we’re dealing with numbers (it enables proper type
> checking, better compiler optimizations, etc.). More importantly, ‘eq?’
> performs pointer comparison, so it shouldn’t be used with numbers (in
> practice it works with “fixnums” but not with bignums).

Ok, noted!

Toggle quote (9 lines)
>> +# Try to trigger eventual race conditions, when killing a process between fork
>> +# and execv calls.
>> +for i in {1..50}
>> +do
>> + $herd restart test3
>> +done
>
> Would it reproduce the problem well enough?

On a slow machine one time out of two, and on a faster machine,
never. The 'reproducer' I used, was to add a 'sleep' between fork and
exec, it works way better!

Tell me if you think its better to drop it.

Toggle quote (2 lines)
> Could you send an updated patch?

Here it is!

Toggle quote (2 lines)
> Thanks for the bug hunting and for the patch!

Thanks for the fast review :)

Mathieu
From 79d3603bf15b8f815136178be8c8a236734a7596 Mon Sep 17 00:00:00 2001
From: Mathieu Othacehe <m.othacehe@gmail.com>
Date: Thu, 7 May 2020 18:39:41 +0200
Subject: [PATCH] service: Fix 'make-kill-destructor' when PGID is zero.

When a process is forked, and before its GID is changed in "exec-command",
it will share the parent GID, which is 0 for Shepherd. In that case, use
the PID instead of the PGID.

Also make sure to reset the SIGTERM handler before forking a process. Failing
to do so, will result in stopping Shepherd if a SIGTERM is received between
fork and execv calls. Restore the SIGTERM handler once the process has been
forked.

* modules/shepherd/service.scm (fork+exec-command): Save the current SIGTERM
handler and reset it before forking. Then, restore it on the parent after
forking.
(make-kill-destructor): Handle the case when PGID is zero, between the process
fork and exec.
---
modules/shepherd/service.scm | 33 ++++++++++++++++++++++++++-------
tests/forking-service.sh | 18 ++++++++++++++++++
2 files changed, 44 insertions(+), 7 deletions(-)

Toggle diff (100 lines)
diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm
index 8604d2f..45fcf32 100644
--- a/modules/shepherd/service.scm
+++ b/modules/shepherd/service.scm
@@ -5,6 +5,7 @@
;; Copyright (C) 2016 Alex Kost <alezost@gmail.com>
;; Copyright (C) 2018 Carlo Zancanaro <carlo@zancanaro.id.au>
;; Copyright (C) 2019 Ricardo Wurmus <rekado@elephly.net>
+;; Copyright (C) 2020 Mathieu Othacehe <m.othacehe@gmail.com>
;;
;; This file is part of the GNU Shepherd.
;;
@@ -884,7 +885,18 @@ its PID."
(unless %sigchld-handler-installed?
(sigaction SIGCHLD handle-SIGCHLD SA_NOCLDSTOP)
(set! %sigchld-handler-installed? #t))
- (let ((pid (primitive-fork)))
+
+ ;; When forking a process, the signal handlers are inherited, until it
+ ;; forks. If SIGTERM is received by the forked process, before it calls
+ ;; execv, the installed SIGTERM handler, stopping Shepherd will be called.
+ ;; To avoid this, save the SIGTERM handler, disable it, and restore it once,
+ ;; the process has been forked. This way, the forked process will use the
+ ;; default SIGTERM handler stopping the process.
+ (let ((term-handler (match (sigaction SIGTERM)
+ ((proc . _)
+ proc)))
+ (pid (and (sigaction SIGTERM SIG_DFL)
+ (primitive-fork))))
(if (zero? pid)
(exec-command command
#:user user
@@ -893,7 +905,10 @@ its PID."
#:directory directory
#:file-creation-mask file-creation-mask
#:environment-variables environment-variables)
- pid)))
+ (begin
+ ;; Restore the initial SIGTERM handler.
+ (sigaction SIGTERM term-handler)
+ pid))))
(define* (make-forkexec-constructor command
#:key
@@ -957,11 +972,15 @@ start."
"Return a procedure that sends SIGNAL to the process group of the PID given
as argument, where SIGNAL defaults to `SIGTERM'."
(lambda (pid . args)
- ;; Kill the whole process group PID belongs to. Don't assume that PID
- ;; is a process group ID: that's not the case when using #:pid-file,
- ;; where the process group ID is the PID of the process that
- ;; "daemonized".
- (kill (- (getpgid pid)) signal)
+ ;; Kill the whole process group PID belongs to. Don't assume that PID is
+ ;; a process group ID: that's not the case when using #:pid-file, where
+ ;; the process group ID is the PID of the process that "daemonized". If
+ ;; this procedure is called, between the process fork and exec, the PGID
+ ;; will still be zero (the Shepherd PGID). In that case, use the PID.
+ (let ((pgid (getpgid pid)))
+ (if (= (getpgid 0) pgid)
+ (kill pid signal) ;don't kill ourself
+ (kill (- pgid) signal)))
#f))
;; Produce a constructor that executes a command.
diff --git a/tests/forking-service.sh b/tests/forking-service.sh
index 7126c3d..bd9aac9 100644
--- a/tests/forking-service.sh
+++ b/tests/forking-service.sh
@@ -71,6 +71,17 @@ cat > "$conf"<<EOF
#:pid-file "$PWD/$service2_pid")
#:stop (make-kill-destructor)
#:respawn? #t))
+
+(define %command3
+ '("$SHELL" "-c" "sleep 600"))
+
+(register-services
+ (make <service>
+ ;; A service that forks into a different process.
+ #:provides '(test3)
+ #:start (make-forkexec-constructor %command3)
+ #:stop (make-kill-destructor)
+ #:respawn? #t))
EOF
cat $conf
@@ -113,3 +124,10 @@ sleep 1;
$herd status test2 | grep started
test "`cat $PWD/$service2_started`" = "started
started"
+
+# Try to trigger eventual race conditions, when killing a process between fork
+# and execv calls.
+for i in `seq 1 50`
+do
+ $herd restart test3
+done
--
2.26.2
L
L
Ludovic Courtès wrote on 11 May 2020 23:09
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 40981-done@debbugs.gnu.org)
87o8qudvri.fsf@gnu.org
Hello,

Mathieu Othacehe <othacehe@gnu.org> skribis:

Toggle quote (10 lines)
>>> The work-around here is to save the installed SIGTERM handler and reset
>>> it. Then, after forking, the parent can restore the SIGTERM handler. The
>>> child will use the default SIGTERM handler that terminates the process.
>>
>> OK, makes sense. (Another occasion to see how something like
>> ‘posix_spawn’ would be more appropriate than fork + exec…)
>
> Didn't know about that function, but it seems way easier to manipulate
> and less error prone indeed!

Make sure to read “A fork() in the Road” on that topic:


Toggle quote (15 lines)
>>> +# Try to trigger eventual race conditions, when killing a process between fork
>>> +# and execv calls.
>>> +for i in {1..50}
>>> +do
>>> + $herd restart test3
>>> +done
>>
>> Would it reproduce the problem well enough?
>
> On a slow machine one time out of two, and on a faster machine,
> never. The 'reproducer' I used, was to add a 'sleep' between fork and
> exec, it works way better!
>
> Tell me if you think its better to drop it.

It’s better than nothing, let’s keep it.

Toggle quote (20 lines)
>>From 79d3603bf15b8f815136178be8c8a236734a7596 Mon Sep 17 00:00:00 2001
> From: Mathieu Othacehe <m.othacehe@gmail.com>
> Date: Thu, 7 May 2020 18:39:41 +0200
> Subject: [PATCH] service: Fix 'make-kill-destructor' when PGID is zero.
>
> When a process is forked, and before its GID is changed in "exec-command",
> it will share the parent GID, which is 0 for Shepherd. In that case, use
> the PID instead of the PGID.
>
> Also make sure to reset the SIGTERM handler before forking a process. Failing
> to do so, will result in stopping Shepherd if a SIGTERM is received between
> fork and execv calls. Restore the SIGTERM handler once the process has been
> forked.
>
> * modules/shepherd/service.scm (fork+exec-command): Save the current SIGTERM
> handler and reset it before forking. Then, restore it on the parent after
> forking.
> (make-kill-destructor): Handle the case when PGID is zero, between the process
> fork and exec.

I added a “Fixes” line and pushed it.

Thanks a lot!

I can roll a 0.8.1 release soonish (I’d like to add signalfd support
while at it, we’ll see.)

Ludo’.
Closed
L
L
Ludovic Courtès wrote on 22 May 2020 22:15
control message for bug #40981
(address . control@debbugs.gnu.org)
87wo539159.fsf@gnu.org
merge 40981 41429
quit
M
M
Mathieu Othacehe wrote on 20 Jun 2020 11:54
(address . control@debbugs.gnu.org)
877dw25age.fsf@meru.i-did-not-set--mail-host-address--so-tickle-me
close 40981
quit
?