[Shepherd] Handling process termination before service is running

  • Done
  • quality assurance status badge
Details
One participant
  • Ludovic Courtès
Owner
unassigned
Submitted by
Ludovic Courtès
Severity
normal

Debbugs page

Ludovic Courtès wrote 7 days ago
(address . bug-guix@gnu.org)
87ikolx4zf.fsf@inria.fr
While on a quest for flaky tests in the Shepherd, I found a genuine bug
that would manifest with this ‘tests/basic.sh’ failure:

Toggle snippet (16 lines)
+ herd -s t-socket-21679 status test-run-from-nonexistent-directory
+ sleep 0.5
+ herd -s t-socket-21679 status test-run-from-nonexistent-directory
+ grep 'exited with code 127'
+ sleep 0.5
+ herd -s t-socket-21679 status test-run-from-nonexistent-directory
+ grep 'exited with code 127'
[…]
2025-03-06 14:06:36 Service test-run-from-nonexistent-directory started.
2025-03-06 14:06:36 Failed to run "/gnu/store/3bg5qfsmjw6p7bh1xadarbaq246zis0d-coreutils-9.1/bin/pwd": In procedure chdir: No such file or directory
2025-03-06 14:06:36 Service test-run-from-nonexistent-directory running with value #<<process> id: 22431 command: ("/gnu/store/3bg5qfsmjw6p7bh1xadarbaq246zis0d-coreutils-9.1/bin/pwd")>.
2025-03-06 14:06:36 Service test-run-from-nonexistent-directory has been started.
2025-03-06 14:06:36 Service test-run-from-nonexistent-directory has been disabled.
2025-03-06 14:11:51 Stopping service root...

What happens is that the service is not marked as “exited with code
127”; instead, it is marked as having exited with code 0:

Toggle snippet (8 lines)
● Status of test-run-from-nonexistent-directory:
It is stopped since 14:06:36 (37 seconds ago).
Process exited successfully.
It is disabled.
Provides: test-run-from-nonexistent-directory
Will not be respawned.

This is due to a race condition: the process terminates before its
service goes from ‘starting’ to ‘running’.

By the time the service controller calls ‘monitor-service-process’, the
process has already terminated, so the process monitor replies 0 to the
'await request because that process no longer exists.

Attached is a test that reproduces the problem.

Ludo’.
# GNU Shepherd --- Handling termination of a process before 'start' completes.
# Copyright © 2025 Ludovic Courtès <ludo@gnu.org>
#
# This file is part of the GNU Shepherd.
#
# The GNU Shepherd is free software; you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 3 of the License, or (at
# your option) any later version.
#
# The GNU Shepherd is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with the GNU Shepherd. If not, see http://www.gnu.org/licenses/.

shepherd --version
herd --version

socket="t-socket-$$"
conf="t-conf-$$"
log="t-log-$$"
pid="t-pid-$$"

herd="herd -s $socket"

trap "cat $log || true; rm -f $socket $conf $log;
test -f $pid && kill \`cat $pid\` || true; rm -f $pid" EXIT

cat > "$conf" <<EOF
(register-services
(list (service
'(stops-early)
#:start (lambda ()
(let ((pid (fork+exec-command
'("$SHELL" "-c" "echo done; exit 42"))))
(format #t "got PID ~a; sleeping~%" pid)

;; Artificially wait until PID is gone for sure.
(let loop ()
(when (false-if-exception (begin (kill pid 0) #t))
(sleep 0.5)
(loop)))
pid))
#:stop (make-kill-destructor)
#:respawn? #f)))
EOF

rm -f "$pid"
shepherd -I -s "$socket" -c "$conf" --pid="$pid" --log="$log" &

# Wait till it's ready.
until test -f "$pid"; do sleep 0.3; done

$herd status
$herd start stops-early
$herd status stops-early
$herd status stops-early | grep stopped
$herd status stops-early | grep 'exited with code 42'
Ludovic Courtès wrote 4 days ago
(address . 76790@debbugs.gnu.org)
87ldteqiye.fsf@gnu.org
Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (7 lines)
> This is due to a race condition: the process terminates before its
> service goes from ‘starting’ to ‘running’.
>
> By the time the service controller calls ‘monitor-service-process’, the
> process has already terminated, so the process monitor replies 0 to the
> 'await request because that process no longer exists.

Fixed in 88cc5a43bf04c13b00b15a8d93cb635d9b64713c by introducing an
atomic fork + monitor primitive.

For the record, I also considered a hacky but less intrusive solution:
adding support for “zombies” in the process monitor, whereby it would
record the status of terminated processes and give that (instead of 0)
when somebody awaits them (patch attached).

It’s fragile though because unlike real zombies, we cannot guarantee
that the PID won’t be reused in the meantime; it’s just unlikely. (PID
FDs would help, but that’s not portable so best if we can avoid it.)

Ludo’.
Toggle diff (97 lines)
diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm
index 221998b..fec773c 100644
--- a/modules/shepherd/service.scm
+++ b/modules/shepherd/service.scm
@@ -2345,25 +2345,35 @@ may never terminate, even after sending it SIGKILL---e.g., kthreadd on Linux."
(const #f)))
(const #f)))
+(define %max-zombie-processes
+ ;; Number of processes recorded as "zombies" by the process
+ ;; monitor--processes that had no waiters when they terminated.
+ 20)
+
(define (process-monitor channel)
"Run a process monitor that handles requests received over @var{channel}."
- (let loop ((waiters vlist-null))
+ (let loop ((waiters vlist-null)
+ (zombies (ring-buffer %max-zombie-processes)))
(match (get-message channel)
(('handle-process-termination pid status)
- ;; Notify any waiters.
- (vhash-foldv* (lambda (waiter _)
- (put-message waiter status)
- #t)
- #t pid waiters)
-
- ;; XXX: The call below is linear in the size of WAITERS, but WAITERS is
- ;; usually empty or small.
- (loop (vhash-fold (lambda (key value result)
- (if (= key pid)
- result
- (vhash-consv key value result)))
- vlist-null
- waiters)))
+ ;; Notify any waiters. When there are none, record PID and STATUS in
+ ;; the ZOMBIES buffer in case somebody asks for it later--
+ (let ((zombie? (vhash-foldv* (lambda (waiter _)
+ (put-message waiter status)
+ #f)
+ #t pid waiters)))
+
+ ;; XXX: The call below is linear in the size of WAITERS, but WAITERS
+ ;; is usually empty or small.
+ (loop (vhash-fold (lambda (key value result)
+ (if (= key pid)
+ result
+ (vhash-consv key value result)))
+ vlist-null
+ waiters)
+ (if zombie?
+ (ring-buffer-insert (cons pid status) zombies)
+ zombies))))
(('spawn arguments service reply)
;; Spawn the command as specified by ARGUMENTS; send the spawn result
@@ -2378,22 +2388,31 @@ may never terminate, even after sending it SIGKILL---e.g., kthreadd on Linux."
(put-message reply result)
(match result
(('exception . _)
- (loop waiters))
+ (loop waiters zombies))
(('success (pid))
- (loop (vhash-consv pid reply waiters))))))
+ (loop (vhash-consv pid reply waiters) zombies)))))
(('await pid reply)
;; Await the termination of PID and send its status on REPLY.
(if (and (catch-system-error (kill pid 0))
(not (pseudo-process? pid)))
- (loop (vhash-consv pid reply waiters))
- (begin ;PID is gone or a pseudo-process
- ;; This might be a race condition (the caller thinks PID is up
- ;; when it's already dead) so log it.
- (local-output (l10n "Awaiting PID ~a, which is already gone.")
- pid)
- (put-message reply 0)
- (loop waiters)))))))
+ (loop (vhash-consv pid reply waiters) zombies)
+ ;; PID is gone or a pseudo-process; in the former case, check if we
+ ;; have info about it in ZOMBIES (this is linear in the length of
+ ;; ZOMBIES, which is small.)
+ (let* ((zombies* (ring-buffer->list zombies))
+ (status (assoc-ref zombies* pid)))
+ (unless status
+ ;; This might be a race condition (the caller thinks PID is up
+ ;; when it's already dead) so log it.
+ (local-output (l10n "Awaiting PID ~a, which is already gone.")
+ pid))
+ (put-message reply (or status 0))
+ (loop waiters
+ (if status
+ (list->ring-buffer (alist-delete pid zombies*)
+ %max-zombie-processes)
+ zombies))))))))
(define spawn-process-monitor
(essential-task-launcher 'process-monitor process-monitor))
Ludovic Courtès wrote 4 days ago
control message for bug #76790
(address . control@debbugs.gnu.org)
87jz8yqiy4.fsf@gnu.org
close 76790
quit
?
Your comment

Commenting via the web interface is currently disabled.

To comment on this conversation send an email to 76790@debbugs.gnu.org

To respond to this issue using the mumi CLI, first switch to it
mumi current 76790
Then, you may apply the latest patchset in this issue (with sign off)
mumi am -- -s
Or, compose a reply to this issue
mumi compose
Or, send patches to this issue
mumi send-email *.patch
You may also tag this issue. See list of standard tags. For example, to set the confirmed and easy tags
mumi command -t +confirmed -t +easy
Or, remove the moreinfo tag and set the help tag
mumi command -t -moreinfo -t +help