Ludovic Courtès wrote 7 days ago
(address . bug-guix@gnu.org)
While on a quest for flaky tests in the Shepherd, I found a genuine bug
that would manifest with this ‘tests/basic.sh’ failure:
Toggle snippet (16 lines)
+ herd -s t-socket-21679 status test-run-from-nonexistent-directory
+ sleep 0.5
+ herd -s t-socket-21679 status test-run-from-nonexistent-directory
+ grep 'exited with code 127'
+ sleep 0.5
+ herd -s t-socket-21679 status test-run-from-nonexistent-directory
+ grep 'exited with code 127'
[…]
2025-03-06 14:06:36 Service test-run-from-nonexistent-directory started.
2025-03-06 14:06:36 Failed to run "/gnu/store/3bg5qfsmjw6p7bh1xadarbaq246zis0d-coreutils-9.1/bin/pwd": In procedure chdir: No such file or directory
2025-03-06 14:06:36 Service test-run-from-nonexistent-directory running with value #<<process> id: 22431 command: ("/gnu/store/3bg5qfsmjw6p7bh1xadarbaq246zis0d-coreutils-9.1/bin/pwd")>.
2025-03-06 14:06:36 Service test-run-from-nonexistent-directory has been started.
2025-03-06 14:06:36 Service test-run-from-nonexistent-directory has been disabled.
2025-03-06 14:11:51 Stopping service root...
What happens is that the service is not marked as “exited with code
127”; instead, it is marked as having exited with code 0:
Toggle snippet (8 lines)
● Status of test-run-from-nonexistent-directory:
It is stopped since 14:06:36 (37 seconds ago).
Process exited successfully.
It is disabled.
Provides: test-run-from-nonexistent-directory
Will not be respawned.
This is due to a race condition: the process terminates before its
service goes from ‘starting’ to ‘running’.
By the time the service controller calls ‘monitor-service-process’, the
process has already terminated, so the process monitor replies 0 to the
'await request because that process no longer exists.
Attached is a test that reproduces the problem.
Ludo’.
# GNU Shepherd --- Handling termination of a process before 'start' completes.
# Copyright © 2025 Ludovic Courtès <ludo@gnu.org>
#
# This file is part of the GNU Shepherd.
#
# The GNU Shepherd is free software; you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 3 of the License, or (at
# your option) any later version.
#
# The GNU Shepherd is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with the GNU Shepherd. If not, see http://www.gnu.org/licenses/.
shepherd --version
herd --version
socket="t-socket-$$"
conf="t-conf-$$"
log="t-log-$$"
pid="t-pid-$$"
herd="herd -s $socket"
trap "cat $log || true; rm -f $socket $conf $log;
test -f $pid && kill \`cat $pid\` || true; rm -f $pid" EXIT
cat > "$conf" <<EOF
(register-services
(list (service
'(stops-early)
#:start (lambda ()
(let ((pid (fork+exec-command
'("$SHELL" "-c" "echo done; exit 42"))))
(format #t "got PID ~a; sleeping~%" pid)
;; Artificially wait until PID is gone for sure.
(let loop ()
(when (false-if-exception (begin (kill pid 0) #t))
(sleep 0.5)
(loop)))
pid))
#:stop (make-kill-destructor)
#:respawn? #f)))
EOF
rm -f "$pid"
shepherd -I -s "$socket" -c "$conf" --pid="$pid" --log="$log" &
# Wait till it's ready.
until test -f "$pid"; do sleep 0.3; done
$herd status
$herd start stops-early
$herd status stops-early
$herd status stops-early | grep stopped
$herd status stops-early | grep 'exited with code 42'