[Shepherd] Loss of SIGCHLD notifications

  • Done
  • quality assurance status badge
Details
One participant
  • Ludovic Courtès
Owner
unassigned
Submitted by
Ludovic Courtès
Severity
important
L
L
Ludovic Courtès wrote on 26 May 2023 14:14
(address . bug-guix@gnu.org)
87353jb88e.fsf@inria.fr
I experienced, with the Shepherd 0.10.0, a situation where ‘herd
restart’ would get stuck after the end-of-grace-period expiration
(restarting nscd in this case):

Toggle snippet (12 lines)
May 26 08:44:33 localhost shepherd[1]: [NetworkManager] Status of nscd:
May 26 08:44:33 localhost shepherd[1]: [NetworkManager] It is running since 08:43:57 (36 seconds ago).
May 26 08:44:33 localhost shepherd[1]: [NetworkManager] Running value is 23753.
May 26 08:44:33 localhost shepherd[1]: [NetworkManager] It is enabled.
May 26 08:44:33 localhost shepherd[1]: [NetworkManager] Provides (nscd).
May 26 08:44:33 localhost shepherd[1]: [NetworkManager] Requires (user-processes syslogd).
May 26 08:44:33 localhost shepherd[1]: [NetworkManager] Will be respawned.
May 26 08:44:33 localhost shepherd[1]: Stopping service nscd...
[...]
May 26 08:44:38 localhost shepherd[1]: Grace period of 5 seconds is over; sending -23753 SIGKILL.

The ‘herd restart’ process is indeed waiting, and the nscd process is
still around as zombie, meaning that shepherd didn’t call waitpid(2):

Toggle snippet (13 lines)
$ sudo herd status nscd
Status of nscd:
It is being stopped.
It is enabled.
Provides (nscd).
Requires (user-processes syslogd).
Will be respawned.
$ ps aux|grep nscd
root 23753 0.0 0.0 0 0 ? Zs 08:43 0:00 [nscd] <defunct>
root 23870 0.0 0.1 49968 18088 ? Sl 08:44 0:00 /gnu/store/4gvgcfdiz67wv04ihqfa8pqwzsb0qpv5-guile-3.0.9/bin/guile --no-auto-compile /run/current-system/profile/bin/herd restart nscd
ludo 25280 33.3 0.0 6112 2304 pts/0 S+ 09:32 0:00 grep --color=auto nscd

At that point, if I pick another service process and kill it, shepherd
again fails to react to SIGCHLD (or doesn’t receive it):

Toggle snippet (22 lines)
$ sudo herd status ntpd
Status of ntpd:
It is running since 08:44:31 AM (52 minutes ago).
Running value is 23814.
It is enabled.
Provides (ntpd).
Requires (user-processes networking).
Will be respawned.
$ sudo kill 23814
$ sudo herd status ntpd
Status of ntpd:
It is running since 08:44:31 AM (52 minutes ago).
Running value is 23814.
It is enabled.
Provides (ntpd).
Requires (user-processes networking).
Will be respawned.
ludo@ribbon ~/src/guix$ ps 23814
PID TTY STAT TIME COMMAND
23814 ? Zs 0:00 [ntpd] <defunct>

If I repeat the same experiment while stracing shepherd, here’s what I
see:

Toggle snippet (14 lines)
$ sudo herd status guix-publish
Status of guix-publish:
It is running since 08:44:32 AM (55 minutes ago).
Running value is 23822.
It is enabled.
Provides (guix-publish).
Requires (user-processes guix-daemon avahi-daemon).
Will be respawned.
$ sudo kill 23822
$ ps 23822
PID TTY STAT TIME COMMAND
23822 ? Zs 0:02 [guix publish] <defunct>

… and the strace:

Toggle snippet (34 lines)
1 rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0
1 epoll_wait(13, [{events=EPOLLIN, data={u32=14, u64=14}}], 8, -1) = 1
1 accept4(14, {sa_family=AF_UNIX}, [112 => 2], SOCK_CLOEXEC|SOCK_NONBLOCK) = 25
1 accept4(14, 0x7ffe6f8f0be0, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
1 epoll_ctl(13, EPOLL_CTL_MOD, 14, {events=EPOLLIN|EPOLLRDHUP|EPOLLONESHOT, data={u32=14, u64=14}}) = 0
1 epoll_wait(13, [], 8, 0) = 0
1 read(25, "(shepherd-command (version 0) (action status) (service guix-publish) (arguments ()) (directory \"/home/ludo/src/guix\"))", 1024) = 118
1 epoll_wait(13, [], 8, 0) = 0
1 epoll_wait(13, [], 8, 0) = 0
1 epoll_wait(13, [], 8, 0) = 0
1 epoll_wait(13, [], 8, 0) = 0
1 epoll_wait(13, [], 8, 0) = 0
1 epoll_wait(13, [], 8, 0) = 0
1 epoll_wait(13, [], 8, 0) = 0
1 epoll_wait(13, [], 8, 0) = 0
1 epoll_wait(13, [], 8, 0) = 0
1 epoll_wait(13, [], 8, 0) = 0
1 epoll_wait(13, [], 8, 0) = 0
1 epoll_wait(13, [], 8, 0) = 0
1 epoll_wait(13, [], 8, 0) = 0
1 epoll_wait(13, [], 8, 0) = 0
1 epoll_wait(13, [], 8, 0) = 0
1 epoll_wait(13, [], 8, 0) = 0
1 write(25, "(reply (version 0) (result ((service (version 0) (provides (guix-publish)) (requires (user-processes guix-daemon avahi-daemon)) (respawn? #t) (docstring \"[No documentation.]\") (enabled? #t) (running 23822) (conflicts ()) (last-respawns ()) (status-changes ((running . 1685083472) (starting . 1685083472) (stopped . 1685083469) (stopping . 1685083469) (running . 1684924319) (starting . 1684924319) (stopped . 1684924319) (stopping . 1684924319) (running . 1684853459) (starting . 1684853459) (stopped"..., 686) = 686
1 close(25) = 0
1 rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0
1 epoll_wait(13, [{events=EPOLLHUP, data={u32=30, u64=30}}], 8, -1) = 1
1 read(30, "", 4096) = 0
1 close(30) = 0
1 close(36) = 0
1 rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0
1 epoll_wait(13,

The signal FD still exists but we didn’t see any activity on it:

Toggle snippet (4 lines)
$ sudo ls -lrt /proc/1/fd |grep signalfd
lrwx------ 1 root root 64 May 26 09:41 10 -> anon_inode:[signalfd]

Here’s the signal handling status (notice two signals queued):

Toggle snippet (8 lines)
$ sudo cat /proc/1/status|grep -i Sig
SigQ: 2/63467
SigPnd: 0000000000000000
SigBlk: 0000000000014003
SigIgn: 0000000000001000
SigCgt: 0000000120814423

With:

(define (signal-mask->list mask)
(let loop ((sig 32) (lst '()))
(if (zero? sig)
lst
(loop (- sig 1)
(if (bit-set? (- sig 1) mask)
(cons sig lst)
lst)))))

… we see that the block mask is equal to ‘%precious-signals’ (good):

Toggle snippet (4 lines)
scheme@(shepherd system)> (signal-mask->list #x0000000000014003)
$33 = (1 2 15 17)

… while the ignored mask lists SIGPIPE:

Toggle snippet (4 lines)
scheme@(shepherd system)> (signal-mask->list #x0000000000001000)
$32 = (13)

However, the process also has 3 GC marker threads and 1 finalization
thread:

Toggle snippet (17 lines)
(gdb) info threads
Id Target Id Frame
* 1 Thread 0x7fdbd530f380 (LWP 1) "shepherd" 0x00007fdbd5415626 in epoll_wait (epfd=13,
events=0x7fdbd203a1a0, maxevents=8, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
2 Thread 0x7fdbd4cd9640 (LWP 118) "GC-marker-0" __futex_abstimed_wait_common64 (private=0,
cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7fdbd5857be8 <mark_cv+40>)
at futex-internal.c:57
3 Thread 0x7fdbd44d8640 (LWP 119) "GC-marker-1" __futex_abstimed_wait_common64 (private=0,
cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7fdbd5857be8 <mark_cv+40>)
at futex-internal.c:57
4 Thread 0x7fdbd3cd7640 (LWP 120) "GC-marker-2" __futex_abstimed_wait_common64 (private=0,
cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7fdbd5857be8 <mark_cv+40>)
at futex-internal.c:57
5 Thread 0x7fdbd3389640 (LWP 123) "shepherd" __GI___libc_read (nbytes=1,
buf=0x7fdbd3388620, fd=6) at ../sysdeps/unix/sysv/linux/read.c:26

These threads seem to be blocking every signal, which is good:

Toggle snippet (26 lines)
$ sudo cat /proc/118/status |grep ^Sig
SigQ: 2/63467
SigPnd: 0000000000000000
SigBlk: fffffffe5f7bfeff
SigIgn: 0000000000001000
SigCgt: 0000000120814423
$ sudo cat /proc/119/status |grep ^Sig
SigQ: 2/63467
SigPnd: 0000000000000000
SigBlk: fffffffe5f7bfeff
SigIgn: 0000000000001000
SigCgt: 0000000120814423
$ sudo cat /proc/120/status |grep ^Sig
SigQ: 2/63467
SigPnd: 0000000000000000
SigBlk: fffffffe5f7bfeff
SigIgn: 0000000000001000
SigCgt: 0000000120814423
$ sudo cat /proc/123/status |grep ^Sig
SigQ: 2/63467
SigPnd: 0000000000000000
SigBlk: fffffffe5ffbfeff
SigIgn: 0000000000001000
SigCgt: 0000000120814423

Then all of a sudden, as I’m conducting this experiment, I see:

Toggle snippet (13 lines)
May 26 09:36:37 localhost ntpd[23814]: ntpd exiting on signal 15 (Terminated)
[…]
May 26 10:52:10 localhost shepherd[1]: Respawning nscd.
May 26 10:52:10 localhost shepherd[1]: Service guix-publish (PID 23822) terminated with signal 15.
May 26 10:52:10 localhost shepherd[1]: Respawning ntpd.
May 26 10:52:10 localhost shepherd[1]: Respawning guix-publish.
May 26 10:52:10 localhost shepherd[1]: Starting service nscd...
May 26 10:52:10 localhost shepherd[1]: Starting service ntpd...
May 26 10:52:10 localhost shepherd[1]: Service ntpd has been started.
May 26 10:52:10 localhost shepherd[1]: Service ntpd started.
May 26 10:52:10 localhost shepherd[1]: Service ntpd running with value 29148.

10:52 is about the time I attached gdb to shepherd…

Interestingly, we get a “terminated with signal 15” message from
shepherd for guix-publish, but not for ntpd.

Long story short: there seems to be a problem with signal delivery.
Most likely, the initial grace period expiration above when stopping
nscd is a symptom of shepherd no longer receiving/processing SIGCHLD
rather than the cause.

To be continued…

Ludo’.
L
L
Ludovic Courtès wrote on 27 May 2023 19:00
(address . 63736@debbugs.gnu.org)
87wn0thfp4.fsf@gnu.org
Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (5 lines)
> Long story short: there seems to be a problem with signal delivery.
> Most likely, the initial grace period expiration above when stopping
> nscd is a symptom of shepherd no longer receiving/processing SIGCHLD
> rather than the cause.

Another possibility is lockup: one of the relevant fibers is either gone
or stuck in ‘put-message’ or ‘get-message’.

I did two things:

b9a37f3 shepherd: Make signal handling fiber an essential task.
8ae2780 service: Do not attempt to restart transient services.

Commit 8ae2780 fixes a bug whereby ‘herd restart’ could end up
attempting to restart a transient service, which would lock up the
calling fiber because the service’s controlling fiber would first
receive the 'terminate message, so it would return and nobody would be
reading further messages send on its channel.

Commit b9a37f3 will allows us to ensure that the signal-handling fiber
never exits (and we’ll get a trace in the log if it tries to).

Ludo’.
L
L
Ludovic Courtès wrote on 28 May 2023 11:25
control message for bug #63736
(address . control@debbugs.gnu.org)
87o7m4hkpc.fsf@gnu.org
severity 63736 important
quit
L
L
Ludovic Courtès wrote on 23 Nov 2023 21:46
Re: bug#63736: [Shepherd] Loss of SIGCHLD notifications
(address . 63736-done@debbugs.gnu.org)
87o7fk8afa.fsf@gnu.org
Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (17 lines)
> Another possibility is lockup: one of the relevant fibers is either gone
> or stuck in ‘put-message’ or ‘get-message’.
>
> I did two things:
>
> b9a37f3 shepherd: Make signal handling fiber an essential task.
> 8ae2780 service: Do not attempt to restart transient services.
>
> Commit 8ae2780 fixes a bug whereby ‘herd restart’ could end up
> attempting to restart a transient service, which would lock up the
> calling fiber because the service’s controlling fiber would first
> receive the 'terminate message, so it would return and nobody would be
> reading further messages send on its channel.
>
> Commit b9a37f3 will allows us to ensure that the signal-handling fiber
> never exits (and we’ll get a trace in the log if it tries to).

Apparently these commits, which made it in 0.10.1 months ago, fixed this
particular bug. Closing!

Ludo’.
Closed
?