[Shepherd] Loss of SIGCHLD notifications

Ludovic Courtès wrote 2 years ago

Recipients:(address . bug-guix@gnu.org)

Message-ID:87353jb88e.fsf@inria.fr

I experienced, with the Shepherd 0.10.0, a situation where ‘herd

restart’ would get stuck after the end-of-grace-period expiration

(restarting nscd in this case):

Toggle snippet (12 lines)May 26 08:44:33 localhost shepherd[1]: [NetworkManager] Status of nscd: 
May 26 08:44:33 localhost shepherd[1]: [NetworkManager]   It is running since 08:43:57 (36 seconds ago). 
May 26 08:44:33 localhost shepherd[1]: [NetworkManager]   Running value is 23753. 
May 26 08:44:33 localhost shepherd[1]: [NetworkManager]   It is enabled. 
May 26 08:44:33 localhost shepherd[1]: [NetworkManager]   Provides (nscd). 
May 26 08:44:33 localhost shepherd[1]: [NetworkManager]   Requires (user-processes syslogd). 
May 26 08:44:33 localhost shepherd[1]: [NetworkManager]   Will be respawned. 
May 26 08:44:33 localhost shepherd[1]: Stopping service nscd... 
[...]
May 26 08:44:38 localhost shepherd[1]: Grace period of 5 seconds is over; sending -23753 SIGKILL. 

The ‘herd restart’ process is indeed waiting, and the nscd process is

still around as zombie, meaning that shepherd didn’t call waitpid(2):

Toggle snippet (13 lines)$ sudo herd status nscd
Status of nscd:
  It is being stopped.
  It is enabled.
  Provides (nscd).
  Requires (user-processes syslogd).
  Will be respawned.
$ ps aux|grep nscd
root     23753  0.0  0.0      0     0 ?        Zs   08:43   0:00 [nscd] <defunct>
root     23870  0.0  0.1  49968 18088 ?        Sl   08:44   0:00 /gnu/store/4gvgcfdiz67wv04ihqfa8pqwzsb0qpv5-guile-3.0.9/bin/guile --no-auto-compile /run/current-system/profile/bin/herd restart nscd
ludo     25280 33.3  0.0   6112  2304 pts/0    S+   09:32   0:00 grep --color=auto nscd

At that point, if I pick another service process and kill it, shepherd

again fails to react to SIGCHLD (or doesn’t receive it):

Toggle snippet (22 lines)$ sudo herd status ntpd
Status of ntpd:
  It is running since 08:44:31 AM (52 minutes ago).
  Running value is 23814.
  It is enabled.
  Provides (ntpd).
  Requires (user-processes networking).
  Will be respawned.
$ sudo kill 23814
$ sudo herd status ntpd
Status of ntpd:
  It is running since 08:44:31 AM (52 minutes ago).
  Running value is 23814.
  It is enabled.
  Provides (ntpd).
  Requires (user-processes networking).
  Will be respawned.
ludo@ribbon ~/src/guix$ ps 23814
  PID TTY      STAT   TIME COMMAND
23814 ?        Zs     0:00 [ntpd] <defunct>

If I repeat the same experiment while stracing shepherd, here’s what I

see:

Toggle snippet (14 lines)$ sudo herd status guix-publish
Status of guix-publish:
  It is running since 08:44:32 AM (55 minutes ago).
  Running value is 23822.
  It is enabled.
  Provides (guix-publish).
  Requires (user-processes guix-daemon avahi-daemon).
  Will be respawned.
$ sudo kill 23822
$ ps 23822
  PID TTY      STAT   TIME COMMAND
23822 ?        Zs     0:02 [guix publish] <defunct>

… and the strace:

Toggle snippet (34 lines)1     rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0
   epoll_wait(13, [{events=EPOLLIN, data={u32=14, u64=14}}], 8, -1) = 1
   accept4(14, {sa_family=AF_UNIX}, [112 => 2], SOCK_CLOEXEC|SOCK_NONBLOCK) = 25
   accept4(14, 0x7ffe6f8f0be0, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
   epoll_ctl(13, EPOLL_CTL_MOD, 14, {events=EPOLLIN|EPOLLRDHUP|EPOLLONESHOT, data={u32=14, u64=14}}) = 0
   epoll_wait(13, [], 8, 0)          = 0
   read(25, "(shepherd-command (version 0) (action status) (service guix-publish) (arguments ()) (directory \"/home/ludo/src/guix\"))", 1024) = 118
   epoll_wait(13, [], 8, 0)          = 0
   epoll_wait(13, [], 8, 0)          = 0
   epoll_wait(13, [], 8, 0)          = 0
   epoll_wait(13, [], 8, 0)          = 0
   epoll_wait(13, [], 8, 0)          = 0
   epoll_wait(13, [], 8, 0)          = 0
   epoll_wait(13, [], 8, 0)          = 0
   epoll_wait(13, [], 8, 0)          = 0
   epoll_wait(13, [], 8, 0)          = 0
   epoll_wait(13, [], 8, 0)          = 0
   epoll_wait(13, [], 8, 0)          = 0
   epoll_wait(13, [], 8, 0)          = 0
   epoll_wait(13, [], 8, 0)          = 0
   epoll_wait(13, [], 8, 0)          = 0
   epoll_wait(13, [], 8, 0)          = 0
   epoll_wait(13, [], 8, 0)          = 0
   write(25, "(reply (version 0) (result ((service (version 0) (provides (guix-publish)) (requires (user-processes guix-daemon avahi-daemon)) (respawn? #t) (docstring \"[No documentation.]\") (enabled? #t) (running 23822) (conflicts ()) (last-respawns ()) (status-changes ((running . 1685083472) (starting . 1685083472) (stopped . 1685083469) (stopping . 1685083469) (running . 1684924319) (starting . 1684924319) (stopped . 1684924319) (stopping . 1684924319) (running . 1684853459) (starting . 1684853459) (stopped"..., 686) = 686
   close(25)                         = 0
   rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0
   epoll_wait(13, [{events=EPOLLHUP, data={u32=30, u64=30}}], 8, -1) = 1
   read(30, "", 4096)                = 0
   close(30)                         = 0
   close(36)                         = 0
   rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0
   epoll_wait(13, 

The signal FD still exists but we didn’t see any activity on it:

Toggle snippet (4 lines)

$ sudo ls -lrt /proc/1/fd |grep signalfd

lrwx------ 1 root root 64 May 26 09:41 10 -> anon_inode:[signalfd]

Here’s the signal handling status (notice two signals queued):

Toggle snippet (8 lines)

$ sudo cat /proc/1/status|grep -i Sig

SigQ: 2/63467

SigPnd: 0000000000000000

SigBlk: 0000000000014003

SigIgn: 0000000000001000

SigCgt: 0000000120814423

With:

(define (signal-mask->list mask)

(let loop ((sig 32) (lst '()))

(if (zero? sig)

lst

(loop (- sig 1)

(if (bit-set? (- sig 1) mask)

(cons sig lst)

lst)))))

… we see that the block mask is equal to ‘%precious-signals’ (good):

Toggle snippet (4 lines)

scheme@(shepherd system)> (signal-mask->list #x0000000000014003)

$33 = (1 2 15 17)

… while the ignored mask lists SIGPIPE:

Toggle snippet (4 lines)

scheme@(shepherd system)> (signal-mask->list #x0000000000001000)

$32 = (13)

However, the process also has 3 GC marker threads and 1 finalization

thread:

Toggle snippet (17 lines)(gdb) info threads
  Id   Target Id                                     Frame 
* 1    Thread 0x7fdbd530f380 (LWP 1) "shepherd"      0x00007fdbd5415626 in epoll_wait (epfd=13, 
    events=0x7fdbd203a1a0, maxevents=8, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  2    Thread 0x7fdbd4cd9640 (LWP 118) "GC-marker-0" __futex_abstimed_wait_common64 (private=0, 
    cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7fdbd5857be8 <mark_cv+40>)
    at futex-internal.c:57
  3    Thread 0x7fdbd44d8640 (LWP 119) "GC-marker-1" __futex_abstimed_wait_common64 (private=0, 
    cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7fdbd5857be8 <mark_cv+40>)
    at futex-internal.c:57
  4    Thread 0x7fdbd3cd7640 (LWP 120) "GC-marker-2" __futex_abstimed_wait_common64 (private=0, 
    cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7fdbd5857be8 <mark_cv+40>)
    at futex-internal.c:57
  5    Thread 0x7fdbd3389640 (LWP 123) "shepherd"    __GI___libc_read (nbytes=1, 
    buf=0x7fdbd3388620, fd=6) at ../sysdeps/unix/sysv/linux/read.c:26

These threads seem to be blocking every signal, which is good:

Toggle snippet (26 lines)$ sudo cat /proc/118/status |grep ^Sig
SigQ:   2/63467
SigPnd: 0000000000000000
SigBlk: fffffffe5f7bfeff
SigIgn: 0000000000001000
SigCgt: 0000000120814423
$ sudo cat /proc/119/status |grep ^Sig
SigQ:   2/63467
SigPnd: 0000000000000000
SigBlk: fffffffe5f7bfeff
SigIgn: 0000000000001000
SigCgt: 0000000120814423
$ sudo cat /proc/120/status |grep ^Sig
SigQ:   2/63467
SigPnd: 0000000000000000
SigBlk: fffffffe5f7bfeff
SigIgn: 0000000000001000
SigCgt: 0000000120814423
$ sudo cat /proc/123/status |grep ^Sig
SigQ:   2/63467
SigPnd: 0000000000000000
SigBlk: fffffffe5ffbfeff
SigIgn: 0000000000001000
SigCgt: 0000000120814423

Then all of a sudden, as I’m conducting this experiment, I see:

Toggle snippet (13 lines)May 26 09:36:37 localhost ntpd[23814]: ntpd exiting on signal 15 (Terminated)
[…]
May 26 10:52:10 localhost shepherd[1]: Respawning nscd. 
May 26 10:52:10 localhost shepherd[1]: Service guix-publish (PID 23822) terminated with signal 15. 
May 26 10:52:10 localhost shepherd[1]: Respawning ntpd. 
May 26 10:52:10 localhost shepherd[1]: Respawning guix-publish. 
May 26 10:52:10 localhost shepherd[1]: Starting service nscd... 
May 26 10:52:10 localhost shepherd[1]: Starting service ntpd... 
May 26 10:52:10 localhost shepherd[1]: Service ntpd has been started. 
May 26 10:52:10 localhost shepherd[1]: Service ntpd started. 
May 26 10:52:10 localhost shepherd[1]: Service ntpd running with value 29148. 

10:52 is about the time I attached gdb to shepherd…

Interestingly, we get a “terminated with signal 15” message from

shepherd for guix-publish, but not for ntpd.

Long story short: there seems to be a problem with signal delivery.

Most likely, the initial grace period expiration above when stopping

nscd is a symptom of shepherd no longer receiving/processing SIGCHLD

rather than the cause.

To be continued…

Ludo’.

is:open	open issues
is:done	closed issues
submitter:<who>	search issue submitter
author:<who>	search by message author
date:yesterday..now	search by issue date
mdate:3m..2d	search by message date