Ludovic Courtès wrote 2 years ago
(address . bug-guix@gnu.org)
I experienced, with the Shepherd 0.10.0, a situation where ‘herd
restart’ would get stuck after the end-of-grace-period expiration
(restarting nscd in this case):
Toggle snippet (12 lines)
May 26 08:44:33 localhost shepherd[1]: [NetworkManager] Status of nscd:
May 26 08:44:33 localhost shepherd[1]: [NetworkManager] It is running since 08:43:57 (36 seconds ago).
May 26 08:44:33 localhost shepherd[1]: [NetworkManager] Running value is 23753.
May 26 08:44:33 localhost shepherd[1]: [NetworkManager] It is enabled.
May 26 08:44:33 localhost shepherd[1]: [NetworkManager] Provides (nscd).
May 26 08:44:33 localhost shepherd[1]: [NetworkManager] Requires (user-processes syslogd).
May 26 08:44:33 localhost shepherd[1]: [NetworkManager] Will be respawned.
May 26 08:44:33 localhost shepherd[1]: Stopping service nscd...
[...]
May 26 08:44:38 localhost shepherd[1]: Grace period of 5 seconds is over; sending -23753 SIGKILL.
The ‘herd restart’ process is indeed waiting, and the nscd process is
still around as zombie, meaning that shepherd didn’t call waitpid(2):
Toggle snippet (13 lines)
$ sudo herd status nscd
Status of nscd:
It is being stopped.
It is enabled.
Provides (nscd).
Requires (user-processes syslogd).
Will be respawned.
$ ps aux|grep nscd
root 23753 0.0 0.0 0 0 ? Zs 08:43 0:00 [nscd] <defunct>
root 23870 0.0 0.1 49968 18088 ? Sl 08:44 0:00 /gnu/store/4gvgcfdiz67wv04ihqfa8pqwzsb0qpv5-guile-3.0.9/bin/guile --no-auto-compile /run/current-system/profile/bin/herd restart nscd
ludo 25280 33.3 0.0 6112 2304 pts/0 S+ 09:32 0:00 grep --color=auto nscd
At that point, if I pick another service process and kill it, shepherd
again fails to react to SIGCHLD (or doesn’t receive it):
Toggle snippet (22 lines)
$ sudo herd status ntpd
Status of ntpd:
It is running since 08:44:31 AM (52 minutes ago).
Running value is 23814.
It is enabled.
Provides (ntpd).
Requires (user-processes networking).
Will be respawned.
$ sudo kill 23814
$ sudo herd status ntpd
Status of ntpd:
It is running since 08:44:31 AM (52 minutes ago).
Running value is 23814.
It is enabled.
Provides (ntpd).
Requires (user-processes networking).
Will be respawned.
ludo@ribbon ~/src/guix$ ps 23814
PID TTY STAT TIME COMMAND
23814 ? Zs 0:00 [ntpd] <defunct>
If I repeat the same experiment while stracing shepherd, here’s what I
see:
Toggle snippet (14 lines)
$ sudo herd status guix-publish
Status of guix-publish:
It is running since 08:44:32 AM (55 minutes ago).
Running value is 23822.
It is enabled.
Provides (guix-publish).
Requires (user-processes guix-daemon avahi-daemon).
Will be respawned.
$ sudo kill 23822
$ ps 23822
PID TTY STAT TIME COMMAND
23822 ? Zs 0:02 [guix publish] <defunct>
… and the strace:
Toggle snippet (34 lines)
1 rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0
1 epoll_wait(13, [{events=EPOLLIN, data={u32=14, u64=14}}], 8, -1) = 1
1 accept4(14, {sa_family=AF_UNIX}, [112 => 2], SOCK_CLOEXEC|SOCK_NONBLOCK) = 25
1 accept4(14, 0x7ffe6f8f0be0, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
1 epoll_ctl(13, EPOLL_CTL_MOD, 14, {events=EPOLLIN|EPOLLRDHUP|EPOLLONESHOT, data={u32=14, u64=14}}) = 0
1 epoll_wait(13, [], 8, 0) = 0
1 read(25, "(shepherd-command (version 0) (action status) (service guix-publish) (arguments ()) (directory \"/home/ludo/src/guix\"))", 1024) = 118
1 epoll_wait(13, [], 8, 0) = 0
1 epoll_wait(13, [], 8, 0) = 0
1 epoll_wait(13, [], 8, 0) = 0
1 epoll_wait(13, [], 8, 0) = 0
1 epoll_wait(13, [], 8, 0) = 0
1 epoll_wait(13, [], 8, 0) = 0
1 epoll_wait(13, [], 8, 0) = 0
1 epoll_wait(13, [], 8, 0) = 0
1 epoll_wait(13, [], 8, 0) = 0
1 epoll_wait(13, [], 8, 0) = 0
1 epoll_wait(13, [], 8, 0) = 0
1 epoll_wait(13, [], 8, 0) = 0
1 epoll_wait(13, [], 8, 0) = 0
1 epoll_wait(13, [], 8, 0) = 0
1 epoll_wait(13, [], 8, 0) = 0
1 epoll_wait(13, [], 8, 0) = 0
1 write(25, "(reply (version 0) (result ((service (version 0) (provides (guix-publish)) (requires (user-processes guix-daemon avahi-daemon)) (respawn? #t) (docstring \"[No documentation.]\") (enabled? #t) (running 23822) (conflicts ()) (last-respawns ()) (status-changes ((running . 1685083472) (starting . 1685083472) (stopped . 1685083469) (stopping . 1685083469) (running . 1684924319) (starting . 1684924319) (stopped . 1684924319) (stopping . 1684924319) (running . 1684853459) (starting . 1684853459) (stopped"..., 686) = 686
1 close(25) = 0
1 rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0
1 epoll_wait(13, [{events=EPOLLHUP, data={u32=30, u64=30}}], 8, -1) = 1
1 read(30, "", 4096) = 0
1 close(30) = 0
1 close(36) = 0
1 rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0
1 epoll_wait(13,
The signal FD still exists but we didn’t see any activity on it:
Toggle snippet (4 lines)
$ sudo ls -lrt /proc/1/fd |grep signalfd
lrwx------ 1 root root 64 May 26 09:41 10 -> anon_inode:[signalfd]
Here’s the signal handling status (notice two signals queued):
Toggle snippet (8 lines)
$ sudo cat /proc/1/status|grep -i Sig
SigQ: 2/63467
SigPnd: 0000000000000000
SigBlk: 0000000000014003
SigIgn: 0000000000001000
SigCgt: 0000000120814423
With:
(define (signal-mask->list mask)
(let loop ((sig 32) (lst '()))
(if (zero? sig)
lst
(loop (- sig 1)
(if (bit-set? (- sig 1) mask)
(cons sig lst)
lst)))))
… we see that the block mask is equal to ‘%precious-signals’ (good):
Toggle snippet (4 lines)
scheme@(shepherd system)> (signal-mask->list #x0000000000014003)
$33 = (1 2 15 17)
… while the ignored mask lists SIGPIPE:
Toggle snippet (4 lines)
scheme@(shepherd system)> (signal-mask->list #x0000000000001000)
$32 = (13)
However, the process also has 3 GC marker threads and 1 finalization
thread:
Toggle snippet (17 lines)
(gdb) info threads
Id Target Id Frame
* 1 Thread 0x7fdbd530f380 (LWP 1) "shepherd" 0x00007fdbd5415626 in epoll_wait (epfd=13,
events=0x7fdbd203a1a0, maxevents=8, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
2 Thread 0x7fdbd4cd9640 (LWP 118) "GC-marker-0" __futex_abstimed_wait_common64 (private=0,
cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7fdbd5857be8 <mark_cv+40>)
at futex-internal.c:57
3 Thread 0x7fdbd44d8640 (LWP 119) "GC-marker-1" __futex_abstimed_wait_common64 (private=0,
cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7fdbd5857be8 <mark_cv+40>)
at futex-internal.c:57
4 Thread 0x7fdbd3cd7640 (LWP 120) "GC-marker-2" __futex_abstimed_wait_common64 (private=0,
cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7fdbd5857be8 <mark_cv+40>)
at futex-internal.c:57
5 Thread 0x7fdbd3389640 (LWP 123) "shepherd" __GI___libc_read (nbytes=1,
buf=0x7fdbd3388620, fd=6) at ../sysdeps/unix/sysv/linux/read.c:26
These threads seem to be blocking every signal, which is good:
Toggle snippet (26 lines)
$ sudo cat /proc/118/status |grep ^Sig
SigQ: 2/63467
SigPnd: 0000000000000000
SigBlk: fffffffe5f7bfeff
SigIgn: 0000000000001000
SigCgt: 0000000120814423
$ sudo cat /proc/119/status |grep ^Sig
SigQ: 2/63467
SigPnd: 0000000000000000
SigBlk: fffffffe5f7bfeff
SigIgn: 0000000000001000
SigCgt: 0000000120814423
$ sudo cat /proc/120/status |grep ^Sig
SigQ: 2/63467
SigPnd: 0000000000000000
SigBlk: fffffffe5f7bfeff
SigIgn: 0000000000001000
SigCgt: 0000000120814423
$ sudo cat /proc/123/status |grep ^Sig
SigQ: 2/63467
SigPnd: 0000000000000000
SigBlk: fffffffe5ffbfeff
SigIgn: 0000000000001000
SigCgt: 0000000120814423
Then all of a sudden, as I’m conducting this experiment, I see:
Toggle snippet (13 lines)
May 26 09:36:37 localhost ntpd[23814]: ntpd exiting on signal 15 (Terminated)
[…]
May 26 10:52:10 localhost shepherd[1]: Respawning nscd.
May 26 10:52:10 localhost shepherd[1]: Service guix-publish (PID 23822) terminated with signal 15.
May 26 10:52:10 localhost shepherd[1]: Respawning ntpd.
May 26 10:52:10 localhost shepherd[1]: Respawning guix-publish.
May 26 10:52:10 localhost shepherd[1]: Starting service nscd...
May 26 10:52:10 localhost shepherd[1]: Starting service ntpd...
May 26 10:52:10 localhost shepherd[1]: Service ntpd has been started.
May 26 10:52:10 localhost shepherd[1]: Service ntpd started.
May 26 10:52:10 localhost shepherd[1]: Service ntpd running with value 29148.
10:52 is about the time I attached gdb to shepherd…
Interestingly, we get a “terminated with signal 15” message from
shepherd for guix-publish, but not for ntpd.
Long story short: there seems to be a problem with signal delivery.
Most likely, the initial grace period expiration above when stopping
nscd is a symptom of shepherd no longer receiving/processing SIGCHLD
rather than the cause.
To be continued…
Ludo’.