Shepherd periodically goes unresponsive on one of my machines

  • Open
  • quality assurance status badge
Details
2 participants
  • Jonathan Frederickson
  • Ludovic Courtès
Owner
unassigned
Submitted by
Jonathan Frederickson
Severity
normal
J
J
Jonathan Frederickson wrote on 18 Jul 02:43 +0200
(address . bug-guix@gnu.org)
df6e8894-fd84-446f-a67f-50cdcc9de5b3@app.fastmail.com
I've been running into an issue with Shepherd on one of my machines. Every so often (and I haven't figured out what conditions trigger it), my Shepherd instances (both home and PID 1) will go unresponsive. I thought I had tracked it down to a misbehaving home service that I had configured, but it's just happened again without that service running.

'herd status' hangs indefinitely:

jfred@terracard ~$ sudo herd status
Password:
<never returns>

...on both instances:

jfred@terracard ~$ herd status
<never returns>

The PID 1 shepherd instance isn't reaping defunct processes:

jfred@terracard ~$ ps aux | grep -i lock
jfred 541 0.0 0.0 3700 2304 ? S 18:30 0:00 swayidle -w timeout 300 swaylock -f -i ~/.wallpapers/user-manual.jpg timeout 10 if pgrep swaylock; then swaymsg "output * dpms off"; fi resume swaymsg "output * dpms on" before-sleep swaylock -f -i ~/.wallpapers/user-manual.jpg
jfred 3111 0.0 0.0 0 0 ? Z 18:53 0:00 [swaylock] <defunct>
jfred 3112 0.0 0.0 0 0 ? Zs 18:53 0:00 [swaylock] <defunct>

Some further troubleshooting... strace indicates that it's waiting on a read() on its fd 9:

jfred@terracard ~ [env]$ sudo strace -fp 1
Password:
strace: Process 1 attached with 5 threads
[pid 144] read(9, <unfinished ...>
[pid 142] futex(0x7fa43892abe8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 141] futex(0x7fa43892abe8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 140] futex(0x7fa43892abe8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY^

...which seems to be:

jfred@terracard ~ [env]$ sudo ls -l /proc/1/fd/9
lr-x------ 1 root root 64 Jul 17 20:39 /proc/1/fd/9 -> 'pipe:[4015]'
jfred@terracard ~ [env]$ sudo lsof -n | grep 4015
lsof: WARNING: can't stat() fuse.portal file system /run/user/1000/doc
Output information may be incomplete.
shepherd 1 root 9r FIFO 0,15 0t0 4015 pipe
shepherd 1 root 11w FIFO 0,15 0t0 4015 pipe
shepherd 1 140 GC-marker root 9r FIFO 0,15 0t0 4015 pipe
shepherd 1 140 GC-marker root 11w FIFO 0,15 0t0 4015 pipe
shepherd 1 141 GC-marker root 9r FIFO 0,15 0t0 4015 pipe
shepherd 1 141 GC-marker root 11w FIFO 0,15 0t0 4015 pipe
shepherd 1 142 GC-marker root 9r FIFO 0,15 0t0 4015 pipe
shepherd 1 142 GC-marker root 11w FIFO 0,15 0t0 4015 pipe
shepherd 1 144 shepherd root 9r FIFO 0,15 0t0 4015 pipe
shepherd 1 144 shepherd root 11w FIFO 0,15 0t0 4015 pipe

My system configuration for this machine can be found here, and I last ran a 'guix pull' on June 21: https://github.com/jfrederickson/dotfiles/blob/master/guix/guix/system/machines/terracard/config.scm

Has anyone else run into this?
L
L
Ludovic Courtès wrote on 19 Jul 17:35 +0200
(name . Jonathan Frederickson)(address . jonathan@terracrypt.net)(address . 72166@debbugs.gnu.org)
878qxxtmwu.fsf@gnu.org
Hi Jonathan,

"Jonathan Frederickson" <jonathan@terracrypt.net> skribis:

Toggle quote (13 lines)
> I've been running into an issue with Shepherd on one of my machines. Every so often (and I haven't figured out what conditions trigger it), my Shepherd instances (both home and PID 1) will go unresponsive. I thought I had tracked it down to a misbehaving home service that I had configured, but it's just happened again without that service running.
>
> 'herd status' hangs indefinitely:
>
> jfred@terracard ~$ sudo herd status
> Password:
> <never returns>
>
> ...on both instances:
>
> jfred@terracard ~$ herd status
> <never returns>

Ouch. What version of shepherd is running? (You can view it with
“cat /proc/1/cmdline | xargs -0”.)

Toggle quote (9 lines)
> The PID 1 shepherd instance isn't reaping defunct processes:
>
> jfred@terracard ~$ ps aux | grep -i lock
> jfred 541 0.0 0.0 3700 2304 ? S 18:30 0:00 swayidle -w timeout 300 swaylock -f -i ~/.wallpapers/user-manual.jpg timeout 10 if pgrep swaylock; then swaymsg "output * dpms off"; fi resume swaymsg "output * dpms on" before-sleep swaylock -f -i ~/.wallpapers/user-manual.jpg
> jfred 3111 0.0 0.0 0 0 ? Z 18:53 0:00 [swaylock] <defunct>
> jfred 3112 0.0 0.0 0 0 ? Zs 18:53 0:00 [swaylock] <defunct>
>
> Some further troubleshooting... strace indicates that it's waiting on a read() on its fd 9:

Interesting. There were bugs in earlier 0.10.x version that could cause
this sort of thing; let’s see what version you have, first.

Ludo’.
J
J
Jonathan Frederickson wrote on 19 Jul 18:25 +0200
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 72166@debbugs.gnu.org)
7974c622-e7d8-48b3-9948-14e8d7654793@app.fastmail.com
On Fri, Jul 19, 2024, at 11:35 AM, Ludovic Courtès wrote:
Toggle quote (35 lines)
> Hi Jonathan,
>
> "Jonathan Frederickson" <jonathan@terracrypt.net> skribis:
>
> > I've been running into an issue with Shepherd on one of my machines. Every so often (and I haven't figured out what conditions trigger it), my Shepherd instances (both home and PID 1) will go unresponsive. I thought I had tracked it down to a misbehaving home service that I had configured, but it's just happened again without that service running.
> >
> > 'herd status' hangs indefinitely:
> >
> > jfred@terracard ~$ sudo herd status
> > Password:
> > <never returns>
> >
> > ...on both instances:
> >
> > jfred@terracard ~$ herd status
> > <never returns>
>
> Ouch. What version of shepherd is running? (You can view it with
> “cat /proc/1/cmdline | xargs -0”.)
>
> > The PID 1 shepherd instance isn't reaping defunct processes:
> >
> > jfred@terracard ~$ ps aux | grep -i lock
> > jfred 541 0.0 0.0 3700 2304 ? S 18:30 0:00 swayidle -w timeout 300 swaylock -f -i ~/.wallpapers/user-manual.jpg timeout 10 if pgrep swaylock; then swaymsg "output * dpms off"; fi resume swaymsg "output * dpms on" before-sleep swaylock -f -i ~/.wallpapers/user-manual.jpg
> > jfred 3111 0.0 0.0 0 0 ? Z 18:53 0:00 [swaylock] <defunct>
> > jfred 3112 0.0 0.0 0 0 ? Zs 18:53 0:00 [swaylock] <defunct>
> >
> > Some further troubleshooting... strace indicates that it's waiting on a read() on its fd 9:
>
> Interesting. There were bugs in earlier 0.10.x version that could cause
> this sort of thing; let’s see what version you have, first.
>
> Ludo’.
>

Hi Ludo, thanks for the troubleshooting help. Looks like I'm running 0.10.4:

jfred@terracard ~$ cat /proc/1/cmdline | xargs -0
/gnu/store/bhynhk0c6ssq3fqqc59fvhxjzwywsjbb-guile-3.0.9/bin/guile --no-auto-compile /gnu/store/39li5qpiaj1lx89xgahlbgvfnjhpcpwg-shepherd-0.10.4/bin/shepherd --config /gnu/store/hfyri6ygfdjq4w3nkha2ypa2k98hhfxj-shepherd.conf

I see now that 0.10.5 was released a few weeks ago, does that have a fix that could be related?
L
L
Ludovic Courtès wrote on 22 Jul 09:14 +0200
(name . Jonathan Frederickson)(address . jonathan@terracrypt.net)(address . 72166@debbugs.gnu.org)
87zfq9kiei.fsf@gnu.org
Hi,

"Jonathan Frederickson" <jonathan@terracrypt.net> skribis:

Toggle quote (7 lines)
> Hi Ludo, thanks for the troubleshooting help. Looks like I'm running 0.10.4:
>
> jfred@terracard ~$ cat /proc/1/cmdline | xargs -0
> /gnu/store/bhynhk0c6ssq3fqqc59fvhxjzwywsjbb-guile-3.0.9/bin/guile --no-auto-compile /gnu/store/39li5qpiaj1lx89xgahlbgvfnjhpcpwg-shepherd-0.10.4/bin/shepherd --config /gnu/store/hfyri6ygfdjq4w3nkha2ypa2k98hhfxj-shepherd.conf
>
> I see now that 0.10.5 was released a few weeks ago, does that have a fix that could be related?

Yes, it could be related. Per the ‘NEWS’ file of Shepherd:

** ‘herd unload root SERVICE’ no longer hands when there’s a replacement

It used to be that, for a running service S that has a replacement registered,
‘herd unload root S’ would hang shepherd, making it totally unresponsive—‘herd
status’, ‘halt’, etc. would hang forever, and inetd-style services would no
longer start, etc. This is now fixed.

Depending on previous ‘guix system reconfigure’ invocations on these
machines, it’s possible that you ended up in this state.

Would be great if you could upgrade and see if the problem still occurs.

Thanks,
Ludo’.
J
J
Jonathan Frederickson wrote on 25 Jul 02:08 +0200
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 72166@debbugs.gnu.org)
5477099d-fbc5-4acd-8320-f88ed3107de7@app.fastmail.com
On Mon, Jul 22, 2024, at 3:14 AM, Ludovic Courtès wrote:
Toggle quote (29 lines)
> Hi,
>
> "Jonathan Frederickson" <jonathan@terracrypt.net> skribis:
>
> > Hi Ludo, thanks for the troubleshooting help. Looks like I'm running 0.10.4:
> >
> > jfred@terracard ~$ cat /proc/1/cmdline | xargs -0
> > /gnu/store/bhynhk0c6ssq3fqqc59fvhxjzwywsjbb-guile-3.0.9/bin/guile --no-auto-compile /gnu/store/39li5qpiaj1lx89xgahlbgvfnjhpcpwg-shepherd-0.10.4/bin/shepherd --config /gnu/store/hfyri6ygfdjq4w3nkha2ypa2k98hhfxj-shepherd.conf
> >
> > I see now that 0.10.5 was released a few weeks ago, does that have a fix that could be related?
>
> Yes, it could be related. Per the ‘NEWS’ file of Shepherd:
>
> ** ‘herd unload root SERVICE’ no longer hands when there’s a replacement
> (<https://issues.guix.gnu.org/71478>)
>
> It used to be that, for a running service S that has a replacement registered,
> ‘herd unload root S’ would hang shepherd, making it totally unresponsive—‘herd
> status’, ‘halt’, etc. would hang forever, and inetd-style services would no
> longer start, etc. This is now fixed.
>
> Depending on previous ‘guix system reconfigure’ invocations on these
> machines, it’s possible that you ended up in this state.
>
> Would be great if you could upgrade and see if the problem still occurs.
>
> Thanks,
> Ludo’.

I've gotten this machine upgraded to 0.10.5 and just experienced the same thing again:

jfred@terracard ~$ ps aux | grep swaylo
jfred 544 0.0 0.0 3700 2432 ? S 19:02 0:00 swayidle -w timeout 300 swaylock -f -i ~/.wallpapers/user-manual.jpg timeout 10 if pgrep swaylock; then swaymsg "output * dpms off"; fi resume swaymsg "output * dpms on" before-sleep swaylock -f -i ~/.wallpapers/user-manual.jpg
jfred 1956 0.0 0.0 0 0 ? Z 19:22 0:00 [swaylock] <defunct>
jfred 1957 0.0 0.0 0 0 ? Zs 19:22 0:00 [swaylock] <defunct>
jfred 2162 0.0 0.0 0 0 ? Z 19:38 0:00 [swaylock] <defunct>
jfred 2163 0.0 0.0 0 0 ? Zs 19:38 0:00 [swaylock] <defunct>
jfred 2604 0.0 0.0 6116 2432 pts/2 S+ 20:04 0:00 grep --color=auto swaylo
jfred@terracard ~$ cat /proc/1/cmdline
/gnu/store/bhynhk0c6ssq3fqqc59fvhxjzwywsjbb-guile-3.0.9/bin/guile--no-auto-compile/gnu/store/wrmyav254ydjn9cad3q169fxg7x6p80b-shepherd-0.10.5/bin/shepherd--config/gnu/store/sfjww12mipyx4zxa6i9x8nxxfyb7h3y4-shepherd.conf

Of note, I haven't run 'guix system reconfigure' or any manual 'herd' commands on this machine since boot.
L
L
Ludovic Courtès wrote on 16 Aug 18:12 +0200
(name . Jonathan Frederickson)(address . jonathan@terracrypt.net)(address . 72166@debbugs.gnu.org)
877ccg78gy.fsf@gnu.org
Hi,

"Jonathan Frederickson" <jonathan@terracrypt.net> skribis:

Toggle quote (2 lines)
> I've gotten this machine upgraded to 0.10.5 and just experienced the same thing again:

Argh.

Toggle quote (12 lines)
> jfred@terracard ~$ ps aux | grep swaylo
> jfred 544 0.0 0.0 3700 2432 ? S 19:02 0:00 swayidle -w timeout 300 swaylock -f -i ~/.wallpapers/user-manual.jpg timeout 10 if pgrep swaylock; then swaymsg "output * dpms off"; fi resume swaymsg "output * dpms on" before-sleep swaylock -f -i ~/.wallpapers/user-manual.jpg
> jfred 1956 0.0 0.0 0 0 ? Z 19:22 0:00 [swaylock] <defunct>
> jfred 1957 0.0 0.0 0 0 ? Zs 19:22 0:00 [swaylock] <defunct>
> jfred 2162 0.0 0.0 0 0 ? Z 19:38 0:00 [swaylock] <defunct>
> jfred 2163 0.0 0.0 0 0 ? Zs 19:38 0:00 [swaylock] <defunct>
> jfred 2604 0.0 0.0 6116 2432 pts/2 S+ 20:04 0:00 grep --color=auto swaylo
> jfred@terracard ~$ cat /proc/1/cmdline
> /gnu/store/bhynhk0c6ssq3fqqc59fvhxjzwywsjbb-guile-3.0.9/bin/guile--no-auto-compile/gnu/store/wrmyav254ydjn9cad3q169fxg7x6p80b-shepherd-0.10.5/bin/shepherd--config/gnu/store/sfjww12mipyx4zxa6i9x8nxxfyb7h3y4-shepherd.conf
>
> Of note, I haven't run 'guix system reconfigure' or any manual 'herd' commands on this machine since boot.

Could you share (maybe privately) the relevant excerpt of
/var/log/messages?

Could you also share (ideally) a minimum Guix System config and a
sequence of commands to reproduce it?

Are you able to reproduce it in ‘guix system vm’?

Thanks,
Ludo’.
J
J
Jonathan Frederickson wrote on 19 Aug 00:54 +0200
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 72166@debbugs.gnu.org)
b559c5a9-1c87-43f7-a294-b5248e117b2d@app.fastmail.com
On Fri, Aug 16, 2024, at 12:12 PM, Ludovic Courtès wrote:
Toggle quote (8 lines)
> Could you share (maybe privately) the relevant excerpt of
> /var/log/messages?
>
> Could you also share (ideally) a minimum Guix System config and a
> sequence of commands to reproduce it?
>
> Are you able to reproduce it in ‘guix system vm’?

I'll send some logs to you privately; I don't actually see any logs that look relevant in /var/log/messages but maybe you'll spot something I haven't.

I'll try to come up with a reproduction and a minimal system config to reproduce the issue, though I will say that I haven't yet tracked down the exact situation that triggers it.

(Perhaps notably, both this machine running Guix System and another machine running Guix Home on a foreign distro where I've experienced the same issue are running Sway, and I think swaylock may be related somehow.)

I'll also try reproducing it with 'guix system vm' if I can, though that may be challenging as this machine only has 4 GB of RAM.
Attachment: file
J
J
Jonathan Frederickson wrote on 19 Aug 22:01 +0200
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 72166@debbugs.gnu.org)
8f52c269-dea8-4856-aa6e-740d99cc72e6@app.fastmail.com
No news yet on reproduction, but I did spot something else interesting when running 'guix home switch-generation' on one of my other machines that's been having similar issues:

Toggle quote (9 lines)
> SLoading /gnu/store/b7sbff937mm32lwv1a0dbh4mqhwkvpfn-shepherd.conf.
> herd: error: exception caught while executing 'load' on service 'root':
> In procedure fport_write: Input/output error
> Comparing /gnu/store/69shsgavrzkb408hvrfi4yjs26z7n112-home/profile/share/fonts and
> /gnu/store/dn7kk3d6hhmgjm89y2p3mrlwz0dgkzpy-home/profile/share/fonts... done (same)
> Evaluating on-change gexps.
>
> On-change gexps evaluation finished.

This is another machine whose user shepherd instance is currently in this unresponsive state.
L
L
Ludovic Courtès wrote on 22 Aug 11:35 +0200
(name . Jonathan Frederickson)(address . jonathan@terracrypt.net)(address . 72166@debbugs.gnu.org)
87zfp4nbn0.fsf@gnu.org
Hi,

You wrote:

Toggle quote (4 lines)
> The machine in question does not have a battery-backed RTC, so it
> loses time when it loses power, but notice the time changes once ntpd
> starts up.

So you’re hitting the Fibers bug described here:


There is still no fix upstream unfortunately, but I hope we’ll get there
in time for Shepherd 1.0.

The only workaround so far is to build shepherd against
‘guile-fibers-1.1’, which is what is done in Guix on AArch64 and RISC-V
since these platforms are likely to be used with single-board computers
lacking a battery-backed RTC.

Thanks,
Ludo’.
J
J
Jonathan Frederickson wrote on 22 Aug 19:52 +0200
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 72166@debbugs.gnu.org)
6f4711b4-3092-4bbe-91d9-8cb2ce44ed1b@terracrypt.net
Aug 22, 2024 05:35:47 Ludovic Courtès <ludo@gnu.org>:

Toggle quote (22 lines)
> Hi,
>
> You wrote:
>
>> The machine in question does not have a battery-backed RTC, so it
>> loses time when it loses power, but notice the time changes once ntpd
>> starts up.
>
> So you’re hitting the Fibers bug described here:
>
>   https://issues.guix.gnu.org/70848
>
> There is still no fix upstream unfortunately, but I hope we’ll get there
> in time for Shepherd 1.0.
>
> The only workaround so far is to build shepherd against
> ‘guile-fibers-1.1’, which is what is done in Guix on AArch64 and RISC-V
> since these platforms are likely to be used with single-board computers
> lacking a battery-backed RTC.
>
> Thanks,
> Ludo’.
Ah! Yes, that seems likely.

I've also run into what looks like the same thing on a laptop that does have an RTC, but it also has an NTP daemon running on it. (I have also noticed that this is most common on that laptop after a suspend/resume cycle, so maybe that's triggering the bug as well?)

I'll check to see if I've recently had an NTP sync the next time this happens, because that would be a definitive confirmation that I'm running into that bug.
Attachment: file
L
L
Ludovic Courtès wrote on 4 Sep 11:24 +0200
(name . Jonathan Frederickson)(address . jonathan@terracrypt.net)(address . 72166@debbugs.gnu.org)
87bk13g49z.fsf@gnu.org
Hi,

Jonathan Frederickson <jonathan@terracrypt.net> skribis:

Toggle quote (2 lines)
> I've also run into what looks like the same thing on a laptop that does have an RTC, but it also has an NTP daemon running on it. (I have also noticed that this is most common on that laptop after a suspend/resume cycle, so maybe that's triggering the bug as well?)

Suspend/resume does not trigger the bug no (fortunately), but a clock
jump due to NTP sync can cause it (specifically, shepherd will consume
CPU time proportional to the drift).

Ludo’.
?
Your comment

Commenting via the web interface is currently disabled.

To comment on this conversation send an email to 72166@debbugs.gnu.org

To respond to this issue using the mumi CLI, first switch to it
mumi current 72166
Then, you may apply the latest patchset in this issue (with sign off)
mumi am -- -s
Or, compose a reply to this issue
mumi compose
Or, send patches to this issue
mumi send-email *.patch