Shepherd periodically goes unresponsive on one of my machines

Open

Details

2 participants

Jonathan Frederickson
Ludovic Courtès

Owner: unassigned

Submitted by: Jonathan Frederickson

Severity: normal

Debbugs page

Jonathan Frederickson wrote 8 months ago

Recipients:(address . bug-guix@gnu.org)

Message-ID:df6e8894-fd84-446f-a67f-50cdcc9de5b3@app.fastmail.com

I've been running into an issue with Shepherd on one of my machines. Every so often (and I haven't figured out what conditions trigger it), my Shepherd instances (both home and PID 1) will go unresponsive. I thought I had tracked it down to a misbehaving home service that I had configured, but it's just happened again without that service running.

'herd status' hangs indefinitely:

jfred@terracard ~$ sudo herd status

Password:

...on both instances:

jfred@terracard ~$ herd status

The PID 1 shepherd instance isn't reaping defunct processes:

jfred@terracard ~$ ps aux | grep -i lock

jfred 541 0.0 0.0 3700 2304 ? S 18:30 0:00 swayidle -w timeout 300 swaylock -f -i ~/.wallpapers/user-manual.jpg timeout 10 if pgrep swaylock; then swaymsg "output * dpms off"; fi resume swaymsg "output * dpms on" before-sleep swaylock -f -i ~/.wallpapers/user-manual.jpg

jfred 3111 0.0 0.0 0 0 ? Z 18:53 0:00 [swaylock] <defunct>

jfred 3112 0.0 0.0 0 0 ? Zs 18:53 0:00 [swaylock] <defunct>

Some further troubleshooting... strace indicates that it's waiting on a read() on its fd 9:

jfred@terracard ~ [env]$ sudo strace -fp 1

Password:

strace: Process 1 attached with 5 threads

[pid 144] read(9, <unfinished ...>

[pid 142] futex(0x7fa43892abe8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>

[pid 141] futex(0x7fa43892abe8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>

[pid 140] futex(0x7fa43892abe8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY^

...which seems to be:

jfred@terracard ~ [env]$ sudo ls -l /proc/1/fd/9

lr-x------ 1 root root 64 Jul 17 20:39 /proc/1/fd/9 -> 'pipe:[4015]'

jfred@terracard ~ [env]$ sudo lsof -n | grep 4015

lsof: WARNING: can't stat() fuse.portal file system /run/user/1000/doc

Output information may be incomplete.

shepherd 1 root 9r FIFO 0,15 0t0 4015 pipe

shepherd 1 root 11w FIFO 0,15 0t0 4015 pipe

shepherd 1 140 GC-marker root 9r FIFO 0,15 0t0 4015 pipe

shepherd 1 140 GC-marker root 11w FIFO 0,15 0t0 4015 pipe

shepherd 1 141 GC-marker root 9r FIFO 0,15 0t0 4015 pipe

shepherd 1 141 GC-marker root 11w FIFO 0,15 0t0 4015 pipe

shepherd 1 142 GC-marker root 9r FIFO 0,15 0t0 4015 pipe

shepherd 1 142 GC-marker root 11w FIFO 0,15 0t0 4015 pipe

shepherd 1 144 shepherd root 9r FIFO 0,15 0t0 4015 pipe

shepherd 1 144 shepherd root 11w FIFO 0,15 0t0 4015 pipe

My system configuration for this machine can be found here, and I last ran a 'guix pull' on June 21: https://github.com/jfrederickson/dotfiles/blob/master/guix/guix/system/machines/terracard/config.scm

Has anyone else run into this?

Ludovic Courtès wrote 8 months ago

Recipients:(name . Jonathan Frederickson)(address . jonathan@terracrypt.net)(address . 72166@debbugs.gnu.org)

Message-ID:878qxxtmwu.fsf@gnu.org

Hi Jonathan,

"Jonathan Frederickson" <jonathan@terracrypt.net> skribis:

Toggle quote (13 lines)> I've been running into an issue with Shepherd on one of my machines. Every so often (and I haven't figured out what conditions trigger it), my Shepherd instances (both home and PID 1) will go unresponsive. I thought I had tracked it down to a misbehaving home service that I had configured, but it's just happened again without that service running.
>
> 'herd status' hangs indefinitely:
>
> jfred@terracard ~$ sudo herd status
> Password: 
> <never returns>
>
> ...on both instances:
>
> jfred@terracard ~$ herd status
> <never returns>

Ouch. What version of shepherd is running? (You can view it with

“cat /proc/1/cmdline | xargs -0”.)

Toggle quote (9 lines)

> The PID 1 shepherd instance isn't reaping defunct processes:

> jfred@terracard ~$ ps aux | grep -i lock

> jfred 541 0.0 0.0 3700 2304 ? S 18:30 0:00 swayidle -w timeout 300 swaylock -f -i ~/.wallpapers/user-manual.jpg timeout 10 if pgrep swaylock; then swaymsg "output * dpms off"; fi resume swaymsg "output * dpms on" before-sleep swaylock -f -i ~/.wallpapers/user-manual.jpg

> jfred 3111 0.0 0.0 0 0 ? Z 18:53 0:00 [swaylock] <defunct>

> jfred 3112 0.0 0.0 0 0 ? Zs 18:53 0:00 [swaylock] <defunct>

> Some further troubleshooting... strace indicates that it's waiting on a read() on its fd 9:

Interesting. There were bugs in earlier 0.10.x version that could cause

this sort of thing; let’s see what version you have, first.

Ludo’.

Jonathan Frederickson wrote 8 months ago

Recipients:(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 72166@debbugs.gnu.org)

Message-ID:7974c622-e7d8-48b3-9948-14e8d7654793@app.fastmail.com

On Fri, Jul 19, 2024, at 11:35 AM, Ludovic Courtès wrote:

Toggle quote (35 lines)> Hi Jonathan,
> 
> "Jonathan Frederickson" <jonathan@terracrypt.net> skribis:
> 
> > I've been running into an issue with Shepherd on one of my machines. Every so often (and I haven't figured out what conditions trigger it), my Shepherd instances (both home and PID 1) will go unresponsive. I thought I had tracked it down to a misbehaving home service that I had configured, but it's just happened again without that service running.
> >
> > 'herd status' hangs indefinitely:
> >
> > jfred@terracard ~$ sudo herd status
> > Password: 
> > <never returns>
> >
> > ...on both instances:
> >
> > jfred@terracard ~$ herd status
> > <never returns>
> 
> Ouch.  What version of shepherd is running?  (You can view it with
> “cat /proc/1/cmdline | xargs -0”.)
> 
> > The PID 1 shepherd instance isn't reaping defunct processes:
> >
> > jfred@terracard ~$ ps aux | grep -i lock
> > jfred      541  0.0  0.0   3700  2304 ?        S    18:30   0:00 swayidle -w timeout 300 swaylock -f -i ~/.wallpapers/user-manual.jpg timeout 10 if pgrep swaylock; then swaymsg "output * dpms off"; fi resume swaymsg "output * dpms on" before-sleep swaylock -f -i ~/.wallpapers/user-manual.jpg
> > jfred     3111  0.0  0.0      0     0 ?        Z    18:53   0:00 [swaylock] <defunct>
> > jfred     3112  0.0  0.0      0     0 ?        Zs   18:53   0:00 [swaylock] <defunct>
> >
> > Some further troubleshooting... strace indicates that it's waiting on a read() on its fd 9:
> 
> Interesting.  There were bugs in earlier 0.10.x version that could cause
> this sort of thing; let’s see what version you have, first.
> 
> Ludo’.
> 

Hi Ludo, thanks for the troubleshooting help. Looks like I'm running 0.10.4:

jfred@terracard ~$ cat /proc/1/cmdline | xargs -0

/gnu/store/bhynhk0c6ssq3fqqc59fvhxjzwywsjbb-guile-3.0.9/bin/guile --no-auto-compile /gnu/store/39li5qpiaj1lx89xgahlbgvfnjhpcpwg-shepherd-0.10.4/bin/shepherd --config /gnu/store/hfyri6ygfdjq4w3nkha2ypa2k98hhfxj-shepherd.conf

I see now that 0.10.5 was released a few weeks ago, does that have a fix that could be related?

Ludovic Courtès wrote 8 months ago

Recipients:(name . Jonathan Frederickson)(address . jonathan@terracrypt.net)(address . 72166@debbugs.gnu.org)

Message-ID:87zfq9kiei.fsf@gnu.org

Hi,

"Jonathan Frederickson" <jonathan@terracrypt.net> skribis:

Toggle quote (7 lines)

> Hi Ludo, thanks for the troubleshooting help. Looks like I'm running 0.10.4:

> jfred@terracard ~$ cat /proc/1/cmdline | xargs -0

> /gnu/store/bhynhk0c6ssq3fqqc59fvhxjzwywsjbb-guile-3.0.9/bin/guile --no-auto-compile /gnu/store/39li5qpiaj1lx89xgahlbgvfnjhpcpwg-shepherd-0.10.4/bin/shepherd --config /gnu/store/hfyri6ygfdjq4w3nkha2ypa2k98hhfxj-shepherd.conf

> I see now that 0.10.5 was released a few weeks ago, does that have a fix that could be related?

Yes, it could be related. Per the ‘NEWS’ file of Shepherd:

** ‘herd unload root SERVICE’ no longer hands when there’s a replacement

(https://issues.guix.gnu.org/71478)

It used to be that, for a running service S that has a replacement registered,

‘herd unload root S’ would hang shepherd, making it totally unresponsive—‘herd

status’, ‘halt’, etc. would hang forever, and inetd-style services would no

longer start, etc. This is now fixed.

Depending on previous ‘guix system reconfigure’ invocations on these

machines, it’s possible that you ended up in this state.

Would be great if you could upgrade and see if the problem still occurs.

Thanks,

Ludo’.

Jonathan Frederickson wrote 8 months ago

Recipients:(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 72166@debbugs.gnu.org)

Message-ID:5477099d-fbc5-4acd-8320-f88ed3107de7@app.fastmail.com

On Mon, Jul 22, 2024, at 3:14 AM, Ludovic Courtès wrote:

Toggle quote (29 lines)> Hi,
> 
> "Jonathan Frederickson" <jonathan@terracrypt.net> skribis:
> 
> > Hi Ludo, thanks for the troubleshooting help. Looks like I'm running 0.10.4:
> >
> > jfred@terracard ~$ cat /proc/1/cmdline | xargs -0
> > /gnu/store/bhynhk0c6ssq3fqqc59fvhxjzwywsjbb-guile-3.0.9/bin/guile --no-auto-compile /gnu/store/39li5qpiaj1lx89xgahlbgvfnjhpcpwg-shepherd-0.10.4/bin/shepherd --config /gnu/store/hfyri6ygfdjq4w3nkha2ypa2k98hhfxj-shepherd.conf
> >
> > I see now that 0.10.5 was released a few weeks ago, does that have a fix that could be related?
> 
> Yes, it could be related.  Per the ‘NEWS’ file of Shepherd:
> 
>   ** ‘herd unload root SERVICE’ no longer hands when there’s a replacement
>      (<https://issues.guix.gnu.org/71478>)
> 
>   It used to be that, for a running service S that has a replacement registered,
>   ‘herd unload root S’ would hang shepherd, making it totally unresponsive—‘herd
>   status’, ‘halt’, etc. would hang forever, and inetd-style services would no
>   longer start, etc.  This is now fixed.
> 
> Depending on previous ‘guix system reconfigure’ invocations on these
> machines, it’s possible that you ended up in this state.
> 
> Would be great if you could upgrade and see if the problem still occurs.
> 
> Thanks,
> Ludo’.

I've gotten this machine upgraded to 0.10.5 and just experienced the same thing again:

jfred@terracard ~$ ps aux | grep swaylo
jfred      544  0.0  0.0   3700  2432 ?        S    19:02   0:00 swayidle -w timeout 300 swaylock -f -i ~/.wallpapers/user-manual.jpg timeout 10 if pgrep swaylock; then swaymsg "output * dpms off"; fi resume swaymsg "output * dpms on" before-sleep swaylock -f -i ~/.wallpapers/user-manual.jpg
jfred     1956  0.0  0.0      0     0 ?        Z    19:22   0:00 [swaylock] <defunct>
jfred     1957  0.0  0.0      0     0 ?        Zs   19:22   0:00 [swaylock] <defunct>
jfred     2162  0.0  0.0      0     0 ?        Z    19:38   0:00 [swaylock] <defunct>
jfred     2163  0.0  0.0      0     0 ?        Zs   19:38   0:00 [swaylock] <defunct>
jfred     2604  0.0  0.0   6116  2432 pts/2    S+   20:04   0:00 grep --color=auto swaylo
jfred@terracard ~$ cat /proc/1/cmdline 
/gnu/store/bhynhk0c6ssq3fqqc59fvhxjzwywsjbb-guile-3.0.9/bin/guile--no-auto-compile/gnu/store/wrmyav254ydjn9cad3q169fxg7x6p80b-shepherd-0.10.5/bin/shepherd--config/gnu/store/sfjww12mipyx4zxa6i9x8nxxfyb7h3y4-shepherd.conf

Of note, I haven't run 'guix system reconfigure' or any manual 'herd' commands on this machine since boot.

Ludovic Courtès wrote 7 months ago

Recipients:(name . Jonathan Frederickson)(address . jonathan@terracrypt.net)(address . 72166@debbugs.gnu.org)

Message-ID:877ccg78gy.fsf@gnu.org

Hi,

"Jonathan Frederickson" <jonathan@terracrypt.net> skribis:

Toggle quote (2 lines)

> I've gotten this machine upgraded to 0.10.5 and just experienced the same thing again:

Argh.

Toggle quote (12 lines)> jfred@terracard ~$ ps aux | grep swaylo
> jfred      544  0.0  0.0   3700  2432 ?        S    19:02   0:00 swayidle -w timeout 300 swaylock -f -i ~/.wallpapers/user-manual.jpg timeout 10 if pgrep swaylock; then swaymsg "output * dpms off"; fi resume swaymsg "output * dpms on" before-sleep swaylock -f -i ~/.wallpapers/user-manual.jpg
> jfred     1956  0.0  0.0      0     0 ?        Z    19:22   0:00 [swaylock] <defunct>
> jfred     1957  0.0  0.0      0     0 ?        Zs   19:22   0:00 [swaylock] <defunct>
> jfred     2162  0.0  0.0      0     0 ?        Z    19:38   0:00 [swaylock] <defunct>
> jfred     2163  0.0  0.0      0     0 ?        Zs   19:38   0:00 [swaylock] <defunct>
> jfred     2604  0.0  0.0   6116  2432 pts/2    S+   20:04   0:00 grep --color=auto swaylo
> jfred@terracard ~$ cat /proc/1/cmdline 
> /gnu/store/bhynhk0c6ssq3fqqc59fvhxjzwywsjbb-guile-3.0.9/bin/guile--no-auto-compile/gnu/store/wrmyav254ydjn9cad3q169fxg7x6p80b-shepherd-0.10.5/bin/shepherd--config/gnu/store/sfjww12mipyx4zxa6i9x8nxxfyb7h3y4-shepherd.conf
>
> Of note, I haven't run 'guix system reconfigure' or any manual 'herd' commands on this machine since boot.

Could you share (maybe privately) the relevant excerpt of

/var/log/messages?

Could you also share (ideally) a minimum Guix System config and a

sequence of commands to reproduce it?

Are you able to reproduce it in ‘guix system vm’?

Thanks,

Ludo’.

Jonathan Frederickson wrote 7 months ago

Recipients:(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 72166@debbugs.gnu.org)

Message-ID:b559c5a9-1c87-43f7-a294-b5248e117b2d@app.fastmail.com

On Fri, Aug 16, 2024, at 12:12 PM, Ludovic Courtès wrote:

Toggle quote (8 lines)

> Could you share (maybe privately) the relevant excerpt of

> /var/log/messages?

> Could you also share (ideally) a minimum Guix System config and a

> sequence of commands to reproduce it?

> Are you able to reproduce it in ‘guix system vm’?

I'll send some logs to you privately; I don't actually see any logs that look relevant in /var/log/messages but maybe you'll spot something I haven't.

I'll try to come up with a reproduction and a minimal system config to reproduce the issue, though I will say that I haven't yet tracked down the exact situation that triggers it.

(Perhaps notably, both this machine running Guix System and another machine running Guix Home on a foreign distro where I've experienced the same issue are running Sway, and I think swaylock may be related somehow.)

I'll also try reproducing it with 'guix system vm' if I can, though that may be challenging as this machine only has 4 GB of RAM.

Attachment: file

Jonathan Frederickson wrote 7 months ago

Recipients:(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 72166@debbugs.gnu.org)

Message-ID:8f52c269-dea8-4856-aa6e-740d99cc72e6@app.fastmail.com

No news yet on reproduction, but I did spot something else interesting when running 'guix home switch-generation' on one of my other machines that's been having similar issues:

Toggle quote (9 lines)

> SLoading /gnu/store/b7sbff937mm32lwv1a0dbh4mqhwkvpfn-shepherd.conf.

> herd: error: exception caught while executing 'load' on service 'root':

> In procedure fport_write: Input/output error

> Comparing /gnu/store/69shsgavrzkb408hvrfi4yjs26z7n112-home/profile/share/fonts and

> /gnu/store/dn7kk3d6hhmgjm89y2p3mrlwz0dgkzpy-home/profile/share/fonts... done (same)

> Evaluating on-change gexps.

> On-change gexps evaluation finished.

This is another machine whose user shepherd instance is currently in this unresponsive state.

Ludovic Courtès wrote 7 months ago

Recipients:(name . Jonathan Frederickson)(address . jonathan@terracrypt.net)(address . 72166@debbugs.gnu.org)

Message-ID:87zfp4nbn0.fsf@gnu.org

Hi,

You wrote:

Toggle quote (4 lines)

> The machine in question does not have a battery-backed RTC, so it

> loses time when it loses power, but notice the time changes once ntpd

> starts up.

So you’re hitting the Fibers bug described here:

https://issues.guix.gnu.org/70848

There is still no fix upstream unfortunately, but I hope we’ll get there

in time for Shepherd 1.0.

The only workaround so far is to build shepherd against

‘guile-fibers-1.1’, which is what is done in Guix on AArch64 and RISC-V

since these platforms are likely to be used with single-board computers

lacking a battery-backed RTC.

Thanks,

Ludo’.

Jonathan Frederickson wrote 7 months ago

Recipients:(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 72166@debbugs.gnu.org)

Message-ID:6f4711b4-3092-4bbe-91d9-8cb2ce44ed1b@terracrypt.net

Aug 22, 2024 05:35:47 Ludovic Courtès <ludo@gnu.org>:

Toggle quote (22 lines)> Hi,
> 
> You wrote:
> 
>> The machine in question does not have a battery-backed RTC, so it
>> loses time when it loses power, but notice the time changes once ntpd
>> starts up.
> 
> So you’re hitting the Fibers bug described here:
> 
>   https://issues.guix.gnu.org/70848
> 
> There is still no fix upstream unfortunately, but I hope we’ll get there
> in time for Shepherd 1.0.
> 
> The only workaround so far is to build shepherd against
> ‘guile-fibers-1.1’, which is what is done in Guix on AArch64 and RISC-V
> since these platforms are likely to be used with single-board computers
> lacking a battery-backed RTC.
> 
> Thanks,
> Ludo’.

Ah! Yes, that seems likely.

I've also run into what looks like the same thing on a laptop that does have an RTC, but it also has an NTP daemon running on it. (I have also noticed that this is most common on that laptop after a suspend/resume cycle, so maybe that's triggering the bug as well?)

I'll check to see if I've recently had an NTP sync the next time this happens, because that would be a definitive confirmation that I'm running into that bug.

Attachment: file

Ludovic Courtès wrote 6 months ago

Recipients:(name . Jonathan Frederickson)(address . jonathan@terracrypt.net)(address . 72166@debbugs.gnu.org)

Message-ID:87bk13g49z.fsf@gnu.org

Hi,

Jonathan Frederickson <jonathan@terracrypt.net> skribis:

Toggle quote (2 lines)

> I've also run into what looks like the same thing on a laptop that does have an RTC, but it also has an NTP daemon running on it. (I have also noticed that this is most common on that laptop after a suspend/resume cycle, so maybe that's triggering the bug as well?)

Suspend/resume does not trigger the bug no (fortunately), but a clock

jump due to NTP sync can cause it (specifically, shepherd will consume

CPU time proportional to the drift).

Ludo’.

Your comment

Commenting via the web interface is currently disabled.

To comment on this conversation send an email to 72166@debbugs.gnu.org

To respond to this issue using the mumi CLI, first switch to it

mumi current 72166

Then, you may apply the latest patchset in this issue (with sign off)

mumi am -- -s

Or, compose a reply to this issue

mumi compose

Or, send patches to this issue

mumi send-email *.patch

You may also tag this issue. See list of standard tags. For example, to set the confirmed and easy tags

mumi command -t +confirmed -t +easy

Or, remove the moreinfo tag and set the help tag

mumi command -t -moreinfo -t +help

is:open	open issues
is:done	closed issues
submitter:<who>	search issue submitter
author:<who>	search by message author
date:yesterday..now	search by issue date
mdate:3m..2d	search by message date