Shepherd becomes unresponsive after an interrupt

  • Done
  • quality assurance status badge
Details
3 participants
  • Ludovic Courtès
  • Mathieu Othacehe
  • Mathieu Othacehe
Owner
unassigned
Submitted by
Mathieu Othacehe
Severity
important
Merged with
M
M
Mathieu Othacehe wrote on 31 Oct 2022 13:44
(address . bug-guix@gnu.org)
87a65cgo1t.fsf@gnu.org
Hello,

When running the following command:

Toggle snippet (3 lines)
sudo herd restart service-that-hangs-upon-restart

then hitting C-c, Shepherd becomes totally unresponsive:

Toggle snippet (3 lines)
sudo herd status

and all further Shpeherd commands hang forever. I was able to reproduce
it in two different configurations:

1. On my laptop with a Wireguard service trying to reach a non-existing
DNS server.

Toggle snippet (6 lines)
(service wireguard-service-type
(wireguard-configuration
(addresses (list "10.0.0.2/24"))
(dns '("10.0.0.50")) #does not exit

2. On Berlin, while trying to restart nginx.

In both situations, the "reboot" command was also hanging.

Thanks,

Mathieu
M
M
Mathieu Othacehe wrote on 31 Oct 2022 14:36
control message for bug #53214
(address . control@debbugs.gnu.org)
87tu3km7xf.fsf@meije.mail-host-address-is-not-set
block 53214 by 58926
quit
M
M
Mathieu Othacehe wrote on 31 Oct 2022 14:37
control message for bug #58926
(address . control@debbugs.gnu.org)
87r0yom7w9.fsf@meije.mail-host-address-is-not-set
severity 58926 important
quit
L
L
Ludovic Courtès wrote on 10 Nov 2022 10:59
Re: bug#58926: Shepherd becomes unresponsive after an interrupt
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 58926@debbugs.gnu.org)
87wn83jfk4.fsf@gnu.org
Hi,

Mathieu Othacehe <othacehe@gnu.org> skribis:

Toggle quote (21 lines)
> sudo herd restart service-that-hangs-upon-restart
>
>
> then hitting C-c, Shepherd becomes totally unresponsive:
>
> sudo herd status
>
>
> and all further Shpeherd commands hang forever. I was able to reproduce
> it in two different configurations:
>
> 1. On my laptop with a Wireguard service trying to reach a non-existing
> DNS server.
>
> (service wireguard-service-type
> (wireguard-configuration
> (addresses (list "10.0.0.2/24"))
> (dns '("10.0.0.50")) #does not exit
>
> 2. On Berlin, while trying to restart nginx.

I experienced case #2: in that case ‘strace -p1’ showed that shepherd
was stuck on waitpid of the nginx process, which was not terminating.
Killing that process would unlock shepherd.


Would be good to see what’s up with WireGuard.

Ludo’.
M
M
Mathieu Othacehe wrote on 12 Nov 2022 09:36
control message for bug #58926
(address . control@debbugs.gnu.org)
87leog603u.fsf@meije.mail-host-address-is-not-set
merge 58926 56674
quit
L
L
Ludovic Courtès wrote on 12 Nov 2022 19:10
Re: bug#58926: Shepherd becomes unresponsive after an interrupt
(name . Mathieu Othacehe)(address . othacehe@gnu.org)
878rkgcabz.fsf@gnu.org
Mathieu Othacehe <othacehe@gnu.org> skribis:

Toggle quote (8 lines)
> 1. On my laptop with a Wireguard service trying to reach a non-existing
> DNS server.
>
> (service wireguard-service-type
> (wireguard-configuration
> (addresses (list "10.0.0.2/24"))
> (dns '("10.0.0.50")) #does not exit

This one is similar to:


It has to do with the fact that “wg-quick up” blocks until it succeeds
and that ‘invoke’ gets stuck on ‘waitpid’ until the “wg-quick” process
terminates.

The solution will be to use something non-blocking instead of ‘invoke’;
I’m looking into it.

Ludo’.
L
L
Ludovic Courtès wrote on 12 Nov 2022 19:28
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 58926@debbugs.gnu.org)
874jv4c9im.fsf@gnu.org
Mathieu Othacehe <othacehe@gnu.org> skribis:

Toggle quote (8 lines)
> then hitting C-c, Shepherd becomes totally unresponsive:
>
> sudo herd status
>
>
> and all further Shpeherd commands hang forever. I was able to reproduce
> it in two different configurations:

[...]

Toggle quote (2 lines)
> 2. On Berlin, while trying to restart nginx.

I can’t reproduce it in a VM.

Before I try it on a production system :-), does anyone have a tip on
how to reproduce it? Or perhaps strace output from a system that
exhibits this bug?

TIA!

Ludo’.
L
L
Ludovic Courtès wrote on 14 Nov 2022 17:32
(address . 56674@debbugs.gnu.org)
87wn7xo5ss.fsf_-_@gnu.org
Hello!

Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (7 lines)
> These fresh Shepherd commits install a non-blocking ‘system*’ replacement:
>
> 975b0aa service: Provide a non-blocking replacement of 'system*'.
> 039c7a8 service: Spawn a fiber responsible for process monitoring.
>
> We’ll have to do more testing and probably go for a 0.9.3 release soon.

Shepherd commit ada88074f0ab7551fd0f3dce8bf06de971382e79 passes my
tests. It definitely solves the wireguard example and similar things
(uses of ‘system*’ in service constructors/destructors); I can’t tell
for sure about nginx because I haven’t been able to reproduce it in a
VM. I’m interested in ways to reproduce it.

It does look like we could go with 0.9.3 real soon now.

Ludo’.
L
L
Ludovic Courtès wrote on 17 Nov 2022 11:23
(name . Mathieu Othacehe)(address . othacehe@gnu.org)
87a64pkhgy.fsf@gnu.org
Hi,

Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (22 lines)
> Mathieu Othacehe <othacehe@gnu.org> skribis:
>
>> 1. On my laptop with a Wireguard service trying to reach a non-existing
>> DNS server.
>>
>> (service wireguard-service-type
>> (wireguard-configuration
>> (addresses (list "10.0.0.2/24"))
>> (dns '("10.0.0.50")) #does not exit
>
> This one is similar to:
>
> https://issues.guix.gnu.org/53225
> https://issues.guix.gnu.org/53381
>
> It has to do with the fact that “wg-quick up” blocks until it succeeds
> and that ‘invoke’ gets stuck on ‘waitpid’ until the “wg-quick” process
> terminates.
>
> The solution will be to use something non-blocking instead of ‘invoke’;
> I’m looking into it.

This is fixed in the Shepherd 0.9.3, which landed in Guix commit
283d7318c5b312d7129adb6dbeea6ad205ce89d1.

As I wrote, I’m not sure whether it fixes the nginx situation since I
could not reproduce it. I’m closing and let’s open a new issue
specifically for nginx if it comes up again with 0.9.3.

Thanks,
Ludo’.
Closed
?