[Shepherd] Use of ‘waitpid’,_ ‘system*’, etc. in service code can cause deadlocks

OpenSubmitted by Ludovic Courtès.
Details
2 participants
  • Ludovic Courtès
  • Maxime Devos
Owner
unassigned
Severity
important
L
L
Ludovic Courtès wrote on 20 Jul 23:39 +0200
[Shepherd] Use of ‘waitpid’, ‘system* ’, etc. in service code can cause deadlocks
(address . bug-guix@gnu.org)
8735evpipv.fsf@inria.fr
Hi!

We’ve just had a bad experience with the nginx service on berlin, where
‘herd restart nginx’ would cause shepherd to get stuck forever in
‘waitpid’ on the process that was supposed to start nginx.

The details are unclear, but one thing is clear is that using ‘waitpid’
(either directly or indirectly with ‘system*’, which is what
‘nginx-service-type’ does) is not great:

1. In the best case, shepherd (as of 0.9.1) is stuck while ‘system*’
is in ‘waitpid’ waiting for child process completion (“stuck” as
in: doesn’t do anything, not even answering ‘herd’ requests or
inetd connections.)

2. I don’t think that can happen with ‘system*’ (because it’s in C),
but generally speaking, there’s a possibility that shepherd’s event
loop will handle child process termination before some other
user-made ‘waitpid’ call does.

Anyway, that’s a bad situation.

So I can think of several ways to address it:

1. Change the nginx service ‘stop’ method to just
(make-kill-destructor), which should work just as well as invoking
“nginx -s stop”.

2. Have Shepherd provide a replacement for ‘system*’.

Thoughts?

Ludo’.
L
L
Ludovic Courtès wrote on 20 Jul 23:43 +0200
control message for bug #56674
(address . control@debbugs.gnu.org)
87y1wno3yx.fsf@gnu.org
severity 56674 important
quit
M
M
Maxime Devos wrote on 21 Jul 01:48 +0200
Re: bug#56674: [Shepherd] Use of ‘waitpi d’, ‘system*’, etc. in service code can ca use deadlocks
c4045c06-2024-b49e-cee9-88dafd3612e6@telenet.be
On 20-07-2022 23:39, Ludovic Courtès wrote:
Toggle quote (29 lines)
> Hi!
>
> We’ve just had a bad experience with the nginx service on berlin, where
> ‘herd restart nginx’ would cause shepherd to get stuck forever in
> ‘waitpid’ on the process that was supposed to start nginx.
>
> The details are unclear, but one thing is clear is that using ‘waitpid’
> (either directly or indirectly with ‘system*’, which is what
> ‘nginx-service-type’ does) is not great:
>
> 1. In the best case, shepherd (as of 0.9.1) is stuck while ‘system*’
> is in ‘waitpid’ waiting for child process completion (“stuck” as
> in: doesn’t do anything, not even answering ‘herd’ requests or
> inetd connections.)
>
> 2. I don’t think that can happen with ‘system*’ (because it’s in C),
> but generally speaking, there’s a possibility that shepherd’s event
> loop will handle child process termination before some other
> user-made ‘waitpid’ call does.
>
> Anyway, that’s a bad situation.
>
> So I can think of several ways to address it:
>
> 1. Change the nginx service ‘stop’ method to just
> (make-kill-destructor), which should work just as well as invoking
> “nginx -s stop”.
>
> 2. Have Shepherd provide a replacement for ‘system*’.
Why Shepherd and not guile fibers? Is this a Shepherd-specific problem?
Toggle quote (2 lines)
>
> Thoughts?
3. Make waitpid (or a variant that does what we need) interact well with
guile-fibers, like how 'accept' is doesn't inhibit switching to another
fiber. There some Linux API with signal handlers or pid fds or such that
might be useful here, though I don't recall the name. Presumably
something similar can be done for the Hurd, though some C glue may be
needed to access the right Hurd APIs if the signal handler API isn't
portable.
Alternatively:
4. Do the waitpid in a separate thread (needs work-around for the
multi-threaded fork problem, probably C things? Or modifying Guile and
maybe glibc to avoid async-unsafe things or make more things async-safe
or whatever the appropriate ...-safe is here.)
If not a Guile Fibers interaction problem, then the asynchronous signal
handler API might still be useful.
Greetings,
Maxime
Attachment: OpenPGP_signature
L
L
Ludovic Courtès wrote on 21 Jul 17:39 +0200
Re: bug#56674: [Shepherd] Use of ‘waitpid’, ‘system*’, etc. in service code can cause deadlocks
(name . Maxime Devos)(address . maximedevos@telenet.be)(address . 56674@debbugs.gnu.org)
87fsiujwzo.fsf@gnu.org
Maxime Devos <maximedevos@telenet.be> skribis:

Toggle quote (2 lines)
> Why Shepherd and not guile fibers? Is this a Shepherd-specific problem?

Blocking calls are a problem for Fibers in general, and ‘waitpid’ is no
exception.

The problem here is Shepherd-specific in the sense that we’re more
likely to use ‘system*’ and ‘waitpid’ in this context. It’s also
Shepherd-specific because shepherd already runs an event loop that
tracks signal FDs and will thus “see” SIGCHLD events.

Toggle quote (8 lines)
> 3. Make waitpid (or a variant that does what we need) interact well
> with guile-fibers, like how 'accept' is doesn't inhibit switching to
> another fiber. There some Linux API with signal handlers or pid fds or
> such that might be useful here, though I don't recall the
> name. Presumably something similar can be done for the Hurd, though
> some C glue may be needed to access the right Hurd APIs if the signal
> handler API isn't portable.

Yes, that’s roughly what I had in mind when I mentioned providing a
replacement for ‘system*’ (but you’re right, it’s a replacement for
‘waitpid’ at its core).

Toggle quote (7 lines)
> Alternatively:
>
> 4. Do the waitpid in a separate thread (needs work-around for the
> multi-threaded fork problem, probably C things? Or modifying Guile and
> maybe glibc to avoid async-unsafe things or make more things
> async-safe or whatever the appropriate ...-safe is here.)

For shepherd, multithreading is not an option due to the semantics of
fork in the presence of threads.

Ludo’.
?