shepherd lost track of nginx

  • Open
  • quality assurance status badge
Details
3 participants
  • Ludovic Courtès
  • Mark H Weaver
  • Robert Vollmert
Owner
unassigned
Submitted by
Robert Vollmert
Severity
normal
R
R
Robert Vollmert wrote on 19 Jul 2019 18:49
(address . bug-guix@gnu.org)
E22C505B-4E64-4489-AE9C-8B19254B2BBD@vllmrt.net
Not sure who’s at fault here, but without doing anything weird,
I ended up with a system where shepherd thought that nginx was
stopped, while there was still an nginx process around. I
certainly didn’t start it by hand.

The result was this:

$ sudo herd restart nginx
Service nginx is not running.
herd: exception caught while executing 'start' on service 'nginx':
Throw to key `srfi-34' with args `("#<condition &invoke-error [program: \"/gnu/store/mlg0xfbiq03s812rm3v7mrlhyngas4xp-nginx-1.17.1/sbin/nginx\" arguments: (\"-c\" \"/gnu/store/r6gl9n7pwf4npiri05qxr40vdihdm2yy-nginx.conf\" \"-p\" \"/var/run/nginx\") exit-status: 1 term-signal: #f stop-signal: #f] 147e000>")’.

That error message could also be clearer about what’s going on. At any
rate, after I killed the nginx process, “herd start nginx” worked fine.

I should add that nginx was still doing its job fine before I killed it.
L
L
Ludovic Courtès wrote on 20 Jul 2019 00:49
(name . Robert Vollmert)(address . rob@vllmrt.net)(address . 36731@debbugs.gnu.org)
87ef2labds.fsf@gnu.org
Hello,

Robert Vollmert <rob@vllmrt.net> skribis:

Toggle quote (5 lines)
> Not sure who’s at fault here, but without doing anything weird,
> I ended up with a system where shepherd thought that nginx was
> stopped, while there was still an nginx process around. I
> certainly didn’t start it by hand.

Did you try “herd status nginx” to see shepherd’s notion of the nginx
process?

Toggle quote (7 lines)
> The result was this:
>
> $ sudo herd restart nginx
> Service nginx is not running.
> herd: exception caught while executing 'start' on service 'nginx':
> Throw to key `srfi-34' with args `("#<condition &invoke-error [program: \"/gnu/store/mlg0xfbiq03s812rm3v7mrlhyngas4xp-nginx-1.17.1/sbin/nginx\" arguments: (\"-c\" \"/gnu/store/r6gl9n7pwf4npiri05qxr40vdihdm2yy-nginx.conf\" \"-p\" \"/var/run/nginx\") exit-status: 1 term-signal: #f stop-signal: #f] 147e000>")’.

Do you use an “opaque” nginx config file, or do you use <nginx-...>
records?

In the former case, the ‘start’ method won’t attempt to read the PID
file (because it cannot be sure it’ll exist), so it’s effectively unable
to track the process. See comment in ‘nginx-shepherd-service’.

Toggle quote (3 lines)
> That error message could also be clearer about what’s going on. At any
> rate, after I killed the nginx process, “herd start nginx” worked fine.

I agree that we could and should improve the error message. Redirecting
nginx’s stderr so that shepherd clients can see it would be best.

Thanks,
Ludo’.
R
R
Robert Vollmert wrote on 20 Jul 2019 09:42
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 36731@debbugs.gnu.org)
78D8D737-8D4D-485D-8388-AD014C669FB9@vllmrt.net
Toggle quote (14 lines)
> On 20. Jul 2019, at 00:49, Ludovic Courtès <ludo@gnu.org> wrote:
>
> Hello,
>
> Robert Vollmert <rob@vllmrt.net> skribis:
>
>> Not sure who’s at fault here, but without doing anything weird,
>> I ended up with a system where shepherd thought that nginx was
>> stopped, while there was still an nginx process around. I
>> certainly didn’t start it by hand.
>
> Did you try “herd status nginx” to see shepherd’s notion of the nginx
> process?

Not at the time, no.

Toggle quote (11 lines)
>
>> The result was this:
>>
>> $ sudo herd restart nginx
>> Service nginx is not running.
>> herd: exception caught while executing 'start' on service 'nginx':
>> Throw to key `srfi-34' with args `("#<condition &invoke-error [program: \"/gnu/store/mlg0xfbiq03s812rm3v7mrlhyngas4xp-nginx-1.17.1/sbin/nginx\" arguments: (\"-c\" \"/gnu/store/r6gl9n7pwf4npiri05qxr40vdihdm2yy-nginx.conf\" \"-p\" \"/var/run/nginx\") exit-status: 1 term-signal: #f stop-signal: #f] 147e000>")’.
>
> Do you use an “opaque” nginx config file, or do you use <nginx-...>
> records?

The latter I think:

(service nginx-service-type
(nginx-configuration
(extra-content “…”)))
L
L
Ludovic Courtès wrote on 20 Jul 2019 15:51
(name . Robert Vollmert)(address . rob@vllmrt.net)(address . 36731@debbugs.gnu.org)
87d0i495l9.fsf@gnu.org
Hi,

Robert Vollmert <rob@vllmrt.net> skribis:

Toggle quote (14 lines)
>>> $ sudo herd restart nginx
>>> Service nginx is not running.
>>> herd: exception caught while executing 'start' on service 'nginx':
>>> Throw to key `srfi-34' with args `("#<condition &invoke-error [program: \"/gnu/store/mlg0xfbiq03s812rm3v7mrlhyngas4xp-nginx-1.17.1/sbin/nginx\" arguments: (\"-c\" \"/gnu/store/r6gl9n7pwf4npiri05qxr40vdihdm2yy-nginx.conf\" \"-p\" \"/var/run/nginx\") exit-status: 1 term-signal: #f stop-signal: #f] 147e000>")’.
>>
>> Do you use an “opaque” nginx config file, or do you use <nginx-...>
>> records?
>
> The latter I think:
>
> (service nginx-service-type
> (nginx-configuration
> (extra-content “…”)))

That’s actually the non-opaque variant, so shepherd should read the PID
file and it shouldn’t get it wrong. Not sure what happened.

If you can reproduce it, it would be great to gather the output of “herd
status nginx” at the time shepherd is confused.

Thanks,
Ludo’.
M
M
Mark H Weaver wrote on 21 Jul 2019 01:07
(name . Ludovic Courtès)(address . ludo@gnu.org)
871rykcniu.fsf@netris.org
Hello,

Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (15 lines)
> Robert Vollmert <rob@vllmrt.net> skribis:
>
>> The result was this:
>>
>> $ sudo herd restart nginx
>> Service nginx is not running.
>> herd: exception caught while executing 'start' on service 'nginx':
>> Throw to key `srfi-34' with args `("#<condition &invoke-error
>> [program:
>> \"/gnu/store/mlg0xfbiq03s812rm3v7mrlhyngas4xp-nginx-1.17.1/sbin/nginx\"
>> arguments: (\"-c\"
>> \"/gnu/store/r6gl9n7pwf4npiri05qxr40vdihdm2yy-nginx.conf\" \"-p\"
>> \"/var/run/nginx\") exit-status: 1 term-signal: #f stop-signal: #f]
>> 147e000>")’.

[…]

Toggle quote (5 lines)
>> That error message could also be clearer about what’s going on. At any
>> rate, after I killed the nginx process, “herd start nginx” worked fine.
>
> I agree that we could and should improve the error message.

On the subject of this error message, why was the &invoke-error
condition serialized to a string before apparently being embedded within
another exception? In other words, why did it print:

Throw to key `srfi-34' with args `("#<condition &invoke-error [program: \"/gnu/store/mlg0xfbiq03s812rm3v7mrlhyngas4xp-nginx-1.17.1/sbin/nginx\" arguments: (\"-c\" \"/gnu/store/r6gl9n7pwf4npiri05qxr40vdihdm2yy-nginx.conf\" \"-p\" \"/var/run/nginx\") exit-status: 1 term-signal: #f stop-signal: #f] 147e000>")’.

instead of something closer to:

Throw to key `srfi-34' with args `(#<condition &invoke-error [program: "/gnu/store/mlg0xfbiq03s812rm3v7mrlhyngas4xp-nginx-1.17.1/sbin/nginx" arguments: ("-c" "/gnu/store/r6gl9n7pwf4npiri05qxr40vdihdm2yy-nginx.conf" "-p" "/var/run/nginx") exit-status: 1 term-signal: #f stop-signal: #f] 147e000>)’.

We may want to go further in this specific case to make a user-friendly
error message, but in the more general case of printing arbitrary
exceptions, eliminating that second layer of string serialization would
help make the error reports a bit nicer to read.

What do you think?

Mark
L
L
Ludovic Courtès wrote on 22 Jul 2019 12:31
(name . Mark H Weaver)(address . mhw@netris.org)
871ryi4az9.fsf@gnu.org
Hi Mark,

Mark H Weaver <mhw@netris.org> skribis:

Toggle quote (28 lines)
> Ludovic Courtès <ludo@gnu.org> writes:
>
>> Robert Vollmert <rob@vllmrt.net> skribis:
>>
>>> The result was this:
>>>
>>> $ sudo herd restart nginx
>>> Service nginx is not running.
>>> herd: exception caught while executing 'start' on service 'nginx':
>>> Throw to key `srfi-34' with args `("#<condition &invoke-error
>>> [program:
>>> \"/gnu/store/mlg0xfbiq03s812rm3v7mrlhyngas4xp-nginx-1.17.1/sbin/nginx\"
>>> arguments: (\"-c\"
>>> \"/gnu/store/r6gl9n7pwf4npiri05qxr40vdihdm2yy-nginx.conf\" \"-p\"
>>> \"/var/run/nginx\") exit-status: 1 term-signal: #f stop-signal: #f]
>>> 147e000>")’.
>
> […]
>
>>> That error message could also be clearer about what’s going on. At any
>>> rate, after I killed the nginx process, “herd start nginx” worked fine.
>>
>> I agree that we could and should improve the error message.
>
> On the subject of this error message, why was the &invoke-error
> condition serialized to a string before apparently being embedded within
> another exception?

That serialization comes from the Shepherd when it talks to its clients
(see ‘write-reply’ in (shepherd comm)).

Normally service methods should write a human-readable message instead
of throwing an exception, but when that happens, shepherd serializes
those things so that one can at least diagnose the problem.

In this case we could use ‘report-invoke-error’ from (guix build utils)
on ‘core-updates’.

Thanks,
Ludo’.
?