[Shepherd] Non-responding service control fiber

  • Done
  • quality assurance status badge
Details
3 participants
  • Giovanni Biscuolo
  • Ludovic Courtès
  • Ludovic Courtès
Owner
unassigned
Submitted by
Ludovic Courtès
Severity
important
Merged with
L
L
Ludovic Courtès wrote on 21 Aug 2023 11:38
[Shepherd] Non-reponding service control fiber
(address . bug-guix@gnu.org)
87il98burf.fsf@inria.fr
Hello,

On milano-guix-1 (a build machine behind bayfront, running shepherd
0.10.2), ‘herd status’ and ‘herd status guix-build-coordinator-agent’
would hang (there’s no ‘guix-build-coordinator’ process running).

‘herd stop childhurd2’ hangs and has no effect.

Conversely, ‘herd status nscd’ and similar for most other services works
fine. When terminating a service’s process, the service gets respawned
just fine.

The conclusion seems to be that the control fiber of the ‘root’ service
is not responding: it is blocked on a get/put? did it exit?

Unfortunately we don’t have data from the logs that would give clues as
to what went wrong.

Ludo’.
G
G
Giovanni Biscuolo wrote on 23 Aug 2023 10:00
(name . Christopher Baines)(address . mail@cbaines.net)
87pm3ejii8.fsf@xelera.eu
Hello,

Ludovic Courtès <ludovic.courtes@inria.fr> writes:

[...]

Toggle quote (6 lines)
> The conclusion seems to be that the control fiber of the ‘root’ service
> is not responding: it is blocked on a get/put? did it exit?
>
> Unfortunately we don’t have data from the logs that would give clues as
> to what went wrong.

I've had a look at /var/log/messages but nothing seems wrong except
messages like this one:

Toggle snippet (8 lines)
Aug 21 14:48:42 localhost shepherd[1]: 6 connections still in use after sshd-13752 termination.
Aug 21 14:48:42 localhost shepherd[1]: Service sshd-13752 (PID 29977) exited with 255.
Aug 21 14:48:42 localhost shepherd[1]: Service sshd-13752 has been disabled.
Aug 21 14:48:42 localhost shepherd[1]: Transient service sshd-13752 terminated, now unregistered.


Is it useful configuring the monitoring service [1] on milano-guix-1 to
have useful data in the logs in case we get a similar issue?

Thanks, Gio'



--
Giovanni Biscuolo

Xelera IT Infrastructures
-----BEGIN PGP SIGNATURE-----

iQJABAEBCgAqFiEERcxjuFJYydVfNLI5030Op87MORIFAmTlvI8MHGdAeGVsZXJh
LmV1AAoJENN9DqfOzDkSkpoP/RYvViPSKbaIclDijHJkUfl7a9FwtR9Wujj2uhaA
HfOSJpP0GOaCdIh2CCM5BzyWQXg4Vg/OWBtyCFs9XCmG7QZe7oevZU8nvZV+/w7/
uT0aHM0OOeJ/oc4DIumXSpozMqE46S5ew4A2Z/3ZQZ/0LxwG0sRKySnwW7aUwlrX
aHF/2op+33oDxnrIuxPBxZWPiyoXGqmxssfDydElmptqeS2lbbQ24igrT8b9Z57D
wkdfbIM148C/rGTNgmdaO7Xtu7gNeTVhZl70jVOBHmDZYMN4AiXFhG4xBq8ix+fZ
6ZvblosqeHDRmbPVrYzTm2ztLrgejoxBoyeoAIvuy1E4/KxnrqRIM9crp4zr491f
ttkupIh0RQgsfXc+adyKyLF0DeVDrJ+Vh1oH0RAjwhms+lX/KhGuED/ds8d3gExE
Ne8ADcXQXQeRmzLTaPRNSp5do5LioWs65n4UDguyck5Wdj6s8hPdBwO10+IudkSv
w+7k/pZjoC+7mPPsn7AzSdP6sZ1+lylzT0TIknDgtDKUxchRnAkhPoJLpc2tLzRK
P0eKmSJ42wfSBXyNOHRjSLjmTgmXfAhC4+1Pqn8XtQulTJ2TaFRKd4+S6jBbXlZp
lJaRMEm1ZJ3eXXQYp1Ca9N//S54JSEDNbnY7yPtSxfi0OYV7PEXuYT5B9ZtYiegd
zXBe
=gm+N
-----END PGP SIGNATURE-----

L
L
Ludovic Courtès wrote on 24 Aug 2023 10:09
(name . Giovanni Biscuolo)(address . g@xelera.eu)
87jztk276p.fsf@inria.fr
Hi,

Giovanni Biscuolo <g@xelera.eu> skribis:

Toggle quote (9 lines)
> I've had a look at /var/log/messages but nothing seems wrong except
> messages like this one:
>
>
> Aug 21 14:48:42 localhost shepherd[1]: 6 connections still in use after sshd-13752 termination.
> Aug 21 14:48:42 localhost shepherd[1]: Service sshd-13752 (PID 29977) exited with 255.
> Aug 21 14:48:42 localhost shepherd[1]: Service sshd-13752 has been disabled.
> Aug 21 14:48:42 localhost shepherd[1]: Transient service sshd-13752 terminated, now unregistered.

Yeah, I think it happened earlier but unfortunately the previously logs
got deleted (rottlog is not behaving as expected).

Toggle quote (3 lines)
> Is it useful configuring the monitoring service [1] on milano-guix-1 to
> have useful data in the logs in case we get a similar issue?

It wouldn’t help in this case, but it’s still interesting to have it
around.

sudo herd eval root '(begin (use-modules (shepherd service monitoring)) (register-services (list (monitoring-service))))'
sudo herd start monitoring

Ludo’.
L
L
Ludovic Courtès wrote on 3 Sep 2023 21:59
control message for bug #65419
(address . control@debbugs.gnu.org)
87v8crc9k2.fsf@gnu.org
merge 65419 65178
quit
L
L
Ludovic Courtès wrote on 3 Sep 2023 21:59
(address . control@debbugs.gnu.org)
87ttsbc9jn.fsf@gnu.org
severity 65419 important
quit
L
L
Ludovic Courtès wrote on 23 Nov 2023 21:42
(address . control@debbugs.gnu.org)
87sf4w8ana.fsf@gnu.org
retitle 65419 [Shepherd] Non-responding service control fiber
quit
L
L
Ludovic Courtès wrote on 20 Dec 2023 00:00
Re: bug#65419: [Shepherd] Non-responding service control fiber
(name . Attila Lendvai)(address . attila@lendvai.name)
87plz1hk6j.fsf_-_@gnu.org
Hello,

Attila Lendvai <attila@lendvai.name> skribis:

Toggle quote (4 lines)
> i think i have found the root cause of this, as documented here: https://issues.guix.gnu.org/67839
>
> that issue contains patches for shepherd to reproduce it in its test suite.

Yes, it looks like this long-standing and hard-to-debug issue may well
be fixed now, thumbs up Attila!!

We have accumulated quite a few fixes by now so I think I’ll release
0.10.3 hopefully in 2023 and otherwise soon after.

Thanks,
Ludo’.
L
L
Ludovic Courtès wrote on 2 Jan 23:09 +0100
control message for bug #65419
(address . control@debbugs.gnu.org)
87mstngzff.fsf@gnu.org
close 65419
quit
?
Your comment

This issue is archived.

To comment on this conversation send an email to 65419@debbugs.gnu.org

To respond to this issue using the mumi CLI, first switch to it
mumi current 65419
Then, you may apply the latest patchset in this issue (with sign off)
mumi am -- -s
Or, compose a reply to this issue
mumi compose
Or, send patches to this issue
mumi send-email *.patch