[Shepherd] Non-responding service control fiber

  • Done
  • quality assurance status badge
Details
3 participants
  • Giovanni Biscuolo
  • Ludovic Courtès
  • Ludovic Courtès
Owner
unassigned
Submitted by
Ludovic Courtès
Severity
important
Merged with
L
L
Ludovic Courtès wrote on 21 Aug 2023 11:38
[Shepherd] Non-reponding service control fiber
(address . bug-guix@gnu.org)
87il98burf.fsf@inria.fr
Hello,

On milano-guix-1 (a build machine behind bayfront, running shepherd
0.10.2), ‘herd status’ and ‘herd status guix-build-coordinator-agent’
would hang (there’s no ‘guix-build-coordinator’ process running).

‘herd stop childhurd2’ hangs and has no effect.

Conversely, ‘herd status nscd’ and similar for most other services works
fine. When terminating a service’s process, the service gets respawned
just fine.

The conclusion seems to be that the control fiber of the ‘root’ service
is not responding: it is blocked on a get/put? did it exit?

Unfortunately we don’t have data from the logs that would give clues as
to what went wrong.

Ludo’.
G
G
Giovanni Biscuolo wrote on 23 Aug 2023 10:00
(name . Christopher Baines)(address . mail@cbaines.net)
87pm3ejii8.fsf@xelera.eu
Hello,

Ludovic Courtès <ludovic.courtes@inria.fr> writes:

[...]

Toggle quote (6 lines)
> The conclusion seems to be that the control fiber of the ‘root’ service
> is not responding: it is blocked on a get/put? did it exit?
>
> Unfortunately we don’t have data from the logs that would give clues as
> to what went wrong.

I've had a look at /var/log/messages but nothing seems wrong except
messages like this one:

Toggle snippet (8 lines)
Aug 21 14:48:42 localhost shepherd[1]: 6 connections still in use after sshd-13752 termination.
Aug 21 14:48:42 localhost shepherd[1]: Service sshd-13752 (PID 29977) exited with 255.
Aug 21 14:48:42 localhost shepherd[1]: Service sshd-13752 has been disabled.
Aug 21 14:48:42 localhost shepherd[1]: Transient service sshd-13752 terminated, now unregistered.


Is it useful configuring the monitoring service [1] on milano-guix-1 to
have useful data in the logs in case we get a similar issue?

Thanks, Gio'



--
Giovanni Biscuolo

Xelera IT Infrastructures
-----BEGIN PGP SIGNATURE-----

iQJABAEBCgAqFiEERcxjuFJYydVfNLI5030Op87MORIFAmTlvI8MHGdAeGVsZXJh
LmV1AAoJENN9DqfOzDkSkpoP/RYvViPSKbaIclDijHJkUfl7a9FwtR9Wujj2uhaA
HfOSJpP0GOaCdIh2CCM5BzyWQXg4Vg/OWBtyCFs9XCmG7QZe7oevZU8nvZV+/w7/
uT0aHM0OOeJ/oc4DIumXSpozMqE46S5ew4A2Z/3ZQZ/0LxwG0sRKySnwW7aUwlrX
aHF/2op+33oDxnrIuxPBxZWPiyoXGqmxssfDydElmptqeS2lbbQ24igrT8b9Z57D
wkdfbIM148C/rGTNgmdaO7Xtu7gNeTVhZl70jVOBHmDZYMN4AiXFhG4xBq8ix+fZ
6ZvblosqeHDRmbPVrYzTm2ztLrgejoxBoyeoAIvuy1E4/KxnrqRIM9crp4zr491f
ttkupIh0RQgsfXc+adyKyLF0DeVDrJ+Vh1oH0RAjwhms+lX/KhGuED/ds8d3gExE
Ne8ADcXQXQeRmzLTaPRNSp5do5LioWs65n4UDguyck5Wdj6s8hPdBwO10+IudkSv
w+7k/pZjoC+7mPPsn7AzSdP6sZ1+lylzT0TIknDgtDKUxchRnAkhPoJLpc2tLzRK
P0eKmSJ42wfSBXyNOHRjSLjmTgmXfAhC4+1Pqn8XtQulTJ2TaFRKd4+S6jBbXlZp
lJaRMEm1ZJ3eXXQYp1Ca9N//S54JSEDNbnY7yPtSxfi0OYV7PEXuYT5B9ZtYiegd
zXBe
=gm+N
-----END PGP SIGNATURE-----

L
L
Ludovic Courtès wrote on 24 Aug 2023 10:09
(name . Giovanni Biscuolo)(address . g@xelera.eu)
87jztk276p.fsf@inria.fr
Hi,

Giovanni Biscuolo <g@xelera.eu> skribis:

Toggle quote (9 lines)
> I've had a look at /var/log/messages but nothing seems wrong except
> messages like this one:
>
>
> Aug 21 14:48:42 localhost shepherd[1]: 6 connections still in use after sshd-13752 termination.
> Aug 21 14:48:42 localhost shepherd[1]: Service sshd-13752 (PID 29977) exited with 255.
> Aug 21 14:48:42 localhost shepherd[1]: Service sshd-13752 has been disabled.
> Aug 21 14:48:42 localhost shepherd[1]: Transient service sshd-13752 terminated, now unregistered.

Yeah, I think it happened earlier but unfortunately the previously logs
got deleted (rottlog is not behaving as expected).

Toggle quote (3 lines)
> Is it useful configuring the monitoring service [1] on milano-guix-1 to
> have useful data in the logs in case we get a similar issue?

It wouldn’t help in this case, but it’s still interesting to have it
around.

sudo herd eval root '(begin (use-modules (shepherd service monitoring)) (register-services (list (monitoring-service))))'
sudo herd start monitoring

Ludo’.
L
L
Ludovic Courtès wrote on 3 Sep 2023 21:59
control message for bug #65419
(address . control@debbugs.gnu.org)
87v8crc9k2.fsf@gnu.org
merge 65419 65178
quit
L
L
Ludovic Courtès wrote on 3 Sep 2023 21:59
(address . control@debbugs.gnu.org)
87ttsbc9jn.fsf@gnu.org
severity 65419 important
quit
L
L
Ludovic Courtès wrote on 23 Nov 2023 21:42
(address . control@debbugs.gnu.org)
87sf4w8ana.fsf@gnu.org
retitle 65419 [Shepherd] Non-responding service control fiber
quit
L
L
Ludovic Courtès wrote on 20 Dec 2023 00:00
Re: bug#65419: [Shepherd] Non-responding service control fiber
(name . Attila Lendvai)(address . attila@lendvai.name)
87plz1hk6j.fsf_-_@gnu.org
Hello,

Attila Lendvai <attila@lendvai.name> skribis:

Toggle quote (4 lines)
> i think i have found the root cause of this, as documented here: https://issues.guix.gnu.org/67839
>
> that issue contains patches for shepherd to reproduce it in its test suite.

Yes, it looks like this long-standing and hard-to-debug issue may well
be fixed now, thumbs up Attila!!

We have accumulated quite a few fixes by now so I think I’ll release
0.10.3 hopefully in 2023 and otherwise soon after.

Thanks,
Ludo’.
L
L
Ludovic Courtès wrote on 2 Jan 23:09 +0100
control message for bug #65419
(address . control@debbugs.gnu.org)
87mstngzff.fsf@gnu.org
close 65419
quit
?