/var/run/shepherd/socket is missing on an otherwise functional system

  • Open
  • quality assurance status badge
Details
5 participants
  • Attila Lendvai
  • Efraim Flashner
  • Ludovic Courtès
  • Maxim Cournoyer
  • Maxime Devos
Owner
unassigned
Submitted by
Attila Lendvai
Severity
important
A
A
Attila Lendvai wrote on 27 Jan 2022 12:32
(name . bug-guix@gnu.org)(address . bug-guix@gnu.org)
BNbiqHqu6jP5GgHIZ0AMLhBo1O4baZdy21bgcUmGS1GqxOKnCHG_5uzzedymxwQsSFtL5gSw9Bppr1FXPoHDqIiKfe5K720Wb9Jivbsr_z4=@lendvai.name
the systems seems to work fine. Gnome is up, i can log in with my user, and everything seems to work, except herd.

i encounter this broken state every once in a while. IRC logs also mention this multiple times, but without many insights:


```
# herd status
error: connect: /var/run/shepherd/socket: No such file or directory

# ps afxu | grep shepherd
root 1 0.0 0.3 160788 43684 ? Sl 11:51 0:00 /gnu/store/cnfsv9ywaacyafkqdqsv2ry8f01yr7a9-guile-3.0.7/bin/guile --no-auto-compile /gnu/store/vza48khbaq0fdmcsrn27xj5y5yy76z6l-shepherd-0.8.1/bin/shepherd --config /gnu/store/q4nd803lxrlkr60s8sx88gvpb6c7lxyd-shepherd.conf

# uptime
12:26:44 up 0:34, 2 users, load average: 0.00, 0.01, 0.00
```

looking at shepherd's code:

```
(define (call-with-server-socket file-name proc)
"Call PROC, passing it a listening socket at FILE-NAME and deleting the
socket file at FILE-NAME upon exit of PROC. Return the values of PROC."
(let ((sock (open-server-socket file-name)))
(dynamic-wind
noop
(lambda () (proc sock))
(lambda ()
(close sock)
(catch-system-error (delete-file file-name))))))
```

maybe this is caused by some call/cc magic that causes an unwind that deletes the file, but then continues?

--
• attila lendvai
• PGP: 963F 5D5F 45C7 DFCD 0A39
--
“Above all, do not lose your desire to walk: Every day I walk myself into a state of well-being and walk away from every illness; I have walked myself into my best thoughts, and I know of no thought so burdensome that one cannot walk away from it.”
— Søren Kierkegaard (1813–1855)
A
A
Attila Lendvai wrote on 27 Jan 2022 13:13
(No Subject)
(name . 53580@debbugs.gnu.org)(address . 53580@debbugs.gnu.org)
64mPNb0u2KM14ObD5EvtwiyzLKVPWLeUoaPyu2PuGOvhqFTVrAdkHMZzxksSNaturHgJG5wz53ZsOa_mnGzUnA44E9kFJj_Zi9QUTdC2g-8=@lendvai.name
i forgot to add that i'm working on a shepherd service, and this may be due to errors in the service's user code, like the start gexp.
Attachment: file
E
E
Efraim Flashner wrote on 1 Feb 2022 12:06
(name . Attila Lendvai)(address . attila@lendvai.name)(name . 53580@debbugs.gnu.org)(address . 53580@debbugs.gnu.org)
YfkUGqsrXEwDDkH7@3900XT
On Thu, Jan 27, 2022 at 12:13:28PM +0000, Attila Lendvai wrote:
Toggle quote (2 lines)
> i forgot to add that i'm working on a shepherd service, and this may be due to errors in the service's user code, like the start gexp.

This is generally when I see this type of error. I normally try to
create a minimal VM and launch that when I'm trying out a new service.

--
Efraim Flashner <efraim@flashner.co.il> רנשלפ םירפא
GPG key = A28B F40C 3E55 1372 662D 14F7 41AA E7DC CA3D 8351
Confidentiality cannot be guaranteed on emails sent or received unencrypted
-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEEoov0DD5VE3JmLRT3Qarn3Mo9g1EFAmH5FBoACgkQQarn3Mo9
g1F0BRAAnZ/qsgsIqq23R63SXh7VFQcpWt0vabSbUTn/d7iSpf2g2xn3Yxe4X/P9
HSD5BIvuevxjTsSq3ohP5A+pDnREgrnJtxLVw58ZhDS+h/Wz81h0cF/vxnABbyoL
CK85axe4keqtiQM0gBO9/lJni9ZL7k4Lhx/LnpziVM3W7ONHuepaVskUd/Gjh4JC
yfju0DHahKTlq3tyi0YveTJV8W3DkN44V88k/F7QJO/cDdi+M8i1/NyjVoeqRXpd
pdrrRaFWtt2T+RfLCnjz+buuyrm4byeZYaGzHSUuCILylJDbZGh2m9rq5P0+CY6g
61qgNKmoqhDImOJtMy7E/k/PStqRuSBMQalQZnV0bmKPUc4crFsZScDn08dOgSLZ
WJaP7D3C9FQRMhaUmMivTFaPxeXm5X6RF1OdAdDZIewZVtnHyb5FQlqNlPteaQn9
rTp8s2OjRNJR98DowYsrelm1936HsTE30XnEp9FsPA8TFfn3MQnCfqBpVfaCkTfB
I7rFW1b1RL4uTNgvpsyBPJu+7c0/1LKqYdCkbWIjsOcseA6QSoxRQ1HupJ/3r2PC
vaJrz8csoTwFv8GReHDTkP5HAXTeXvWgEdqW+Re/SsHvimE8Pyyfs2KZfXUUTIfy
zh9qxw1xzDhOKNz1+/w9shJimik56Tzz4baM6A83hNHRlTpmw/8=
=53Uo
-----END PGP SIGNATURE-----


M
M
Maxime Devos wrote on 1 Feb 2022 20:28
Re: bug#53580: /var/run/shepherd/socket is missing on an otherwise functional system
04e30e30595ba96786a78c1dbc1768636b5c71e9.camel@telenet.be
Attila Lendvai schreef op do 27-01-2022 om 11:32 [+0000]:
Toggle quote (14 lines)
> (define (call-with-server-socket file-name proc)
>   "Call PROC, passing it a listening socket at FILE-NAME and deleting the
> socket file at FILE-NAME upon exit of PROC.  Return the values of PROC."
>   (let ((sock (open-server-socket file-name)))
>     (dynamic-wind
>       noop
>       (lambda () (proc sock))
>       (lambda ()
>         (close sock)
>         (catch-system-error (delete-file file-name))))))
> ```
>
> maybe this is caused by some call/cc magic that causes an unwind that deletes the file, but then continues?

Shepherd doesn't use call/cc anywhere. However, it does use
_delimited_ continuations, even though only through let/ec and
'guard'/'catch'/... More generally, call/cc is typically unused in
(Guile) Scheme code, and call-with-prompt / abort-to-prompt / shift /
reset / % are used instead.

My guess what happens: the start code of a shepherd service
fails between 'fork' and 'exec', with an exception. The exception
isn't caught (or is caught and reraised), so the 'out' guard of the
'dynamic-wind' is entered and the file representing the socket is
deleted.

If that's indeed the case, it might be a good idea to install
some exception handlers in fork+exec-command and friends (including
make-forkexec-constructor/container), to make shepherd more robust
w.r.t. services failing to start.

Greetings,
Maxime.
-----BEGIN PGP SIGNATURE-----

iI0EABYKADUWIQTB8z7iDFKP233XAR9J4+4iGRcl7gUCYfmJwxccbWF4aW1lZGV2
b3NAdGVsZW5ldC5iZQAKCRBJ4+4iGRcl7n+PAQC9O5RtCoEZ35TnqASgf+c0cYCs
z4rE4tnPUcUqOg5SRgEAg7SaXSfBntuGGG63v0n522zgxYXgtG+9nZQOOLgv8Qk=
=q4tt
-----END PGP SIGNATURE-----


A
A
Attila Lendvai wrote on 4 Apr 2022 09:15
(name . Maxime Devos)(address . maximedevos@telenet.be)(address . 53580@debbugs.gnu.org)
6NiL_Ch8DBvICfU6SITfzvKgAYXSazejBXnarb4WkiHJIy5ueKpiqTd5Jwr5SJiyzUCDxhH-ebv_vUxI5Vf8jD484kNu3Ykc1f6f48nt1ZU=@lendvai.name
FTR,

the issue is that when Shepherd is booting up, i.e. starting from its config file, it calls the start forms without guarding for any possible exceptions. any error propagates up beyond the loop and up until an unwind protect that deletes the socket.

the reason my system seemed fully functional is that my service was pretty much the last one to be started.

--
• attila lendvai
• PGP: 963F 5D5F 45C7 DFCD 0A39
--
“I made up the term 'object-oriented', and I can tell you I didn't have C++ in mind.”
— Alan Kay, OOPSLA '97
M
M
Maxim Cournoyer wrote on 18 May 14:58 +0200
control message for bug #53580
(address . control@debbugs.gnu.org)
87o7mhn6ea.fsf@gmail.com
severity 53580 important
quit
L
L
Ludovic Courtès wrote on 18 May 22:12 +0200
Re: bug#53580: /var/run/shepherd/socket is missing on an otherwise functional system
(name . Attila Lendvai)(address . attila@lendvai.name)(address . 53580@debbugs.gnu.org)
87jzx51jre.fsf@gnu.org
Hello Attila,

I had totally overlooked this bug report.

Attila Lendvai <attila@lendvai.name> skribis:

Toggle quote (10 lines)
> the systems seems to work fine. Gnome is up, i can log in with my user, and everything seems to work, except herd.
>
> i encounter this broken state every once in a while. IRC logs also mention this multiple times, but without many insights:
>
> https://logs.guix.gnu.org/guix/search?query=%2Fvar%2Frun%2Fshepherd%2Fsocket
>
> ```
> # herd status
> error: connect: /var/run/shepherd/socket: No such file or directory

[...]

Toggle quote (4 lines)
> the issue is that when Shepherd is booting up, i.e. starting from its config file, it calls the start forms without guarding for any possible exceptions. any error propagates up beyond the loop and up until an unwind protect that deletes the socket.
>
> the reason my system seemed fully functional is that my service was pretty much the last one to be started.

Currently (in 0.10.0), the ‘run-daemon’ procedure loads the user’s
config file before listening on /var/run/shepherd/socket. However, if
an exception is thrown from the config file, it stops:

Toggle snippet (16 lines)
$ echo '(error "oops")' > /tmp/conf.scm
$ ./shepherd -I -s sock -c /tmp/conf.scm
Starting service root...
Service root started.
Service root running with value #t.
Service root has been started.
misc-error(#f "~A" ("oops") #f)

Some deprecated features have been used. Set the environment
variable GUILE_WARN_DEPRECATED to "detailed" and rerun the
program to get more information. Set it to "no" to suppress
this message.
$ echo $?
1

Now, while the config file is being evaluated, shepherd does not listen
on its socket, which isn’t great.

This is mitigated by the use of ‘start-in-the-background’ (introduced in
0.9.0) in the config file, which, as the name implies, doesn’t block
further operation.

So I *think* we’re mostly okay now. The one thing we could do is load
the whole config file in a separate fiber, and maybe it’s fine to keep
going even when there’s an error during config file evaluation?

WDYT?

Thanks,
Ludo’.
A
A
Attila Lendvai wrote 30 hours ago
shepherd's architecture
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 53580@debbugs.gnu.org)
Jf0lcTW5Lw4gnNDSPsv037iYNAMvK28S6tL4Zh0FdGp7nnQCgCD_uITYxJ4PFxqKkaS5CUH_7mUucz2tvKVJKdQt2uhizTDQaiJ0Jup2wbs=@lendvai.name
[forked from: bug#53580: /var/run/shepherd/socket is missing on an otherwise functional system]

Toggle quote (7 lines)
> So I think we’re mostly okay now. The one thing we could do is load
> the whole config file in a separate fiber, and maybe it’s fine to keep
> going even when there’s an error during config file evaluation?
>
> WDYT?


i think there's a fundamental issue to be resolved here, and addressing that would implicitly resolve the entire class of issues that this one belongs to.

guile (shepherd) is run as the init process, and because of that it may not exit or be respawn. but at the same time when we reconfigure a guix system, then shepherd's config should not only be reloaded, but its internal state merged with the new config, and potentially even with an evolved shepherd codebase.

i still lack a proper mental model of all this to succesfully predict what will happen when i `guix system reconfigure` after i `guix pull`-ed my service code, and/or changed the config of my services.

--------

this problem of migration is pretty much a CS research topic...

ideally, there should be a non-shepherd-specific protocol defined for such migrations, and the new shpeherd codebase could migrate its state from the old to the new, with most of the migration code being automatic. some of it must be hand written as rquired by some semantic changes.

even more ideally, we should reflexive systems; admit that source code is a graph, and store it as one (as opposed to a string of characters); and our systems should have orthogonal persistency, etc, etc... a far cry from what we have now.

Fare's excellent blog has some visionary thoughts on this, especially in:


but given that we will not have these any time soon... what can we do now?

--------

note: what follows are wild ideas, and i'm not sure i have the necessary understanding of the involved subsystems to properly judge their feasibility... so take them with a pinch of salt.

idea 1
--------

it doesn't seem to be an insurmontable task to make sure that guile can safely unlink a module from its heap, check if there are any references into the module to be dropped, and then reload this module from disk.

the already runing fibers would keep the required code in the heap until after they are stopped/restarted. then the module would get GC'd eventually.

this would help solve the problem that a reconfigured service may have a completely different start/stop code. and by taking some careful shortcuts we may be able to make reloading work without having to stop the service process in question.

idea 2
--------

another, probably better idea:

split up shepherd's codebase into isolated parts:

1) the init process

2) the service runners, which are spawned by 1). let's call this part
'the runner'.

3) the CLI scripts that implement stuff like `reboot` by sending a
message to 1).

the runner would spawn and manage the actual daemon binaries/processes.

the init process would communicate with the runners through a channel/pipe that is created when the runner are spawn. i.e. here we wouldn't need an IPC socket file like we need for the communication between the scripts and the init process.

AFAIU the internal structure of shepherd is already turning into something like this with the use of fibers and channels. i suspect Ludo has something like this on his mind already.

in this setup most of the complexity and the evolution of the shepherd codebase would happen in the runner, and the other two parts could be kept minimal and would rarely need to change (and thus require a reboot).

the need for a reboot could be detected by noticing that the compiled binary of the init process has changed compared to what is currently running as PID 1.

the driver process of a service could be reloaded/respawned the next time when the daemon is stopped or it quits unexpectedly.

--------

recently i've succesfully wrote a shepherd service that spawns a daemon, and from a fiber it does two way communication with the daemon using a pipe connected to the daemon's stdio. i guess that counts as a proof of concept for the second idea, but i'm not sure about its stability. a stuck/failing service is a different issue than a stuck/failing init process.

for reference, the spawning of the daemon:


the fiber's code that talks to it:


--
• attila lendvai
• PGP: 963F 5D5F 45C7 DFCD 0A39
--
“We reject: kings, presidents and voting. We believe in: rough consensus and running code.”
— David Clark for the IETF
?