/var/run/shepherd/socket is missing on an otherwise functional system

  • Open
  • quality assurance status badge
Details
8 participants
  • Attila Lendvai
  • Brian Cully
  • Efraim Flashner
  • Felix Lechner
  • Ludovic Courtès
  • Maxim Cournoyer
  • Maxime Devos
  • Csepp
Owner
unassigned
Submitted by
Attila Lendvai
Severity
important
A
A
Attila Lendvai wrote on 27 Jan 2022 12:32
(name . bug-guix@gnu.org)(address . bug-guix@gnu.org)
BNbiqHqu6jP5GgHIZ0AMLhBo1O4baZdy21bgcUmGS1GqxOKnCHG_5uzzedymxwQsSFtL5gSw9Bppr1FXPoHDqIiKfe5K720Wb9Jivbsr_z4=@lendvai.name
the systems seems to work fine. Gnome is up, i can log in with my user, and everything seems to work, except herd.

i encounter this broken state every once in a while. IRC logs also mention this multiple times, but without many insights:


```
# herd status
error: connect: /var/run/shepherd/socket: No such file or directory

# ps afxu | grep shepherd
root 1 0.0 0.3 160788 43684 ? Sl 11:51 0:00 /gnu/store/cnfsv9ywaacyafkqdqsv2ry8f01yr7a9-guile-3.0.7/bin/guile --no-auto-compile /gnu/store/vza48khbaq0fdmcsrn27xj5y5yy76z6l-shepherd-0.8.1/bin/shepherd --config /gnu/store/q4nd803lxrlkr60s8sx88gvpb6c7lxyd-shepherd.conf

# uptime
12:26:44 up 0:34, 2 users, load average: 0.00, 0.01, 0.00
```

looking at shepherd's code:

```
(define (call-with-server-socket file-name proc)
"Call PROC, passing it a listening socket at FILE-NAME and deleting the
socket file at FILE-NAME upon exit of PROC. Return the values of PROC."
(let ((sock (open-server-socket file-name)))
(dynamic-wind
noop
(lambda () (proc sock))
(lambda ()
(close sock)
(catch-system-error (delete-file file-name))))))
```

maybe this is caused by some call/cc magic that causes an unwind that deletes the file, but then continues?

--
• attila lendvai
• PGP: 963F 5D5F 45C7 DFCD 0A39
--
“Above all, do not lose your desire to walk: Every day I walk myself into a state of well-being and walk away from every illness; I have walked myself into my best thoughts, and I know of no thought so burdensome that one cannot walk away from it.”
— Søren Kierkegaard (1813–1855)
A
A
Attila Lendvai wrote on 27 Jan 2022 13:13
(No Subject)
(name . 53580@debbugs.gnu.org)(address . 53580@debbugs.gnu.org)
64mPNb0u2KM14ObD5EvtwiyzLKVPWLeUoaPyu2PuGOvhqFTVrAdkHMZzxksSNaturHgJG5wz53ZsOa_mnGzUnA44E9kFJj_Zi9QUTdC2g-8=@lendvai.name
i forgot to add that i'm working on a shepherd service, and this may be due to errors in the service's user code, like the start gexp.
Attachment: file
E
E
Efraim Flashner wrote on 1 Feb 2022 12:06
(name . Attila Lendvai)(address . attila@lendvai.name)(name . 53580@debbugs.gnu.org)(address . 53580@debbugs.gnu.org)
YfkUGqsrXEwDDkH7@3900XT
On Thu, Jan 27, 2022 at 12:13:28PM +0000, Attila Lendvai wrote:
Toggle quote (2 lines)
> i forgot to add that i'm working on a shepherd service, and this may be due to errors in the service's user code, like the start gexp.

This is generally when I see this type of error. I normally try to
create a minimal VM and launch that when I'm trying out a new service.

--
Efraim Flashner <efraim@flashner.co.il> ????? ?????
GPG key = A28B F40C 3E55 1372 662D 14F7 41AA E7DC CA3D 8351
Confidentiality cannot be guaranteed on emails sent or received unencrypted
-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEEoov0DD5VE3JmLRT3Qarn3Mo9g1EFAmH5FBoACgkQQarn3Mo9
g1F0BRAAnZ/qsgsIqq23R63SXh7VFQcpWt0vabSbUTn/d7iSpf2g2xn3Yxe4X/P9
HSD5BIvuevxjTsSq3ohP5A+pDnREgrnJtxLVw58ZhDS+h/Wz81h0cF/vxnABbyoL
CK85axe4keqtiQM0gBO9/lJni9ZL7k4Lhx/LnpziVM3W7ONHuepaVskUd/Gjh4JC
yfju0DHahKTlq3tyi0YveTJV8W3DkN44V88k/F7QJO/cDdi+M8i1/NyjVoeqRXpd
pdrrRaFWtt2T+RfLCnjz+buuyrm4byeZYaGzHSUuCILylJDbZGh2m9rq5P0+CY6g
61qgNKmoqhDImOJtMy7E/k/PStqRuSBMQalQZnV0bmKPUc4crFsZScDn08dOgSLZ
WJaP7D3C9FQRMhaUmMivTFaPxeXm5X6RF1OdAdDZIewZVtnHyb5FQlqNlPteaQn9
rTp8s2OjRNJR98DowYsrelm1936HsTE30XnEp9FsPA8TFfn3MQnCfqBpVfaCkTfB
I7rFW1b1RL4uTNgvpsyBPJu+7c0/1LKqYdCkbWIjsOcseA6QSoxRQ1HupJ/3r2PC
vaJrz8csoTwFv8GReHDTkP5HAXTeXvWgEdqW+Re/SsHvimE8Pyyfs2KZfXUUTIfy
zh9qxw1xzDhOKNz1+/w9shJimik56Tzz4baM6A83hNHRlTpmw/8=
=53Uo
-----END PGP SIGNATURE-----


M
M
Maxime Devos wrote on 1 Feb 2022 20:28
Re: bug#53580: /var/run/shepherd/socket is missing on an otherwise functional system
04e30e30595ba96786a78c1dbc1768636b5c71e9.camel@telenet.be
Attila Lendvai schreef op do 27-01-2022 om 11:32 [+0000]:
Toggle quote (14 lines)
> (define (call-with-server-socket file-name proc)
>   "Call PROC, passing it a listening socket at FILE-NAME and deleting the
> socket file at FILE-NAME upon exit of PROC.  Return the values of PROC."
>   (let ((sock (open-server-socket file-name)))
>     (dynamic-wind
>       noop
>       (lambda () (proc sock))
>       (lambda ()
>         (close sock)
>         (catch-system-error (delete-file file-name))))))
> ```
>
> maybe this is caused by some call/cc magic that causes an unwind that deletes the file, but then continues?

Shepherd doesn't use call/cc anywhere. However, it does use
_delimited_ continuations, even though only through let/ec and
'guard'/'catch'/... More generally, call/cc is typically unused in
(Guile) Scheme code, and call-with-prompt / abort-to-prompt / shift /
reset / % are used instead.

My guess what happens: the start code of a shepherd service
fails between 'fork' and 'exec', with an exception. The exception
isn't caught (or is caught and reraised), so the 'out' guard of the
'dynamic-wind' is entered and the file representing the socket is
deleted.

If that's indeed the case, it might be a good idea to install
some exception handlers in fork+exec-command and friends (including
make-forkexec-constructor/container), to make shepherd more robust
w.r.t. services failing to start.

Greetings,
Maxime.
-----BEGIN PGP SIGNATURE-----

iI0EABYKADUWIQTB8z7iDFKP233XAR9J4+4iGRcl7gUCYfmJwxccbWF4aW1lZGV2
b3NAdGVsZW5ldC5iZQAKCRBJ4+4iGRcl7n+PAQC9O5RtCoEZ35TnqASgf+c0cYCs
z4rE4tnPUcUqOg5SRgEAg7SaXSfBntuGGG63v0n522zgxYXgtG+9nZQOOLgv8Qk=
=q4tt
-----END PGP SIGNATURE-----


A
A
Attila Lendvai wrote on 4 Apr 2022 09:15
(name . Maxime Devos)(address . maximedevos@telenet.be)(address . 53580@debbugs.gnu.org)
6NiL_Ch8DBvICfU6SITfzvKgAYXSazejBXnarb4WkiHJIy5ueKpiqTd5Jwr5SJiyzUCDxhH-ebv_vUxI5Vf8jD484kNu3Ykc1f6f48nt1ZU=@lendvai.name
FTR,

the issue is that when Shepherd is booting up, i.e. starting from its config file, it calls the start forms without guarding for any possible exceptions. any error propagates up beyond the loop and up until an unwind protect that deletes the socket.

the reason my system seemed fully functional is that my service was pretty much the last one to be started.

--
• attila lendvai
• PGP: 963F 5D5F 45C7 DFCD 0A39
--
“I made up the term 'object-oriented', and I can tell you I didn't have C++ in mind.”
— Alan Kay, OOPSLA '97
M
M
Maxim Cournoyer wrote on 18 May 2023 14:58
control message for bug #53580
(address . control@debbugs.gnu.org)
87o7mhn6ea.fsf@gmail.com
severity 53580 important
quit
L
L
Ludovic Courtès wrote on 18 May 2023 22:12
Re: bug#53580: /var/run/shepherd/socket is missing on an otherwise functional system
(name . Attila Lendvai)(address . attila@lendvai.name)(address . 53580@debbugs.gnu.org)
87jzx51jre.fsf@gnu.org
Hello Attila,

I had totally overlooked this bug report.

Attila Lendvai <attila@lendvai.name> skribis:

Toggle quote (10 lines)
> the systems seems to work fine. Gnome is up, i can log in with my user, and everything seems to work, except herd.
>
> i encounter this broken state every once in a while. IRC logs also mention this multiple times, but without many insights:
>
> https://logs.guix.gnu.org/guix/search?query=%2Fvar%2Frun%2Fshepherd%2Fsocket
>
> ```
> # herd status
> error: connect: /var/run/shepherd/socket: No such file or directory

[...]

Toggle quote (4 lines)
> the issue is that when Shepherd is booting up, i.e. starting from its config file, it calls the start forms without guarding for any possible exceptions. any error propagates up beyond the loop and up until an unwind protect that deletes the socket.
>
> the reason my system seemed fully functional is that my service was pretty much the last one to be started.

Currently (in 0.10.0), the ‘run-daemon’ procedure loads the user’s
config file before listening on /var/run/shepherd/socket. However, if
an exception is thrown from the config file, it stops:

Toggle snippet (16 lines)
$ echo '(error "oops")' > /tmp/conf.scm
$ ./shepherd -I -s sock -c /tmp/conf.scm
Starting service root...
Service root started.
Service root running with value #t.
Service root has been started.
misc-error(#f "~A" ("oops") #f)

Some deprecated features have been used. Set the environment
variable GUILE_WARN_DEPRECATED to "detailed" and rerun the
program to get more information. Set it to "no" to suppress
this message.
$ echo $?
1

Now, while the config file is being evaluated, shepherd does not listen
on its socket, which isn’t great.

This is mitigated by the use of ‘start-in-the-background’ (introduced in
0.9.0) in the config file, which, as the name implies, doesn’t block
further operation.

So I *think* we’re mostly okay now. The one thing we could do is load
the whole config file in a separate fiber, and maybe it’s fine to keep
going even when there’s an error during config file evaluation?

WDYT?

Thanks,
Ludo’.
A
A
Attila Lendvai wrote on 27 May 2023 12:33
shepherd's architecture
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 53580@debbugs.gnu.org)
Jf0lcTW5Lw4gnNDSPsv037iYNAMvK28S6tL4Zh0FdGp7nnQCgCD_uITYxJ4PFxqKkaS5CUH_7mUucz2tvKVJKdQt2uhizTDQaiJ0Jup2wbs=@lendvai.name
[forked from: bug#53580: /var/run/shepherd/socket is missing on an otherwise functional system]

Toggle quote (7 lines)
> So I think we’re mostly okay now. The one thing we could do is load
> the whole config file in a separate fiber, and maybe it’s fine to keep
> going even when there’s an error during config file evaluation?
>
> WDYT?


i think there's a fundamental issue to be resolved here, and addressing that would implicitly resolve the entire class of issues that this one belongs to.

guile (shepherd) is run as the init process, and because of that it may not exit or be respawn. but at the same time when we reconfigure a guix system, then shepherd's config should not only be reloaded, but its internal state merged with the new config, and potentially even with an evolved shepherd codebase.

i still lack a proper mental model of all this to succesfully predict what will happen when i `guix system reconfigure` after i `guix pull`-ed my service code, and/or changed the config of my services.

--------

this problem of migration is pretty much a CS research topic...

ideally, there should be a non-shepherd-specific protocol defined for such migrations, and the new shpeherd codebase could migrate its state from the old to the new, with most of the migration code being automatic. some of it must be hand written as rquired by some semantic changes.

even more ideally, we should reflexive systems; admit that source code is a graph, and store it as one (as opposed to a string of characters); and our systems should have orthogonal persistency, etc, etc... a far cry from what we have now.

Fare's excellent blog has some visionary thoughts on this, especially in:


but given that we will not have these any time soon... what can we do now?

--------

note: what follows are wild ideas, and i'm not sure i have the necessary understanding of the involved subsystems to properly judge their feasibility... so take them with a pinch of salt.

idea 1
--------

it doesn't seem to be an insurmontable task to make sure that guile can safely unlink a module from its heap, check if there are any references into the module to be dropped, and then reload this module from disk.

the already runing fibers would keep the required code in the heap until after they are stopped/restarted. then the module would get GC'd eventually.

this would help solve the problem that a reconfigured service may have a completely different start/stop code. and by taking some careful shortcuts we may be able to make reloading work without having to stop the service process in question.

idea 2
--------

another, probably better idea:

split up shepherd's codebase into isolated parts:

1) the init process

2) the service runners, which are spawned by 1). let's call this part
'the runner'.

3) the CLI scripts that implement stuff like `reboot` by sending a
message to 1).

the runner would spawn and manage the actual daemon binaries/processes.

the init process would communicate with the runners through a channel/pipe that is created when the runner are spawn. i.e. here we wouldn't need an IPC socket file like we need for the communication between the scripts and the init process.

AFAIU the internal structure of shepherd is already turning into something like this with the use of fibers and channels. i suspect Ludo has something like this on his mind already.

in this setup most of the complexity and the evolution of the shepherd codebase would happen in the runner, and the other two parts could be kept minimal and would rarely need to change (and thus require a reboot).

the need for a reboot could be detected by noticing that the compiled binary of the init process has changed compared to what is currently running as PID 1.

the driver process of a service could be reloaded/respawned the next time when the daemon is stopped or it quits unexpectedly.

--------

recently i've succesfully wrote a shepherd service that spawns a daemon, and from a fiber it does two way communication with the daemon using a pipe connected to the daemon's stdio. i guess that counts as a proof of concept for the second idea, but i'm not sure about its stability. a stuck/failing service is a different issue than a stuck/failing init process.

for reference, the spawning of the daemon:


the fiber's code that talks to it:


--
• attila lendvai
• PGP: 963F 5D5F 45C7 DFCD 0A39
--
“We reject: kings, presidents and voting. We believe in: rough consensus and running code.”
— David Clark for the IETF
A
A
Attila Lendvai wrote on 29 May 2023 00:23
(name . Ludovic Courtès)(address . ludo@gnu.org)
Fe3OtPtYH2PHkXerCVPLsOIdvUw04jv5hYL_lms1t-V-JQ8vTvdHyu7Lk-PYfYIBns96NX34wxf8Yb3uNiiPGQzOMjB2hO7_yn3lrILV6fA=@lendvai.name
[resending to include the guix-devel list. apologies for everyone who receives this mail twice!]

------

[forked from: bug#53580: /var/run/shepherd/socket is missing on an otherwise functional system]


Toggle quote (7 lines)
> So I think we’re mostly okay now. The one thing we could do is load
> the whole config file in a separate fiber, and maybe it’s fine to keep
> going even when there’s an error during config file evaluation?
>
> WDYT?


i think there's a fundamental issue to be resolved here, and addressing that would implicitly resolve the entire class of issues that this one belongs to.

guile (shepherd) is run as the init process, and because of that it may not exit or be respawn. but at the same time when we reconfigure a guix system, then shepherd's config should not only be reloaded, but its internal state merged with the new config, and potentially even with an evolved shepherd codebase.

i still lack a proper mental model of all this to succesfully predict what will happen when i `guix system reconfigure` after i `guix pull`-ed my service code, and/or changed the config of my services.

--------

this problem of migration is pretty much a CS research topic...

ideally, there should be a non-shepherd-specific protocol defined for such migrations, and the new shpeherd codebase could migrate its state from the old to the new, with most of the migration code being automatic. some of it must be hand written as rquired by some semantic changes.

even more ideally, we should reflexive systems; admit that source code is a graph, and store it as one (as opposed to a string of characters); and our systems should have orthogonal persistency, etc, etc... a far cry from what we have now.

Fare's excellent blog has some visionary thoughts on this, especially in:


but given that we will not have these any time soon... what can we do now?

--------

note: what follows are wild ideas, and i'm not sure i have the necessary understanding of the involved subsystems to properly judge their feasibility... so take them with a pinch of salt.

idea 1
--------

it doesn't seem to be an insurmontable task to make sure that guile can safely unlink a module from its heap, check if there are any references into the module to be dropped, and then reload this module from disk.

the already runing fibers would keep the required code in the heap until after they are stopped/restarted. then the module would get GC'd eventually.

this would help solve the problem that a reconfigured service may have a completely different start/stop code. and by taking some careful shortcuts we may be able to make reloading work without having to stop the service process in question.

idea 2
--------

another, probably better idea:

split up shepherd's codebase into isolated parts:

1) the init process

2) the service runners, which are spawned by 1). let's call this part
'the runner'.

3) the CLI scripts that implement stuff like `reboot` by sending a
message to 1).

the runner would spawn and manage the actual daemon binaries/processes.

the init process would communicate with the runners through a channel/pipe that is created when the runner are spawn. i.e. here we wouldn't need an IPC socket file like we need for the communication between the scripts and the init process.

AFAIU the internal structure of shepherd is already turning into something like this with the use of fibers and channels. i suspect Ludo has something like this on his mind already.

in this setup most of the complexity and the evolution of the shepherd codebase would happen in the runner, and the other two parts could be kept minimal and would rarely need to change (and thus require a reboot).

the need for a reboot could be detected by noticing that the compiled binary of the init process has changed compared to what is currently running as PID 1.

the driver process of a service could be reloaded/respawned the next time when the daemon is stopped or it quits unexpectedly.

--------

recently i've succesfully wrote a shepherd service that spawns a daemon, and from a fiber it does two way communication with the daemon using a pipe connected to the daemon's stdio. i guess that counts as a proof of concept for the second idea, but i'm not sure about its stability. a stuck/failing service is a different issue than a stuck/failing init process.

for reference, the spawning of the daemon:


the fiber's code that talks to it:


--
• attila lendvai
• PGP: 963F 5D5F 45C7 DFCD 0A39
--
“Dying societies accumulate laws like dying men accumulate remedies.”
— Nicolás Gómez Dávila (1913–1994), 'Escolios a un texto implicito: Seleccion'
B
B
Brian Cully wrote on 29 May 2023 16:46
(name . Attila Lendvai)(address . attila@lendvai.name)
87fs7fxjut.fsf@psyduck.jhoto.kublai.com
Attila Lendvai <attila@lendvai.name> writes:

Toggle quote (21 lines)
> it doesn't seem to be an insurmontable task to make sure that
> guile
> can safely unlink a module from its heap, check if there are any
> references into the module to be dropped, and then reload this
> module
> from disk.
>
> the already runing fibers would keep the required code in the
> heap
> until after they are stopped/restarted. then the module would
> get GC'd
> eventually.
>
> this would help solve the problem that a reconfigured service
> may have
> a completely different start/stop code. and by taking some
> careful
> shortcuts we may be able to make reloading work without having
> to stop
> the service process in question.

Erlang has had hot code reloading for decades, built around the
needs of 100% uptime systems. The problem is more complex than it
often appears to people who are used to how lisps traditionally do
it. I strongly recommend reading up on Erlang's migration
system. Briefly: you can't just swap out function definitions,
because they rely on non-function state which needs to be migrated
along with the function itself, and you can't do it whenever you
want, because external actors may be relying on a view of the
internal state. To accomplish this, Erlang has a lot of machinery,
and it fits in to the core design of the language and runtime
which would be extremely difficult to port over to non-Erlang
languages. Doing it in Scheme is probably possible in an academic
sense, but not in a practical one.

OTOH, Lisp Flavoured Erlang exists if you want that syntax. There
would definitely be advantages to writing an init (and, indeed,
any service that needs 100% uptime) on top of the Erlang virtual
machine. But going the other way, by porting Erlang's
functionality into Scheme, is going to be a wash.

Toggle quote (7 lines)
> in this setup most of the complexity and the evolution of the
> shepherd
> codebase would happen in the runner, and the other two parts
> could be
> kept minimal and would rarely need to change (and thus require a
> reboot).

Accepting that dramatic enough changes to PID 1 are going to
require a reboot seems reasonable to me. They should be even more
rare than kernel updates, and we accept rebooting there already.

-bjc
F
F
Felix Lechner wrote on 29 May 2023 17:18
(name . Brian Cully)(address . bjc@spork.org)
CAFHYt56fWdVFbd-BNX1TKqabLtHC5L7QwBY1YdXhsWw_4cp3Ng@mail.gmail.com
Hi Brian,

On Mon, May 29, 2023 at 8:02?AM Brian Cully via Development of GNU
Guix and the GNU System distribution. <guix-devel@gnu.org> wrote:
Toggle quote (3 lines)
>
> Erlang has had hot code reloading for decades

Thank you for that pointer! I also had Erlang on my mind while reading
Attila's message.

Toggle quote (5 lines)
> Lisp Flavoured Erlang exists if you want that syntax. There
> would definitely be advantages to writing an init (and, indeed,
> any service that needs 100% uptime) on top of the Erlang virtual
> machine.

“Twenty years from now you will be more disappointed by the things
that you didn't do than by the ones you did do. So throw off the
bowlines. Sail away from the safe harbor. Catch the trade winds in
your sails. Explore. Dream. Discover.” --- H. Jackson Brown Jr in
"P.S. I Love You"

Kind regards
Felix
L
L
Ludovic Courtès wrote on 6 Jun 2023 17:16
(name . Attila Lendvai)(address . attila@lendvai.name)(address . 53580@debbugs.gnu.org)
87v8g07ham.fsf@gnu.org
Hi Attila,

Attila Lendvai <attila@lendvai.name> skribis:

Toggle quote (13 lines)
> [forked from: bug#53580: /var/run/shepherd/socket is missing on an otherwise functional system]
>
>> So I think we’re mostly okay now. The one thing we could do is load
>> the whole config file in a separate fiber, and maybe it’s fine to keep
>> going even when there’s an error during config file evaluation?
>>
>> WDYT?
>
>
> i think there's a fundamental issue to be resolved here, and addressing that would implicitly resolve the entire class of issues that this one belongs to.
>
> guile (shepherd) is run as the init process, and because of that it may not exit or be respawn. but at the same time when we reconfigure a guix system, then shepherd's config should not only be reloaded, but its internal state merged with the new config, and potentially even with an evolved shepherd codebase.

Sorry to be direct: is there a concrete bug you’re reporting here?

Toggle quote (2 lines)
> i still lack a proper mental model of all this to succesfully predict what will happen when i `guix system reconfigure` after i `guix pull`-ed my service code, and/or changed the config of my services.

What happens is that ‘guix system reconfigure’ loads new services into
the running shepherd. New services simply get started; services for
which a same-named service is already running instead get registered as
a “replacement”, meaning that the new version of the service only gets
started when the user explicitly runs ‘herd restart SERVICE’.

Non-stop upgrades is ideal, but shepherd alone cannot do that. For
instance, nginx supports that, and no init system could implement that
on its behalf.

Ludo’.
C
C
Csepp wrote on 8 Jun 2023 14:54
Re: bug#53580: shepherd's architecture
(name . Ludovic Courtès)(address . ludo@gnu.org)
874jnidscx.fsf@riseup.net
Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (42 lines)
> Hi Attila,
>
> Attila Lendvai <attila@lendvai.name> skribis:
>
>> [forked from: bug#53580: /var/run/shepherd/socket is missing on an otherwise functional system]
>>
>>> So I think we’re mostly okay now. The one thing we could do is load
>>> the whole config file in a separate fiber, and maybe it’s fine to keep
>>> going even when there’s an error during config file evaluation?
>>>
>>> WDYT?
>>
>>
>> i think there's a fundamental issue to be resolved here, and
>> addressing that would implicitly resolve the entire class of issues
>> that this one belongs to.
>>
>> guile (shepherd) is run as the init process, and because of that it
>> may not exit or be respawn. but at the same time when we reconfigure
>> a guix system, then shepherd's config should not only be reloaded,
>> but its internal state merged with the new config, and potentially
>> even with an evolved shepherd codebase.
>
> Sorry to be direct: is there a concrete bug you’re reporting here?
>
>> i still lack a proper mental model of all this to succesfully
>> predict what will happen when i `guix system reconfigure` after i
>> `guix pull`-ed my service code, and/or changed the config of my
>> services.
>
> What happens is that ‘guix system reconfigure’ loads new services into
> the running shepherd. New services simply get started; services for
> which a same-named service is already running instead get registered as
> a “replacement”, meaning that the new version of the service only gets
> started when the user explicitly runs ‘herd restart SERVICE’.
>
> Non-stop upgrades is ideal, but shepherd alone cannot do that. For
> instance, nginx supports that, and no init system could implement that
> on its behalf.
>
> Ludo’.

Do services get a reference to their previously running version?
The Minix project was experimenting with supporting something like
supervisor trees for high uptime, and one way they were trying to
achieve that was by giving services the memory of their previous
version, so they could read their state and migrate it to their own
memory.
A
A
Attila Lendvai wrote on 8 Jun 2023 22:56
Re: shepherd's architecture
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 53580@debbugs.gnu.org)
LobCX9UrNR84j_EQcEgHGmbNdHFgCQlPg4GFg-pt5fmdNGw7XTLv-3_6OgUnSePBAnQspps4icDGEvi5bd166sSH6Navy1J1c875tytgh18=@lendvai.name
Toggle quote (3 lines)
> Sorry to be direct: is there a concrete bug you’re reporting here?


i didn't pay careful enough attention to report something specific, but one thing that pops to mind:

when i'm working on my service code, which is `guix pull`ed in from my channel, then after a reconfigure i seem to have to reboot for my new code to get activated. a simple `herd restart` on the service didn't seem to be enough. i.e. the guile modules that my service code is using did not get reloaded into the PID 1 guile.

keep in mind that this is a non-trivial service that e.g. spawns a long-lived fiber to talk to the daemon through its stdio while the daemon is running. IOW, its start GEXP is not just a simple forkexec, but something more complex that uses functions from guile modules that should be reloaded into PID 1 when the new version of the service is to be started.

--
• attila lendvai
• PGP: 963F 5D5F 45C7 DFCD 0A39
--
“The unexamined life is not worth living for a human being.”
— Socrates (c. 470–399 BC, tried and executed), 'Apology' (399 BC)
L
L
Ludovic Courtès wrote on 11 Jun 2023 16:16
Re: bug#53580: /var/run/shepherd/socket is missing on an otherwise functional system
(name . Attila Lendvai)(address . attila@lendvai.name)(address . 53580@debbugs.gnu.org)
87352yrsmy.fsf_-_@gnu.org
Hi,

Attila Lendvai <attila@lendvai.name> skribis:

Toggle quote (2 lines)
> when i'm working on my service code, which is `guix pull`ed in from my channel, then after a reconfigure i seem to have to reboot for my new code to get activated. a simple `herd restart` on the service didn't seem to be enough. i.e. the guile modules that my service code is using did not get reloaded into the PID 1 guile.

Guile modules do not get reloaded; there’s no mechanism in place to
reload previously-loaded Guile modules.

Toggle quote (2 lines)
> keep in mind that this is a non-trivial service that e.g. spawns a long-lived fiber to talk to the daemon through its stdio while the daemon is running. IOW, its start GEXP is not just a simple forkexec, but something more complex that uses functions from guile modules that should be reloaded into PID 1 when the new version of the service is to be started.

OK, got it. There’s not enough info here to be concrete, but I’d
recommend making it a separate process if you need to reliably
reload/replace the module. IOW, you’d make it a “regular” service
spawned with ‘make-forkexec-constructor’ or similar.

However this doesn’t have anything to do with the initial bug report and
the title of this message; for clarity, please move further discussion
to guix-devel.

Thanks,
Ludo’.
L
L
Ludovic Courtès wrote on 11 Jun 2023 16:18
(name . Attila Lendvai)(address . attila@lendvai.name)(address . 53580@debbugs.gnu.org)
87y1kqqdzl.fsf@gnu.org
Attila Lendvai <attila@lendvai.name> skribis:

Toggle quote (11 lines)
> (define (call-with-server-socket file-name proc)
> "Call PROC, passing it a listening socket at FILE-NAME and deleting the
> socket file at FILE-NAME upon exit of PROC. Return the values of PROC."
> (let ((sock (open-server-socket file-name)))
> (dynamic-wind
> noop
> (lambda () (proc sock))
> (lambda ()
> (close sock)
> (catch-system-error (delete-file file-name))))))

For the record, ‘dynamic-wind’ here was replaced by ‘catch’ in
46790f9d924af2a9521adccb9e6db6afd9c1a2e7, which corresponds to the
introduction of Fibers in 0.9.x.

Ludo’.
?