From debbugs-submit-bounces@debbugs.gnu.org Sat May 27 06:34:03 2023
Received: (at 53580) by debbugs.gnu.org; 27 May 2023 10:34:04 +0000
Received: from localhost ([127.0.0.1]:51423 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
	id 1q2rFL-0002NW-Dc
	for submit@debbugs.gnu.org; Sat, 27 May 2023 06:34:03 -0400
Received: from mail-40136.proton.ch ([185.70.40.136]:21345)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <attila@lendvai.name>) id 1q2rFJ-0002Mw-EZ
 for 53580@debbugs.gnu.org; Sat, 27 May 2023 06:34:02 -0400
Date: Sat, 27 May 2023 10:33:41 +0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=lendvai.name;
 s=protonmail2; t=1685183634; x=1685442834;
 bh=sEr0tmZLFsFVrG9O7/DkrLRlBsDVacqtypASo5en9Ls=;
 h=Date:To:From:Cc:Subject:Message-ID:Feedback-ID:From:To:Cc:Date:
 Subject:Reply-To:Feedback-ID:Message-ID:BIMI-Selector;
 b=SzVpKthbLmJH09H695FBqC8A7TQT/iwrVmyOv1ELrKAloakIFe1BGhOGSHCoRQP7S
 I1M3OscFI5xyJexbah4S3lHq0EPAkpxBW2nDs6PMGAglf4JKvnMFyV9N+OIwnAro1+
 dxq4tvHaGTM2SdjF7c5teXApXZFiOrPqoSm4ZQlQHCv0mPm0acYIspSjk5U9S755E+
 orgwnrEGgsNCAQ9Hic0nTqiUaJHPzbOnAXTSdw1nnaJUMqL5qlORULsRdyqrHWFcL3
 iTp0omonR3RfymkiJLdR4FhQtgWnUPlKtzBop2d0vgjA/n/E7372a+aXZ4uLIOg2U8
 WAMs1L/PxyEkA==
To: =?utf-8?Q?Ludovic_Court=C3=A8s?= <ludo@gnu.org>
From: Attila Lendvai <attila@lendvai.name>
Subject: shepherd's architecture
Message-ID: <Jf0lcTW5Lw4gnNDSPsv037iYNAMvK28S6tL4Zh0FdGp7nnQCgCD_uITYxJ4PFxqKkaS5CUH_7mUucz2tvKVJKdQt2uhizTDQaiJ0Jup2wbs=@lendvai.name>
Feedback-ID: 28384833:user:proton
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Score: -0.0 (/)
X-Debbugs-Envelope-To: 53580
Cc: 53580@debbugs.gnu.org
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit@debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request@debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces@debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
X-Spam-Score: -1.0 (-)

[forked from: bug#53580: /var/run/shepherd/socket is missing on an otherwis=
e functional system]

> So I think we=E2=80=99re mostly okay now. The one thing we could do is lo=
ad
> the whole config file in a separate fiber, and maybe it=E2=80=99s fine to=
 keep
> going even when there=E2=80=99s an error during config file evaluation?
>
> WDYT?


i think there's a fundamental issue to be resolved here, and addressing tha=
t would implicitly resolve the entire class of issues that this one belongs=
 to.

guile (shepherd) is run as the init process, and because of that it may not=
 exit or be respawn. but at the same time when we reconfigure a guix system=
, then shepherd's config should not only be reloaded, but its internal stat=
e merged with the new config, and potentially even with an evolved shepherd=
 codebase.

i still lack a proper mental model of all this to succesfully predict what =
will happen when i `guix system reconfigure` after i `guix pull`-ed my serv=
ice code, and/or changed the config of my services.

--------

this problem of migration is pretty much a CS research topic...

ideally, there should be a non-shepherd-specific protocol defined for such =
migrations, and the new shpeherd codebase could migrate its state from the =
old to the new, with most of the migration code being automatic. some of it=
 must be hand written as rquired by some semantic changes.

even more ideally, we should reflexive systems; admit that source code is a=
 graph, and store it as one (as opposed to a string of characters); and our=
 systems should have orthogonal persistency, etc, etc... a far cry from wha=
t we have now.

Fare's excellent blog has some visionary thoughts on this, especially in:

https://ngnghm.github.io/blog/2015/09/08/chapter-5-non-stop-change/

but given that we will not have these any time soon... what can we do now?

--------

note: what follows are wild ideas, and i'm not sure i have the necessary un=
derstanding of the involved subsystems to properly judge their feasibility.=
.. so take them with a pinch of salt.

idea 1
--------

it doesn't seem to be an insurmontable task to make sure that guile can saf=
ely unlink a module from its heap, check if there are any references into t=
he module to be dropped, and then reload this module from disk.

the already runing fibers would keep the required code in the heap until af=
ter they are stopped/restarted. then the module would get GC'd eventually.

this would help solve the problem that a reconfigured service may have a co=
mpletely different start/stop code. and by taking some careful shortcuts we=
 may be able to make reloading work without having to stop the service proc=
ess in question.

idea 2
--------

another, probably better idea:

split up shepherd's codebase into isolated parts:

 1) the init process

 2) the service runners, which are spawned by 1). let's call this part
    'the runner'.

 3) the CLI scripts that implement stuff like `reboot` by sending a
    message to 1).

the runner would spawn and manage the actual daemon binaries/processes.

the init process would communicate with the runners through a channel/pipe =
that is created when the runner are spawn. i.e. here we wouldn't need an IP=
C socket file like we need for the communication between the scripts and th=
e init process.

AFAIU the internal structure of shepherd is already turning into something =
like this with the use of fibers and channels. i suspect Ludo has something=
 like this on his mind already.

in this setup most of the complexity and the evolution of the shepherd code=
base would happen in the runner, and the other two parts could be kept mini=
mal and would rarely need to change (and thus require a reboot).

the need for a reboot could be detected by noticing that the compiled binar=
y of the init process has changed compared to what is currently running as =
PID 1.

the driver process of a service could be reloaded/respawned the next time w=
hen the daemon is stopped or it quits unexpectedly.

--------

recently i've succesfully wrote a shepherd service that spawns a daemon, an=
d from a fiber it does two way communication with the daemon using a pipe c=
onnected to the daemon's stdio. i guess that counts as a proof of concept f=
or the second idea, but i'm not sure about its stability. a stuck/failing s=
ervice is a different issue than a stuck/failing init process.

for reference, the spawning of the daemon:

https://github.com/attila-lendvai/guix-crypto/blob/8f996239bb8c2a1103c3e546=
05faf680fe1ed093/src/guix-crypto/services/swarm.scm#L315

the fiber's code that talks to it:

https://github.com/attila-lendvai/guix-crypto/blob/8f996239bb8c2a1103c3e546=
05faf680fe1ed093/src/guix-crypto/swarm-utils.scm#L133

--
=E2=80=A2 attila lendvai
=E2=80=A2 PGP: 963F 5D5F 45C7 DFCD 0A39
--
=E2=80=9CWe reject: kings, presidents and voting. We believe in: rough cons=
ensus and running code.=E2=80=9D
=09=E2=80=94 David Clark for the IETF