Shepherd desertion upon service canonical name change?

  • Open
  • quality assurance status badge
Details
2 participants
  • Ludovic Courtès
  • Bruno Victal
Owner
unassigned
Submitted by
Bruno Victal
Severity
normal
B
B
Bruno Victal wrote on 2 Apr 2023 18:34
(name . bug-guix)(address . bug-guix@gnu.org)
3a857c75-a871-e37d-4ab5-c5fec308a8b9@makinata.eu

Upon a guix system reconfigure, if a running service has undergone a canonical name change since the previous generation
then shutdown or reboot commands fail, with shepherd indicating itself (root service) as stopped.


Suspected commit range with the change that triggered this:
from: f01b5299db6031174f05124b843c936388cd872a --> to: 380faf265b0c3b231ab8b69597d161be5e704e18


Console messages:

Toggle snippet (41 lines)
mirai@guix-nuc ~$ sudo guix system reconfigure config.scm
building /gnu/store/72d7db3ckjswjdfv2w0a6wk2yspqakgc-upgrade-shepherd-services.scm.drv...
shepherd: Service host-name has been started.
shepherd: Service user-homes has been started.
shepherd: Service sysctl has been started.
shepherd: Service host-name has been started.
shepherd: Service term-console could not be started.
shepherd: Service x11-socket-directory has been started.
shepherd: Service NetworkManager conflicts with running services (networking).
To complete the upgrade, run 'herd restart SERVICE' to stop,
upgrade, and restart each service that was not automatically restarted.
Run 'herd status' to view the list of services on your system.
mirai@guix-nuc ~$ sudo reboot
Exiting shepherd...
Service nftables has been stopped.
Service console-font-tty2 has been stopped.
Service term-tty2 has been stopped.
Service avahi-daemon has been stopped.9
Service xorg-server has been stopped.
Service nscd has been stopped.
Service console-font-tty1 has been stopped.
Service term-tty1 has been stopped.
Service mcron has been stopped.
Service console-font-tty5 has been stopped.
Service term-tty5 has been stopped.
Service ntpd has been stopped.
Service tor has been stopped.
wrong-type-arg("for-each" "Wrong type argument: ~S" (4332) ())
Service xvnc has been stopped.
Service git-daemon has been stopped.
Assertion (eq? (canonical-name new) (canonical-name old)) failed.
assertion-failed()
Service root has been stopped.
mirai@guix-nuc ~$ sudo reboot
mirai@guix-nuc ~$ sudo shutdown
Service root is not running.
mirai@guix-nuc ~$


/var/log/messages excerpt:

Toggle snippet (161 lines)
Apr 2 15:12:03 localhost elogind[421]: New session 1 of user root.
Apr 2 15:12:33 localhost nscd: 11685 monitored file `/etc/hosts` was deleted, removing watch
Apr 2 15:12:33 localhost nscd: 11685 monitored file `/etc/hosts` was created, adding watch
Apr 2 15:12:33 localhost nscd: 11685 monitored file `/etc/hosts` was written to
Apr 2 15:12:33 localhost nscd: 11685 monitored file `/etc/nsswitch.conf` was deleted, removing watch
Apr 2 15:12:33 localhost nscd: 11685 monitored file `/etc/nsswitch.conf` was deleted, removing watch
Apr 2 15:12:33 localhost nscd: 11685 monitored file `/etc/nsswitch.conf` was created, adding watch
Apr 2 15:12:33 localhost nscd: 11685 monitored file `/etc/nsswitch.conf` was created, adding watch
Apr 2 15:12:33 localhost nscd: 11685 monitored file `/etc/nsswitch.conf` was written to
Apr 2 15:12:33 localhost nscd: 11685 monitored file `/etc/services` was deleted, removing watch
Apr 2 15:12:33 localhost nscd: 11685 monitored file `/etc/services` was created, adding watch
Apr 2 15:12:33 localhost nscd: 11685 monitored file `/etc/services` was written to
Apr 2 15:12:44 localhost shepherd[1]: Evaluating user expression (and (defined? (quote transient?)) (map (# ?) ?)).
Apr 2 15:12:44 localhost shepherd[1]: Evaluating user expression (let ((services (map primitive-load (# #)))) (?)).
Apr 2 15:12:44 localhost shepherd[1]: Service host-name has been started.
Apr 2 15:12:44 localhost shepherd[1]: Service user-homes has been started.
Apr 2 15:12:44 localhost shepherd[1]: [sysctl] net.ipv6.conf.all.temp_valid_lft = 5400
Apr 2 15:12:44 localhost shepherd[1]: [sysctl] net.ipv6.conf.all.temp_prefered_lft = 2700
Apr 2 15:12:44 localhost shepherd[1]: [sysctl] fs.protected_hardlinks = 1
Apr 2 15:12:44 localhost shepherd[1]: [sysctl] fs.protected_symlinks = 1
Apr 2 15:12:44 localhost shepherd[1]: Service sysctl has been started.
Apr 2 15:12:44 localhost shepherd[1]: Service host-name has been started.
Apr 2 15:12:44 localhost shepherd[1]: Service term-console could not be started.
Apr 2 15:12:44 localhost shepherd[1]: Service x11-socket-directory has been started.
Apr 2 15:12:44 localhost shepherd[1]: Service NetworkManager conflicts with running services (networking).
Apr 2 15:12:44 localhost elogind[421]: Removed session 1.
Apr 2 15:13:07 localhost elogind[421]: New session 1 of user root.
Apr 2 15:13:07 localhost shepherd[1]: Exiting shepherd...
Apr 2 15:13:07 localhost shepherd[1]: Service nftables has been stopped.
Apr 2 15:13:07 localhost shepherd[1]: Service console-font-tty2 has been stopped.
Apr 2 15:13:07 localhost shepherd[1]: Service term-tty2 has been stopped.
Apr 2 15:13:07 localhost avahi-daemon[427]: Got SIGTERM, quitting.
Apr 2 15:13:07 localhost shepherd[1]: Service avahi-daemon has been stopped.
Apr 2 15:13:07 localhost avahi-daemon[427]: Leaving mDNS multicast group on interface eno1.IPv6 with address fd40:cafe:babe:70:b728:c0ed:cc22:2291.
Apr 2 15:13:07 localhost avahi-daemon[427]: Leaving mDNS multicast group on interface eno1.IPv4 with address 192.168.1.201.
Apr 2 15:13:07 localhost avahi-daemon[427]: Leaving mDNS multicast group on interface lo.IPv6 with address ::1.
Apr 2 15:13:07 localhost avahi-daemon[427]: Leaving mDNS multicast group on interface lo.IPv4 with address 127.0.0.1.
Apr 2 15:13:07 localhost avahi-daemon[427]: avahi-daemon 0.8 exiting.
Apr 2 15:13:07 localhost elogind[421]: Removed session c6.
Apr 2 15:13:07 localhost gnome-session-binary[1454]: WARNING: Could not get session class: No such device or address
Apr 2 15:13:07 localhost gdm-session-worker: GLib-GObject: g_object_unref: assertion 'G_IS_OBJECT (object)' failed
Apr 2 15:13:07 localhost elogind[421]: Removed session c1.
Apr 2 15:13:07 localhost gdm: Gdm: Child process 491 was already dead.
Apr 2 15:13:07 localhost gdm: Gdm: Unable to kill session worker process
Apr 2 15:13:07 localhost Xvnc[496]: VNCSConnST: closing 127.0.0.1::41636: Server shutdown
Apr 2 15:13:07 localhost Xvnc[496]: EncodeManager: Framebuffer updates: 19032
Apr 2 15:13:07 localhost Xvnc[496]: EncodeManager: Tight:
Apr 2 15:13:07 localhost Xvnc[496]: EncodeManager: Solid: 45.131 krects, 155.096 Mpixels
Apr 2 15:13:07 localhost Xvnc[496]: EncodeManager: 705.172 KiB (1:859.891 ratio)
Apr 2 15:13:07 localhost Xvnc[496]: EncodeManager: Bitmap RLE: 5.474 krects, 1.53354 Mpixels
Apr 2 15:13:07 localhost Xvnc[496]: EncodeManager: 161.436 KiB (1:37.5043 ratio)
Apr 2 15:13:07 localhost Xvnc[496]: EncodeManager: Indexed RLE: 76.27 krects, 211.555 Mpixels
Apr 2 15:13:07 localhost Xvnc[496]: EncodeManager: 45.3484 MiB (1:17.8152 ratio)
Apr 2 15:13:07 localhost Xvnc[496]: EncodeManager: Full Colour: 53.582 krects, 142.266 Mpixels
Apr 2 15:13:07 localhost Xvnc[496]: EncodeManager: 43.1725 MiB (1:12.5847 ratio)
Apr 2 15:13:07 localhost Xvnc[496]: EncodeManager: Total: 180.457 krects, 510.45 Mpixels
Apr 2 15:13:07 localhost Xvnc[496]: EncodeManager: 89.3672 MiB (1:21.812 ratio)
Apr 2 15:13:07 localhost vmunix: [2762204.195308] traps: .gdm-real[465] general protection fault ip:7f1d9ddbc3a5 sp:7ffff9dad0e0 error:0 in libgobject-2.0.so.0.7000.2[7f1d9dd96000+2e000]
Apr 2 15:13:07 localhost shepherd[1]: Service xorg-server has been stopped.
Apr 2 15:13:07 localhost Xvnc[496]: ComparingUpdateTracker: 5.27525 Gpixels in / 538.205 Mpixels out
Apr 2 15:13:07 localhost Xvnc[496]: ComparingUpdateTracker: (1:9.80156 ratio)
Apr 2 15:13:07 localhost shepherd[1]: Service nscd has been stopped.
Apr 2 15:13:07 localhost shepherd[1]: Service console-font-tty1 has been stopped.
Apr 2 15:13:07 localhost shepherd[1]: Service term-tty1 has been stopped.
Apr 2 15:13:07 localhost shepherd[1]: Service mcron has been stopped.
Apr 2 15:13:07 localhost shepherd[1]: Service console-font-tty5 has been stopped.
Apr 2 15:13:07 localhost shepherd[1]: Service term-tty5 has been stopped.
Apr 2 15:13:07 localhost ntpd[420]: ntpd exiting on signal 15 (Terminated)
Apr 2 15:13:07 localhost ntpd[420]: 194.117.47.42 local addr 192.168.1.201 -> <null>
Apr 2 15:13:07 localhost ntpd[420]: 194.117.47.44 local addr 192.168.1.201 -> <null>
Apr 2 15:13:07 localhost ntpd[420]: 193.136.164.4 local addr 192.168.1.201 -> <null>
Apr 2 15:13:07 localhost ntpd[420]: 178.17.161.12 local addr 192.168.1.201 -> <null>
Apr 2 15:13:07 localhost ntpd[420]: 51.77.89.239 local addr 192.168.1.201 -> <null>
Apr 2 15:13:07 localhost ntpd[420]: 200.25.3.17 local addr 192.168.1.201 -> <null>
Apr 2 15:13:07 localhost ntpd[420]: 51.77.89.237 local addr 192.168.1.201 -> <null>
Apr 2 15:13:07 localhost ntpd[420]: 51.77.89.236 local addr 192.168.1.201 -> <null>
Apr 2 15:13:07 localhost ntpd[420]: 193.136.152.71 local addr 192.168.1.201 -> <null>
Apr 2 15:13:07 localhost shepherd[1]: Service ntpd has been stopped.
Apr 2 15:13:07 localhost shepherd[1]: Respawning xvnc.
Apr 2 15:13:07 localhost shepherd[1]: Service xvnc has been started.
Apr 2 15:13:07 localhost Tor[2]: Catching signal TERM, exiting cleanly.
Apr 2 15:13:07 localhost shepherd[1]: [Xvnc]
Apr 2 15:13:07 localhost shepherd[1]: [Xvnc] Xvnc TigerVNC 1.12.80 - built Jan 1 1970 00:00:01
Apr 2 15:13:07 localhost Xvnc[4332]: vncext: VNC extension running!
Apr 2 15:13:07 localhost shepherd[1]: [Xvnc] Copyright (C) 1999-2022 TigerVNC Team and many others (see README.rst)
Apr 2 15:13:07 localhost Xvnc[4332]: vncext: Listening for VNC connections on local interface(s), port 5910
Apr 2 15:13:07 localhost shepherd[1]: [Xvnc] See https://www.tigervnc.org for information on TigerVNC.
Apr 2 15:13:07 localhost shepherd[1]: [Xvnc] Underlying X server release 12101004
Apr 2 15:13:07 localhost shepherd[1]: [Xvnc]
Apr 2 15:13:07 localhost Xvnc[4332]: vncext: created VNC server for screen 0
Apr 2 15:13:07 localhost shepherd[1]: Service tor has been stopped.
Apr 2 15:13:07 localhost shepherd[1]: wrong-type-arg("for-each" "Wrong type argument: ~S" (4332) ())
Apr 2 15:13:07 localhost shepherd[1]: Service xvnc has been stopped.
Apr 2 15:13:07 localhost shepherd[1]: Service git-daemon has been stopped.
Apr 2 15:13:07 localhost NetworkManager[419]: <info> [1680444787.5126] caught SIGTERM, shutting down normally.
Apr 2 15:13:07 localhost NetworkManager[419]: <info> [1680444787.5333] dhcp4 (eno1): canceled DHCP transaction
Apr 2 15:13:07 localhost NetworkManager[419]: <info> [1680444787.5348] dhcp6 (eno1): activation: beginning transaction (timeout in 45 seconds)
Apr 2 15:13:07 localhost NetworkManager[419]: <info> [1680444787.5348] dhcp6 (eno1): state changed no lease
Apr 2 15:13:07 localhost NetworkManager[419]: <info> [1680444787.5364] manager: NetworkManager state is now CONNECTED_SITE
Apr 2 15:13:07 localhost dbus-daemon[416]: [system] Activating service name='org.freedesktop.nm_dispatcher' requested by ':1.4' (uid=0 pid=419 comm="/gnu/store/aby04j5i1bzfgm6hzjb6qcqwz9f1lcfn-networ") (using servicehelper)
Apr 2 15:13:07 localhost dbus-daemon[416]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Apr 2 15:13:07 localhost NetworkManager[419]: <info> [1680444787.5543] exiting (success)
Apr 2 15:13:07 localhost shepherd[1]: Assertion (eq? (canonical-name new) (canonical-name old)) failed.
Apr 2 15:13:07 localhost shepherd[1]: assertion-failed()
Apr 2 15:13:07 localhost shepherd[1]: Service root has been stopped.
Apr 2 15:13:07 localhost shepherd[1]: [Xvnc] [mi] mieq: warning: overriding existing handler (nil) with 0x52f720 for event 2
Apr 2 15:13:07 localhost shepherd[1]: [Xvnc] [mi] mieq: warning: overriding existing handler (nil) with 0x52f720 for event 3
Apr 2 15:13:07 localhost elogind[421]: Removed session 1.
Apr 2 15:15:13 localhost shepherd[1]: [Xvnc] XDM: too many retransmissions
Apr 2 15:15:13 localhost Xvnc[4332]: ComparingUpdateTracker: 0 pixels in / 0 pixels out
Apr 2 15:15:13 localhost Xvnc[4332]: ComparingUpdateTracker: (1:-nan ratio)
Apr 2 15:15:38 localhost elogind[421]: New session 1 of user root.
Apr 2 15:15:38 localhost elogind[421]: Removed session 1.
Apr 2 15:16:05 localhost elogind[421]: New session 1 of user root.
Apr 2 15:16:05 localhost elogind[421]: Removed session 1.
Apr 2 15:16:09 localhost elogind[421]: New session 1 of user root.
Apr 2 15:16:09 localhost shepherd[1]: Service root is not running.
Apr 2 15:16:09 localhost elogind[421]: Removed session 1.
Apr 2 15:16:14 localhost elogind[421]: New session 1 of user root.
Apr 2 15:16:14 localhost elogind[421]: Removed session 1.
Apr 2 15:16:20 localhost elogind[421]: New session 1 of user root.
Apr 2 15:16:20 localhost elogind[421]: Removed session 1.
Apr 2 15:16:40 localhost elogind[421]: New session 1 of user root.
Apr 2 15:16:40 localhost elogind[421]: Removed session 1.
Apr 2 15:16:49 localhost elogind[421]: New session 1 of user root.
Apr 2 15:16:49 localhost elogind[421]: Removed session 1.
Apr 2 15:16:56 localhost elogind[421]: New session 1 of user root.
Apr 2 15:16:56 localhost elogind[421]: Removed session 1.
Apr 2 15:17:07 localhost elogind[421]: New session 1 of user root.
Apr 2 15:17:07 localhost shepherd[1]: The root service is used to operate on shepherd itself.
Apr 2 15:17:07 localhost elogind[421]: Removed session 1.
Apr 2 15:17:13 localhost elogind[421]: New session 1 of user root.
Apr 2 15:17:13 localhost elogind[421]: Removed session 1.
Apr 2 15:17:24 localhost elogind[421]: New session 1 of user root.
Apr 2 15:17:24 localhost shepherd[1]: root (help status halt power-off load eval unload reload daemonize persistency no-persistency cd restart)
Apr 2 15:17:24 localhost elogind[421]: Removed session 1.
Apr 2 15:17:38 localhost elogind[421]: New session 1 of user root.
Apr 2 15:17:38 localhost shepherd[1]: You must be kidding.
Apr 2 15:17:38 localhost elogind[421]: Removed session 1.
Apr 2 15:18:35 localhost elogind[421]: New session 1 of user root.
Apr 2 15:18:35 localhost elogind[421]: Removed session 1.
Apr 2 15:18:37 localhost elogind[421]: New session 1 of user root.
Apr 2 15:18:37 localhost shepherd[1]: You must be kidding.
Apr 2 15:18:37 localhost elogind[421]: Removed session 1.
Apr 2 15:22:58 localhost elogind[421]: Removed session c28.
Apr 2 15:22:58 localhost shepherd[1]: 1 connection still in use after sshd-32 termination.
Apr 2 15:22:58 localhost shepherd[1]: Service sshd-32 (PID 9577) exited with 255.
Apr 2 15:22:58 localhost shepherd[1]: Service sshd-32 has been disabled.
Apr 2 15:22:58 localhost shepherd[1]: Transient service sshd-32 terminated, now unregistered.
Apr 2 15:24:59 localhost elogind[421]: Power key pressed.
Apr 2 15:24:59 localhost elogind[421]: Powering Off...
Apr 2 15:24:59 localhost elogind[421]: System is powering down..
Apr 2 15:24:59 localhost shepherd[1]: Service root is not running.
Apr 2 15:24:59 localhost shepherd[1]: [elogind] Service root is not running.
Apr 2 15:29:10 localhost elogind[421]: Power key pressed.
Apr 2 15:41:22 localhost pulseaudio[1564]: [pulseaudio] sap.c: sendmsg() failed: Network is unreachable
Apr 2 15:41:57 localhost last message repeated 7 times
Apr 2 15:43:02 localhost last message repeated 13 times
L
L
Ludovic Courtès wrote on 27 Apr 2023 14:55
(name . Bruno Victal)(address . mirai@makinata.eu)(address . 62619@debbugs.gnu.org)
87wn1xwkl1.fsf@gnu.org
Hi Bruno,

Bruno Victal <mirai@makinata.eu> skribis:

Toggle quote (3 lines)
> Upon a guix system reconfigure, if a running service has undergone a canonical name change since the previous generation
> then shutdown or reboot commands fail, with shepherd indicating itself (root service) as stopped.

Oh, fun.

Toggle quote (2 lines)
> mirai@guix-nuc ~$ sudo reboot

[…]

Toggle quote (6 lines)
> Service xvnc has been stopped.
> Service git-daemon has been stopped.
> Assertion (eq? (canonical-name new) (canonical-name old)) failed.
> assertion-failed()
> Service root has been stopped.

I think this assertion failure is the root cause. It comes from
‘register-services’ in 0.9.3, which reads:

Toggle snippet (30 lines)
(define (register-services . new-services)
"Add NEW-SERVICES to the list of known services. If a service has already
been registered, arrange to have it replaced when it is next stopped. If it
is currently stopped, replace it immediately."
(define (register-single-service new)
;; Sanity-checks first.
(assert (list-of-symbols? (provided-by new)))
(assert (list-of-symbols? (required-by new)))
(assert (boolean? (respawn? new)))

;; FIXME: Just because we have a unique canonical name now doesn't mean it
;; will remain unique as other services are added. Whenever a service is
;; added it should check that it's not conflicting with any already
;; registered canonical names.
(match (lookup-services (canonical-name new))
(() ;; empty, so we can safely add ourselves
(for-each (lambda (name)
(let ((old (lookup-services name)))
(hashq-set! %services name (cons new old))))
(provided-by new)))
((old . rest) ;; one service registered, it may be an old version of us
(assert (null? rest))
(assert (eq? (canonical-name new) (canonical-name old)))
(if (running? old)
(slot-set! old 'replacement new)
(replace-service old new)))))

(for-each register-single-service new-services))

‘register-services’ was called from ‘replace-service’, itself called
from ‘stop’ (right after ‘networking’ had been actually stopped):

Toggle snippet (12 lines)
;; SERVICE is no longer running.
(slot-set! service 'running #f)

;; Reset the list of respawns.
(slot-set! service 'last-respawns '())

;; Replace the service with its replacement, if it has one
(let ((replacement (slot-ref service 'replacement)))
(when replacement
(replace-service service replacement))) ;<- here

The assertion failure was thrown here. ‘stop’ calls itself
recursively but it doesn’t catch exceptions from recursive calls:

Toggle snippet (8 lines)
(fold-services (lambda (other acc)
(if (and (running? other)
(required-by? service other))
(append (stop other) acc) ;<- here
acc))
'())

The problem is that ‘root-service’ marks itself as stopped before it has
effectively shut down the services:

Toggle snippet (8 lines)
#:stop (lambda (unused . args)
(local-output (l10n "Exiting shepherd..."))
;; Prevent that we try to stop ourself again.
(slot-set! root-service 'running #f)
(shutdown-services)
(quit))

So what happened is that ‘shutdown-services’ threw, the exception wasn’t
caught, and thus it never called ‘quit’. QED.


The service registry in the soon-to-be-released 0.10.0 no longer has
that assertion failure (it vanished in
08510a2a2aaab388c90dd402bd7506d33014454f). Instead, it registers a
replacement for the first service found to have one of the names of the
new service.

The problem of the #:stop method of ‘root-service’ still exists: if
‘shutdown-services’ throws, the ‘root’ service won’t terminate and it’ll
remain in ‘stopping’ status. We could wrap the ‘shutdown-services’ call
in ‘false-if-exception’, but I’d lean towards not doing it: it’s not
supposed to throw, so maybe it’s best not to swallow the exception.

To summarize, I believe the problem is solved in 0.10.

Thanks,
Ludo’.
?