Shepherd non-deterministically fails to load the "guix-daemon" service after the "user-processes" service has been started

  • Open
  • quality assurance status badge
Details
2 participants
  • Ludovic Courtès
  • Markus Nilsson
Owner
unassigned
Submitted by
Markus Nilsson
Severity
normal
M
M
Markus Nilsson wrote on 12 Jul 2022 15:07
(address . bug-guix@gnu.org)
CA+phM3JnBbA_0M16u-abNE_8n2dw3BssAca8igy2WF2fnqeMhg@mail.gmail.com
TITLE
Shepherd non-deterministically fails to load the "guix-daemon" service
after the "user-processes" service has been started

EXPECTED BEHAVIOUR
Shepherd will always start the "guix-daemon" service after the
"user-processes" service has been started.

OBSERVED BEHAVIOUR
Shepherd will sometimes start the "guix-daemon" service after the
"user-processes" service has been started and sometimes it won't.

HARDWARE ENVIRONMENT
Motherboard: Asus M5A88-M EVO
CPU: AMD Athlon II X4 640 Processor
SAS card: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev
02)
Hard drive: 9x assorted spinning rust sata hard drives

SOFTWARE ENVIRONMENT
See details below

STEPS TO REPRODUCE
Create the software environment detailed below on an equivalent hardware
environment to the one listed above

DETAILS
I have a media server running NFS to host videos to computers on my home
network. The media server has 9 hard drives in it. 5 hard drives are
connected to the motherboard's sata ports and another four are connected to
a HBA SAS controller card in IT mode sitting in the motherboard's pci-e
slot. The hard-drives are in an LVM configuration like below:

mark@streetkid ~$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE
MOUNTPOINTS
sda 8:0 0 465.8G 0 disk
??sda1 8:1 0 2M 0 part
??sda2 8:2 0 100G 0 part
/gnu/store
? /
??sda3 8:3 0 365.8G 0 part
??streetkid_VG_storage-streetkid_LV_home 253:0 0 4.3T 0 lvm /home
sdb 8:16 0 931.5G 0 disk
??sdb1 8:17 0 931.5G 0 part
??streetkid_VG_storage-streetkid_LV_home 253:0 0 4.3T 0 lvm /home
sdc 8:32 0 149G 0 disk
??sdc1 8:33 0 149G 0 part
??streetkid_VG_storage-streetkid_LV_home 253:0 0 4.3T 0 lvm /home
sdd 8:48 0 465.8G 0 disk
??sdd1 8:49 0 465.8G 0 part
??streetkid_VG_storage-streetkid_LV_home 253:0 0 4.3T 0 lvm /home
sde 8:64 0 465.8G 0 disk
??sde1 8:65 0 465.8G 0 part
??streetkid_VG_storage-streetkid_LV_home 253:0 0 4.3T 0 lvm /home
sdf 8:80 0 298.1G 0 disk
??sdf1 8:81 0 298.1G 0 part
??streetkid_VG_storage-streetkid_LV_home 253:0 0 4.3T 0 lvm /home
sdg 8:96 0 931.5G 0 disk
??sdg1 8:97 0 931.5G 0 part
??streetkid_VG_storage-streetkid_LV_home 253:0 0 4.3T 0 lvm /home
sdh 8:112 0 465.8G 0 disk
??sdh1 8:113 0 465.8G 0 part
??streetkid_VG_storage-streetkid_LV_home 253:0 0 4.3T 0 lvm /home
sdi 8:128 0 298.1G 0 disk
??sdi1 8:129 0 298.1G 0 part
??streetkid_VG_storage-streetkid_LV_home 253:0 0 4.3T 0 lvm /home



Recently I noticed that running guix commands would sometimes fail with the
following error:

mark@streetkid ~$ guix package -m ~/.config/guix/manifest.scm
guix package: error: failed to connect to `/var/guix/daemon-socket/socket':
Connection refused


The NFS service would be working fine but guix would refuse to work. Not
knowing what to do to fix guix I tried rolling back to previous system
generations using the GRUB menu on bootup. Sometimes this would result in
guix working but then on the next bootup - using the same system generation
- guix would fail again. I found it was easy to work around the issue by
manually getting shepherd to start the guix-daemon:

mark@streetkid ~$ sudo herd start guix-daemon
Service guix-daemon has been started.


I looked through the kernel log and found that the harddrives connected to
the SAS controller were taking a long while to be brought up which was
blocking services that depend on the "device-mapping" service from being
started. The following snippet from /var/log/messages.1.gz illustrates this:

Jul 1 21:36:58 localhost vmunix: [ 18.538151] shepherd[1]: Service
device-mapping-streetkid_VG_storage-streetkid_LV_home could not be started.
Jul 1 21:36:58 localhost vmunix: [ 18.540000] shepherd[1]: Service
file-system-/home depends on
device-mapping-streetkid_VG_storage-streetkid_LV_home.
Jul 1 21:36:58 localhost vmunix: [ 18.541809] shepherd[1]: Service
file-system-/home could not be started.
Jul 1 21:36:58 localhost vmunix: [ 18.543412] shepherd[1]: Service
file-systems depends on file-system-/home.
Jul 1 21:36:58 localhost vmunix: [ 18.545276] shepherd[1]: Service
file-systems could not be started.
Jul 1 21:36:58 localhost vmunix: [ 18.546921] shepherd[1]: Service
user-processes depends on file-systems.
Jul 1 21:36:58 localhost vmunix: [ 18.548774] shepherd[1]: Service
user-processes could not be started.
Jul 1 21:36:58 localhost vmunix: [ 18.550432] shepherd[1]: Service
guix-daemon depends on user-processes.
Jul 1 21:36:58 localhost vmunix: [ 18.552355] shepherd[1]: Service
guix-daemon could not be started.

Looking at the logs shows that during bootup the Shepherd will make a few
attempts at starting services while waiting for the hard drives connected
to the SAS card to become available. The "device-mapping" service needs the
SAS card hard drives to be up and will fail when they are not. When they
are not up the services that depend on the "device-mapping" service will
fail. After a while the hard drives come online and then the
"device-mapping" service starts, which then causes the services that depend
on it to start up as well.

I then noticed something very strange. On some boot ups the Shepherd would
try and fail a few times to start services that depend on "device-mapping"
and then when the SAS hard drives were up it would successfully load these
dependent services including the "guix-daemon" service. On other boot ups
it would try and fail a few times to start services that depend on
"device-mapping" and then when the SAS hard drives were up it would
successfully load these dependent services BUT WOULD NOT LOAD the
"guix-daemon" service.

To see this non-deterministic bug in the logs open the attached file
"messages.1". Go to line 21664 which is the start of the boot process
starting Jul 1 09:02:35. Search for the next string "Service
user-processes" and you can see this service failing to start on line 22558
and again on line 22581. The "user-processes" service then succeeds on line
22657 and on the next line (22658) you can see that the "guix-daemon"
starts. In summary:

21664 - System starts booting
22558 - "user-processes" service fails to start
22581 - "user-processes" service fails to start again
22657 - "user-processes" service starts successfully
22658 - "guix-daemon" service starts successfully
22837 - System shuts down

On the next boot "guix-daemon" fails to start after "user-processes" has
started:
22838 - System starts booting
23728 - "user-processes" service fails to start
23750 - "user-processes" service fails to start again
23762 - "user-processes" service fails to start a third time
23764 - "guix-daemon" service fails to start
23840 - "user-processes" service starts successfully
23938 - System shuts down

On the next boot "guix-daemon" again fails to start after "user-processes"
has started:
23939 - System starts booting
24826 - "user-processes" service fails to start
24853 - "user-processes" service fails to start again
24865 - "user-processes" service fails to start a third time
24867 - "guix-daemon" service fails to start
24941 - "user-processes" service starts successfully
25053 - System shuts down

On the next boot "guix-daemon" successfully starts after "user-processes"
has started:
25054 - System starts booting
25952 - "user-processes" service fails to start
25975 - "user-processes" service fails to start again
26064 - "user-processes" service starts successfully
26066 - "guix-daemon" service starts successfully
26169 - System shuts down

Note that these four boots are using the same system generation so the
software is the same on all boots.
The generation info is shown below:

Generation 27 Jun 29 2022 22:50:13
file name: /var/guix/profiles/system-27-link
canonical file name: /gnu/store/xrn0lz22gnhhq9ywk30q2nmim3fgx5sd-system
label: GNU with Linux 5.17.15
bootloader: grub
root device: UUID: f75c515d-0ef7-4af1-950e-5486408d2451
kernel: /gnu/store/xbcwcv2a024n0vmncvhgllmz52d4rprj-linux-5.17.15/bzImage
channels:
nonguix:
branch: master
commit: 1a122e06fe046caebf39395edc797515861acd3b
guix:
branch: master
commit: 0fe0c739f53bbb635eb5c4a3b172f4be45293c82
configuration file:
/gnu/store/sniryjir3bhlswyarmdmrwdw8zqh8zgp-configuration.scm

The config file for this generation is in the attachment
"system-config.scm".


I could find two clues which might be relevant:

1) It seems that on the occasions that the "guix-daemon" successfully
starts after "user-processes" has started the Shepherd hasn't tried and
then failed to start it earlier. Conversely, on the occasions when the
"guix-daemon" service fails to start after "user-processes" has started the
shepherd will have tried and failed to start the "guix-daemon" service.
Does the "guix-daemon" service definition state that it should not attempt
to start if it has failed to start before?

2) The earlier boots in the kernel log seem to not have the problem of the
SAS hard drives not coming up in time to cause services that depend on them
from failing to start. This might mean that the SAS card is starting to
fail. On the other hand once the hard drives are up the system is stable
and the NFS service works fine.

A screenshot of the lvm error messages from a recent boot up showing the
hard drives failing to be detected is shown in the "lvm-error-messages.jpg"
attachment.
Attachment: file
Attachment: messages.1
;; This is an operating system configuration generated ;; by the graphical installer. (use-modules (gnu) (nongnu packages linux) (nongnu system linux-initrd)) (use-service-modules desktop networking ssh xorg nfs) (define %my-wol-rule (udev-rule "90-my-wol.rules" (string-append "ACTION==\"add\", " "SUBSYSTEM==\"net\", " "NAME==\"yolo420\", " "RUN+=\"/run/current-system/profile/sbin/ethtool -s $name wol g\""))) (define %my-NIC-name-rule (udev-rule "60-my-NIC-name.rules" (string-append "ACTION==\"add\", " "SUBSYSTEM==\"net\", " "ATTR{address}==\"XX:XX:XX:XX:XX:XX\", " "NAME:=\"yolo420\""))) (operating-system (kernel linux) (initrd microcode-initrd) (firmware (list linux-firmware));; So onboard radeon card works (locale "en_AU.utf8") (timezone "Australia/Sydney") (keyboard-layout (keyboard-layout "us")) (host-name "streetkid") (users (cons* (user-account (name "mark") (comment "Mark") (group "users") (home-directory "/home/mark") (supplementary-groups '("wheel" "netdev" "audio" "video"))) %base-user-accounts)) (packages (append (list (specification->package "nss-certs") (specification->package "nfs-utils") (specification->package "hdparm") (specification->package "smartmontools") (specification->package "ethtool") (specification->package "lvm2") (specification->package "vim")) %base-packages)) (services (cons* (service static-networking-service-type ;; For static networking (list (static-networking (addresses (list (network-address (device "yolo420");; For streetkid NIC (value "192.168.0.6/24")))) (routes (list (network-route (destination "default") (gateway "192.168.0.1"))))))) (service openssh-service-type) (service ntp-service-type) (udev-rules-service 'my-wol-rule %my-wol-rule) (udev-rules-service 'my-NIC-name-rule %my-NIC-name-rule) (service nfs-service-type (nfs-configuration (exports '(("/home/mark/exports" "192.168.0.0/24(rw,fsid=0,no_subtree_check)") ("/home/mark/exports/pictures" "192.168.0.0/24(rw,no_subtree_check)") ("/home/mark/exports/videos" "192.168.0.0/24(rw,no_subtree_check)"))))) (modify-services %base-services (guix-service-type config => (guix-configuration (inherit config) (substitute-urls (append (list "https://substitutes.nonguix.org") %default-substitute-urls)) (authorized-keys (append (list (local-file "/home/mark/.config/guix/substitute-server-signing-keys/nonguix-signing-key.pub")) %default-authorized-guix-keys))))))) (bootloader (bootloader-configuration (bootloader grub-bootloader) (targets '("/dev/sda")) (keyboard-layout keyboard-layout))) (mapped-devices (list (mapped-device (source "streetkid_VG_storage") (target "streetkid_VG_storage-streetkid_LV_home") (type lvm-device-mapping)))) (file-systems (cons* (file-system (mount-point "/") (device (uuid "00000000-0000-0000-0000-000000000000")) (type "ext4")) (file-system (mount-point "/home") (device "/dev/mapper/streetkid_VG_storage-streetkid_LV_home") (type "ext4") (dependencies mapped-devices)) %base-file-systems)))
L
L
Ludovic Courtès wrote on 18 Jul 2022 14:54
(name . Markus Nilsson)(address . markusnilsson890@gmail.com)(address . 56519@debbugs.gnu.org)
87cze2zimu.fsf@gnu.org
Hi Markus,

From the /var/log/messages excerpt you sent, it would seem that the root
cause is failing to device mapping for the “streetkid” storage:

Markus Nilsson <markusnilsson890@gmail.com> skribis:

Toggle quote (12 lines)
> Jul 1 21:39:35 localhost vmunix: [ 16.150396] sd 6:0:1:0: Attached scsi generic sg6 type 0
> Jul 1 21:39:35 localhost vmunix: [ 16.157673] sd 6:0:0:0: [sdf] 625142448 512-byte logical blocks: (320 GB/298 GiB)
> Jul 1 21:39:35 localhost vmunix: [ 16.157689] sd 6:0:0:0: [sdf] 4096-byte physical blocks
> Jul 1 21:39:35 localhost vmunix: [ 16.195016] shepherd[1]: Service device-mapping-streetkid_VG_storage-streetkid_LV_home could not be started.
> Jul 1 21:39:35 localhost vmunix: [ 16.197048] shepherd[1]: Service file-system-/home depends on device-mapping-streetkid_VG_storage-streetkid_LV_home.
> Jul 1 21:39:35 localhost vmunix: [ 16.199307] shepherd[1]: Service file-system-/home could not be started.
> Jul 1 21:39:35 localhost vmunix: [ 16.203577] shepherd[1]: Service file-system-/dev/pts has been started.
> Jul 1 21:39:35 localhost vmunix: [ 16.206789] shepherd[1]: Service file-system-/sys/kernel/debug has been started.
> Jul 1 21:39:35 localhost vmunix: [ 16.209950] shepherd[1]: Service file-system-/dev/shm has been started.
> Jul 1 21:39:35 localhost vmunix: [ 16.212595] shepherd[1]: Service file-system-/sys/firmware/efi/efivars has been started.
> Jul 1 21:39:35 localhost vmunix: [ 16.215719] shepherd[1]: Service file-system-/gnu/store has been started.

[...]

Toggle quote (14 lines)
> Jul 1 21:39:35 localhost vmunix: [ 17.481611] shepherd[1]: Service nscd could not be started.
> Jul 1 21:39:35 localhost vmunix: [ 17.499327] ppdev: user-space parallel port driver
> Jul 1 21:39:35 localhost vmunix: [ 17.691044] sdi: sdi1
> Jul 1 21:39:35 localhost vmunix: [ 17.949918] sd 6:0:3:0: [sdi] Attached SCSI disk
> Jul 1 21:39:35 localhost vmunix: [ 17.956933] shepherd[1]: Service device-mapping-streetkid_VG_storage-streetkid_LV_home could not be started.
> Jul 1 21:39:35 localhost vmunix: [ 17.958868] shepherd[1]: Service file-system-/home depends on device-mapping-streetkid_VG_storage-streetkid_LV_home.
> Jul 1 21:39:35 localhost vmunix: [ 17.960699] shepherd[1]: Service file-system-/home could not be started.
> Jul 1 21:39:35 localhost vmunix: [ 17.962778] shepherd[1]: Service file-systems depends on file-system-/home.
> Jul 1 21:39:35 localhost vmunix: [ 17.964868] shepherd[1]: Service file-systems could not be started.
> Jul 1 21:39:35 localhost vmunix: [ 17.966751] shepherd[1]: Service user-processes depends on file-systems.
> Jul 1 21:39:35 localhost vmunix: [ 17.968643] shepherd[1]: Service user-processes could not be started.
> Jul 1 21:39:35 localhost vmunix: [ 17.970712] shepherd[1]: Service guix-daemon depends on user-processes.
> Jul 1 21:39:35 localhost vmunix: [ 17.971359] shepherd[1]: Service guix-daemon could not be started.

That is: ‘device-mapping-streetkid_VG_storage-streetkid_LV_home’ fails
to start; shepherd tries to start other services, but they depend on it
so every time it fails with “Service X depends on
device-mapping-streetkid_VG_storage-streetkid_LV_home” followed by
“Service X could not be started”. Eventually, that propagates down to
‘user-processes’ and ‘guix-daemon’, both of which can’t be started.

You can see the dependency graph of Shepherd services with:

guix system shepherd-graph …/config.scm | xdot -

So it might not be a Shepherd or Guix issue per se, but rather something
related to that “streetkid” device. I couldn’t find hints in the log
though.

Ideas?

Thanks,
Ludo’.
M
M
Markus Nilsson wrote on 26 Jul 2022 04:48
(address . 56519@debbugs.gnu.org)
CA+phM3+20kfGVinutRiqm_rJszFNgGozHOGnDYDkeqaVyX3u2Q@mail.gmail.com
Hi Ludo,

Thankyou for taking a look at this issue!

I think I didn't make it clear in the original bug report what the actual
problem was.

I understand that in the beginning of the boot process the
"device-mapping-streetkid_VG_storage-streetkid_LV_home" shepherd service is
failing and this is causing shepherd services that depend on it to fail to
start as a consequence. Later in the boot process the
"device-mapping-streetkid_VG_storage-streetkid_LV_home" does successfully
start though.

What happens is that after the
"device-mapping-streetkid_VG_storage-streetkid_LV_home" does start (after a
couple of tries), other shepherd services will then start including the
"user-processes" shepherd service. The problem is that the "guix-daemon"
shepherd service doesn't always start after that. Sometimes it will, but
sometimes it won't (see my first message for examples). Shouldn't the
"guix-daemon" shepherd service ALWAYS start after the "user-processes"
shepherd service starts?

I also found that the "nscd" shepherd service is failing to start even
though the "user-processes" shepherd service has started.

The following terminal output was taken after my streetkid server had
finished booting up. This should make things clearer:

1 mark@streetkid ~$ sudo herd status
2 Started:
3 + /proc/fs/nfsd
4 + console-font-tty1
5 + console-font-tty2
6 + console-font-tty3
7 + console-font-tty4
8 + console-font-tty5
9 + console-font-tty6
10 + device-mapping-streetkid_VG_storage-streetkid_LV_home
11 + file-system-/dev/pts
12 + file-system-/dev/shm
13 + file-system-/gnu/store
14 + file-system-/home
15 + file-system-/sys/firmware/efi/efivars
16 + file-system-/sys/kernel/debug
17 + file-systems
18 + idmap-daemon
19 + loopback
20 + mcron
21 + networking
22 + nfs
23 + ntpd
24 + root
25 + root-file-system
26 + rpc-pipefs
27 + rpc.mountd
28 + rpc.nfsd
29 + rpc.statd
30 + rpcbind-daemon
31 + ssh-daemon
32 + sshd-1
33 + syslogd
34 + term-tty1
35 + term-tty2
36 + term-tty3
37 + term-tty4
38 + term-tty5
39 + term-tty6
40 + udev
41 + urandom-seed
42 + user-file-systems
43 + user-processes
44 + virtual-terminal
45 Stopped:
46 - guix-daemon
47 - nscd
48 - term-console
49 One-shot:
50 * host-name
51 * sysctl
52 * user-homes

As you can see on line 10
"device-mapping-streetkid_VG_storage-streetkid_LV_home" has been started
and on line 43 "user-processes" has been started. The problem is that
"guix-daemon" hasn't been started (line 46) and "nscd" also hasn't been
started (line 47). Shouldn't Shepherd start these two services once
"user-processes" has been started?

I checked other services that depend on "user-processes" (see the attached
shepherd dependency graph for my system). The "mcron" service and "ntpd"
successfully start (lines 20 and 23). This still leaves the mystery of why
"guix-daemon" and "nscd" won't start even though "user-processes" HAS been
started.

I hope this makes things clearer.

Cheers
Markus
Attachment: file
L
L
Ludovic Courtès wrote on 1 Aug 2022 15:56
(name . Markus Nilsson)(address . markusnilsson890@gmail.com)(address . 56519@debbugs.gnu.org)
87sfmgvzin.fsf@gnu.org
Hi Markus,

Markus Nilsson <markusnilsson890@gmail.com> skribis:

Toggle quote (6 lines)
> I checked other services that depend on "user-processes" (see the attached
> shepherd dependency graph for my system). The "mcron" service and "ntpd"
> successfully start (lines 20 and 23). This still leaves the mystery of why
> "guix-daemon" and "nscd" won't start even though "user-processes" HAS been
> started.

Could you check what /var/log/messages says for this specific
occurrence?

It will shed some light as to what services were started, in what order,
how they failed, etc.

Thanks,
Ludo’.
M
M
Markus Nilsson wrote on 5 Aug 2022 08:23
(address . 56519@debbugs.gnu.org)
CA+phM3+_snFw1koh9f9uzYPjBi+TVmqrj=85Hkfc7JJ3oynGNw@mail.gmail.com
Hi Ludo,

Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (3 lines)
>Could you check what /var/log/messages says for this specific occurrence?

>It will shed some light as to what services were started, in what order,
how they failed, etc.

I have attached the kernel log information for today (August 5 2022) as
well as the output of "herd status". You can see that on line 1020
"user-processes" successfully starts, but "guix-daemon" or "nscd" doesn't
attempt to start afterwards for some unknown reason. Below is a summary of
the relevant events from the log:

1 - System starts booting
873 - "user-processes" service fails to start
908 - "user-processes" service fails to start a 2nd time
910 - "nscd" service fails to start
920 - "user-processes" service fails to start a 3rd time
922 - "guix-daemon" service fails to start
943 - "user-processes" service fails to start a 4th time
1020 - "user-processes" service starts successfully
1071 - "ntpd" service starts successfully
1075 - "mcron" service starts successfully

This seems to indicate that the Shepherd will attempt to start
"user-processes" multiple times until it succeeds. But Shepherd won't try
to start "guix-daemon" or "nscd" if it has failed once.

Cheers
Markus
Attachment: file
Attachment: Aug_5_kernel_log
Attachment: Aug_5_herd_status
?