(address . bug-guix@gnu.org)
TITLE
Shepherd non-deterministically fails to load the "guix-daemon" service
after the "user-processes" service has been started
EXPECTED BEHAVIOUR
Shepherd will always start the "guix-daemon" service after the
"user-processes" service has been started.
OBSERVED BEHAVIOUR
Shepherd will sometimes start the "guix-daemon" service after the
"user-processes" service has been started and sometimes it won't.
HARDWARE ENVIRONMENT
Motherboard: Asus M5A88-M EVO
CPU: AMD Athlon II X4 640 Processor
SAS card: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev
02)
Hard drive: 9x assorted spinning rust sata hard drives
SOFTWARE ENVIRONMENT
See details below
STEPS TO REPRODUCE
Create the software environment detailed below on an equivalent hardware
environment to the one listed above
DETAILS
I have a media server running NFS to host videos to computers on my home
network. The media server has 9 hard drives in it. 5 hard drives are
connected to the motherboard's sata ports and another four are connected to
a HBA SAS controller card in IT mode sitting in the motherboard's pci-e
slot. The hard-drives are in an LVM configuration like below:
mark@streetkid ~$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE
MOUNTPOINTS
sda 8:0 0 465.8G 0 disk
??sda1 8:1 0 2M 0 part
??sda2 8:2 0 100G 0 part
/gnu/store
? /
??sda3 8:3 0 365.8G 0 part
??streetkid_VG_storage-streetkid_LV_home 253:0 0 4.3T 0 lvm /home
sdb 8:16 0 931.5G 0 disk
??sdb1 8:17 0 931.5G 0 part
??streetkid_VG_storage-streetkid_LV_home 253:0 0 4.3T 0 lvm /home
sdc 8:32 0 149G 0 disk
??sdc1 8:33 0 149G 0 part
??streetkid_VG_storage-streetkid_LV_home 253:0 0 4.3T 0 lvm /home
sdd 8:48 0 465.8G 0 disk
??sdd1 8:49 0 465.8G 0 part
??streetkid_VG_storage-streetkid_LV_home 253:0 0 4.3T 0 lvm /home
sde 8:64 0 465.8G 0 disk
??sde1 8:65 0 465.8G 0 part
??streetkid_VG_storage-streetkid_LV_home 253:0 0 4.3T 0 lvm /home
sdf 8:80 0 298.1G 0 disk
??sdf1 8:81 0 298.1G 0 part
??streetkid_VG_storage-streetkid_LV_home 253:0 0 4.3T 0 lvm /home
sdg 8:96 0 931.5G 0 disk
??sdg1 8:97 0 931.5G 0 part
??streetkid_VG_storage-streetkid_LV_home 253:0 0 4.3T 0 lvm /home
sdh 8:112 0 465.8G 0 disk
??sdh1 8:113 0 465.8G 0 part
??streetkid_VG_storage-streetkid_LV_home 253:0 0 4.3T 0 lvm /home
sdi 8:128 0 298.1G 0 disk
??sdi1 8:129 0 298.1G 0 part
??streetkid_VG_storage-streetkid_LV_home 253:0 0 4.3T 0 lvm /home
Recently I noticed that running guix commands would sometimes fail with the
following error:
mark@streetkid ~$ guix package -m ~/.config/guix/manifest.scm
guix package: error: failed to connect to `/var/guix/daemon-socket/socket':
Connection refused
The NFS service would be working fine but guix would refuse to work. Not
knowing what to do to fix guix I tried rolling back to previous system
generations using the GRUB menu on bootup. Sometimes this would result in
guix working but then on the next bootup - using the same system generation
- guix would fail again. I found it was easy to work around the issue by
manually getting shepherd to start the guix-daemon:
mark@streetkid ~$ sudo herd start guix-daemon
Service guix-daemon has been started.
I looked through the kernel log and found that the harddrives connected to
the SAS controller were taking a long while to be brought up which was
blocking services that depend on the "device-mapping" service from being
started. The following snippet from /var/log/messages.1.gz illustrates this:
Jul 1 21:36:58 localhost vmunix: [ 18.538151] shepherd[1]: Service
device-mapping-streetkid_VG_storage-streetkid_LV_home could not be started.
Jul 1 21:36:58 localhost vmunix: [ 18.540000] shepherd[1]: Service
file-system-/home depends on
device-mapping-streetkid_VG_storage-streetkid_LV_home.
Jul 1 21:36:58 localhost vmunix: [ 18.541809] shepherd[1]: Service
file-system-/home could not be started.
Jul 1 21:36:58 localhost vmunix: [ 18.543412] shepherd[1]: Service
file-systems depends on file-system-/home.
Jul 1 21:36:58 localhost vmunix: [ 18.545276] shepherd[1]: Service
file-systems could not be started.
Jul 1 21:36:58 localhost vmunix: [ 18.546921] shepherd[1]: Service
user-processes depends on file-systems.
Jul 1 21:36:58 localhost vmunix: [ 18.548774] shepherd[1]: Service
user-processes could not be started.
Jul 1 21:36:58 localhost vmunix: [ 18.550432] shepherd[1]: Service
guix-daemon depends on user-processes.
Jul 1 21:36:58 localhost vmunix: [ 18.552355] shepherd[1]: Service
guix-daemon could not be started.
Looking at the logs shows that during bootup the Shepherd will make a few
attempts at starting services while waiting for the hard drives connected
to the SAS card to become available. The "device-mapping" service needs the
SAS card hard drives to be up and will fail when they are not. When they
are not up the services that depend on the "device-mapping" service will
fail. After a while the hard drives come online and then the
"device-mapping" service starts, which then causes the services that depend
on it to start up as well.
I then noticed something very strange. On some boot ups the Shepherd would
try and fail a few times to start services that depend on "device-mapping"
and then when the SAS hard drives were up it would successfully load these
dependent services including the "guix-daemon" service. On other boot ups
it would try and fail a few times to start services that depend on
"device-mapping" and then when the SAS hard drives were up it would
successfully load these dependent services BUT WOULD NOT LOAD the
"guix-daemon" service.
To see this non-deterministic bug in the logs open the attached file
"messages.1". Go to line 21664 which is the start of the boot process
starting Jul 1 09:02:35. Search for the next string "Service
user-processes" and you can see this service failing to start on line 22558
and again on line 22581. The "user-processes" service then succeeds on line
22657 and on the next line (22658) you can see that the "guix-daemon"
starts. In summary:
21664 - System starts booting
22558 - "user-processes" service fails to start
22581 - "user-processes" service fails to start again
22657 - "user-processes" service starts successfully
22658 - "guix-daemon" service starts successfully
22837 - System shuts down
On the next boot "guix-daemon" fails to start after "user-processes" has
started:
22838 - System starts booting
23728 - "user-processes" service fails to start
23750 - "user-processes" service fails to start again
23762 - "user-processes" service fails to start a third time
23764 - "guix-daemon" service fails to start
23840 - "user-processes" service starts successfully
23938 - System shuts down
On the next boot "guix-daemon" again fails to start after "user-processes"
has started:
23939 - System starts booting
24826 - "user-processes" service fails to start
24853 - "user-processes" service fails to start again
24865 - "user-processes" service fails to start a third time
24867 - "guix-daemon" service fails to start
24941 - "user-processes" service starts successfully
25053 - System shuts down
On the next boot "guix-daemon" successfully starts after "user-processes"
has started:
25054 - System starts booting
25952 - "user-processes" service fails to start
25975 - "user-processes" service fails to start again
26064 - "user-processes" service starts successfully
26066 - "guix-daemon" service starts successfully
26169 - System shuts down
Note that these four boots are using the same system generation so the
software is the same on all boots.
The generation info is shown below:
Generation 27 Jun 29 2022 22:50:13
file name: /var/guix/profiles/system-27-link
canonical file name: /gnu/store/xrn0lz22gnhhq9ywk30q2nmim3fgx5sd-system
label: GNU with Linux 5.17.15
bootloader: grub
root device: UUID: f75c515d-0ef7-4af1-950e-5486408d2451
kernel: /gnu/store/xbcwcv2a024n0vmncvhgllmz52d4rprj-linux-5.17.15/bzImage
channels:
nonguix:
repository URL: https://gitlab.com/nonguix/nonguix
branch: master
commit: 1a122e06fe046caebf39395edc797515861acd3b
guix:
repository URL: https://git.savannah.gnu.org/git/guix.git
branch: master
commit: 0fe0c739f53bbb635eb5c4a3b172f4be45293c82
configuration file:
/gnu/store/sniryjir3bhlswyarmdmrwdw8zqh8zgp-configuration.scm
The config file for this generation is in the attachment
"system-config.scm".
I could find two clues which might be relevant:
1) It seems that on the occasions that the "guix-daemon" successfully
starts after "user-processes" has started the Shepherd hasn't tried and
then failed to start it earlier. Conversely, on the occasions when the
"guix-daemon" service fails to start after "user-processes" has started the
shepherd will have tried and failed to start the "guix-daemon" service.
Does the "guix-daemon" service definition state that it should not attempt
to start if it has failed to start before?
2) The earlier boots in the kernel log seem to not have the problem of the
SAS hard drives not coming up in time to cause services that depend on them
from failing to start. This might mean that the SAS card is starting to
fail. On the other hand once the hard drives are up the system is stable
and the NFS service works fine.
A screenshot of the lvm error messages from a recent boot up showing the
hard drives failing to be detected is shown in the "lvm-error-messages.jpg"
attachment.
;; This is an operating system configuration generated
;; by the graphical installer.
(use-modules (gnu)
(nongnu packages linux)
(nongnu system linux-initrd))
(use-service-modules desktop networking ssh xorg nfs)
(define %my-wol-rule
(udev-rule
"90-my-wol.rules"
(string-append "ACTION==\"add\", "
"SUBSYSTEM==\"net\", "
"NAME==\"yolo420\", "
"RUN+=\"/run/current-system/profile/sbin/ethtool -s $name wol g\"")))
(define %my-NIC-name-rule
(udev-rule
"60-my-NIC-name.rules"
(string-append "ACTION==\"add\", "
"SUBSYSTEM==\"net\", "
"ATTR{address}==\"XX:XX:XX:XX:XX:XX\", "
"NAME:=\"yolo420\"")))
(operating-system
(kernel linux)
(initrd microcode-initrd)
(firmware (list linux-firmware));; So onboard radeon card works
(locale "en_AU.utf8")
(timezone "Australia/Sydney")
(keyboard-layout (keyboard-layout "us"))
(host-name "streetkid")
(users (cons* (user-account
(name "mark")
(comment "Mark")
(group "users")
(home-directory "/home/mark")
(supplementary-groups
'("wheel" "netdev" "audio" "video")))
%base-user-accounts))
(packages
(append
(list (specification->package "nss-certs")
(specification->package "nfs-utils")
(specification->package "hdparm")
(specification->package "smartmontools")
(specification->package "ethtool")
(specification->package "lvm2")
(specification->package "vim"))
%base-packages))
(services
(cons* (service static-networking-service-type ;; For static networking
(list (static-networking
(addresses
(list (network-address
(device "yolo420");; For streetkid NIC
(value "192.168.0.6/24"))))
(routes
(list (network-route
(destination "default")
(gateway "192.168.0.1")))))))
(service openssh-service-type)
(service ntp-service-type)
(udev-rules-service 'my-wol-rule %my-wol-rule)
(udev-rules-service 'my-NIC-name-rule %my-NIC-name-rule)
(service nfs-service-type
(nfs-configuration
(exports
'(("/home/mark/exports"
"192.168.0.0/24(rw,fsid=0,no_subtree_check)")
("/home/mark/exports/pictures"
"192.168.0.0/24(rw,no_subtree_check)")
("/home/mark/exports/videos"
"192.168.0.0/24(rw,no_subtree_check)")))))
(modify-services %base-services
(guix-service-type config => (guix-configuration
(inherit config)
(substitute-urls
(append (list "https://substitutes.nonguix.org")
%default-substitute-urls))
(authorized-keys
(append (list (local-file "/home/mark/.config/guix/substitute-server-signing-keys/nonguix-signing-key.pub"))
%default-authorized-guix-keys)))))))
(bootloader
(bootloader-configuration
(bootloader grub-bootloader)
(targets '("/dev/sda"))
(keyboard-layout keyboard-layout)))
(mapped-devices
(list (mapped-device
(source "streetkid_VG_storage")
(target "streetkid_VG_storage-streetkid_LV_home")
(type lvm-device-mapping))))
(file-systems
(cons* (file-system
(mount-point "/")
(device (uuid "00000000-0000-0000-0000-000000000000"))
(type "ext4"))
(file-system
(mount-point "/home")
(device "/dev/mapper/streetkid_VG_storage-streetkid_LV_home")
(type "ext4")
(dependencies mapped-devices))
%base-file-systems)))