[core-updates-frozen] cannot login ('org.freedesktop.login1' service times out)

  • Done
  • quality assurance status badge
Details
12 participants
  • Abhiseck Paira
  • Caleb Herbert
  • Josselin Poiret
  • Guillaume Le Vaillant
  • ison
  • Leo Famulari
  • Ludovic Courtès
  • Luis Felipe
  • Maxim Cournoyer
  • Maxime Devos
  • Michael Rohleder
  • Timothy Sample
Owner
unassigned
Submitted by
Maxim Cournoyer
Severity
important
M
M
Maxim Cournoyer wrote on 23 Nov 2021 07:47
(name . bug-guix)(address . bug-guix@gnu.org)
87pmqr1i8k.fsf@gmail.com
Hello,

I'm not 100% sure this is the cause, but these are the last messages I
have before I rebooted:

Toggle snippet (9 lines)
Nov 23 01:09:14 localhost dbus-daemon[383]: [system] Activating service name='org.freedesktop.login1' requested by ':1.17' (uid=0 pid=370
comm="/gnu/store/ximad0zvg12r4x0x80mvym8hzg0n33jl-shadow") (using servicehelper)
Nov 23 01:09:14 localhost elogind[1189]: elogind is already running as PID 390
Nov 23 01:09:20 localhost shepherd[1]: Respawning term-tty1.
Nov 23 01:09:20 localhost shepherd[1]: Service host-name has been started.
Nov 23 01:09:20 localhost shepherd[1]: Service term-tty1 has been started.
Nov 23 01:09:39 localhost dbus-daemon[383]: [system] Failed to activate service 'org.freedesktop.login1': timed out (service_start_timeout=25000ms)

I don't remember if I saw the slim login screen; but in any case I
couldn't successfully login even via a ptty.

It may have to do with polkit.

To be investigated.

This happened on a system *not* using gdm (it uses slim) and with
ratpoison as the WM, on commit f42bc604547d9ee8e35fcd66d5db7786954cfac3
of the core-updates-frozen branch.

To be investigated.

Thanks,

Maxim
J
J
Josselin Poiret wrote on 24 Nov 2021 10:02
(address . dev@jpoiret.xyz)
87czmqorjb.fsf@jpoiret.xyz
Hello Maxim,

Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:

Toggle quote (23 lines)
> --8<---------------cut here---------------start------------->8---
> Nov 23 01:09:14 localhost dbus-daemon[383]: [system] Activating service name='org.freedesktop.login1' requested by ':1.17' (uid=0 pid=370
> comm="/gnu/store/ximad0zvg12r4x0x80mvym8hzg0n33jl-shadow") (using servicehelper)
> Nov 23 01:09:14 localhost elogind[1189]: elogind is already running as PID 390
> Nov 23 01:09:20 localhost shepherd[1]: Respawning term-tty1.
> Nov 23 01:09:20 localhost shepherd[1]: Service host-name has been started.
> Nov 23 01:09:20 localhost shepherd[1]: Service term-tty1 has been started.
> Nov 23 01:09:39 localhost dbus-daemon[383]: [system] Failed to activate service 'org.freedesktop.login1': timed out (service_start_timeout=25000ms)
> --8<---------------cut here---------------end--------------->8---
>
> I don't remember if I saw the slim login screen; but in any case I
> couldn't successfully login even via a ptty.
>
> It may have to do with polkit.
>
> To be investigated.
>
> This happened on a system *not* using gdm (it uses slim) and with
> ratpoison as the WM, on commit f42bc604547d9ee8e35fcd66d5db7786954cfac3
> of the core-updates-frozen branch.
>
> To be investigated.

I cannot reproduce in a fresh VM on commit
d5de4e163ccef80f78bc5fe330f568d8fe3a23ab, and can login just fine, with

(services (cons* (service slim-service-type (slim-configuration))
(modify-services %desktop-services
(delete gdm-service-type))))

Is this still affecting you?

Best,
Josselin Poiret
M
M
Maxim Cournoyer wrote on 25 Nov 2021 04:03
(name . Josselin Poiret)(address . dev@jpoiret.xyz)(address . 52051@debbugs.gnu.org)
87sfvlos26.fsf@gmail.com
Hello Josselin,

Josselin Poiret <dev@jpoiret.xyz> writes:

Toggle quote (40 lines)
> Hello Maxim,
>
> Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:
>
>> --8<---------------cut here---------------start------------->8---
>> Nov 23 01:09:14 localhost dbus-daemon[383]: [system] Activating
>> service name='org.freedesktop.login1' requested by ':1.17' (uid=0
>> pid=370
>> comm="/gnu/store/ximad0zvg12r4x0x80mvym8hzg0n33jl-shadow") (using servicehelper)
>> Nov 23 01:09:14 localhost elogind[1189]: elogind is already running as PID 390
>> Nov 23 01:09:20 localhost shepherd[1]: Respawning term-tty1.
>> Nov 23 01:09:20 localhost shepherd[1]: Service host-name has been started.
>> Nov 23 01:09:20 localhost shepherd[1]: Service term-tty1 has been started.
>> Nov 23 01:09:39 localhost dbus-daemon[383]: [system] Failed to
>> activate service 'org.freedesktop.login1': timed out
>> (service_start_timeout=25000ms)
>> --8<---------------cut here---------------end--------------->8---
>>
>> I don't remember if I saw the slim login screen; but in any case I
>> couldn't successfully login even via a ptty.
>>
>> It may have to do with polkit.
>>
>> To be investigated.
>>
>> This happened on a system *not* using gdm (it uses slim) and with
>> ratpoison as the WM, on commit f42bc604547d9ee8e35fcd66d5db7786954cfac3
>> of the core-updates-frozen branch.
>>
>> To be investigated.
>
> I cannot reproduce in a fresh VM on commit
> d5de4e163ccef80f78bc5fe330f568d8fe3a23ab, and can login just fine, with
>
> (services (cons* (service slim-service-type (slim-configuration))
> (modify-services %desktop-services
> (delete gdm-service-type))))
>
> Is this still affecting you?

Yes! It didn't occur in a 'guix system vm my-config.scm', but the exact
same config deployed on my machine fails at login.

Some symptoms:

1. Slim login screen comes up, but after entering credentials Xorg
resets (back to login screen)

2. going to a TTY and attempting to login there, it'd fail with a "Login
failed after 60 s timeout" or similar error.

3. I can login via SSH (thanks goodness!)

4. There are no errors (EE) in /var/log/Xorg.0.log

5. here's the tail of my /var/log/messages:

Toggle snippet (35 lines)
Nov 24 21:23:54 localhost ntpd[346]: Soliciting pool server 216.197.156.83
Nov 24 21:23:55 localhost ntpd[346]: Soliciting pool server 206.108.0.133
Nov 24 21:23:56 localhost wpa_supplicant[343]: wlp4s0: CTRL-EVENT-BEACON-LOSS
Nov 24 21:23:56 localhost ntpd[346]: Soliciting pool server 98.143.85.249
Nov 24 21:23:57 localhost ntpd[346]: Soliciting pool server 192.95.27.155
Nov 24 21:23:58 localhost dbus-daemon[341]: [system] Activating service name='org.freedesktop.login1' requested by ':1.16' (uid=0 pid=324 comm="/gnu/store/ximad0zvg12r4x0x80mvym8hzg0n33jl-shadow") (using servicehelper)
Nov 24 21:23:58 localhost elogind[1114]: elogind is already running as PID 355
Nov 24 21:24:11 localhost wpa_supplicant[343]: wlp4s0: CTRL-EVENT-BEACON-LOSS
Nov 24 21:24:21 localhost last message repeated 5 times
Nov 24 21:24:23 localhost dbus-daemon[341]: [system] Failed to activate service 'org.freedesktop.login1': timed out (service_start_timeout=25000ms)
Nov 24 21:24:23 localhost shepherd[1]: Respawning term-tty2.
Nov 24 21:24:23 localhost shepherd[1]: Service host-name has been started.
Nov 24 21:24:23 localhost shepherd[1]: Service term-tty2 has been started.
Nov 24 21:24:23 localhost wpa_supplicant[343]: wlp4s0: CTRL-EVENT-BEACON-LOSS
Nov 24 21:24:27 localhost last message repeated 3 times
Nov 24 21:26:04 localhost dbus-daemon[341]: [system] Activating service name='org.freedesktop.login1' requested by ':1.17' (uid=0 pid=429 comm="/gnu/store/nvvmksc9pvahqmypaz3h8mqya82vnga8-slim-1") (using servicehelper)
Nov 24 21:26:04 localhost elogind[1127]: elogind is already running as PID 355
Nov 24 21:26:29 localhost dbus-daemon[341]: [system] Failed to activate service 'org.freedesktop.login1': timed out (service_start_timeout=25000ms)
Nov 24 21:26:29 localhost shepherd[1]: Respawning xorg-server.
Nov 24 21:26:29 localhost shepherd[1]: Service host-name has been started.
Nov 24 21:26:29 localhost shepherd[1]: Service xorg-server has been started.
Nov 24 21:27:23 localhost ntpd[346]: Soliciting pool server 209.115.181.108
Nov 24 21:27:24 localhost ntpd[346]: Soliciting pool server 138.197.153.200
Nov 24 21:27:25 localhost ntpd[346]: Soliciting pool server 162.159.200.123
Nov 24 21:27:26 localhost ntpd[346]: Soliciting pool server 162.159.200.1
Nov 24 21:29:09 localhost ntpd[346]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Nov 24 21:29:42 localhost ntpd[346]: Soliciting pool server 199.182.221.110
Nov 24 21:35:23 localhost wpa_supplicant[343]: wlp4s0: CTRL-EVENT-SIGNAL-CHANGE above=0 signal=-87 noise=-95 txrate=270000
Nov 24 21:35:32 localhost wpa_supplicant[343]: wlp4s0: CTRL-EVENT-SIGNAL-CHANGE above=1 signal=-67 noise=-95 txrate=270000
Nov 24 21:42:50 localhost dbus-daemon[341]: [system] Activating service name='org.freedesktop.login1' requested by ':1.19' (uid=0 pid=1269 comm="sudo herd status dbus ") (using servicehelper)
Nov 24 21:42:50 localhost elogind[1273]: elogind is already running as PID 355
Nov 24 21:43:15 localhost dbus-daemon[341]: [system] Failed to activate service 'org.freedesktop.login1': timed out (service_start_timeout=25000ms)
Nov 24 21:47:25 localhost last message repeated 3 times

So it seems the issue is with some D-Bus service
(org.freedesktop.login1) failing to be started. I'm not sure how it can
be explained that it works in a VM but not on the real machine...

Also, the dbus-system process (pid 341) as the following environment
variable (singular) defined:

Toggle snippet (4 lines)
# cat /proc/341/environ | xargs -0 -n1
PATH=/run/current-system/profile/bin

which seems like not much. Perhaps it needs some XDG_DATA_DIRS to find
that D-Bus service file? Where would it come from?

To be investigated...

Thanks,

Maxim
M
M
Maxim Cournoyer wrote on 25 Nov 2021 05:07
(name . Josselin Poiret)(address . dev@jpoiret.xyz)(address . 52051@debbugs.gnu.org)
87tug0c1zi.fsf@gmail.com
Hello,

I've found a workaround: restarting elogind via SSH resolved the issue.

I guess itt may be a race between elogind and dbus-system (elogind gets
started before dbus-system is fully up, and the communication with the
session bus is somehow crippled from there?).

I'll experiment with the elogind service a bit, adding a dumb sleep to
its start slot to see if it prevents the race.

Thank you,

Maxim
M
M
Maxim Cournoyer wrote on 25 Nov 2021 23:07
(name . Josselin Poiret)(address . dev@jpoiret.xyz)(address . 52051@debbugs.gnu.org)
87a6hranz1.fsf@gmail.com
Hello,

In case it is useful, here's what the shepherd-graph of my system looks
like:
Attachment: hurd.svg
My system configuration looks something like this (some parts redacted).

Toggle snippet (253 lines)
(use-modules (gnu)
(guix modules)
(srfi srfi-1)
;; for mcron experiment
(gnu packages autotools)
(gnu packages guile-xyz)
(gnu packages man)
(gnu packages texinfo)
(guix git)
(guix packages))

(use-service-modules admin audio desktop docker linux mail mcron networking nfs
nix spice ssh sysctl telephony virtualization vpn xorg)
(use-package-modules android bootloaders certs gnome linux
nfs ratpoison rsync tls)

(define updatedb-job
;; Run 'updatedb' at 3AM every day.
#~(job '(next-hour '(3))
(lambda ()
(execl #$(file-append findutils "/bin/updatedb")
"updatedb"
"--prunepaths=/gnu/store /tmp /var/tmp /var/lib"))
"updatedb"))

(define btrfs-balance-job
;; Re-allocate chunks which are using less than 5% of their chunk
;; space, to regain Btrfs 'unallocated' space. The usage is kept
;; low (5%) to minimize wear on the SSD. Runs at 5 AM every 3 days.
#~(job '(next-hour-from (next-day (range 1 31 3)) '(5))
(lambda ()
(execl #$(file-append btrfs-progs "/bin/btrfs")
"balance" "start" "-dusage=5" "/"))
"btrfs-balance"))

(define i2c-udev-rule
;; Sets i2c group to i2c devices, so that we can access it as a
;; simple user, e.g. using ddcutil.
(udev-rule
"90-i2c.rules"
(string-append "KERNEL==\"i2c-[0-9]*\", "
"GROUP=\"i2c\"")))

(define c920-webcam-udev-rule
(udev-rule "90-c920-webcam.rules"
(string-append "SUBSYSTEM==\"usb\", "
"ATTRS{idVendor}==\"046d\", "
"ATTRS{idProduct}==\"082d\", "
"GROUP=\"video\"")))

(define %my-desktop-services
(remove (lambda (service)
(eq? (service-kind service) gdm-service-type))
%desktop-services))

;; TODO: Honor rootflags in the initrd,
;; then remove degraded.
(define %common-btrfs-options '(("compress-force" . "zstd")
("space_cache" . "v2")
"degraded"))

(define %system
(operating-system
(host-name "hurd")
(timezone "America/Montreal")
(keyboard-layout (keyboard-layout "dvorak"))
(bootloader (bootloader-configuration
(bootloader grub-bootloader)
(targets (list "/dev/sda" "/dev/sdb" "/dev/sdc"))
(terminal-outputs '(console))
(keyboard-layout keyboard-layout)))
(kernel-arguments '("quiet" "snd_hda_intel.dmic_detect=0"
"modprobe.blacklist=rtl8187"))
(mapped-devices
(list (mapped-device
(source "/dev/sda2")
(target "cryptroot")
(type luks-device-mapping))
(mapped-device
(source "/dev/sdb2")
(target "cryptroot-mirror")
(type luks-device-mapping))
(mapped-device
(source "/dev/sdc2")
(target "cryptroot-mirror2")
(type luks-device-mapping))))

;; Note: Using any of the LUKS encrypted drives exposed under
;; /dev/mapper is enough to reference the Btrfs RAID-1 array,
;; since the 'btrfs device scan' command is executed in the init
;; RAM disk and takes care of assembling the array.
(file-systems (cons* (file-system
(mount-point "/")
(device "/dev/mapper/cryptroot")
(type "btrfs")
(options (alist->file-system-options
(cons '("subvol" . "@root")
%common-btrfs-options)))
(dependencies mapped-devices))
(file-system
(device "/dev/mapper/cryptroot")
(mount-point "/home")
(type "btrfs")
(options (alist->file-system-options
(cons '("subvol" . "@home")
%common-btrfs-options)))
(dependencies mapped-devices))
(file-system
(device "/dev/mapper/cryptroot")
(mount-point "/data")
(type "btrfs")
(options (alist->file-system-options
(cons '("subvol" . "@data")
%common-btrfs-options)))
(dependencies mapped-devices))
%base-file-systems))
(users (cons (user-account
(name "me")
(group "users")
(supplementary-groups
'("lp" ;for bluetooth
"dialout" ;for serial port
"adbusers" ;for adb
"i2c" ;for ddcutil
"realtime" ;for jackd
"kvm" "libvirt" "docker"
"wheel" "netdev" "audio" "video"))
(home-directory "/home/maxim"))
%base-user-accounts))
(groups
(cons* (user-group (system? #t) (name "adbusers"))
(user-group (system? #t) (name "i2c"))
(user-group (system? #t) (name "realtime"))
%base-groups))
(services
(cons*
(service docker-service-type (docker-configuration
(debug? #t)))
(service qemu-binfmt-service-type
(qemu-binfmt-configuration
(platforms (lookup-qemu-platforms "arm" "aarch64" "ppc64le"))))
(service earlyoom-service-type)
(service zram-device-service-type (zram-device-configuration
(size "6G")
(compression-algorithm 'ZSTD)))
(bluetooth-service #:auto-enable? #t)
(pam-limits-service
(list
;; Required in order for jackd to function properly.
(pam-limits-entry "@realtime" 'both 'rtprio 99)
(pam-limits-entry "@realtime" 'both 'memlock 'unlimited)))
(service mcron-service-type
(mcron-configuration
(mcron (package/inherit mcron
(source (git-checkout
(url "file:///home/maxim/src/mcron")))
(native-inputs (append
`(("autoconf" ,autoconf)
("automake" ,automake)
("help2man" ,help2man)
("texinfo" ,texinfo))
(package-native-inputs mcron)))))
(jobs (list updatedb-job
btrfs-balance-job))))
(service mpd-service-type
(mpd-configuration
(user "me")))
;; Virtualization services to use with GNOME Boxes.
(service libvirt-service-type
(libvirt-configuration
(unix-sock-group "libvirt")))
(service virtlog-service-type)
(service opendht-service-type (opendht-configuration
(enable-logging? #t)
(peer-discovery? #t)
(proxy-server-port 8282)))
(service openssh-service-type
(openssh-configuration
(port-number 22)
(permit-root-login #f)
(password-authentication? #f)
(use-pam? #f))))
(service slim-service-type
(slim-configuration
(auto-login? #f)
(default-user "me")
(xorg-configuration
(xorg-configuration
(keyboard-layout keyboard-layout)))))
(service nix-service-type)
(service nfs-service-type
(nfs-configuration
(nfs-versions '("4.2" "4.1" "4.0"
"3" "2")) ; for U-Boot
(exports
'(("/some/path/to/rootfs"
"*(rw,no_root_squash,no_subtree_check)")))
(nfsd-udp? #t))) ;for U-Boot
(service guix-publish-service-type
(guix-publish-configuration
(advertise? #t)
(host "0.0.0.0"))) ;listen on all interfaces
(modify-services %my-desktop-services
;; Ignore the silly HHKB2 power off key shortcut.
(elogind-service-type config =>
(elogind-configuration
(inherit config)
(handle-power-key 'ignore)))
;; Enable using adb as a simple user with a multitude of devices.
(udev-service-type config =>
(udev-configuration
(inherit config)
(rules (cons* android-udev-rules
i2c-udev-rule
c920-webcam-udev-rule
(udev-configuration-rules config)))))
(network-manager-service-type config =>
(network-manager-configuration
(inherit config)
(vpn-plugins
(list network-manager-openvpn
network-manager-openconnect))))
(guix-service-type config =>
(guix-configuration
(inherit config)
(substitute-urls '("http://127.0.0.1:8181"
"https://ci.guix.gnu.org"))
(discover? #t)
(authorized-keys
(cons*
keys...
%default-authorized-guix-keys))
(log-compression 'none)
(extra-options '("--max-jobs=4")))))))

;; Packages installed by default.
(packages (append (map specification->package
'("ratpoison"
"linux-pam" ;for the pam_limits manpage
"nss-certs" ;for HTTPS access
"rsync" ;for the backup script
"btrfs-progs"
"nix"
"nfs-utils"
"ntfs-3g")) ;for mount.ntfs
%base-packages))

;; Allow resolution of '.local' host names with mDNS.
(name-service-switch %mdns-host-lookup-nss)))

%system

But virtualizing it with `guix system vm` doesn't trigger the bug, I'm
guessing because the NFS export doesn't work (it fails quickly rather
than take time like it would on an actual boot).

Thanks,

Maxim
L
L
Ludovic Courtès wrote on 25 Nov 2021 23:13
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)
87sfvjdgtl.fsf@gnu.org
Hi,

Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

Toggle quote (14 lines)
> Nov 24 21:23:58 localhost dbus-daemon[341]: [system] Activating service name='org.freedesktop.login1' requested by ':1.16' (uid=0 pid=324 comm="/gnu/store/ximad0zvg12r4x0x80mvym8hzg0n33jl-shadow") (using servicehelper)
> Nov 24 21:23:58 localhost elogind[1114]: elogind is already running as PID 355
> Nov 24 21:24:11 localhost wpa_supplicant[343]: wlp4s0: CTRL-EVENT-BEACON-LOSS
> Nov 24 21:24:21 localhost last message repeated 5 times
> Nov 24 21:24:23 localhost dbus-daemon[341]: [system] Failed to activate service 'org.freedesktop.login1': timed out (service_start_timeout=25000ms)
> Nov 24 21:24:23 localhost shepherd[1]: Respawning term-tty2.
> Nov 24 21:24:23 localhost shepherd[1]: Service host-name has been started.
> Nov 24 21:24:23 localhost shepherd[1]: Service term-tty2 has been started.
> Nov 24 21:24:23 localhost wpa_supplicant[343]: wlp4s0: CTRL-EVENT-BEACON-LOSS
> Nov 24 21:24:27 localhost last message repeated 3 times
> Nov 24 21:26:04 localhost dbus-daemon[341]: [system] Activating service name='org.freedesktop.login1' requested by ':1.17' (uid=0 pid=429 comm="/gnu/store/nvvmksc9pvahqmypaz3h8mqya82vnga8-slim-1") (using servicehelper)
> Nov 24 21:26:04 localhost elogind[1127]: elogind is already running as PID 355
> Nov 24 21:26:29 localhost dbus-daemon[341]: [system] Failed to activate service 'org.freedesktop.login1': timed out (service_start_timeout=25000ms)

It looks like elogind things it’s already running as PID 355, where the
PID comes from its PID file; there’s indeed a process with this PID, but
is it really elogind? Or could it be that there’s a stale PID file?

Does /var/log/messages shows the very first time elogind is started at
boot time? With which PID?

We need to find out why dbus-daemon thinks it needs to restart elogind.

Ludo’.
M
M
Maxim Cournoyer wrote on 26 Nov 2021 05:00
control message for bug #52051
(address . control@debbugs.gnu.org)
871r33a7mb.fsf@gmail.com
severity 52051 important
quit
J
J
Josselin Poiret wrote on 26 Nov 2021 10:35
Re: bug#52051: [core-updates-frozen] cannot login ('org.freedesktop.login1' service times out)
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)
87wnkvp8cx.fsf@jpoiret.xyz
Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:

Toggle quote (8 lines)
> Hello,
>
> I've found a workaround: restarting elogind via SSH resolved the issue.
>
> I guess itt may be a race between elogind and dbus-system (elogind gets
> started before dbus-system is fully up, and the communication with the
> session bus is somehow crippled from there?).

Does this happen with other login managers on your system, like LightDM?
The thing is that elogind depends on dbus-system, so I'm not sure there
should be a race there.

I noticed though that none of our desktop managers's shepherd services
require elogind, and in the case of SLiM not even dbus-system. Maybe we
should add it there, since we want Shepherd to handle launching elogind,
and avoid dbus launching one by itself if the login1 service is used or
even by the PAM elogind module. Can you try adding that to
slim-shepherd-service?

Also, maybe patching the dbus service to Exec=false instead would be a
good safeguard against dbus launching elogind itself.

Tangentially: SLiM is very old and unmaintained, so it may one day
simply break with newer versions of systemd/elogind.

Best,
Josselin Poiret
M
M
Maxim Cournoyer wrote on 7 Dec 2021 22:23
(name . Josselin Poiret)(address . dev@jpoiret.xyz)(address . 52051@debbugs.gnu.org)
87mtlcun1h.fsf@gmail.com
Hello!

Josselin Poiret <dev@jpoiret.xyz> writes:

Toggle quote (14 lines)
> Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:
>
>> Hello,
>>
>> I've found a workaround: restarting elogind via SSH resolved the issue.
>>
>> I guess itt may be a race between elogind and dbus-system (elogind gets
>> started before dbus-system is fully up, and the communication with the
>> session bus is somehow crippled from there?).
>
> Does this happen with other login managers on your system, like LightDM?
> The thing is that elogind depends on dbus-system, so I'm not sure there
> should be a race there.

I've yet to try this, but it *doesn't* happen on another machine using
an almost-identical configuration (slim + ratpoison).

Toggle quote (7 lines)
> I noticed though that none of our desktop managers's shepherd services
> require elogind, and in the case of SLiM not even dbus-system. Maybe we
> should add it there, since we want Shepherd to handle launching elogind,
> and avoid dbus launching one by itself if the login1 service is used or
> even by the PAM elogind module. Can you try adding that to
> slim-shepherd-service?

I've tried adding elogind in the requirements of the slim service type
as well as the upower-daemon service type, but it didn't help.

Based on my observations (/var/log/messages), it seems that the dbus
session bus is not yet ready when elogin starts, and thus the various
dbus services are not yet published/registered when other dependents
need them.

I noticed that dbus system session times out on every boot on that
machine. I also notice that when the NTPD daemon starts, it rewinds
time by about 25 s on every boot

Toggle snippet (21 lines)
[...]
Dec 7 15:55:02 localhost shepherd[1]: Service dbus-system has been started.
[...]
Dec 7 15:55:10 localhost shepherd[1]: Service ntpd has been started.
[...]
Dec 7 15:54:46 localhost ntpd[341]: ntpd 4.2.8p15@1.3728-o Thu Jan 1 12:00:01 AM UTC 1970 (1): Starting
[...]
Dec 7 15:55:14 localhost shepherd[1]: Service elogind has been started.
[...]
Dec 7 15:55:21 localhost shepherd[1]: Service upower-daemon has been started.
[...]
Dec 7 15:54:51 localhost dbus-daemon[335]: [system] Activating service name='org.freedesktop.login1' requested by ':1.1' (uid=0 pid=345 comm="/gnu/store/g1qlpzcfnk2r6186al2hfqjmq9yl7qkk-upower") (using servicehelper)
[...]
Dec 7 15:55:02 localhost elogind[352]: elogind is already running as PID 343
[...]
Dec 7 15:55:16 localhost dbus-daemon[335]: [system] Failed to activate service 'org.freedesktop.login1': timed out (service_start_timeout=25000ms)
[...]
Dec 7 15:55:24 localhost dbus-daemon[335]: [system] Connection has not authenticated soon enough, closing it (auth_timeout=30000ms, elapsed: 36684ms)


Toggle quote (3 lines)
> Also, maybe patching the dbus service to Exec=false instead would be a
> good safeguard against dbus launching elogind itself.

Perhaps a good idea to try also!

Attached is the /var/log/messages of the problematic boot, nearly
unedited.
Attachment: messages.boot
Thanks,

Maxim
M
M
Maxim Cournoyer wrote on 7 Dec 2021 23:15
(name . Josselin Poiret)(address . dev@jpoiret.xyz)(address . 52051@debbugs.gnu.org)
8735n46oz2.fsf@gmail.com
Hi again,

Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:

[...]

Toggle quote (4 lines)
> I noticed that dbus system session times out on every boot on that
> machine. I also notice that when the NTPD daemon starts, it rewinds
> time by about 25 s on every boot

Not NTP related; the same occurs when removing the ntp-service-type from
%desktop-services.

Maxim
M
M
Maxim Cournoyer wrote on 8 Dec 2021 03:12
(name . Josselin Poiret)(address . dev@jpoiret.xyz)(address . 52051@debbugs.gnu.org)
87ilvzvo7o.fsf@gmail.com
Hello,

Josselin Poiret <dev@jpoiret.xyz> writes:

Toggle quote (14 lines)
> Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:
>
>> Hello,
>>
>> I've found a workaround: restarting elogind via SSH resolved the issue.
>>
>> I guess itt may be a race between elogind and dbus-system (elogind gets
>> started before dbus-system is fully up, and the communication with the
>> session bus is somehow crippled from there?).
>
> Does this happen with other login managers on your system, like LightDM?
> The thing is that elogind depends on dbus-system, so I'm not sure there
> should be a race there.

I have the same problem using GDM on that machine :-(.

I'll try poking the dbus session manually using dbus-send, or the Scheme
API for it under (guix build jami-service).

Thanks!

Maxim
L
L
Ludovic Courtès wrote on 8 Dec 2021 23:18
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)
87a6hahha0.fsf@gnu.org
Hi,

Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

Toggle quote (12 lines)
> Dec 7 15:55:02 localhost shepherd[1]: Service dbus-system has been started.
> [...]
> Dec 7 15:55:10 localhost shepherd[1]: Service ntpd has been started.
> [...]
> Dec 7 15:54:46 localhost ntpd[341]: ntpd 4.2.8p15@1.3728-o Thu Jan 1 12:00:01 AM UTC 1970 (1): Starting
> [...]
> Dec 7 15:55:14 localhost shepherd[1]: Service elogind has been started.
> [...]
> Dec 7 15:55:21 localhost shepherd[1]: Service upower-daemon has been started.
> [...]
> Dec 7 15:54:51 localhost dbus-daemon[335]: [system] Activating service name='org.freedesktop.login1' requested by ':1.1' (uid=0 pid=345 comm="/gnu/store/g1qlpzcfnk2r6186al2hfqjmq9yl7qkk-upower") (using servicehelper)

The key thing here is that dbus-daemon things elogind is not running and
thus considers it has to start it, which is bound to fail.

You can display the list of services known to the DBus system bus with:

dbus-send --system --print-reply --dest=org.freedesktop.DBus \
/org/freedesktop/DBus org.freedesktop.DBus.ListNames

I tried in a VM that had booted fine and it shows
“org.freedesktop.login1” as expected. Could you check what that gives
for you when the bug above shows up?

I wonder if it’s possible for there to be a race, like elogind says
“hi!” to dbus-daemon but dbus-daemon is still sleepy and doesn’t notice.
Seems hard to believe though.

If it’s reproducible, could you instrument ‘elogind-service-type’ such
that elogind is started with “strace -f -o /elogind.log -s 700”?

HTH,
Ludo’.
M
M
Maxim Cournoyer wrote on 9 Dec 2021 15:18
(name . Ludovic Courtès)(address . ludo@gnu.org)
87bl1plv4q.fsf@gmail.com
Hi Ludovic!

Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (27 lines)
> Hi,
>
> Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:
>
>> Dec 7 15:55:02 localhost shepherd[1]: Service dbus-system has been started.
>> [...]
>> Dec 7 15:55:10 localhost shepherd[1]: Service ntpd has been started.
>> [...]
>> Dec 7 15:54:46 localhost ntpd[341]: ntpd 4.2.8p15@1.3728-o Thu Jan 1 12:00:01 AM UTC 1970 (1): Starting
>> [...]
>> Dec 7 15:55:14 localhost shepherd[1]: Service elogind has been started.
>> [...]
>> Dec 7 15:55:21 localhost shepherd[1]: Service upower-daemon has been started.
>> [...]
>> Dec 7 15:54:51 localhost dbus-daemon[335]: [system] Activating
>> service name='org.freedesktop.login1' requested by ':1.1' (uid=0
>> pid=345 comm="/gnu/store/g1qlpzcfnk2r6186al2hfqjmq9yl7qkk-upower")
>> (using servicehelper)
>
> The key thing here is that dbus-daemon things elogind is not running and
> thus considers it has to start it, which is bound to fail.
>
> You can display the list of services known to the DBus system bus with:
>
> dbus-send --system --print-reply --dest=org.freedesktop.DBus \
> /org/freedesktop/DBus org.freedesktop.DBus.ListNames

After launching elogind (added a wait for it in its start slot), the
first time it prints just:

Dec 9 09:08:24 localhost shepherd[1]: ;;; (available-dbus-services
("org.freedesktop.DBus" "fi.w1.wpa_supplicant1"))

And on the last try (25 seconds of trying) it prints:
Dec 9 09:09:27 localhost shepherd[1]: ;;; (available-dbus-services ("org.freedesktop.DBus" "org.freedesktop.PolicyKit1" "org.freedesktop.ModemManager1" "fi.w1.wpa_supplicant1" "org.freedesktop.NetworkManager"))

login1 never shows up.

So elogind is essentially hung up when it is started that early during
boot.

Attached is the requested strace output attached to elogind.

I've tried adding a synchronization loop that tests for dbus-daemon
socket readiness (able to make a connection) to its shepherd service but
it didn't help.

If elogind is restarted post boot login1 shows up and all is happy.
Attachment: elogind.log
Thanks!

Maxim
L
L
Ludovic Courtès wrote on 9 Dec 2021 23:15
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)
87v8zxctn5.fsf@gnu.org
Hello!

Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

Toggle quote (2 lines)
> 374 connect(11, {sa_family=AF_UNIX, sun_path="/var/run/dbus/system_bus_socket"}, 34) = 0

[...]

Toggle quote (27 lines)
> 374 epoll_wait(5, [{events=EPOLLIN|EPOLLOUT|EPOLLHUP, data={u32=24802800, u64=24802800}}], 20, -1) = 1
> 374 sendmsg(11, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\1\0\1\0\0\0\0\1\0\0\0m\0\0\0\1\1o\0\25\0\0\0/org/freedesktop/DBus\0\0\0\3\1s\0\5\0\0\0Hello\0\0\0\2\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\6\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0", iov_len=128}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = -1 EPIPE (Broken pipe)
> 374 gettid() = 374
> 374 epoll_ctl(5, EPOLL_CTL_MOD, 11, {events=0, data={u32=24802800, u64=24802800}}) = 0
> 374 timerfd_settime(12, TFD_TIMER_ABSTIME, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=1}}, NULL) = 0
> 374 epoll_wait(5, [{events=EPOLLHUP, data={u32=24802800, u64=24802800}}, {events=EPOLLIN, data={u32=24764384, u64=24764384}}], 20, -1) = 2
> 374 read(12, "\1\0\0\0\0\0\0\0", 8) = 8
> 374 gettid() = 374
> 374 timerfd_settime(12, TFD_TIMER_ABSTIME, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=1}}, NULL) = 0
> 374 epoll_wait(5, [{events=EPOLLHUP, data={u32=24802800, u64=24802800}}, {events=EPOLLIN, data={u32=24764384, u64=24764384}}], 20, -1) = 2
> 374 read(12, "\1\0\0\0\0\0\0\0", 8) = 8
> 374 gettid() = 374
> 374 timerfd_settime(12, TFD_TIMER_ABSTIME, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=1}}, NULL) = 0
> 374 epoll_wait(5, [{events=EPOLLHUP, data={u32=24802800, u64=24802800}}, {events=EPOLLIN, data={u32=24764384, u64=24764384}}], 20, -1) = 2
> 374 read(12, "\1\0\0\0\0\0\0\0", 8) = 8
> 374 epoll_ctl(5, EPOLL_CTL_DEL, 11, NULL) = 0
> 374 close(11) = 0
> 374 gettid() = 374
> 374 epoll_wait(5, <unfinished ...>
> 391 <... close resumed>) = 0
> 391 madvise(0x7fd6c83dc000, 8368128, MADV_DONTNEED) = 0
> 391 exit(0) = ?
> 391 +++ exited with 0 +++
> 374 <... epoll_wait resumed>[{events=EPOLLERR, data={u32=24768000, u64=24768000}}], 17, -1) = 1
> 374 lseek(7, 0, SEEK_SET) = 0
> 374 read(7, "tty7\n", 63) = 5

As you pointed out on IRC, the initially ‘Hello’ method call above leads
to EPIPE, and we can see that elogind eventually closes its socket to
dbus-daemon *but* keeps doing its thing.

Some interesting things to note…

First, to my surprise, elogind does not use the client library of the
‘dbus’ package:

Toggle snippet (5 lines)
$ guix gc --references $(./pre-inst-env guix build elogind)|grep dbus
$ echo $?
1

(This is already the case in ‘master’ with v243.7.) Instead, it has its
own implementation of the DBus protocol, in C, from systemd—we can’t
have enough sources of bugs and vulnerabilities.

Anyway, the “Hello” message is sent to the system bus asynchronously in
‘sd-bus.c’:

Toggle snippet (23 lines)
static int bus_send_hello(sd_bus *bus) {
_cleanup_(sd_bus_message_unrefp) sd_bus_message *m = NULL;
int r;

assert(bus);

if (!bus->bus_client)
return 0;

r = sd_bus_message_new_method_call(
bus,
&m,
"org.freedesktop.DBus",
"/org/freedesktop/DBus",
"org.freedesktop.DBus",
"Hello");
if (r < 0)
return r;

return sd_bus_call_async(bus, NULL, m, hello_callback, NULL, 0);
}

A callback is called when a reply is received or an error arises:

Toggle snippet (14 lines)
static int hello_callback(sd_bus_message *reply, void *userdata, sd_bus_error *error) {

[...]

fail:
/* When Hello() failed, let's propagate this in two ways: first we return the error immediately here,
* which is the propagated up towards the event loop. Let's also invalidate the connection, so that
* if the user then calls back into us again we won't wait any longer. */

bus_set_state(bus, BUS_CLOSING);
return r;
}

It’s not clear from that whether the authors intended for the thing to
keep going in case of failure. In our case it’s not helpful.

But why does dbus-daemon drop the connection in the first place?

To know that, we could change ‘dbus-root-service-type’ to run
dbus-daemon from a ‘--enable-verbose-mode’ build, and with the
‘DBUS_VERBOSE’ environment set to 1.

Looking at ‘dbus-server-socket.c’ it would seem that the only sensible
way this can happen is if dbus-daemon doesn’t yet have a
‘new_connection_function’ set at the time it accepts the incoming
connection:

Toggle snippet (28 lines)
static dbus_bool_t
handle_new_client_fd_and_unlock (DBusServer *server,
DBusSocket client_fd)
{

[...]

/* See if someone wants to handle this new connection, self-referencing
* for paranoia.
*/
new_connection_function = server->new_connection_function;
new_connection_data = server->new_connection_data;

_dbus_server_ref_unlocked (server);
SERVER_UNLOCK (server);

if (new_connection_function)
{
(* new_connection_function) (server, connection,
new_connection_data);
}
dbus_server_unref (server);

/* If no one grabbed a reference, the connection will die. */
_dbus_connection_close_if_only_one_ref (connection);
dbus_connection_unref (connection);

We know that elogind is started after dbus-daemon has written its PID
file (there’s a Shepherd service dependency). Is there a possibility
that dbus-daemon writes its PID file before it has set
‘new_connection_function’?

It would seem that ‘bus_context_new’ writes the PID file after it has
called ‘dbus_server_set_new_connection_function’ via ‘setup_server’ via
‘process_config_first_time_only’.

So not sure what happens.

That’s it for today. Thoughts?

Ludo’.
M
M
Maxim Cournoyer wrote on 10 Dec 2021 06:09
(name . Ludovic Courtès)(address . ludo@gnu.org)
875yrxyrjr.fsf@gmail.com
Hello Ludovic!

Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (44 lines)
> Hello!
>
> Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:
>
>> 374 connect(11, {sa_family=AF_UNIX, sun_path="/var/run/dbus/system_bus_socket"}, 34) = 0
>
> [...]
>
>> 374 epoll_wait(5, [{events=EPOLLIN|EPOLLOUT|EPOLLHUP, data={u32=24802800, u64=24802800}}], 20, -1) = 1
>> 374 sendmsg(11, {msg_name=NULL, msg_namelen=0,
>> msg_iov=[{iov_base="l\1\0\1\0\0\0\0\1\0\0\0m\0\0\0\1\1o\0\25\0\0\0/org/freedesktop/DBus\0\0\0\3\1s\0\5\0\0\0Hello\0\0\0\2\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\6\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0",
>> iov_len=128}], msg_iovlen=1, msg_controllen=0, msg_flags=0},
>> MSG_DONTWAIT|MSG_NOSIGNAL) = -1 EPIPE (Broken pipe)
>> 374 gettid() = 374
>> 374 epoll_ctl(5, EPOLL_CTL_MOD, 11, {events=0, data={u32=24802800, u64=24802800}}) = 0
>> 374 timerfd_settime(12, TFD_TIMER_ABSTIME, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=1}}, NULL) = 0
>> 374 epoll_wait(5, [{events=EPOLLHUP, data={u32=24802800, u64=24802800}}, {events=EPOLLIN, data={u32=24764384, u64=24764384}}], 20, -1) = 2
>> 374 read(12, "\1\0\0\0\0\0\0\0", 8) = 8
>> 374 gettid() = 374
>> 374 timerfd_settime(12, TFD_TIMER_ABSTIME, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=1}}, NULL) = 0
>> 374 epoll_wait(5, [{events=EPOLLHUP, data={u32=24802800, u64=24802800}}, {events=EPOLLIN, data={u32=24764384, u64=24764384}}], 20, -1) = 2
>> 374 read(12, "\1\0\0\0\0\0\0\0", 8) = 8
>> 374 gettid() = 374
>> 374 timerfd_settime(12, TFD_TIMER_ABSTIME, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=1}}, NULL) = 0
>> 374 epoll_wait(5, [{events=EPOLLHUP, data={u32=24802800, u64=24802800}}, {events=EPOLLIN, data={u32=24764384, u64=24764384}}], 20, -1) = 2
>> 374 read(12, "\1\0\0\0\0\0\0\0", 8) = 8
>> 374 epoll_ctl(5, EPOLL_CTL_DEL, 11, NULL) = 0
>> 374 close(11) = 0
>> 374 gettid() = 374
>> 374 epoll_wait(5, <unfinished ...>
>> 391 <... close resumed>) = 0
>> 391 madvise(0x7fd6c83dc000, 8368128, MADV_DONTNEED) = 0
>> 391 exit(0) = ?
>> 391 +++ exited with 0 +++
>> 374 <... epoll_wait resumed>[{events=EPOLLERR, data={u32=24768000, u64=24768000}}], 17, -1) = 1
>> 374 lseek(7, 0, SEEK_SET) = 0
>> 374 read(7, "tty7\n", 63) = 5
>
> As you pointed out on IRC, the initially ‘Hello’ method call above leads
> to EPIPE, and we can see that elogind eventually closes its socket to
> dbus-daemon *but* keeps doing its thing.
>
> Some interesting things to note…

Indeed, very interesting :-). Thanks for diving in!

Toggle quote (12 lines)
> First, to my surprise, elogind does not use the client library of the
> ‘dbus’ package:
>
> $ guix gc --references $(./pre-inst-env guix build elogind)|grep dbus
> $ echo $?
> 1
>
>
> (This is already the case in ‘master’ with v243.7.) Instead, it has its
> own implementation of the DBus protocol, in C, from systemd—we can’t
> have enough sources of bugs and vulnerabilities.

Oh! Marius found this interesting issue [0] that they shared on IRC
today; I wonder if it may be related. sd-bus apparently pipeline things
aggressively, which is not always handled well by other D-bus
participants.


Toggle quote (45 lines)
> Anyway, the “Hello” message is sent to the system bus asynchronously in
> ‘sd-bus.c’:
>
> static int bus_send_hello(sd_bus *bus) {
> _cleanup_(sd_bus_message_unrefp) sd_bus_message *m = NULL;
> int r;
>
> assert(bus);
>
> if (!bus->bus_client)
> return 0;
>
> r = sd_bus_message_new_method_call(
> bus,
> &m,
> "org.freedesktop.DBus",
> "/org/freedesktop/DBus",
> "org.freedesktop.DBus",
> "Hello");
> if (r < 0)
> return r;
>
> return sd_bus_call_async(bus, NULL, m, hello_callback, NULL, 0);
> }
>
>
> A callback is called when a reply is received or an error arises:
>
> static int hello_callback(sd_bus_message *reply, void *userdata, sd_bus_error *error) {
>
> [...]
>
> fail:
> /* When Hello() failed, let's propagate this in two ways: first we return the error immediately here,
> * which is the propagated up towards the event loop. Let's also invalidate the connection, so that
> * if the user then calls back into us again we won't wait any longer. */
>
> bus_set_state(bus, BUS_CLOSING);
> return r;
> }
>
>
> It’s not clear from that whether the authors intended for the thing to
> keep going in case of failure. In our case it’s not helpful.

If we picture this in the systemd use case, I believe sd-bus must be
used as *the* d-bus daemon, perhaps? So it should never die, and expect
users to call back into it to retry things? In our case, it acts as a
D-Bus client, not a server (IIUC), so perhaps its behavior is not tuned
right for our use case.

Toggle quote (11 lines)
> But why does dbus-daemon drop the connection in the first place?
>
> To know that, we could change ‘dbus-root-service-type’ to run
> dbus-daemon from a ‘--enable-verbose-mode’ build, and with the
> ‘DBUS_VERBOSE’ environment set to 1.
>
> Looking at ‘dbus-server-socket.c’ it would seem that the only sensible
> way this can happen is if dbus-daemon doesn’t yet have a
> ‘new_connection_function’ set at the time it accepts the incoming
> connection:

Interesting that you mention this; that's what I worked on yesterday
(see attached patches). I managed to get elogind panicking the kernel
during a guix system reconfigure, which corrupted GRUB, so had to chroot
and reconfigure from there [1]. Not sure what happened, but it seems
that killing and restarting elogind is susceptible to cause hard locks.

From 690d96cba581f253761e6c3263262fbc899aa2a2 Mon Sep 17 00:00:00 2001
From: Maxim Cournoyer <maxim.cournoyer@gmail.com>
Date: Tue, 7 Dec 2021 22:20:02 -0500
Subject: [PATCH 1/5] gnu: Add dbus-verbose.

* gnu/packages/glib.scm (dbus-verbose): New variable.
---
gnu/packages/glib.scm | 15 +++++++++++++++
1 file changed, 15 insertions(+)

Toggle diff (28 lines)
diff --git a/gnu/packages/glib.scm b/gnu/packages/glib.scm
index 398d3943e5..2eead335f5 100644
--- a/gnu/packages/glib.scm
+++ b/gnu/packages/glib.scm
@@ -176,6 +176,21 @@ (define dbus
shared NFS home directories.")
(license license:gpl2+))) ; or Academic Free License 2.1
+;;; The reason this is not enabled in the regular dbus package is because it
+;;; impacts the performance of D-Bus (including its library) as a whole, even
+;;; when the DBUS_VERBOSE environment variable is not set.
+(define-public dbus-verbose
+ (package/inherit dbus
+ (name "dbus-verbose")
+ (arguments (substitute-keyword-arguments (package-arguments dbus)
+ ((#:configure-flags flags '())
+ `(cons "--enable-verbose-mode" ,flags))))
+ (synopsis "D-Bus with verbose mode enabled for debugging")
+ (description "This variant D-Bus package is built with verbose mode, which
+eases debugging of D-Bus services by printing various debug information when
+the @code{DBUS_VERBOSE} environment variable is set to @samp{1}. For more
+information, refer to the @samp{dbus-daemon(1)} man page.")))
+
(define glib
(package
(name "glib")
--
2.34.0
From c614159fbc539c8b1f556fe9ced2d13963b6d6f9 Mon Sep 17 00:00:00 2001
From: Maxim Cournoyer <maxim.cournoyer@gmail.com>
Date: Tue, 7 Dec 2021 22:48:39 -0500
Subject: [PATCH 2/5] services: dbus: Add a VERBOSE? configuration option.

* gnu/services/dbus.scm (<dbus-configuration>)[verbose?]: New field.
(dbus-shepherd-service): Use it.
(dbus-service)[verbose?]: Add argument and update doc.
* doc/guix.texi (Desktop Services): Document it.
---
doc/guix.texi | 9 +++++++--
gnu/services/dbus.scm | 20 +++++++++++++++-----
2 files changed, 22 insertions(+), 7 deletions(-)

Toggle diff (83 lines)
diff --git a/doc/guix.texi b/doc/guix.texi
index 58625666c1..4b1297e5de 100644
--- a/doc/guix.texi
+++ b/doc/guix.texi
@@ -20280,9 +20280,14 @@ The actual service definitions included in @code{%desktop-services} and
provided by @code{(gnu services dbus)} and @code{(gnu services desktop)}
are described below.
-@deffn {Scheme Procedure} dbus-service [#:dbus @var{dbus}] [#:services '()]
+@deffn {Scheme Procedure} dbus-service [#:dbus @var{dbus}] [#:services '()] @
+ [#:verbose?]
Return a service that runs the ``system bus'', using @var{dbus}, with
-support for @var{services}.
+support for @var{services}. When @var{verbose?} is true, it causes the
+@samp{DBUS_VERBOSE} environment variable to be set to @samp{1}; a
+verbose-enabled D-Bus package such as @code{dbus-verbose} should be
+provided as @var{dbus} in this scenario. The verbose output is logged
+to @file{/var/log/dbus-daemon.log}.
@uref{https://dbus.freedesktop.org/, D-Bus} is an inter-process communication
facility. Its system bus is used to allow system services to communicate
diff --git a/gnu/services/dbus.scm b/gnu/services/dbus.scm
index 85a4c3ec9a..da167f2e36 100644
--- a/gnu/services/dbus.scm
+++ b/gnu/services/dbus.scm
@@ -53,7 +53,9 @@ (define-record-type* <dbus-configuration>
(dbus dbus-configuration-dbus ;file-like
(default dbus))
(services dbus-configuration-services ;list of <package>
- (default '())))
+ (default '()))
+ (verbose? dbus-configuration-verbose? ;boolean
+ (default #f)))
(define (system-service-directory services)
"Return the system service directory, containing @code{.service} files for
@@ -187,7 +189,7 @@ (define (dbus-activation config)
(define dbus-shepherd-service
(match-lambda
- (($ <dbus-configuration> dbus)
+ (($ <dbus-configuration> dbus _ verbose?)
(list (shepherd-service
(documentation "Run the D-Bus system daemon.")
(provision '(dbus-system))
@@ -195,6 +197,10 @@ (define dbus-shepherd-service
(start #~(make-forkexec-constructor
(list (string-append #$dbus "/bin/dbus-daemon")
"--nofork" "--system" "--syslog-only")
+ #$@(if verbose?
+ '(#:environment-variables '("DBUS_VERBOSE=1")
+ #:log-file "/var/log/dbus-daemon.log")
+ '())
#:pid-file "/var/run/dbus/pid"))
(stop #~(make-kill-destructor)))))))
@@ -230,9 +236,12 @@ (define dbus-root-service-type
bus. It allows programs and daemons to communicate and is also responsible
for spawning (@dfn{activating}) D-Bus services on demand.")))
-(define* (dbus-service #:key (dbus dbus) (services '()))
+(define* (dbus-service #:key (dbus dbus) (services '()) verbose?)
"Return a service that runs the \"system bus\", using @var{dbus}, with
-support for @var{services}.
+support for @var{services}. When @var{verbose?} is true, it causes the
+@samp{DBUS_VERBOSE} environment variable to be set to @samp{1}; a
+verbose-enabled D-Bus package such as @code{dbus-verbose} should be provided
+as @var{dbus} in this scenario.
@uref{http://dbus.freedesktop.org/, D-Bus} is an inter-process communication
facility. Its system bus is used to allow system services to communicate and
@@ -244,7 +253,8 @@ (define* (dbus-service #:key (dbus dbus) (services '()))
@var{services} must be equal to @code{(list avahi)}."
(service dbus-root-service-type
(dbus-configuration (dbus dbus)
- (services services))))
+ (services services)
+ (verbose? verbose?))))
(define (wrapped-dbus-service service program variables)
"Return a wrapper for @var{service}, a package containing a D-Bus service,
--
2.34.0
Toggle quote (32 lines)
> static dbus_bool_t
> handle_new_client_fd_and_unlock (DBusServer *server,
> DBusSocket client_fd)
> {
>
> [...]
>
> /* See if someone wants to handle this new connection, self-referencing
> * for paranoia.
> */
> new_connection_function = server->new_connection_function;
> new_connection_data = server->new_connection_data;
>
> _dbus_server_ref_unlocked (server);
> SERVER_UNLOCK (server);
>
> if (new_connection_function)
> {
> (* new_connection_function) (server, connection,
> new_connection_data);
> }
> dbus_server_unref (server);
>
> /* If no one grabbed a reference, the connection will die. */
> _dbus_connection_close_if_only_one_ref (connection);
> dbus_connection_unref (connection);
>
> We know that elogind is started after dbus-daemon has written its PID
> file (there’s a Shepherd service dependency). Is there a possibility
> that dbus-daemon writes its PID file before it has set
> ‘new_connection_function’?

Are PID files conventionally agreed to be synchronization primitives?
In the case of systemd, I feel this may be more of an afterthought,
given the system is designed to rely on socket-based activation. Still
just guessing though, and your analysis above seems to suggest
otherwise.

Toggle quote (8 lines)
> It would seem that ‘bus_context_new’ writes the PID file after it has
> called ‘dbus_server_set_new_connection_function’ via ‘setup_server’ via
> ‘process_config_first_time_only’.
>
> So not sure what happens.
>
> That’s it for today. Thoughts?

Thank you!

I'm sorry for my lack of deep analysis here, I've spent most of my
evening attempting to fix my system just to boot ^^'. I've at least
managed to collect the following verbose D-Bus log (54 MiB uncompressed)
which hopefully can shed some light onto how this failure came to be.
Attachment: dbus-daemon.log.xz
Happy Friday!

Maxim
L
L
Ludovic Courtès wrote on 10 Dec 2021 11:17
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)
87k0gcbw79.fsf@gnu.org
Hi Maxim!

Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

[...]

Toggle quote (7 lines)
> Oh! Marius found this interesting issue [0] that they shared on IRC
> today; I wonder if it may be related. sd-bus apparently pipeline things
> aggressively, which is not always handled well by other D-bus
> participants.
>
> [0] https://github.com/systemd/systemd/issues/16610

Interesting.

Toggle quote (3 lines)
>> Anyway, the “Hello” message is sent to the system bus asynchronously in
>> ‘sd-bus.c’:

[...]

Toggle quote (23 lines)
>> static int hello_callback(sd_bus_message *reply, void *userdata, sd_bus_error *error) {
>>
>> [...]
>>
>> fail:
>> /* When Hello() failed, let's propagate this in two ways: first we return the error immediately here,
>> * which is the propagated up towards the event loop. Let's also invalidate the connection, so that
>> * if the user then calls back into us again we won't wait any longer. */
>>
>> bus_set_state(bus, BUS_CLOSING);
>> return r;
>> }
>>
>>
>> It’s not clear from that whether the authors intended for the thing to
>> keep going in case of failure. In our case it’s not helpful.
>
> If we picture this in the systemd use case, I believe sd-bus must be
> used as *the* d-bus daemon, perhaps? So it should never die, and expect
> users to call back into it to retry things? In our case, it acts as a
> D-Bus client, not a server (IIUC), so perhaps its behavior is not tuned
> right for our use case.

I think systemd-logind is a separate process, just like elogind, that
talks over D-Bus; I don’t think there’s any difference here. It just
seems that this corner case, where we don’t get a reply for ‘Hello’, is
not correctly handled.

Toggle quote (6 lines)
> Interesting that you mention this; that's what I worked on yesterday
> (see attached patches). I managed to get elogind panicking the kernel
> during a guix system reconfigure, which corrupted GRUB, so had to chroot
> and reconfigure from there [1]. Not sure what happened, but it seems
> that killing and restarting elogind is susceptible to cause hard locks.

Ouch! This is weird. Note that “everything” depends on elogind, as can
be seen with ‘guix system shepherd-graph’. So “herd stop elogind” is
not something you should try at home.

Incidentally, can you reproduce the bug in a VM? That would be much
nicer.

Toggle quote (7 lines)
>> We know that elogind is started after dbus-daemon has written its PID
>> file (there’s a Shepherd service dependency). Is there a possibility
>> that dbus-daemon writes its PID file before it has set
>> ‘new_connection_function’?
>
> Are PID files conventionally agreed to be synchronization primitives?

Yes. Daemons are supposed to write their PID file once they’re ready to
do their job.

Toggle quote (5 lines)
> I'm sorry for my lack of deep analysis here, I've spent most of my
> evening attempting to fix my system just to boot ^^'. I've at least
> managed to collect the following verbose D-Bus log (54 MiB uncompressed)
> which hopefully can shed some light onto how this failure came to be.

Excerpt:

Toggle snippet (24 lines)
366: 0x7f28f396e740: 1639108961.938358 [dbus-sysdeps-util-unix.c(237):_dbus_write_pid_to_file_and_pipe] writing pid file /var/run/dbus/pid
366: 0x7f28f396e740: 1639108961.938438 [dbus-sysdeps-util-unix.c(291):_dbus_write_pid_to_file_and_pipe] No pid pipe to write to
366: 0x7f28f396e740: 1639108961.938474 [dbus-userdb.c(177):_dbus_user_database_lookup] Using cache for UID 986 information
366: 0x7f28f396e740: 1639108961.938665 [dbus-sysdeps-unix.c(3514):_dbus_socketpair] full-duplex pipe 6 <-> 7
366: 0x7f28f396e740: 1639108961.938700 [main.c(719):main] We are on D-Bus...
366: 0x7f28f396e740: 1639108961.938725 [dbus-mainloop.c(884):_dbus_loop_run] Running main loop, depth 0 -> 1
366: 0x7f28f396e740: 1639108962.566557 [dbus-server-socket.c(182):socket_handle_watch] Handling client connection, flags 0x1
366: 0x7f28f396e740: 1639108962.566623 [dbus-sysdeps-unix.c(2344):_dbus_accept] client fd 8 accepted
366: 0x7f28f396e740: 1639108962.566665 [dbus-server-socket.c(94):handle_new_client_fd_and_unlock] Creating new client connection with fd 8
366: 0x7f28f396e740: 1639108962.566779 [dbus-connection.c(1360):_dbus_connection_new_for_transport] LOCK
366: 0x7f28f396e740: 1639108962.566824 [dbus-transport-socket.c(180):check_read_watch] fd = 8
366: 0x7f28f396e740: 1639108962.566869 [dbus-transport-socket.c(227):check_read_watch] setting read watch enabled = 1

[...]

366: 0x7f28f396e740: 1639108962.568765 [dbus-transport-socket.c(974):socket_handle_watch] handling read watch 0x23c2b00 flags = 9
366: 0x7f28f396e740: 1639108962.568807 [dbus-transport-socket.c(348):exchange_credentials] exchange_credentials: do_reading = 1, do_writing = 0
366: 0x7f28f396e740: 1639108962.568862 [dbus-transport-socket.c(383):exchange_credentials] Failed to read credentials Failed to read credentials byte (zero-length read)
366: 0x7f28f396e740: 1639108962.568904 [dbus-transport.c(510):_dbus_transport_disconnect] start
366: 0x7f28f396e740: 1639108962.568945 [dbus-transport-socket.c(1041):socket_disconnect]
366: 0x7f28f396e740: 1639108962.568986 [dbus-transport-socket.c(76):free_watches] start
366: 0x7f28f396e740: 1639108962.569029 [dbus-watch.c(423):_dbus_watch_list_remove_watch] Removing watch on fd 8

We see this sequence:

1. Writes PID file.

2. Accepts incoming connection.

3. “Failed to read credentials”.

4. Close connection.

So we need to understand #3. Perhaps going back to strace logs could
help this time…

Thanks,
Ludo’.
M
M
Michael Rohleder wrote on 15 Dec 2021 20:27
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)(address . 52051@debbugs.gnu.org)
87k0g5hdne.fsf@rohleder.de
Hi Maxim,

I currently have this (very annoying) issue on _one_ of my machines (two
others work with nearly the same config).
I can't login at all not via console nor ssh or sddm.
I spend some time to reproduce it in a vm, but no success so far.

These are the relevant messages from syslog:
Dec 15 18:15:52 micha dbus-daemon[470]: [system] Failed to activate service 'org.freedesktop.login1': timed out (service_start_timeout=25000ms)
Dec 15 18:16:47 micha dbus-daemon[470]: [system] Activating service name='org.freedesktop.login1' requested by ':1.8' (uid=0 pid=899 comm="/gnu/store/ximad0zvg12r4x0x80mvym8hzg0n33jl-shadow") (using servicehelper)
Dec 15 18:16:47 micha elogind[935]: elogind is already running as PID 558
Dec 15 18:17:12 micha dbus-daemon[470]: [system] Failed to activate service 'org.freedesktop.login1': timed out (service_start_timeout=25000ms)

--
How much does it cost to entice a dope-smoking UNIX system guru to Dayton?
-- Brian Boyle, UNIX/WORLD's First Annual Salary Survey
-----BEGIN PGP SIGNATURE-----

iQEzBAEBCAAdFiEEdV4t5dDVhcUueCgwfHr/vv7yyyUFAmG6QaYACgkQfHr/vv7y
yyW5zwf/WYlUhJlA2C3ZcT5KbY5tqYLzAkYLuEGq+Lpvn6tGv2CbjHTOkIhSztJ7
YfNyG4GowxFPLcMp2bhsT2D4DbgXrhKDSoqnglsSPCfQ+TJSm7Xy/G4kala4sHbm
X/+BoGFtybf2ysF48a+8GwS+D68lLQ3f7xhLLMOoY3HpdgtsdrM89YcNvzGkp7bD
KAFPIY4BPiWXm5iaqNwCI1xE7PIraRWa2/3Q8A8ohf4Xun1yaJBEH21xTvElbqvO
sHZnKKFwaZemOdWkoQKzw1yC9PGXym0Gl8hlDiwbKrmvpdJL3/Urfz2GQ7H/in5a
07gb6rxZiEXr33li9v+yjYyNBqwcZw==
=jESv
-----END PGP SIGNATURE-----

A
A
Abhiseck Paira wrote on 16 Dec 2021 04:36
[core-updates-frozen] cannot login ('org.freedesktop.login1' service times out)
(address . 52051@debbugs.gnu.org)
874k795i1m.fsf@disroot.org
Hello,

I am getting the same errors others have mentioned. I can't login via
tty and GDM wouldn't launch either. This happened after I pulled this
commit 73eb941b8b3793aec6110a4ae28bdbfc3ab4f6c5 with guix pull and ran
system reconfigure.

--
Abhiseck Paira
E34E 825B 979E EB9F 8505 F80E E93D 353B 7740 0709
-----BEGIN PGP SIGNATURE-----

iQIzBAEBCAAdFiEE406CW5ee65+FBfgO6T01O3dABwkFAmG6tmoACgkQ6T01O3dA
BwkJ4hAAo5F/xIkt6m57b5VKEnp4t6Cu7ZSI8nHEAJ/5FxdU0hlm9Qeoh+9bpS9N
SVh4hGa3rNN6AK5ylz5f4z2qfy67i4L2kk+zhphBzOXP2jp9NoAvG6fqBRczhmqM
LgKWULofkvu3Gd2mjsf98gnsv8+QLOLWklwGUTlUvy23Rz+lYwnEZpykDZXa3ElK
HVpKK7buCD+aFz/QAGxqYxiv5bxvIldcKj2gmbkpU3ZJK0w0VT19MLzfX+jZuj5Y
P7Du9sdt48qNqCXNK2hT1C2Uv2Nnx0UVpJJEt/V8Mk0r8XpMbfuqdOgFBx2V9vVV
xXRWmA73NQHvVkf/StLB9Kpg208Ka/bRP1Tyuha29eL7u4vk/psdUEaHnRF6W+UD
XnEaHaXFgITSlSK3oNZ91QxFtAOU8+8VyckwYIdBQvBK5TrW9IY7aWSOOCZNe5C3
o87s1j8VNhEpmNUe75ExL/SX8a4ldIJ4R/k8F1bO6cCdxcB4134dFY56bkXVa7m9
IukxJcBGcylZjxgc1b4D+fTfbOv8xI/givBMhHa+e+oBePC2MwZuxFigxKcgRm5A
ZN+dcyXYKExvjS4gmJiMUmk6rFpnje6+/LP0BYzGFmUdHy2qwjs3KsaJf078QHIy
4oHgXd3fyVc+fGJICqXh1Wg3A8qrc4mDUsdOU6ljSlH5zZhtgAY=
=Ip6w
-----END PGP SIGNATURE-----

I
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)
87sfusofjg.fsf@airmail.cc
I may have found a clue about what's happening.

After updating I ran into this same issue (I login from TTY), and I see
the same "elogind is already running" error.
However, if I boot into an older generation (before the
core-updates-frozen merge), and then again reboot back into the newest
generation (after core-updates-frozen merge) then I am able to login.

I was able to reproduce those exact steps too. Reboot to new generation,
can't login again. Reboot to old generation, then reboot back to new
generation and I can log in.

Note that even when I'm able to login on the new generation I still see
the "elogind is already running" error printed, but only before the
login prompt shows up, not after.

Also I'm not sure if this matters, but another difference I notice
during the attempts when I can successfully login on the new generation
is that I see Herd restart the `nscd` service.
Perhaps someone else experiencing this issue should try to restart nscd
(via SSH) and see if that works for you.

Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:

Toggle quote (29 lines)
> Hello,
>
> I'm not 100% sure this is the cause, but these are the last messages I
> have before I rebooted:
>
> Nov 23 01:09:14 localhost dbus-daemon[383]: [system] Activating service name='org.freedesktop.login1' requested by ':1.17' (uid=0 pid=370
> comm="/gnu/store/ximad0zvg12r4x0x80mvym8hzg0n33jl-shadow") (using servicehelper)
> Nov 23 01:09:14 localhost elogind[1189]: elogind is already running as PID 390
> Nov 23 01:09:20 localhost shepherd[1]: Respawning term-tty1.
> Nov 23 01:09:20 localhost shepherd[1]: Service host-name has been started.
> Nov 23 01:09:20 localhost shepherd[1]: Service term-tty1 has been started.
> Nov 23 01:09:39 localhost dbus-daemon[383]: [system] Failed to activate service 'org.freedesktop.login1': timed out (service_start_timeout=25000ms)
>
> I don't remember if I saw the slim login screen; but in any case I
> couldn't successfully login even via a ptty.
>
> It may have to do with polkit.
>
> To be investigated.
>
> This happened on a system *not* using gdm (it uses slim) and with
> ratpoison as the WM, on commit f42bc604547d9ee8e35fcd66d5db7786954cfac3
> of the core-updates-frozen branch.
>
> To be investigated.
>
> Thanks,
>
> Maxim
G
G
Guillaume Le Vaillant wrote on 16 Dec 2021 18:59
Re: bug#52051: [core-updates-frozen] cannot login
(address . 52051@debbugs.gnu.org)
87mtl0774j.fsf@kitej
Hi,

I just got this login error when updating an old machine with a HDD as
storage. On some other faster machines using SSD or NVME storage this
issue never happened, so I thought the error might be triggered by
slow IO.

Do some of you also see the issue on fast machines/storage?
-----BEGIN PGP SIGNATURE-----

iIUEAREKAC0WIQTLxZxm7Ce5cXlAaz5r6CCK3yH+PwUCYbuBLA8cZ2x2QHBvc3Rl
by5uZXQACgkQa+ggit8h/j+S/AD/byi2hXPYz9OtuuHugDB/bhicLhXoQ5n7EHCV
MyHum/UBAJehTqeY9LE4+ddy8qgY+Fz19InvhLTL9o/UHrGW9J5F
=4YEE
-----END PGP SIGNATURE-----

J
J
Josselin Poiret wrote on 16 Dec 2021 19:24
Re: bug#52051: [core-updates-frozen] cannot login ('org.freedesktop.login1' service times out)
8735msz9v8.fsf@jpoiret.xyz
Hello Maxim and Ludovic,

Seeing all the new activity on this bug, I decided to take a closer
look.

It doesn't seem to me that the credential byte read is the problem, as
can be seen by the elogind strace: it sends

Toggle snippet (3 lines)
374 sendmsg(11, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0AUTH EXTERNAL\r\nDATA\r\n", iov_len=22}, {iov_base="NEGOTIATE_UNIX_FD\r\n", iov_len=19}, {iov_base="BEGIN\r\n", iov_len=7}], msg_iovlen=3, msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 48

to which dbus answers with

Toggle snippet (3 lines)
374 recvmsg(11, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="DATA\r\nOK ea94265b9365f7158f1d81ee61b20dab\r\nAGREE_UNIX_FD\r\n", iov_len=256}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 58

so dbus doesn't seem to choke on the initial authentication (where the
credentials read happens) and it's clear that the byte 0 is properly
sent at the beginning of the handshake procedure.

Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:

Toggle quote (5 lines)
> I'm sorry for my lack of deep analysis here, I've spent most of my
> evening attempting to fix my system just to boot ^^'. I've at least
> managed to collect the following verbose D-Bus log (54 MiB uncompressed)
> which hopefully can shed some light onto how this failure came to be.

Would it be possible to get both the elogind strace and the dbus log for
the same session? We could then simply `grep` the authentication cookie
sent back by dbus to elogind to track the corresponding fd in the dbus
log.

Best,
Josselin Poiret
M
M
Michael Rohleder wrote on 16 Dec 2021 19:53
(address . 52051@debbugs.gnu.org)
87v8zo4c0l.fsf_-_@rohleder.de
Hi Guillaume,

Guillaume Le Vaillant <glv@posteo.net> writes:
Toggle quote (7 lines)
> I just got this login error when updating an old machine with a HDD as
> storage. On some other faster machines using SSD or NVME storage this
> issue never happened, so I thought the error might be triggered by
> slow IO.
>
> Do some of you also see the issue on fast machines/storage?

I can confirm this. For me, this happens on a machine with a HDD, but
not on another with a SSD nor my laptop with NVME.

(It also "feels" like, it's about timing or so)

--
Q: How does a Unix guru have sex?
A: unzip;strip;touch;finger;mount;fsck;more;yes;umount;sleep
-----BEGIN PGP SIGNATURE-----

iQFFBAEBCAAvFiEEdV4t5dDVhcUueCgwfHr/vv7yyyUFAmG7iyoRHG1pa2VAcm9o
bGVkZXIuZGUACgkQfHr/vv7yyyVTqAf/QIdicZyGBbv4TC4lUqJEDoUwg1ej1q9T
7OWLfn75SzL+rNJQJZT7QR/RhGQtth46utp34gZv1RGayoKTHXH6QEha1e6kbiLj
yyT1BB6/5Byg0TSAAnTHy92PR8E1taVstsM6M16Kk+YIe8aANRuQEzUgcXYvt6V+
vuC4m+7d3zliydkJsi2chPHdpMqaS/TZwR9tzV1mmcGpRCRpj6rjcykHAOPhJHPf
93W2l40WmBvAlgO6ymXRlZVK8CZgqk2xbMkLsIq/jhG+FHQTsjxg7HEEwFHkW1Dh
ggwGzxrqlbRxm2i1fc91qLIkAzkGnVWysx+7YZJAg5RDwoghjlsOUw==
=PB76
-----END PGP SIGNATURE-----

C
C
Caleb Herbert wrote on 20 Dec 2021 01:40
[core-updates-frozen] cannot login ('org.freedesktop.login1' service times out)
(address . 52051@debbugs.gnu.org)
3a3f7cfa-9525-17a6-d53b-82d23dc9b70b@bluehome.net
I am using an X200 Tablet with an HDD. This happened on my system. I
am now on Fedora with Guix until this issue is fixed.
Attachment: OpenPGP_signature
T
T
Timothy Sample wrote on 21 Dec 2021 10:31
(name . Josselin Poiret)(address . dev@jpoiret.xyz)
87czlq1ez4.fsf@ngyro.com
Hello,

I dug out an old HDD, put it in my X200, and was able to reproduce this.
Eventually I was able to log in to the thing by fiddling with the
“auth_timeout” parameter in the D-Bus config.

Josselin Poiret <dev@jpoiret.xyz> writes:

Toggle quote (7 lines)
> [...]
>
> Would it be possible to get both the elogind strace and the dbus log for
> the same session? We could then simply `grep` the authentication cookie
> sent back by dbus to elogind to track the corresponding fd in the dbus
> log.

This is exactly what I did (thanks for posting the patches, Maxim).
While looking around in the D-Bus log, I noticed it complaining about
authentication expiry around where it mentions the PID of elogind. This
is in Maxim’s log, too. It counts down an authentication expiry with
messages like “Connection 0x23b9f50 authentication expires in ...”.
That brought me to the D-Bus NEWS file, which mentions adjusting
“auth_timeout” to fix a boot regression for users with older hardware
[1]. The NEWS file mentions a bug report [2] that discusses how this
might be related to hard disk speed.


After reading that I tried with the timeout bumped up to a minute, and
the X200 booted into GDM just fine, twice in row, and then failed again
when I removed the change. (I should add that it still printed the
“elogind is already running” messages, but it worked anyway.)

What’s weird is that this bug is very old (2014), and the default
timeout was increased at the time from 5s to 30s to solve the bug. So
it’s part of the story, but not the whole story.


-- Tim
L
L
Leo Famulari wrote on 21 Dec 2021 17:13
(name . Timothy Sample)(address . samplet@ngyro.com)
YcH9L9rNDrcosM20@jasmine.lan
On Tue, Dec 21, 2021 at 04:31:27AM -0500, Timothy Sample wrote:
Toggle quote (8 lines)
> [2] https://bugs.freedesktop.org/show_bug.cgi?id=86431#c3
>
> After reading that I tried with the timeout bumped up to a minute, and
> the X200 booted into GDM just fine, twice in row, and then failed again
> when I removed the change. (I should add that it still printed the
> “elogind is already running” messages, but it worked anyway.)

Okay, great. I don't "auth_timeout" in the Guix source tree. Can you say
where you adjusted it?
T
T
Timothy Sample wrote on 21 Dec 2021 17:36
(name . Leo Famulari)(address . leo@famulari.name)
87fsqlewzd.fsf@ngyro.com
Hi Leo,

Leo Famulari <leo@famulari.name> writes:

Toggle quote (12 lines)
> On Tue, Dec 21, 2021 at 04:31:27AM -0500, Timothy Sample wrote:
>> [1] https://gitlab.freedesktop.org/dbus/dbus/-/blob/master/NEWS#L2487
>> [2] https://bugs.freedesktop.org/show_bug.cgi?id=86431#c3
>>
>> After reading that I tried with the timeout bumped up to a minute, and
>> the X200 booted into GDM just fine, twice in row, and then failed again
>> when I removed the change. (I should add that it still printed the
>> “elogind is already running” messages, but it worked anyway.)
>
> Okay, great. I don't "auth_timeout" in the Guix source tree. Can you say
> where you adjusted it?

Good question! :) Here’s the patch:
Toggle diff (12 lines)
diff --git a/gnu/services/dbus.scm b/gnu/services/dbus.scm
index 85a4c3e..a680ed7 100644
--- a/gnu/services/dbus.scm
+++ b/gnu/services/dbus.scm
@@ -106,6 +106,7 @@ (define-syntax directives
(define (services->sxml services)
;; Return the SXML 'includedir' clauses for DIRS.
`(busconfig
+ (limit (@ (name "auth_timeout")) "60000")
(servicehelper "/run/setuid-programs/dbus-daemon-launch-helper")
;; First, the '.service' files of services subject to activation.
M
M
Maxime Devos wrote on 21 Dec 2021 17:52
596551103b98265f46051b7898f5ca38744c7401.camel@telenet.be
Timothy Sample schreef op di 21-12-2021 om 11:36 [-0500]:
Toggle quote (3 lines)
> +            (limit (@ (name "auth_timeout")) "60000")


Are there any good reasons for having a timeout at all?
(Except for the local-user denial of service, but local users can do
"guix build -f something-that-allocates-almost-all-memory-and-melts-
the-cpu.scm" anyway ...)

If not, can the timeout be disabled/set to infinity?

A comment like

Toggle quote (5 lines)
> ;; Set timeout to a huge number (16.6 hours), because
> ;; upstream often sets timeouts low for spinning disks,
> ;; slow CPUs, etc.
> (limit [...] "60000")

could be useful (I'm assuming the timeout is in seconds here).

Greetings,
Maxime.
L
L
Leo Famulari wrote on 21 Dec 2021 18:32
(name . Maxime Devos)(address . maximedevos@telenet.be)
YcIPpSte2D4e8b6R@jasmine.lan
On Tue, Dec 21, 2021 at 04:52:19PM +0000, Maxime Devos wrote:
Toggle quote (7 lines)
> Are there any good reasons for having a timeout at all?
> (Except for the local-user denial of service, but local users can do
> "guix build -f something-that-allocates-almost-all-memory-and-melts-
> the-cpu.scm" anyway ...)
>
> If not,�can the timeout be disabled/set to infinity?

Could be. But I think that, based on a several years of reports, the
X200 with an HDD is the slowest machine used with Guix System. On my
X200 with HDD, I have personally experienced similar race-y bugs that
seem to crop up after major upgrades --- I assume that it's a case of
bad luck, where important programs for booting move to distant parts of
the disk and seeking is too slow.

Toggle quote (9 lines)
> A comment like
>
> > ;; Set timeout to a huge number (16.6 hours), because
> > ;; upstream often sets timeouts low for spinning disks,
> > ;; slow CPUs, etc.
> > (limit [...] "60000")
>
> could be useful (I'm assuming the timeout is in seconds here).

I suggest we wait until such drastic action is necessary. Otherwise we
might be banging our heads against the wall in a few years, trying to
debug something. Let's not rush to extremes :)
L
L
Leo Famulari wrote on 21 Dec 2021 18:37
(name . Timothy Sample)(address . samplet@ngyro.com)
YcIQ2ufni8EmmoJl@jasmine.lan
On Tue, Dec 21, 2021 at 11:36:22AM -0500, Timothy Sample wrote:
Toggle quote (14 lines)
> diff --git a/gnu/services/dbus.scm b/gnu/services/dbus.scm
> index 85a4c3e..a680ed7 100644
> --- a/gnu/services/dbus.scm
> +++ b/gnu/services/dbus.scm
> @@ -106,6 +106,7 @@ (define-syntax directives
> (define (services->sxml services)
> ;; Return the SXML 'includedir' clauses for DIRS.
> `(busconfig
> + (limit (@ (name "auth_timeout")) "60000")
> (servicehelper "/run/setuid-programs/dbus-daemon-launch-helper")
>
> ;; First, the '.service' files of services subject to activation.


With this patch, I can boot and login to X on my X200s with HDD.
Obviously this failure and subsequent fix are not deterministic, but
it's a good sign! Let's get some more testing! And comrades, let's test
the xorg-server update while doing it:

L
L
Leo Famulari wrote on 21 Dec 2021 18:44
(name . Timothy Sample)(address . samplet@ngyro.com)
YcISjw7v7K0MqYv9@jasmine.lan
On Tue, Dec 21, 2021 at 12:37:30PM -0500, Leo Famulari wrote:
Toggle quote (4 lines)
> With this patch, I can boot and login to X on my X200s with HDD.
> Obviously this failure and subsequent fix are not deterministic, but
> it's a good sign! Let's get some more testing!

I pushed the patch to a Savannah branch to ease testing:


So, you can do:

`guix pull --branch=wip-fix-52051 && guix system reconfigure [...]`

Or similar with time-machine
-----BEGIN PGP SIGNATURE-----

iQIzBAABCAAdFiEEsFFZSPHn08G5gDigJkb6MLrKfwgFAmHCEo8ACgkQJkb6MLrK
fwgyGw/+OwHyC/tnZbHPRNpVqgn83wtcvRf4Ud1Vzg1hN07Tmb0rVXaJAQFvNVpg
1Wn+bO0/Wo+PAMWtqnzfp3kkZu9kf0d9XlmNxxnTI6Dsj+dnFJB2Mp2NOfCM2sQU
5IRelOpSfHy7agEAbUoAF/K3ycJk/XaPnuKkV7CK87kPUQ8w9a3Hd8oibd+eZm4C
YEkeaGo1FRT2bv2rqWOPxp4HeLFSGJ91K4Q/TnAAmHsiKApdfhBwM7FqljGzkL4I
W8nTGqdCy577NVoiPQpGm+RTYksKjSv/h93H6HTGJ1EnSfFW/FpIGXNKjMfkF/A9
T5oxvYXNg/jaXYEmzNpNpqqerXaIiY89+9FZe3x88XTcp3fOrCUOsf++CLA4pIdV
cQ5nE+HD+fWZtTom3NhbpB26WcVNT4fqkpIANtvJEIQw+dcNjvN61B+3+/ox/G3x
JhFvVeB3gqOr3FlqppBjOKKcsp/1BTPggwyneOj5vcxjBWU84Zx8ciR9diq67ddp
gqMgTLcBGoQZZlBwzUkRD27K5KGvXAlSv27WcRlucZpNjXna/hOFc+KK/JCXhBZh
GimEOy8MtVAQiKL4grSgW/4MpmHghGm7hpilj8dUXRP46eYv5EKh59aGTw4tiRsz
sVICdg4YUWstCprS+bud2dsJEjwyLzkZRRLoX/m1BAIJVBCMVJc=
=g/8D
-----END PGP SIGNATURE-----


M
M
Maxime Devos wrote on 21 Dec 2021 18:51
(name . Leo Famulari)(address . leo@famulari.name)
260a106ae1127257b512f599282f9456df23b8dc.camel@telenet.be
Leo Famulari schreef op di 21-12-2021 om 12:32 [-0500]:
Toggle quote (14 lines)
> > A comment like
> >
> > > ;; Set timeout to a huge number (16.6 hours), because
> > > ;; upstream often sets timeouts low for spinning disks,
> > > ;; slow CPUs, etc.
> > > (limit [...] "60000")
> >
> > could be useful (I'm assuming the timeout is in seconds here).
>
> I suggest we wait until such drastic action is necessary. Otherwise
> we
> might be banging our heads against the wall in a few years, trying to
> debug something. Let's not rush to extremes :)

I don't think commenting new code is a drastic and extreme action.
I assume you were referring you were referring to the ‘16.6 hours’?
If so, just replace ‘16.6 hours’ by ‘one minute’, because it turns
out that 'auth_timeout' is in milliseconds and not seconds,

Greetings,
Maxime.
L
L
Leo Famulari wrote on 21 Dec 2021 19:11
(name . Maxime Devos)(address . maximedevos@telenet.be)
YcIYvEHqChOZ9Lja@jasmine.lan
On Tue, Dec 21, 2021 at 05:51:28PM +0000, Maxime Devos wrote:
Toggle quote (3 lines)
> I don't think commenting new code is a drastic and extreme action.
> I assume you were referring you were referring to the ‘16.6 hours’?

Right. We don't want to use a timeout of 16.6 hours. From the
perspective of a user, that's not meaningfully longer than, say, 30
minutes. And really, if the system can't bring up the login interface
for a laptop in 5 minutes, it's totally broken.

Toggle quote (4 lines)
> If so, just replace ‘16.6 hours’ by ‘one minute’, because it turns
> out that 'auth_timeout' is in milliseconds and not seconds,
> see <https://dbus.freedesktop.org/doc/dbus-daemon.1.html>.

I think I was confused about the units here.
L
L
Luis Felipe wrote on 22 Dec 2021 02:46
[core-updates-frozen] cannot login ('org.freedesktop.login1' service times out)
(name . 52051@debbugs.gnu.org)(address . 52051@debbugs.gnu.org)
sHeDSIHyuoHesmvYdFh5Bpqpj6e4qmLCslkETTM1XkrH6y9NvpmgrbfKOwmkOZoDnngUDzQhoZqdr-5U4NGQsMDh7-QHHbdR51DYFZXFkAE=@protonmail.com
Hello,

Just wanted to inform that I ran

`guix pull --branch=wip-fix-52051 && guix system reconfigure [...]`

and I could log in without problems now. Thanks.
Attachment: signature.asc
L
L
Leo Famulari wrote on 22 Dec 2021 03:19
(name . Luis Felipe via Bug reports for GNU Guix)(address . bug-guix@gnu.org)(name . 52051@debbugs.gnu.org)(address . 52051@debbugs.gnu.org)
YcKLKNr0I2zdBlQ4@jasmine.lan
On Wed, Dec 22, 2021 at 01:46:41AM +0000, Luis Felipe via Bug reports for GNU Guix wrote:
Toggle quote (8 lines)
> Hello,
>
> Just wanted to inform that I ran
>
> `guix pull --branch=wip-fix-52051 && guix system reconfigure [...]`
>
> and I could log in without problems now. Thanks.

Awesome!

Would someone like to write a proper commit message and add a code
comment?
-----BEGIN PGP SIGNATURE-----

iQIzBAABCAAdFiEEsFFZSPHn08G5gDigJkb6MLrKfwgFAmHCiyUACgkQJkb6MLrK
fwgXRw//fPOjjq1AzaoylkBF2WjF2RVeyH6g9l2+ptmxuBau+2dlt6nacCQQ75/Z
m9voJ8P7xtGG/hu5qkhIvgnzWmSJMJcGgLHTOluP1nnED0zhzCBK4MlulVFpcZUI
Jt1gQ66aLv3d8KzQElfaPAQ8C57g45eBajOvf8A24qpTD3lc4xYSH44uYeEvdSeR
7AGJbw8OvfcSWzcn1SnBBfRJlgDMAKSyZjREERJ/2mraf9s+D8x4qAGkCTINcEvJ
3vXM9v7DZoLUJEWyX8nKWeHic1eDWs+L2TbO7DB0hxDuvpWD0ixPx9w5b7z1Y3iR
h8wj0WYRtgrDLqN2zdE0ydukgamfwJAU2CfTv/GjuX8xUQb6/hmeAmWFIQBx/T3B
rd7LSAWWZ+DN9gqEVJRK7BrzWgAmNraje/i5KnHh0Xp++G1OAVN/xw9yfYSAqUza
xT8zg1fxAiFBzO/4v+Gmr2S9Ob56EwmZFcLr9Gc8TV2UdaIfDoRMlE9c5gUgfhYq
SB8NXRQmVTpVRpX0TbTT7CBpHOgKWH8QKqmBiS+W9cxuSEBEmEFSzxwcom3+A827
09keGpksrv4dPibPWZzS1oRAfeiwx3TticYjY0TlXHowWApyMaZWq7xZyu81jK0X
t1KRzHltnu5jD0DlPq2fCENmpyM+vKTbyaLFGPf0kCYR84/99/0=
=gIVl
-----END PGP SIGNATURE-----


L
L
Leo Famulari wrote on 22 Dec 2021 22:16
(name . Timothy Sample)(address . samplet@ngyro.com)(name . 52051@debbugs.gnu.org)(address . 52051@debbugs.gnu.org)
YcOVrk7teKyQp2LX@jasmine.lan
On Tue, Dec 21, 2021 at 09:19:20PM -0500, Leo Famulari wrote:
Toggle quote (3 lines)
> Would someone like to write a proper commit message and add a code
> comment?

How about the attached patch? I'd like to push this soon, because it's a
severe problem for some users.
From ab9c4de76dea889e5d05bcf7fa173868357d5f44 Mon Sep 17 00:00:00 2001
From: Timothy Sample <samplet@ngyro.com>
Date: Tue, 21 Dec 2021 11:52:34 -0500
Subject: [PATCH] services: dbus: Wait 1 minute for elogind to get ready.


* gnu/services/dbus.scm (dbus-configuration-directory): Set a 60 second timeout
in the dbus config.
---
gnu/services/dbus.scm | 4 ++++
1 file changed, 4 insertions(+)

Toggle diff (17 lines)
diff --git a/gnu/services/dbus.scm b/gnu/services/dbus.scm
index 85a4c3ec9a..d2daf60497 100644
--- a/gnu/services/dbus.scm
+++ b/gnu/services/dbus.scm
@@ -106,6 +106,10 @@ (define-syntax directives
(define (services->sxml services)
;; Return the SXML 'includedir' clauses for DIRS.
`(busconfig
+ ;; Increase this timeout to 60 seconds to work around race-y
+ ;; failures such as <https://issues.guix.gnu.org/52051> on slow
+ ;; computers with slow I/O.
+ (limit (@ (name "auth_timeout")) "60000")
(servicehelper "/run/setuid-programs/dbus-daemon-launch-helper")
;; First, the '.service' files of services subject to activation.
--
2.34.0
-----BEGIN PGP SIGNATURE-----

iQIzBAABCAAdFiEEsFFZSPHn08G5gDigJkb6MLrKfwgFAmHDlasACgkQJkb6MLrK
fwhaWBAA3Gu1fyVpcrVq74zZQVt60UXfiyX1YqU6T4e8mJwtNojaP9T2qydbAEBe
ppPINK13RXECXCP1UcZM6iLd4pH7wqMZgdueFSWA59dlfIC78HHNUzRlaivyIjG/
MdizI0/OKbbK/ZWSfVfpoueOzJA66i1GOIa4ibj3vgt2dJSaYvsvz+jQ5OtuGFLr
oUqrc7W5nE9oOQJGY4Z4Q3AdRzgltYU2OEtavWzoqR1/IvJsRT2kBaHfO0o1gqgi
Z/tDcxqZCneLI385HCdxWuKuHQU30dU676h9Pk7rCAiccQfcg6B35lurwwnzemoS
L2ynzqloBj+wyFmrd0JKyZwfv36CTHWPHTCiLuKSrD4+vmGCNmfokQ6X5s8RXWyV
10dnTdeePNzi/8HilNh/nDQ6HrDX97GMFm4Qoe0VuI5PB0xw4UOy8J+cwJ/noxBb
Gr0L6bpljcTL0fIoe7p4PrcUVdbd6dDzg71VhFY90xu1NXV0nkgagEyFNVlf4cDE
p+taynbOmHT5kFX+7ooypNS6888qwHCslpB+7PvCoSVoxCIkkMdsBcJ4zyGpOnOb
tlOjXV/uEiJn4nb0b4eIFna0sbX//HJBqOgrI+E1Ie4+kTq+E3KesK+yFpD9Jltm
iSLBPLY6X1ROjGZ19BiMKEUZn1z1RJXiTRQtNnHKupNJFwyY4dk=
=MsCe
-----END PGP SIGNATURE-----


L
L
Leo Famulari wrote on 23 Dec 2021 00:38
(name . Timothy Sample)(address . samplet@ngyro.com)(name . 52051@debbugs.gnu.org)(address . 52051-done@debbugs.gnu.org)
YcO28rn9HnZBLxwQ@jasmine.lan
On Wed, Dec 22, 2021 at 04:16:30PM -0500, Leo Famulari wrote:
Toggle quote (17 lines)
> On Tue, Dec 21, 2021 at 09:19:20PM -0500, Leo Famulari wrote:
> > Would someone like to write a proper commit message and add a code
> > comment?
>
> How about the attached patch? I'd like to push this soon, because it's a
> severe problem for some users.

> From ab9c4de76dea889e5d05bcf7fa173868357d5f44 Mon Sep 17 00:00:00 2001
> From: Timothy Sample <samplet@ngyro.com>
> Date: Tue, 21 Dec 2021 11:52:34 -0500
> Subject: [PATCH] services: dbus: Wait 1 minute for elogind to get ready.
>
> Fixes <https://issues.guix.gnu.org/issue/52051>.
>
> * gnu/services/dbus.scm (dbus-configuration-directory): Set a 60 second timeout
> in the dbus config.

Pushed as 488f1c589df00e802163af534294d93372e5c025
-----BEGIN PGP SIGNATURE-----

iQIzBAABCAAdFiEEsFFZSPHn08G5gDigJkb6MLrKfwgFAmHDtvIACgkQJkb6MLrK
fwjtCA//YG/AXQNOk8gZRbKB7TZjutPJVqs0T0Xkt24McWedlZ64loODfLbVKb79
cfiOKlXNocZ8G7Mf8TT6GE1ACfs/1cKcGrP0EvKZBnl+Y+gkqhv8EUkYfp7YOpPG
GGPJ/3OZrcPIL7r8jusm7EJ2L+3lO7S64Y7/MjHK9ILnQkzrX7NLl6rH7iqPJEwj
6IaVDBouZXfGNLtOGpwG0lnvsnxKoRj21EaLhBSKA5w0D4NPjBwbE4AbtytvDXj0
UZUcCpgen8TWAZC8HzQiMA965iKiEPAseFs1sIIWm5mXPCHWvzZIiPT5ZDP/BAua
SWJK2L/KiycOuRWi2tkpHXyqeXVW3E0Qk7rdK6Rcjtyvm17+SxSbNT5egnTYm07B
aPLeKM6SyeBt+gQSnf6A1pQcVdtFI2dOPSJFbtHkgepkoMNjLI06OM2NvS/VJsfA
UEgvzDy4lmz+qJeNQ+NVjbOo+i+gRpUURvOGPjQoGilnTt3VKRBAhZDOovsyanq6
VvRy0/VNKJa3Q9HOBfHLI2RBxkC46oTIQNcf/35ANhpPIEeAIYrjc1Bln/0+6SHG
BYY5odSMOE56AHp7i+NbhMBarQKTINMY+ZqJ4bUPqUXkRAMBUsPB/+Nz7CpOm3il
d1J/URV4fora3VxH7jpulFl9SHjX4NMeyGDCC9yxEw2ZEiXyLiA=
=/Poh
-----END PGP SIGNATURE-----


Closed
?