[core-updates-frozen] cannot login ('org.freedesktop.login1' service times out)

Done

Details

12 participants

Abhiseck Paira
Caleb Herbert
Josselin Poiret
Guillaume Le Vaillant
ison
Leo Famulari
Ludovic Courtès
Luis Felipe
Maxim Cournoyer
Maxime Devos
Michael Rohleder
Timothy Sample

Owner: unassigned

Submitted by: Maxim Cournoyer

Severity: important

Debbugs page

Maxim Cournoyer wrote on 23 Nov 2021 07:47

Recipients:(name . bug-guix)(address . bug-guix@gnu.org)

Message-ID:87pmqr1i8k.fsf@gmail.com

Hello,

I'm not 100% sure this is the cause, but these are the last messages I

have before I rebooted:

Toggle snippet (9 lines)

Nov 23 01:09:14 localhost dbus-daemon[383]: [system] Activating service name='org.freedesktop.login1' requested by ':1.17' (uid=0 pid=370

comm="/gnu/store/ximad0zvg12r4x0x80mvym8hzg0n33jl-shadow") (using servicehelper)

Nov 23 01:09:14 localhost elogind[1189]: elogind is already running as PID 390

Nov 23 01:09:20 localhost shepherd[1]: Respawning term-tty1.

Nov 23 01:09:20 localhost shepherd[1]: Service host-name has been started.

Nov 23 01:09:20 localhost shepherd[1]: Service term-tty1 has been started.

Nov 23 01:09:39 localhost dbus-daemon[383]: [system] Failed to activate service 'org.freedesktop.login1': timed out (service_start_timeout=25000ms)

I don't remember if I saw the slim login screen; but in any case I

couldn't successfully login even via a ptty.

It may have to do with polkit.

To be investigated.

This happened on a system *not* using gdm (it uses slim) and with

ratpoison as the WM, on commit f42bc604547d9ee8e35fcd66d5db7786954cfac3

of the core-updates-frozen branch.

To be investigated.

Thanks,

Maxim

Josselin Poiret wrote on 24 Nov 2021 10:02

Recipients:(address . dev@jpoiret.xyz)

Message-ID:87czmqorjb.fsf@jpoiret.xyz

Hello Maxim,

Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:

Toggle quote (23 lines)> --8<---------------cut here---------------start------------->8---
> Nov 23 01:09:14 localhost dbus-daemon[383]: [system] Activating service name='org.freedesktop.login1' requested by ':1.17' (uid=0 pid=370
>     comm="/gnu/store/ximad0zvg12r4x0x80mvym8hzg0n33jl-shadow") (using servicehelper)
> Nov 23 01:09:14 localhost elogind[1189]: elogind is already running as PID 390
> Nov 23 01:09:20 localhost shepherd[1]: Respawning term-tty1. 
> Nov 23 01:09:20 localhost shepherd[1]: Service host-name has been started. 
> Nov 23 01:09:20 localhost shepherd[1]: Service term-tty1 has been started.
> Nov 23 01:09:39 localhost dbus-daemon[383]: [system] Failed to activate service 'org.freedesktop.login1': timed out (service_start_timeout=25000ms)
> --8<---------------cut here---------------end--------------->8---
>
> I don't remember if I saw the slim login screen; but in any case I
> couldn't successfully login even via a ptty.
>
> It may have to do with polkit.
>
> To be investigated.
>
> This happened on a system *not* using gdm (it uses slim) and with
> ratpoison as the WM, on commit f42bc604547d9ee8e35fcd66d5db7786954cfac3
> of the core-updates-frozen branch.
>
> To be investigated.

I cannot reproduce in a fresh VM on commit

d5de4e163ccef80f78bc5fe330f568d8fe3a23ab, and can login just fine, with

(services (cons* (service slim-service-type (slim-configuration))

(modify-services %desktop-services

(delete gdm-service-type))))

Is this still affecting you?

Best,

Josselin Poiret

Maxim Cournoyer wrote on 25 Nov 2021 04:03

Recipients:(name . Josselin Poiret)(address . dev@jpoiret.xyz)(address . 52051@debbugs.gnu.org)

Message-ID:87sfvlos26.fsf@gmail.com

Hello Josselin,

Josselin Poiret <dev@jpoiret.xyz> writes:

Toggle quote (40 lines)> Hello Maxim,
>
> Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:
>
>> --8<---------------cut here---------------start------------->8---
>> Nov 23 01:09:14 localhost dbus-daemon[383]: [system] Activating
>> service name='org.freedesktop.login1' requested by ':1.17' (uid=0
>> pid=370
>>     comm="/gnu/store/ximad0zvg12r4x0x80mvym8hzg0n33jl-shadow") (using servicehelper)
>> Nov 23 01:09:14 localhost elogind[1189]: elogind is already running as PID 390
>> Nov 23 01:09:20 localhost shepherd[1]: Respawning term-tty1. 
>> Nov 23 01:09:20 localhost shepherd[1]: Service host-name has been started. 
>> Nov 23 01:09:20 localhost shepherd[1]: Service term-tty1 has been started.
>> Nov 23 01:09:39 localhost dbus-daemon[383]: [system] Failed to
>> activate service 'org.freedesktop.login1': timed out
>> (service_start_timeout=25000ms)
>> --8<---------------cut here---------------end--------------->8---
>>
>> I don't remember if I saw the slim login screen; but in any case I
>> couldn't successfully login even via a ptty.
>>
>> It may have to do with polkit.
>>
>> To be investigated.
>>
>> This happened on a system *not* using gdm (it uses slim) and with
>> ratpoison as the WM, on commit f42bc604547d9ee8e35fcd66d5db7786954cfac3
>> of the core-updates-frozen branch.
>>
>> To be investigated.
>
> I cannot reproduce in a fresh VM on commit
> d5de4e163ccef80f78bc5fe330f568d8fe3a23ab, and can login just fine, with
>
>   (services (cons* (service slim-service-type (slim-configuration))
>                    (modify-services %desktop-services
>                      (delete gdm-service-type))))
>
> Is this still affecting you?

Yes! It didn't occur in a 'guix system vm my-config.scm', but the exact

same config deployed on my machine fails at login.

Some symptoms:

1. Slim login screen comes up, but after entering credentials Xorg

resets (back to login screen)

2. going to a TTY and attempting to login there, it'd fail with a "Login

failed after 60 s timeout" or similar error.

3. I can login via SSH (thanks goodness!)

4. There are no errors (EE) in /var/log/Xorg.0.log

5. here's the tail of my /var/log/messages:

Toggle snippet (35 lines)Nov 24 21:23:54 localhost ntpd[346]: Soliciting pool server 216.197.156.83
Nov 24 21:23:55 localhost ntpd[346]: Soliciting pool server 206.108.0.133
Nov 24 21:23:56 localhost wpa_supplicant[343]: wlp4s0: CTRL-EVENT-BEACON-LOSS 
Nov 24 21:23:56 localhost ntpd[346]: Soliciting pool server 98.143.85.249
Nov 24 21:23:57 localhost ntpd[346]: Soliciting pool server 192.95.27.155
Nov 24 21:23:58 localhost dbus-daemon[341]: [system] Activating service name='org.freedesktop.login1' requested by ':1.16' (uid=0 pid=324 comm="/gnu/store/ximad0zvg12r4x0x80mvym8hzg0n33jl-shadow") (using servicehelper)
Nov 24 21:23:58 localhost elogind[1114]: elogind is already running as PID 355
Nov 24 21:24:11 localhost wpa_supplicant[343]: wlp4s0: CTRL-EVENT-BEACON-LOSS 
Nov 24 21:24:21 localhost last message repeated 5 times
Nov 24 21:24:23 localhost dbus-daemon[341]: [system] Failed to activate service 'org.freedesktop.login1': timed out (service_start_timeout=25000ms)
Nov 24 21:24:23 localhost shepherd[1]: Respawning term-tty2. 
Nov 24 21:24:23 localhost shepherd[1]: Service host-name has been started. 
Nov 24 21:24:23 localhost shepherd[1]: Service term-tty2 has been started. 
Nov 24 21:24:23 localhost wpa_supplicant[343]: wlp4s0: CTRL-EVENT-BEACON-LOSS 
Nov 24 21:24:27 localhost last message repeated 3 times
Nov 24 21:26:04 localhost dbus-daemon[341]: [system] Activating service name='org.freedesktop.login1' requested by ':1.17' (uid=0 pid=429 comm="/gnu/store/nvvmksc9pvahqmypaz3h8mqya82vnga8-slim-1") (using servicehelper)
Nov 24 21:26:04 localhost elogind[1127]: elogind is already running as PID 355
Nov 24 21:26:29 localhost dbus-daemon[341]: [system] Failed to activate service 'org.freedesktop.login1': timed out (service_start_timeout=25000ms)
Nov 24 21:26:29 localhost shepherd[1]: Respawning xorg-server. 
Nov 24 21:26:29 localhost shepherd[1]: Service host-name has been started. 
Nov 24 21:26:29 localhost shepherd[1]: Service xorg-server has been started. 
Nov 24 21:27:23 localhost ntpd[346]: Soliciting pool server 209.115.181.108
Nov 24 21:27:24 localhost ntpd[346]: Soliciting pool server 138.197.153.200
Nov 24 21:27:25 localhost ntpd[346]: Soliciting pool server 162.159.200.123
Nov 24 21:27:26 localhost ntpd[346]: Soliciting pool server 162.159.200.1
Nov 24 21:29:09 localhost ntpd[346]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Nov 24 21:29:42 localhost ntpd[346]: Soliciting pool server 199.182.221.110
Nov 24 21:35:23 localhost wpa_supplicant[343]: wlp4s0: CTRL-EVENT-SIGNAL-CHANGE above=0 signal=-87 noise=-95 txrate=270000
Nov 24 21:35:32 localhost wpa_supplicant[343]: wlp4s0: CTRL-EVENT-SIGNAL-CHANGE above=1 signal=-67 noise=-95 txrate=270000
Nov 24 21:42:50 localhost dbus-daemon[341]: [system] Activating service name='org.freedesktop.login1' requested by ':1.19' (uid=0 pid=1269 comm="sudo herd status dbus ") (using servicehelper)
Nov 24 21:42:50 localhost elogind[1273]: elogind is already running as PID 355
Nov 24 21:43:15 localhost dbus-daemon[341]: [system] Failed to activate service 'org.freedesktop.login1': timed out (service_start_timeout=25000ms)
Nov 24 21:47:25 localhost last message repeated 3 times

So it seems the issue is with some D-Bus service

(org.freedesktop.login1) failing to be started. I'm not sure how it can

be explained that it works in a VM but not on the real machine...

Also, the dbus-system process (pid 341) as the following environment

variable (singular) defined:

Toggle snippet (4 lines)

# cat /proc/341/environ | xargs -0 -n1

PATH=/run/current-system/profile/bin

which seems like not much. Perhaps it needs some XDG_DATA_DIRS to find

that D-Bus service file? Where would it come from?

To be investigated...

Thanks,

Maxim

Maxim Cournoyer wrote on 25 Nov 2021 05:07

Recipients:(name . Josselin Poiret)(address . dev@jpoiret.xyz)(address . 52051@debbugs.gnu.org)

Message-ID:87tug0c1zi.fsf@gmail.com

Hello,

I've found a workaround: restarting elogind via SSH resolved the issue.

I guess itt may be a race between elogind and dbus-system (elogind gets

started before dbus-system is fully up, and the communication with the

session bus is somehow crippled from there?).

I'll experiment with the elogind service a bit, adding a dumb sleep to

its start slot to see if it prevents the race.

Thank you,

Maxim

Maxim Cournoyer wrote on 25 Nov 2021 23:07

Recipients:(name . Josselin Poiret)(address . dev@jpoiret.xyz)(address . 52051@debbugs.gnu.org)

Message-ID:87a6hranz1.fsf@gmail.com

Hello,

In case it is useful, here's what the shepherd-graph of my system looks

like:

Attachment: hurd.svg

My system configuration looks something like this (some parts redacted).

Toggle snippet (253 lines)(use-modules (gnu)
             (guix modules)
             (srfi srfi-1)
             ;; for mcron experiment
             (gnu packages autotools)
             (gnu packages guile-xyz)
             (gnu packages man)
             (gnu packages texinfo)
             (guix git)
             (guix packages))

(use-service-modules admin audio desktop docker linux mail mcron networking nfs
                     nix spice ssh sysctl telephony virtualization vpn xorg)
(use-package-modules android bootloaders certs gnome linux
                     nfs ratpoison rsync tls)

(define updatedb-job
  ;; Run 'updatedb' at 3AM every day.
  #~(job '(next-hour '(3))
         (lambda ()
           (execl #$(file-append findutils "/bin/updatedb")
                  "updatedb"
                  "--prunepaths=/gnu/store /tmp /var/tmp /var/lib"))
         "updatedb"))

(define btrfs-balance-job
  ;; Re-allocate chunks which are using less than 5% of their chunk
  ;; space, to regain Btrfs 'unallocated' space.  The usage is kept
  ;; low (5%) to minimize wear on the SSD.  Runs at 5 AM every 3 days.
  #~(job '(next-hour-from (next-day (range 1 31 3)) '(5))
         (lambda ()
           (execl #$(file-append btrfs-progs "/bin/btrfs")
                  "balance" "start" "-dusage=5" "/"))
         "btrfs-balance"))

(define i2c-udev-rule
  ;; Sets i2c group to i2c devices, so that we can access it as a
  ;; simple user, e.g. using ddcutil.
  (udev-rule
   "90-i2c.rules"
   (string-append "KERNEL==\"i2c-[0-9]*\", "
		  "GROUP=\"i2c\"")))

(define c920-webcam-udev-rule
  (udev-rule "90-c920-webcam.rules"
             (string-append "SUBSYSTEM==\"usb\", "
                            "ATTRS{idVendor}==\"046d\", "
                            "ATTRS{idProduct}==\"082d\", "
                            "GROUP=\"video\"")))

(define %my-desktop-services
  (remove (lambda (service)
            (eq? (service-kind service) gdm-service-type))
          %desktop-services))

;; TODO: Honor rootflags in the initrd,
;; then remove degraded.
(define %common-btrfs-options '(("compress-force" . "zstd")
                                ("space_cache" . "v2")
                                "degraded"))

(define %system
  (operating-system
    (host-name "hurd")
    (timezone "America/Montreal")
    (keyboard-layout (keyboard-layout "dvorak"))
    (bootloader (bootloader-configuration
                 (bootloader grub-bootloader)
                 (targets (list "/dev/sda" "/dev/sdb" "/dev/sdc"))
                 (terminal-outputs '(console))
		 (keyboard-layout keyboard-layout)))
    (kernel-arguments '("quiet" "snd_hda_intel.dmic_detect=0"
                        "modprobe.blacklist=rtl8187"))
    (mapped-devices
     (list (mapped-device
            (source "/dev/sda2")
            (target "cryptroot")
            (type luks-device-mapping))
           (mapped-device
            (source "/dev/sdb2")
            (target "cryptroot-mirror")
            (type luks-device-mapping))
           (mapped-device
            (source "/dev/sdc2")
            (target "cryptroot-mirror2")
            (type luks-device-mapping))))

    ;; Note: Using any of the LUKS encrypted drives exposed under
    ;; /dev/mapper is enough to reference the Btrfs RAID-1 array,
    ;; since the 'btrfs device scan' command is executed in the init
    ;; RAM disk and takes care of assembling the array.
    (file-systems (cons* (file-system
                           (mount-point "/")
                           (device "/dev/mapper/cryptroot")
                           (type "btrfs")
			   (options (alist->file-system-options
                                     (cons '("subvol" . "@root")
                                           %common-btrfs-options)))
			   (dependencies mapped-devices))
                         (file-system
                           (device "/dev/mapper/cryptroot")
                           (mount-point "/home")
                           (type "btrfs")
                           (options (alist->file-system-options
                                     (cons '("subvol" . "@home")
                                           %common-btrfs-options)))
                           (dependencies mapped-devices))
                         (file-system
                           (device "/dev/mapper/cryptroot")
                           (mount-point "/data")
                           (type "btrfs")
                           (options (alist->file-system-options
                                     (cons '("subvol" . "@data")
                                           %common-btrfs-options)))
                           (dependencies mapped-devices))
                         %base-file-systems))
    (users (cons (user-account
                  (name "me")
                  (group "users")
                  (supplementary-groups
		   '("lp"               ;for bluetooth
		     "dialout"		;for serial port
		     "adbusers"         ;for adb
		     "i2c"              ;for ddcutil
		     "realtime"         ;for jackd
                     "kvm" "libvirt" "docker"
		     "wheel" "netdev" "audio" "video"))
                  (home-directory "/home/maxim"))
                 %base-user-accounts))
    (groups
     (cons* (user-group (system? #t) (name "adbusers"))
	    (user-group (system? #t) (name "i2c"))
            (user-group (system? #t) (name "realtime"))
	    %base-groups))
    (services
     (cons*
      (service docker-service-type (docker-configuration
                                    (debug? #t)))
      (service qemu-binfmt-service-type
               (qemu-binfmt-configuration
                (platforms (lookup-qemu-platforms "arm" "aarch64" "ppc64le"))))
      (service earlyoom-service-type)
      (service zram-device-service-type (zram-device-configuration
                                         (size "6G")
                                         (compression-algorithm 'ZSTD)))
      (bluetooth-service #:auto-enable? #t)
      (pam-limits-service
       (list
        ;; Required in order for jackd to function properly.
        (pam-limits-entry "@realtime" 'both 'rtprio 99)
        (pam-limits-entry "@realtime" 'both 'memlock 'unlimited)))
      (service mcron-service-type
               (mcron-configuration
                (mcron (package/inherit mcron
                         (source (git-checkout
                                  (url "file:///home/maxim/src/mcron")))
                         (native-inputs (append
                                            `(("autoconf" ,autoconf)
                                              ("automake" ,automake)
                                              ("help2man" ,help2man)
                                              ("texinfo" ,texinfo))
                                            (package-native-inputs mcron)))))
                (jobs (list updatedb-job
                            btrfs-balance-job))))
      (service mpd-service-type
               (mpd-configuration
                (user "me")))
      ;; Virtualization services to use with GNOME Boxes.
      (service libvirt-service-type
               (libvirt-configuration
                (unix-sock-group "libvirt")))
      (service virtlog-service-type)
      (service opendht-service-type (opendht-configuration
                                     (enable-logging? #t)
                                     (peer-discovery? #t)
                                     (proxy-server-port 8282)))
      (service openssh-service-type
	       (openssh-configuration
	        (port-number 22)
	        (permit-root-login #f)
	        (password-authentication? #f)
	        (use-pam? #f))))
      (service slim-service-type
	       (slim-configuration
	        (auto-login? #f)
	        (default-user "me")
	        (xorg-configuration
	         (xorg-configuration
	      	  (keyboard-layout keyboard-layout)))))
      (service nix-service-type)
      (service nfs-service-type
               (nfs-configuration
                (nfs-versions '("4.2" "4.1" "4.0"
                                "3" "2")) ; for U-Boot
                (exports
                 '(("/some/path/to/rootfs"
                    "*(rw,no_root_squash,no_subtree_check)")))
                (nfsd-udp? #t)))        ;for U-Boot
      (service guix-publish-service-type
               (guix-publish-configuration
                (advertise? #t)
                (host "0.0.0.0")))      ;listen on all interfaces
      (modify-services %my-desktop-services
        ;; Ignore the silly HHKB2 power off key shortcut.
        (elogind-service-type config =>
                              (elogind-configuration
                               (inherit config)
			       (handle-power-key 'ignore)))
        ;; Enable using adb as a simple user with a multitude of devices.
        (udev-service-type config =>
			   (udev-configuration
			    (inherit config)
			    (rules (cons* android-udev-rules
					  i2c-udev-rule
                                          c920-webcam-udev-rule
					  (udev-configuration-rules config)))))
        (network-manager-service-type config =>
				      (network-manager-configuration
				       (inherit config)
				       (vpn-plugins
				        (list network-manager-openvpn
                                              network-manager-openconnect))))
	(guix-service-type config =>
			   (guix-configuration
			    (inherit config)
                            (substitute-urls '("http://127.0.0.1:8181"
					       "https://ci.guix.gnu.org"))
                            (discover? #t)
                            (authorized-keys
                             (cons*
                                keys...
			      %default-authorized-guix-keys))
                            (log-compression 'none)
			    (extra-options '("--max-jobs=4")))))))

    ;; Packages installed by default.
    (packages (append (map specification->package
                           '("ratpoison"
                             "linux-pam" ;for the pam_limits manpage
                             "nss-certs" ;for HTTPS access
		             "rsync"     ;for the backup script
                             "btrfs-progs"
                             "nix"
                             "nfs-utils"
                             "ntfs-3g")) ;for mount.ntfs
                  %base-packages))

    ;; Allow resolution of '.local' host names with mDNS.
    (name-service-switch %mdns-host-lookup-nss)))

%system

But virtualizing it with `guix system vm` doesn't trigger the bug, I'm

guessing because the NFS export doesn't work (it fails quickly rather

than take time like it would on an actual boot).

Thanks,

Maxim

Ludovic Courtès wrote on 25 Nov 2021 23:13

Recipients:(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)

Message-ID:87sfvjdgtl.fsf@gnu.org

Hi,

Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

Toggle quote (14 lines)> Nov 24 21:23:58 localhost dbus-daemon[341]: [system] Activating service name='org.freedesktop.login1' requested by ':1.16' (uid=0 pid=324 comm="/gnu/store/ximad0zvg12r4x0x80mvym8hzg0n33jl-shadow") (using servicehelper)
> Nov 24 21:23:58 localhost elogind[1114]: elogind is already running as PID 355
> Nov 24 21:24:11 localhost wpa_supplicant[343]: wlp4s0: CTRL-EVENT-BEACON-LOSS 
> Nov 24 21:24:21 localhost last message repeated 5 times
> Nov 24 21:24:23 localhost dbus-daemon[341]: [system] Failed to activate service 'org.freedesktop.login1': timed out (service_start_timeout=25000ms)
> Nov 24 21:24:23 localhost shepherd[1]: Respawning term-tty2. 
> Nov 24 21:24:23 localhost shepherd[1]: Service host-name has been started. 
> Nov 24 21:24:23 localhost shepherd[1]: Service term-tty2 has been started. 
> Nov 24 21:24:23 localhost wpa_supplicant[343]: wlp4s0: CTRL-EVENT-BEACON-LOSS 
> Nov 24 21:24:27 localhost last message repeated 3 times
> Nov 24 21:26:04 localhost dbus-daemon[341]: [system] Activating service name='org.freedesktop.login1' requested by ':1.17' (uid=0 pid=429 comm="/gnu/store/nvvmksc9pvahqmypaz3h8mqya82vnga8-slim-1") (using servicehelper)
> Nov 24 21:26:04 localhost elogind[1127]: elogind is already running as PID 355
> Nov 24 21:26:29 localhost dbus-daemon[341]: [system] Failed to activate service 'org.freedesktop.login1': timed out (service_start_timeout=25000ms)

It looks like elogind things it’s already running as PID 355, where the

PID comes from its PID file; there’s indeed a process with this PID, but

is it really elogind? Or could it be that there’s a stale PID file?

Does /var/log/messages shows the very first time elogind is started at

boot time? With which PID?

We need to find out why dbus-daemon thinks it needs to restart elogind.

Ludo’.

Maxim Cournoyer wrote on 26 Nov 2021 05:00

control message for bug #52051

Recipients:(address . control@debbugs.gnu.org)

Message-ID:871r33a7mb.fsf@gmail.com

severity 52051 important

quit

Josselin Poiret wrote on 26 Nov 2021 10:35

Re: bug#52051: [core-updates-frozen] cannot login ('org.freedesktop.login1' service times out)

Recipients:(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)

Message-ID:87wnkvp8cx.fsf@jpoiret.xyz

Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:

Toggle quote (8 lines)

> Hello,

> I've found a workaround: restarting elogind via SSH resolved the issue.

> I guess itt may be a race between elogind and dbus-system (elogind gets

> started before dbus-system is fully up, and the communication with the

> session bus is somehow crippled from there?).

Does this happen with other login managers on your system, like LightDM?

The thing is that elogind depends on dbus-system, so I'm not sure there

should be a race there.

I noticed though that none of our desktop managers's shepherd services

require elogind, and in the case of SLiM not even dbus-system. Maybe we

should add it there, since we want Shepherd to handle launching elogind,

and avoid dbus launching one by itself if the login1 service is used or

even by the PAM elogind module. Can you try adding that to

slim-shepherd-service?

Also, maybe patching the dbus service to Exec=false instead would be a

good safeguard against dbus launching elogind itself.

Tangentially: SLiM is very old and unmaintained, so it may one day

simply break with newer versions of systemd/elogind.

Best,

Josselin Poiret

Maxim Cournoyer wrote on 7 Dec 2021 22:23

Recipients:(name . Josselin Poiret)(address . dev@jpoiret.xyz)(address . 52051@debbugs.gnu.org)

Message-ID:87mtlcun1h.fsf@gmail.com

Hello!

Josselin Poiret <dev@jpoiret.xyz> writes:

Toggle quote (14 lines)> Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:
>
>> Hello,
>>
>> I've found a workaround: restarting elogind via SSH resolved the issue.
>>
>> I guess itt may be a race between elogind and dbus-system (elogind gets
>> started before dbus-system is fully up, and the communication with the
>> session bus is somehow crippled from there?).
>
> Does this happen with other login managers on your system, like LightDM?
> The thing is that elogind depends on dbus-system, so I'm not sure there
> should be a race there.

I've yet to try this, but it *doesn't* happen on another machine using

an almost-identical configuration (slim + ratpoison).

Toggle quote (7 lines)

> I noticed though that none of our desktop managers's shepherd services

> require elogind, and in the case of SLiM not even dbus-system. Maybe we

> should add it there, since we want Shepherd to handle launching elogind,

> and avoid dbus launching one by itself if the login1 service is used or

> even by the PAM elogind module. Can you try adding that to

> slim-shepherd-service?

I've tried adding elogind in the requirements of the slim service type

as well as the upower-daemon service type, but it didn't help.

Based on my observations (/var/log/messages), it seems that the dbus

session bus is not yet ready when elogin starts, and thus the various

dbus services are not yet published/registered when other dependents

need them.

I noticed that dbus system session times out on every boot on that

machine. I also notice that when the NTPD daemon starts, it rewinds

time by about 25 s on every boot

Toggle snippet (21 lines)[...]
Dec  7 15:55:02 localhost shepherd[1]: Service dbus-system has been started.
[...]
Dec  7 15:55:10 localhost shepherd[1]: Service ntpd has been started.
[...]
Dec  7 15:54:46 localhost ntpd[341]: ntpd 4.2.8p15@1.3728-o Thu Jan  1 12:00:01 AM UTC 1970 (1): Starting
[...]
Dec  7 15:55:14 localhost shepherd[1]: Service elogind has been started.
[...]
Dec  7 15:55:21 localhost shepherd[1]: Service upower-daemon has been started.
[...]
Dec  7 15:54:51 localhost dbus-daemon[335]: [system] Activating service name='org.freedesktop.login1' requested by ':1.1' (uid=0 pid=345 comm="/gnu/store/g1qlpzcfnk2r6186al2hfqjmq9yl7qkk-upower") (using servicehelper)
[...]
Dec  7 15:55:02 localhost elogind[352]: elogind is already running as PID 343
[...]
Dec  7 15:55:16 localhost dbus-daemon[335]: [system] Failed to activate service 'org.freedesktop.login1': timed out (service_start_timeout=25000ms)
[...]
Dec  7 15:55:24 localhost dbus-daemon[335]: [system] Connection has not authenticated soon enough, closing it (auth_timeout=30000ms, elapsed: 36684ms)

Toggle quote (3 lines)

> Also, maybe patching the dbus service to Exec=false instead would be a

> good safeguard against dbus launching elogind itself.

Perhaps a good idea to try also!

Attached is the /var/log/messages of the problematic boot, nearly

unedited.

Attachment: messages.boot

Thanks,

Maxim

Maxim Cournoyer wrote on 7 Dec 2021 23:15

Recipients:(name . Josselin Poiret)(address . dev@jpoiret.xyz)(address . 52051@debbugs.gnu.org)

Message-ID:8735n46oz2.fsf@gmail.com

Hi again,

Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:

[...]

Toggle quote (4 lines)

> I noticed that dbus system session times out on every boot on that

> machine. I also notice that when the NTPD daemon starts, it rewinds

> time by about 25 s on every boot

Not NTP related; the same occurs when removing the ntp-service-type from

%desktop-services.

Maxim

Maxim Cournoyer wrote on 8 Dec 2021 03:12

Recipients:(name . Josselin Poiret)(address . dev@jpoiret.xyz)(address . 52051@debbugs.gnu.org)

Message-ID:87ilvzvo7o.fsf@gmail.com

Hello,

Josselin Poiret <dev@jpoiret.xyz> writes:

Toggle quote (14 lines)> Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:
>
>> Hello,
>>
>> I've found a workaround: restarting elogind via SSH resolved the issue.
>>
>> I guess itt may be a race between elogind and dbus-system (elogind gets
>> started before dbus-system is fully up, and the communication with the
>> session bus is somehow crippled from there?).
>
> Does this happen with other login managers on your system, like LightDM?
> The thing is that elogind depends on dbus-system, so I'm not sure there
> should be a race there.

I have the same problem using GDM on that machine :-(.

I'll try poking the dbus session manually using dbus-send, or the Scheme

API for it under (guix build jami-service).

Thanks!

Maxim

Ludovic Courtès wrote on 8 Dec 2021 23:18

Recipients:(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)

Message-ID:87a6hahha0.fsf@gnu.org

Hi,

Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

Toggle quote (12 lines)> Dec  7 15:55:02 localhost shepherd[1]: Service dbus-system has been started.
> [...]
> Dec  7 15:55:10 localhost shepherd[1]: Service ntpd has been started.
> [...]
> Dec  7 15:54:46 localhost ntpd[341]: ntpd 4.2.8p15@1.3728-o Thu Jan  1 12:00:01 AM UTC 1970 (1): Starting
> [...]
> Dec  7 15:55:14 localhost shepherd[1]: Service elogind has been started.
> [...]
> Dec  7 15:55:21 localhost shepherd[1]: Service upower-daemon has been started.
> [...]
> Dec  7 15:54:51 localhost dbus-daemon[335]: [system] Activating service name='org.freedesktop.login1' requested by ':1.1' (uid=0 pid=345 comm="/gnu/store/g1qlpzcfnk2r6186al2hfqjmq9yl7qkk-upower") (using servicehelper)

The key thing here is that dbus-daemon things elogind is not running and

thus considers it has to start it, which is bound to fail.

You can display the list of services known to the DBus system bus with:

dbus-send --system --print-reply --dest=org.freedesktop.DBus \

/org/freedesktop/DBus org.freedesktop.DBus.ListNames

I tried in a VM that had booted fine and it shows

“org.freedesktop.login1” as expected. Could you check what that gives

for you when the bug above shows up?

I wonder if it’s possible for there to be a race, like elogind says

“hi!” to dbus-daemon but dbus-daemon is still sleepy and doesn’t notice.

Seems hard to believe though.

If it’s reproducible, could you instrument ‘elogind-service-type’ such

that elogind is started with “strace -f -o /elogind.log -s 700”?

HTH,

Ludo’.

Maxim Cournoyer wrote on 9 Dec 2021 15:18

Recipients:(name . Ludovic Courtès)(address . ludo@gnu.org)

Message-ID:87bl1plv4q.fsf@gmail.com

Hi Ludovic!

Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (27 lines)> Hi,
>
> Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:
>
>> Dec  7 15:55:02 localhost shepherd[1]: Service dbus-system has been started.
>> [...]
>> Dec  7 15:55:10 localhost shepherd[1]: Service ntpd has been started.
>> [...]
>> Dec  7 15:54:46 localhost ntpd[341]: ntpd 4.2.8p15@1.3728-o Thu Jan  1 12:00:01 AM UTC 1970 (1): Starting
>> [...]
>> Dec  7 15:55:14 localhost shepherd[1]: Service elogind has been started.
>> [...]
>> Dec  7 15:55:21 localhost shepherd[1]: Service upower-daemon has been started.
>> [...]
>> Dec 7 15:54:51 localhost dbus-daemon[335]: [system] Activating
>> service name='org.freedesktop.login1' requested by ':1.1' (uid=0
>> pid=345 comm="/gnu/store/g1qlpzcfnk2r6186al2hfqjmq9yl7qkk-upower")
>> (using servicehelper)
>
> The key thing here is that dbus-daemon things elogind is not running and
> thus considers it has to start it, which is bound to fail.
>
> You can display the list of services known to the DBus system bus with:
>
>   dbus-send --system --print-reply --dest=org.freedesktop.DBus \
>     /org/freedesktop/DBus org.freedesktop.DBus.ListNames

After launching elogind (added a wait for it in its start slot), the

first time it prints just:

Dec 9 09:08:24 localhost shepherd[1]: ;;; (available-dbus-services

("org.freedesktop.DBus" "fi.w1.wpa_supplicant1"))

And on the last try (25 seconds of trying) it prints:

Dec 9 09:09:27 localhost shepherd[1]: ;;; (available-dbus-services ("org.freedesktop.DBus" "org.freedesktop.PolicyKit1" "org.freedesktop.ModemManager1" "fi.w1.wpa_supplicant1" "org.freedesktop.NetworkManager"))

login1 never shows up.

So elogind is essentially hung up when it is started that early during

boot.

Attached is the requested strace output attached to elogind.

I've tried adding a synchronization loop that tests for dbus-daemon

socket readiness (able to make a connection) to its shepherd service but

it didn't help.

If elogind is restarted post boot login1 shows up and all is happy.

Attachment: elogind.log

Thanks!

Maxim

Ludovic Courtès wrote on 9 Dec 2021 23:15

Recipients:(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)

Message-ID:87v8zxctn5.fsf@gnu.org

Hello!

Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

Toggle quote (2 lines)

> 374 connect(11, {sa_family=AF_UNIX, sun_path="/var/run/dbus/system_bus_socket"}, 34) = 0

[...]

Toggle quote (27 lines)> 374   epoll_wait(5, [{events=EPOLLIN|EPOLLOUT|EPOLLHUP, data={u32=24802800, u64=24802800}}], 20, -1) = 1
> 374   sendmsg(11, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\1\0\1\0\0\0\0\1\0\0\0m\0\0\0\1\1o\0\25\0\0\0/org/freedesktop/DBus\0\0\0\3\1s\0\5\0\0\0Hello\0\0\0\2\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\6\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0", iov_len=128}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = -1 EPIPE (Broken pipe)
> 374   gettid()                          = 374
> 374   epoll_ctl(5, EPOLL_CTL_MOD, 11, {events=0, data={u32=24802800, u64=24802800}}) = 0
> 374   timerfd_settime(12, TFD_TIMER_ABSTIME, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=1}}, NULL) = 0
> 374   epoll_wait(5, [{events=EPOLLHUP, data={u32=24802800, u64=24802800}}, {events=EPOLLIN, data={u32=24764384, u64=24764384}}], 20, -1) = 2
> 374   read(12, "\1\0\0\0\0\0\0\0", 8)   = 8
> 374   gettid()                          = 374
> 374   timerfd_settime(12, TFD_TIMER_ABSTIME, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=1}}, NULL) = 0
> 374   epoll_wait(5, [{events=EPOLLHUP, data={u32=24802800, u64=24802800}}, {events=EPOLLIN, data={u32=24764384, u64=24764384}}], 20, -1) = 2
> 374   read(12, "\1\0\0\0\0\0\0\0", 8)   = 8
> 374   gettid()                          = 374
> 374   timerfd_settime(12, TFD_TIMER_ABSTIME, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=1}}, NULL) = 0
> 374   epoll_wait(5, [{events=EPOLLHUP, data={u32=24802800, u64=24802800}}, {events=EPOLLIN, data={u32=24764384, u64=24764384}}], 20, -1) = 2
> 374   read(12, "\1\0\0\0\0\0\0\0", 8)   = 8
> 374   epoll_ctl(5, EPOLL_CTL_DEL, 11, NULL) = 0
> 374   close(11)                         = 0
> 374   gettid()                          = 374
> 374   epoll_wait(5,  <unfinished ...>
> 391   <... close resumed>)              = 0
> 391   madvise(0x7fd6c83dc000, 8368128, MADV_DONTNEED) = 0
> 391   exit(0)                           = ?
> 391   +++ exited with 0 +++
> 374   <... epoll_wait resumed>[{events=EPOLLERR, data={u32=24768000, u64=24768000}}], 17, -1) = 1
> 374   lseek(7, 0, SEEK_SET)             = 0
> 374   read(7, "tty7\n", 63)             = 5

As you pointed out on IRC, the initially ‘Hello’ method call above leads

to EPIPE, and we can see that elogind eventually closes its socket to

dbus-daemon *but* keeps doing its thing.

Some interesting things to note…

First, to my surprise, elogind does not use the client library of the

‘dbus’ package:

Toggle snippet (5 lines)

$ guix gc --references $(./pre-inst-env guix build elogind)|grep dbus

$ echo $?

(This is already the case in ‘master’ with v243.7.) Instead, it has its

own implementation of the DBus protocol, in C, from systemd—we can’t

have enough sources of bugs and vulnerabilities.

Anyway, the “Hello” message is sent to the system bus asynchronously in

‘sd-bus.c’:

Toggle snippet (23 lines)

static int bus_send_hello(sd_bus *bus) {

_cleanup_(sd_bus_message_unrefp) sd_bus_message *m = NULL;

int r;

assert(bus);

if (!bus->bus_client)

return 0;

r = sd_bus_message_new_method_call(

bus,

&m,

"org.freedesktop.DBus",

"/org/freedesktop/DBus",

"org.freedesktop.DBus",

"Hello");

if (r < 0)

return r;

return sd_bus_call_async(bus, NULL, m, hello_callback, NULL, 0);

}

A callback is called when a reply is received or an error arises:

Toggle snippet (14 lines)

static int hello_callback(sd_bus_message *reply, void *userdata, sd_bus_error *error) {

[...]

fail:

/* When Hello() failed, let's propagate this in two ways: first we return the error immediately here,

* which is the propagated up towards the event loop. Let's also invalidate the connection, so that

* if the user then calls back into us again we won't wait any longer. */

bus_set_state(bus, BUS_CLOSING);

return r;

}

It’s not clear from that whether the authors intended for the thing to

keep going in case of failure. In our case it’s not helpful.

But why does dbus-daemon drop the connection in the first place?

To know that, we could change ‘dbus-root-service-type’ to run

dbus-daemon from a ‘--enable-verbose-mode’ build, and with the

‘DBUS_VERBOSE’ environment set to 1.

Looking at ‘dbus-server-socket.c’ it would seem that the only sensible

way this can happen is if dbus-daemon doesn’t yet have a

‘new_connection_function’ set at the time it accepts the incoming

connection:

Toggle snippet (28 lines)

static dbus_bool_t

handle_new_client_fd_and_unlock (DBusServer *server,

DBusSocket client_fd)

{

[...]

/* See if someone wants to handle this new connection, self-referencing

* for paranoia.

new_connection_function = server->new_connection_function;

new_connection_data = server->new_connection_data;

_dbus_server_ref_unlocked (server);

SERVER_UNLOCK (server);

if (new_connection_function)

{

(* new_connection_function) (server, connection,

new_connection_data);

}

dbus_server_unref (server);

/* If no one grabbed a reference, the connection will die. */

_dbus_connection_close_if_only_one_ref (connection);

dbus_connection_unref (connection);

We know that elogind is started after dbus-daemon has written its PID

file (there’s a Shepherd service dependency). Is there a possibility

that dbus-daemon writes its PID file before it has set

‘new_connection_function’?

It would seem that ‘bus_context_new’ writes the PID file after it has

called ‘dbus_server_set_new_connection_function’ via ‘setup_server’ via

‘process_config_first_time_only’.

So not sure what happens.

That’s it for today. Thoughts?

Ludo’.

Maxim Cournoyer wrote on 10 Dec 2021 06:09

Recipients:(name . Ludovic Courtès)(address . ludo@gnu.org)

Message-ID:875yrxyrjr.fsf@gmail.com

Hello Ludovic!

Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (44 lines)> Hello!
>
> Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:
>
>> 374   connect(11, {sa_family=AF_UNIX, sun_path="/var/run/dbus/system_bus_socket"}, 34) = 0
>
> [...]
>
>> 374   epoll_wait(5, [{events=EPOLLIN|EPOLLOUT|EPOLLHUP, data={u32=24802800, u64=24802800}}], 20, -1) = 1
>> 374 sendmsg(11, {msg_name=NULL, msg_namelen=0,
>> msg_iov=[{iov_base="l\1\0\1\0\0\0\0\1\0\0\0m\0\0\0\1\1o\0\25\0\0\0/org/freedesktop/DBus\0\0\0\3\1s\0\5\0\0\0Hello\0\0\0\2\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\6\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0",
>> iov_len=128}], msg_iovlen=1, msg_controllen=0, msg_flags=0},
>> MSG_DONTWAIT|MSG_NOSIGNAL) = -1 EPIPE (Broken pipe)
>> 374   gettid()                          = 374
>> 374   epoll_ctl(5, EPOLL_CTL_MOD, 11, {events=0, data={u32=24802800, u64=24802800}}) = 0
>> 374   timerfd_settime(12, TFD_TIMER_ABSTIME, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=1}}, NULL) = 0
>> 374   epoll_wait(5, [{events=EPOLLHUP, data={u32=24802800, u64=24802800}}, {events=EPOLLIN, data={u32=24764384, u64=24764384}}], 20, -1) = 2
>> 374   read(12, "\1\0\0\0\0\0\0\0", 8)   = 8
>> 374   gettid()                          = 374
>> 374   timerfd_settime(12, TFD_TIMER_ABSTIME, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=1}}, NULL) = 0
>> 374   epoll_wait(5, [{events=EPOLLHUP, data={u32=24802800, u64=24802800}}, {events=EPOLLIN, data={u32=24764384, u64=24764384}}], 20, -1) = 2
>> 374   read(12, "\1\0\0\0\0\0\0\0", 8)   = 8
>> 374   gettid()                          = 374
>> 374   timerfd_settime(12, TFD_TIMER_ABSTIME, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=1}}, NULL) = 0
>> 374   epoll_wait(5, [{events=EPOLLHUP, data={u32=24802800, u64=24802800}}, {events=EPOLLIN, data={u32=24764384, u64=24764384}}], 20, -1) = 2
>> 374   read(12, "\1\0\0\0\0\0\0\0", 8)   = 8
>> 374   epoll_ctl(5, EPOLL_CTL_DEL, 11, NULL) = 0
>> 374   close(11)                         = 0
>> 374   gettid()                          = 374
>> 374   epoll_wait(5,  <unfinished ...>
>> 391   <... close resumed>)              = 0
>> 391   madvise(0x7fd6c83dc000, 8368128, MADV_DONTNEED) = 0
>> 391   exit(0)                           = ?
>> 391   +++ exited with 0 +++
>> 374   <... epoll_wait resumed>[{events=EPOLLERR, data={u32=24768000, u64=24768000}}], 17, -1) = 1
>> 374   lseek(7, 0, SEEK_SET)             = 0
>> 374   read(7, "tty7\n", 63)             = 5
>
> As you pointed out on IRC, the initially ‘Hello’ method call above leads
> to EPIPE, and we can see that elogind eventually closes its socket to
> dbus-daemon *but* keeps doing its thing.
>
> Some interesting things to note…

Indeed, very interesting :-). Thanks for diving in!

Toggle quote (12 lines)> First, to my surprise, elogind does not use the client library of the
> ‘dbus’ package:
>
> $ guix gc --references $(./pre-inst-env guix build elogind)|grep dbus
> $ echo $?
> 1
>
>
> (This is already the case in ‘master’ with v243.7.)  Instead, it has its
> own implementation of the DBus protocol, in C, from systemd—we can’t
> have enough sources of bugs and vulnerabilities.

Oh! Marius found this interesting issue [0] that they shared on IRC

today; I wonder if it may be related. sd-bus apparently pipeline things

aggressively, which is not always handled well by other D-bus

participants.

[0] https://github.com/systemd/systemd/issues/16610

Toggle quote (45 lines)> Anyway, the “Hello” message is sent to the system bus asynchronously in
> ‘sd-bus.c’:
>
> static int bus_send_hello(sd_bus *bus) {
>         _cleanup_(sd_bus_message_unrefp) sd_bus_message *m = NULL;
>         int r;
>
>         assert(bus);
>
>         if (!bus->bus_client)
>                 return 0;
>
>         r = sd_bus_message_new_method_call(
>                         bus,
>                         &m,
>                         "org.freedesktop.DBus",
>                         "/org/freedesktop/DBus",
>                         "org.freedesktop.DBus",
>                         "Hello");
>         if (r < 0)
>                 return r;
>
>         return sd_bus_call_async(bus, NULL, m, hello_callback, NULL, 0);
> }
>
>
> A callback is called when a reply is received or an error arises:
>
> static int hello_callback(sd_bus_message *reply, void *userdata, sd_bus_error *error) {
>
> [...]
>
> fail:
>         /* When Hello() failed, let's propagate this in two ways: first we return the error immediately here,
>          * which is the propagated up towards the event loop. Let's also invalidate the connection, so that
>          * if the user then calls back into us again we won't wait any longer. */
>
>         bus_set_state(bus, BUS_CLOSING);
>         return r;
> }
>
>
> It’s not clear from that whether the authors intended for the thing to
> keep going in case of failure.  In our case it’s not helpful.

If we picture this in the systemd use case, I believe sd-bus must be
used as *the* d-bus daemon, perhaps?  So it should never die, and expect
users to call back into it to retry things?  In our case, it acts as a
D-Bus client, not a server (IIUC), so perhaps its behavior is not tuned
right for our use case.

Toggle quote (11 lines)> But why does dbus-daemon drop the connection in the first place?
>
> To know that, we could change ‘dbus-root-service-type’ to run
> dbus-daemon from a ‘--enable-verbose-mode’ build, and with the
> ‘DBUS_VERBOSE’ environment set to 1.
>
> Looking at ‘dbus-server-socket.c’ it would seem that the only sensible
> way this can happen is if dbus-daemon doesn’t yet have a
> ‘new_connection_function’ set at the time it accepts the incoming
> connection:

Interesting that you mention this; that's what I worked on yesterday
(see attached patches).  I managed to get elogind panicking the kernel
during a guix system reconfigure, which corrupted GRUB, so had to chroot
and reconfigure from there [1].  Not sure what happened, but it seems
that killing and restarting elogind is susceptible to cause hard locks.

[1]  https://lists.gnu.org/archive/html/guix-devel/2018-06/msg00202.html

From 690d96cba581f253761e6c3263262fbc899aa2a2 Mon Sep 17 00:00:00 2001

* gnu/packages/glib.scm (dbus-verbose): New variable.

---

gnu/packages/glib.scm | 15 +++++++++++++++

1 file changed, 15 insertions(+)

Toggle diff (28 lines)diff --git a/gnu/packages/glib.scm b/gnu/packages/glib.scm
index 398d3943e5..2eead335f5 100644
--- a/gnu/packages/glib.scm
+++ b/gnu/packages/glib.scm
@@ -176,6 +176,21 @@ (define dbus
 shared NFS home directories.")
     (license license:gpl2+)))                     ; or Academic Free License 2.1
 
+;;; The reason this is not enabled in the regular dbus package is because it
+;;; impacts the performance of D-Bus (including its library) as a whole, even
+;;; when the DBUS_VERBOSE environment variable is not set.
+(define-public dbus-verbose
+  (package/inherit dbus
+    (name "dbus-verbose")
+    (arguments (substitute-keyword-arguments (package-arguments dbus)
+                 ((#:configure-flags flags '())
+                  `(cons "--enable-verbose-mode" ,flags))))
+    (synopsis "D-Bus with verbose mode enabled for debugging")
+    (description "This variant D-Bus package is built with verbose mode, which
+eases debugging of D-Bus services by printing various debug information when
+the @code{DBUS_VERBOSE} environment variable is set to @samp{1}.  For more
+information, refer to the @samp{dbus-daemon(1)} man page.")))
+
 (define glib
   (package
     (name "glib")
-- 
2.34.0

From c614159fbc539c8b1f556fe9ced2d13963b6d6f9 Mon Sep 17 00:00:00 2001
From: Maxim Cournoyer <maxim.cournoyer@gmail.com>
Date: Tue, 7 Dec 2021 22:48:39 -0500
Subject: [PATCH 2/5] services: dbus: Add a VERBOSE? configuration option.

* gnu/services/dbus.scm (<dbus-configuration>)[verbose?]: New field.
(dbus-shepherd-service): Use it.
(dbus-service)[verbose?]: Add argument and update doc.
* doc/guix.texi (Desktop Services): Document it.
---
 doc/guix.texi         |  9 +++++++--
 gnu/services/dbus.scm | 20 +++++++++++++++-----
 2 files changed, 22 insertions(+), 7 deletions(-)

Toggle diff (83 lines)diff --git a/doc/guix.texi b/doc/guix.texi
index 58625666c1..4b1297e5de 100644
--- a/doc/guix.texi
+++ b/doc/guix.texi
@@ -20280,9 +20280,14 @@ The actual service definitions included in @code{%desktop-services} and
 provided by @code{(gnu services dbus)} and @code{(gnu services desktop)}
 are described below.
 
-@deffn {Scheme Procedure} dbus-service [#:dbus @var{dbus}] [#:services '()]
+@deffn {Scheme Procedure} dbus-service [#:dbus @var{dbus}] [#:services '()] @
+                                       [#:verbose?]
 Return a service that runs the ``system bus'', using @var{dbus}, with
-support for @var{services}.
+support for @var{services}.  When @var{verbose?} is true, it causes the
+@samp{DBUS_VERBOSE} environment variable to be set to @samp{1}; a
+verbose-enabled D-Bus package such as @code{dbus-verbose} should be
+provided as @var{dbus} in this scenario.  The verbose output is logged
+to @file{/var/log/dbus-daemon.log}.
 
 @uref{https://dbus.freedesktop.org/, D-Bus} is an inter-process communication
 facility.  Its system bus is used to allow system services to communicate
diff --git a/gnu/services/dbus.scm b/gnu/services/dbus.scm
index 85a4c3ec9a..da167f2e36 100644
--- a/gnu/services/dbus.scm
+++ b/gnu/services/dbus.scm
@@ -53,7 +53,9 @@ (define-record-type* <dbus-configuration>
   (dbus      dbus-configuration-dbus              ;file-like
              (default dbus))
   (services  dbus-configuration-services          ;list of <package>
-             (default '())))
+             (default '()))
+  (verbose?  dbus-configuration-verbose?          ;boolean
+             (default #f)))
 
 (define (system-service-directory services)
   "Return the system service directory, containing @code{.service} files for
@@ -187,7 +189,7 @@ (define (dbus-activation config)
 
 (define dbus-shepherd-service
   (match-lambda
-    (($ <dbus-configuration> dbus)
+    (($ <dbus-configuration> dbus _ verbose?)
      (list (shepherd-service
             (documentation "Run the D-Bus system daemon.")
             (provision '(dbus-system))
@@ -195,6 +197,10 @@ (define dbus-shepherd-service
             (start #~(make-forkexec-constructor
                       (list (string-append #$dbus "/bin/dbus-daemon")
                             "--nofork" "--system" "--syslog-only")
+                      #$@(if verbose?
+                             '(#:environment-variables '("DBUS_VERBOSE=1")
+                               #:log-file "/var/log/dbus-daemon.log")
+                             '())
                       #:pid-file "/var/run/dbus/pid"))
             (stop #~(make-kill-destructor)))))))
 
@@ -230,9 +236,12 @@ (define dbus-root-service-type
 bus.  It allows programs and daemons to communicate and is also responsible
 for spawning (@dfn{activating}) D-Bus services on demand.")))
 
-(define* (dbus-service #:key (dbus dbus) (services '()))
+(define* (dbus-service #:key (dbus dbus) (services '()) verbose?)
   "Return a service that runs the \"system bus\", using @var{dbus}, with
-support for @var{services}.
+support for @var{services}.  When @var{verbose?} is true, it causes the
+@samp{DBUS_VERBOSE} environment variable to be set to @samp{1}; a
+verbose-enabled D-Bus package such as @code{dbus-verbose} should be provided
+as @var{dbus} in this scenario.
 
 @uref{http://dbus.freedesktop.org/, D-Bus} is an inter-process communication
 facility.  Its system bus is used to allow system services to communicate and
@@ -244,7 +253,8 @@ (define* (dbus-service #:key (dbus dbus) (services '()))
 @var{services} must be equal to @code{(list avahi)}."
   (service dbus-root-service-type
            (dbus-configuration (dbus dbus)
-                               (services services))))
+                               (services services)
+                               (verbose? verbose?))))
 
 (define (wrapped-dbus-service service program variables)
   "Return a wrapper for @var{service}, a package containing a D-Bus service,
-- 
2.34.0

Attachment: 0003-build-Move-D-Bus-parser-to-its-own-build-module.patch

Attachment: 0004-services-telephony-Factorize-bus-polling-logic.patch

Attachment: 0005-services-dbus-Add-synchronization-primitives-to-dbus.patch

Toggle quote (32 lines)> static dbus_bool_t
> handle_new_client_fd_and_unlock (DBusServer *server,
>                                  DBusSocket  client_fd)
> {
>
> [...]
>
>   /* See if someone wants to handle this new connection, self-referencing
>    * for paranoia.
>    */
>   new_connection_function = server->new_connection_function;
>   new_connection_data = server->new_connection_data;
>
>   _dbus_server_ref_unlocked (server);
>   SERVER_UNLOCK (server);
>
>   if (new_connection_function)
>     {
>       (* new_connection_function) (server, connection,
>                                    new_connection_data);
>     }
>   dbus_server_unref (server);
>
>   /* If no one grabbed a reference, the connection will die. */
>   _dbus_connection_close_if_only_one_ref (connection);
>   dbus_connection_unref (connection);
>
> We know that elogind is started after dbus-daemon has written its PID
> file (there’s a Shepherd service dependency).  Is there a possibility
> that dbus-daemon writes its PID file before it has set
> ‘new_connection_function’?

Are PID files conventionally agreed to be synchronization primitives?
In the case of systemd, I feel this may be more of an afterthought,
given the system is designed to rely on socket-based activation.  Still
just guessing though, and your analysis above seems to suggest
otherwise.

Toggle quote (8 lines)

> It would seem that ‘bus_context_new’ writes the PID file after it has

> called ‘dbus_server_set_new_connection_function’ via ‘setup_server’ via

> ‘process_config_first_time_only’.

> So not sure what happens.

> That’s it for today. Thoughts?

Thank you!

I'm sorry for my lack of deep analysis here, I've spent most of my
evening attempting to fix my system just to boot ^^'.  I've at least
managed to collect the following verbose D-Bus log (54 MiB uncompressed)
which hopefully can shed some light onto how this failure came to be.

Attachment: dbus-daemon.log.xz

Happy Friday!

Maxim

Ludovic Courtès wrote on 10 Dec 2021 11:17

Recipients:(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)

Message-ID:87k0gcbw79.fsf@gnu.org

Hi Maxim!

Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

[...]

Toggle quote (7 lines)

> Oh! Marius found this interesting issue [0] that they shared on IRC

> today; I wonder if it may be related. sd-bus apparently pipeline things

> aggressively, which is not always handled well by other D-bus

> participants.

> [0] https://github.com/systemd/systemd/issues/16610

Interesting.

Toggle quote (3 lines)

>> Anyway, the “Hello” message is sent to the system bus asynchronously in

>> ‘sd-bus.c’:

[...]

Toggle quote (23 lines)>> static int hello_callback(sd_bus_message *reply, void *userdata, sd_bus_error *error) {
>>
>> [...]
>>
>> fail:
>>         /* When Hello() failed, let's propagate this in two ways: first we return the error immediately here,
>>          * which is the propagated up towards the event loop. Let's also invalidate the connection, so that
>>          * if the user then calls back into us again we won't wait any longer. */
>>
>>         bus_set_state(bus, BUS_CLOSING);
>>         return r;
>> }
>>
>>
>> It’s not clear from that whether the authors intended for the thing to
>> keep going in case of failure.  In our case it’s not helpful.
>
> If we picture this in the systemd use case, I believe sd-bus must be
> used as *the* d-bus daemon, perhaps?  So it should never die, and expect
> users to call back into it to retry things?  In our case, it acts as a
> D-Bus client, not a server (IIUC), so perhaps its behavior is not tuned
> right for our use case.

I think systemd-logind is a separate process, just like elogind, that

talks over D-Bus; I don’t think there’s any difference here. It just

seems that this corner case, where we don’t get a reply for ‘Hello’, is

not correctly handled.

Toggle quote (6 lines)

> Interesting that you mention this; that's what I worked on yesterday

> (see attached patches). I managed to get elogind panicking the kernel

> during a guix system reconfigure, which corrupted GRUB, so had to chroot

> and reconfigure from there [1]. Not sure what happened, but it seems

> that killing and restarting elogind is susceptible to cause hard locks.

Ouch! This is weird. Note that “everything” depends on elogind, as can

be seen with ‘guix system shepherd-graph’. So “herd stop elogind” is

not something you should try at home.

Incidentally, can you reproduce the bug in a VM? That would be much

nicer.

Toggle quote (7 lines)

>> We know that elogind is started after dbus-daemon has written its PID

>> file (there’s a Shepherd service dependency). Is there a possibility

>> that dbus-daemon writes its PID file before it has set

>> ‘new_connection_function’?

> Are PID files conventionally agreed to be synchronization primitives?

Yes. Daemons are supposed to write their PID file once they’re ready to

do their job.

Toggle quote (5 lines)

> I'm sorry for my lack of deep analysis here, I've spent most of my

> evening attempting to fix my system just to boot ^^'. I've at least

> managed to collect the following verbose D-Bus log (54 MiB uncompressed)

> which hopefully can shed some light onto how this failure came to be.

Excerpt:

Toggle snippet (24 lines)366: 0x7f28f396e740: 1639108961.938358 [dbus-sysdeps-util-unix.c(237):_dbus_write_pid_to_file_and_pipe] writing pid file /var/run/dbus/pid
0x7f28f396e740: 1639108961.938438 [dbus-sysdeps-util-unix.c(291):_dbus_write_pid_to_file_and_pipe] No pid pipe to write to
0x7f28f396e740: 1639108961.938474 [dbus-userdb.c(177):_dbus_user_database_lookup] Using cache for UID 986 information
0x7f28f396e740: 1639108961.938665 [dbus-sysdeps-unix.c(3514):_dbus_socketpair] full-duplex pipe 6 <-> 7
0x7f28f396e740: 1639108961.938700 [main.c(719):main] We are on D-Bus...
0x7f28f396e740: 1639108961.938725 [dbus-mainloop.c(884):_dbus_loop_run] Running main loop, depth 0 -> 1
0x7f28f396e740: 1639108962.566557 [dbus-server-socket.c(182):socket_handle_watch] Handling client connection, flags 0x1
0x7f28f396e740: 1639108962.566623 [dbus-sysdeps-unix.c(2344):_dbus_accept] client fd 8 accepted
0x7f28f396e740: 1639108962.566665 [dbus-server-socket.c(94):handle_new_client_fd_and_unlock] Creating new client connection with fd 8
0x7f28f396e740: 1639108962.566779 [dbus-connection.c(1360):_dbus_connection_new_for_transport] LOCK
0x7f28f396e740: 1639108962.566824 [dbus-transport-socket.c(180):check_read_watch] fd = 8
0x7f28f396e740: 1639108962.566869 [dbus-transport-socket.c(227):check_read_watch]   setting read watch enabled = 1

[...]

0x7f28f396e740: 1639108962.568765 [dbus-transport-socket.c(974):socket_handle_watch] handling read watch 0x23c2b00 flags = 9
0x7f28f396e740: 1639108962.568807 [dbus-transport-socket.c(348):exchange_credentials] exchange_credentials: do_reading = 1, do_writing = 0
0x7f28f396e740: 1639108962.568862 [dbus-transport-socket.c(383):exchange_credentials] Failed to read credentials Failed to read credentials byte (zero-length read)
0x7f28f396e740: 1639108962.568904 [dbus-transport.c(510):_dbus_transport_disconnect] start
0x7f28f396e740: 1639108962.568945 [dbus-transport-socket.c(1041):socket_disconnect] 
0x7f28f396e740: 1639108962.568986 [dbus-transport-socket.c(76):free_watches] start
0x7f28f396e740: 1639108962.569029 [dbus-watch.c(423):_dbus_watch_list_remove_watch] Removing watch on fd 8

We see this sequence:

1. Writes PID file.

2. Accepts incoming connection.

3. “Failed to read credentials”.

4. Close connection.

So we need to understand #3. Perhaps going back to strace logs could

help this time…

Thanks,

Ludo’.

Michael Rohleder wrote on 15 Dec 2021 20:27

Recipients:(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)(address . 52051@debbugs.gnu.org)

Message-ID:87k0g5hdne.fsf@rohleder.de

Hi Maxim,

I currently have this (very annoying) issue on _one_ of my machines (two
others work with nearly the same config).
I can't login at all not via console nor ssh or sddm.
I spend some time to reproduce it in a vm, but no success so far.

These are the relevant messages from syslog:
Dec 15 18:15:52 micha dbus-daemon[470]: [system] Failed to activate service 'org.freedesktop.login1': timed out (service_start_timeout=25000ms)
Dec 15 18:16:47 micha dbus-daemon[470]: [system] Activating service name='org.freedesktop.login1' requested by ':1.8' (uid=0 pid=899 comm="/gnu/store/ximad0zvg12r4x0x80mvym8hzg0n33jl-shadow") (using servicehelper)
Dec 15 18:16:47 micha elogind[935]: elogind is already running as PID 558
Dec 15 18:17:12 micha dbus-daemon[470]: [system] Failed to activate service 'org.freedesktop.login1': timed out (service_start_timeout=25000ms)

-- 
How much does it cost to entice a dope-smoking UNIX system guru to Dayton?
		-- Brian Boyle, UNIX/WORLD's First Annual Salary Survey

Abhiseck Paira wrote on 16 Dec 2021 04:36

[core-updates-frozen] cannot login ('org.freedesktop.login1' service times out)

Recipients:(address . 52051@debbugs.gnu.org)

Message-ID:874k795i1m.fsf@disroot.org

Hello,

I am getting the same errors others have mentioned. I can't login via
tty and GDM wouldn't launch either. This happened after I pulled this
commit 73eb941b8b3793aec6110a4ae28bdbfc3ab4f6c5 with guix pull and ran
system reconfigure.

-- 
Abhiseck Paira
E34E 825B 979E EB9F 8505  F80E E93D 353B 7740 0709
https://abhiseck.neocities.org

ison wrote on 16 Dec 2021 14:09

Recipients:(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)

Message-ID:87sfusofjg.fsf@airmail.cc

I may have found a clue about what's happening.

After updating I ran into this same issue (I login from TTY), and I see

the same "elogind is already running" error.

However, if I boot into an older generation (before the

core-updates-frozen merge), and then again reboot back into the newest

generation (after core-updates-frozen merge) then I am able to login.

I was able to reproduce those exact steps too. Reboot to new generation,

can't login again. Reboot to old generation, then reboot back to new

generation and I can log in.

Note that even when I'm able to login on the new generation I still see

the "elogind is already running" error printed, but only before the

Also I'm not sure if this matters, but another difference I notice

during the attempts when I can successfully login on the new generation

is that I see Herd restart the `nscd` service.

Perhaps someone else experiencing this issue should try to restart nscd

(via SSH) and see if that works for you.

Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:

Toggle quote (29 lines)> Hello,
>
> I'm not 100% sure this is the cause, but these are the last messages I
> have before I rebooted:
>
> Nov 23 01:09:14 localhost dbus-daemon[383]: [system] Activating service name='org.freedesktop.login1' requested by ':1.17' (uid=0 pid=370
>     comm="/gnu/store/ximad0zvg12r4x0x80mvym8hzg0n33jl-shadow") (using servicehelper)
> Nov 23 01:09:14 localhost elogind[1189]: elogind is already running as PID 390
> Nov 23 01:09:20 localhost shepherd[1]: Respawning term-tty1. 
> Nov 23 01:09:20 localhost shepherd[1]: Service host-name has been started. 
> Nov 23 01:09:20 localhost shepherd[1]: Service term-tty1 has been started.
> Nov 23 01:09:39 localhost dbus-daemon[383]: [system] Failed to activate service 'org.freedesktop.login1': timed out (service_start_timeout=25000ms)
>
> I don't remember if I saw the slim login screen; but in any case I
> couldn't successfully login even via a ptty.
>
> It may have to do with polkit.
>
> To be investigated.
>
> This happened on a system *not* using gdm (it uses slim) and with
> ratpoison as the WM, on commit f42bc604547d9ee8e35fcd66d5db7786954cfac3
> of the core-updates-frozen branch.
>
> To be investigated.
>
> Thanks,
>
> Maxim

Guillaume Le Vaillant wrote on 16 Dec 2021 18:59

Re: bug#52051: [core-updates-frozen] cannot login

Recipients:(address . 52051@debbugs.gnu.org)

Message-ID:87mtl0774j.fsf@kitej

Hi,

I just got this login error when updating an old machine with a HDD as

storage. On some other faster machines using SSD or NVME storage this

issue never happened, so I thought the error might be triggered by

slow IO.

Do some of you also see the issue on fast machines/storage?

-----BEGIN PGP SIGNATURE-----

iIUEAREKAC0WIQTLxZxm7Ce5cXlAaz5r6CCK3yH+PwUCYbuBLA8cZ2x2QHBvc3Rl

by5uZXQACgkQa+ggit8h/j+S/AD/byi2hXPYz9OtuuHugDB/bhicLhXoQ5n7EHCV

MyHum/UBAJehTqeY9LE4+ddy8qgY+Fz19InvhLTL9o/UHrGW9J5F

=4YEE

-----END PGP SIGNATURE-----

Josselin Poiret wrote on 16 Dec 2021 19:24

Re: bug#52051: [core-updates-frozen] cannot login ('org.freedesktop.login1' service times out)

Recipients:

Message-ID:8735msz9v8.fsf@jpoiret.xyz

Hello Maxim and Ludovic,

Seeing all the new activity on this bug, I decided to take a closer

look.

It doesn't seem to me that the credential byte read is the problem, as

can be seen by the elogind strace: it sends

Toggle snippet (3 lines)

374 sendmsg(11, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0AUTH EXTERNAL\r\nDATA\r\n", iov_len=22}, {iov_base="NEGOTIATE_UNIX_FD\r\n", iov_len=19}, {iov_base="BEGIN\r\n", iov_len=7}], msg_iovlen=3, msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 48

to which dbus answers with

Toggle snippet (3 lines)

374 recvmsg(11, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="DATA\r\nOK ea94265b9365f7158f1d81ee61b20dab\r\nAGREE_UNIX_FD\r\n", iov_len=256}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 58

so dbus doesn't seem to choke on the initial authentication (where the

credentials read happens) and it's clear that the byte 0 is properly

sent at the beginning of the handshake procedure.

Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:

Toggle quote (5 lines)

> I'm sorry for my lack of deep analysis here, I've spent most of my

> evening attempting to fix my system just to boot ^^'. I've at least

> managed to collect the following verbose D-Bus log (54 MiB uncompressed)

> which hopefully can shed some light onto how this failure came to be.

Would it be possible to get both the elogind strace and the dbus log for
the same session?  We could then simply `grep` the authentication cookie
sent back by dbus to elogind to track the corresponding fd in the dbus
log.

Best,
Josselin Poiret

Michael Rohleder wrote on 16 Dec 2021 19:53

Recipients:(address . 52051@debbugs.gnu.org)

Message-ID:87v8zo4c0l.fsf_-_@rohleder.de

Hi Guillaume,

Guillaume Le Vaillant <glv@posteo.net> writes:

Toggle quote (7 lines)

> I just got this login error when updating an old machine with a HDD as

> storage. On some other faster machines using SSD or NVME storage this

> issue never happened, so I thought the error might be triggered by

> slow IO.

> Do some of you also see the issue on fast machines/storage?

I can confirm this. For me, this happens on a machine with a HDD, but

not on another with a SSD nor my laptop with NVME.

(It also "feels" like, it's about timing or so)

Q: How does a Unix guru have sex?

A: unzip;strip;touch;finger;mount;fsck;more;yes;umount;sleep

Caleb Herbert wrote on 20 Dec 2021 01:40

[core-updates-frozen] cannot login ('org.freedesktop.login1' service times out)

Recipients:(address . 52051@debbugs.gnu.org)

Message-ID:3a3f7cfa-9525-17a6-d53b-82d23dc9b70b@bluehome.net

I am using an X200 Tablet with an HDD. This happened on my system. I

am now on Fedora with Guix until this issue is fixed.

Attachment: OpenPGP_0x42E4FDF80F03CA7C.asc

Attachment: OpenPGP_signature

Timothy Sample wrote on 21 Dec 2021 10:31

Recipients:(name . Josselin Poiret)(address . dev@jpoiret.xyz)

Message-ID:87czlq1ez4.fsf@ngyro.com

Hello,

I dug out an old HDD, put it in my X200, and was able to reproduce this.

Eventually I was able to log in to the thing by fiddling with the

“auth_timeout” parameter in the D-Bus config.

Josselin Poiret <dev@jpoiret.xyz> writes:

Toggle quote (7 lines)

> [...]

> Would it be possible to get both the elogind strace and the dbus log for

> the same session? We could then simply `grep` the authentication cookie

> sent back by dbus to elogind to track the corresponding fd in the dbus

> log.

This is exactly what I did (thanks for posting the patches, Maxim).

While looking around in the D-Bus log, I noticed it complaining about

authentication expiry around where it mentions the PID of elogind. This

is in Maxim’s log, too. It counts down an authentication expiry with

messages like “Connection 0x23b9f50 authentication expires in ...”.

That brought me to the D-Bus NEWS file, which mentions adjusting

“auth_timeout” to fix a boot regression for users with older hardware

[1]. The NEWS file mentions a bug report [2] that discusses how this

might be related to hard disk speed.

[1] https://gitlab.freedesktop.org/dbus/dbus/-/blob/master/NEWS#L2487

[2] https://bugs.freedesktop.org/show_bug.cgi?id=86431#c3

After reading that I tried with the timeout bumped up to a minute, and

the X200 booted into GDM just fine, twice in row, and then failed again

when I removed the change. (I should add that it still printed the

“elogind is already running” messages, but it worked anyway.)

What’s weird is that this bug is very old (2014), and the default

timeout was increased at the time from 5s to 30s to solve the bug. So

it’s part of the story, but not the whole story.

-- Tim

Leo Famulari wrote on 21 Dec 2021 17:13

Recipients:(name . Timothy Sample)(address . samplet@ngyro.com)

Message-ID:YcH9L9rNDrcosM20@jasmine.lan

On Tue, Dec 21, 2021 at 04:31:27AM -0500, Timothy Sample wrote:

Toggle quote (8 lines)

> [1] https://gitlab.freedesktop.org/dbus/dbus/-/blob/master/NEWS#L2487

> [2] https://bugs.freedesktop.org/show_bug.cgi?id=86431#c3

> After reading that I tried with the timeout bumped up to a minute, and

> the X200 booted into GDM just fine, twice in row, and then failed again

> when I removed the change. (I should add that it still printed the

> “elogind is already running” messages, but it worked anyway.)

Okay, great. I don't "auth_timeout" in the Guix source tree. Can you say

where you adjusted it?

Timothy Sample wrote on 21 Dec 2021 17:36

Recipients:(name . Leo Famulari)(address . leo@famulari.name)

Message-ID:87fsqlewzd.fsf@ngyro.com

Hi Leo,

Leo Famulari <leo@famulari.name> writes:

Toggle quote (12 lines)> On Tue, Dec 21, 2021 at 04:31:27AM -0500, Timothy Sample wrote:
>> [1] https://gitlab.freedesktop.org/dbus/dbus/-/blob/master/NEWS#L2487
>> [2] https://bugs.freedesktop.org/show_bug.cgi?id=86431#c3
>> 
>> After reading that I tried with the timeout bumped up to a minute, and
>> the X200 booted into GDM just fine, twice in row, and then failed again
>> when I removed the change.  (I should add that it still printed the
>> “elogind is already running” messages, but it worked anyway.)
>
> Okay, great. I don't "auth_timeout" in the Guix source tree. Can you say
> where you adjusted it?

Good question! :) Here’s the patch:

Toggle diff (12 lines)diff --git a/gnu/services/dbus.scm b/gnu/services/dbus.scm
index 85a4c3e..a680ed7 100644
--- a/gnu/services/dbus.scm
+++ b/gnu/services/dbus.scm
@@ -106,6 +106,7 @@ (define-syntax directives
         (define (services->sxml services)
           ;; Return the SXML 'includedir' clauses for DIRS.
           `(busconfig
+            (limit (@ (name "auth_timeout")) "60000")
             (servicehelper "/run/setuid-programs/dbus-daemon-launch-helper")
 
             ;; First, the '.service' files of services subject to activation.

Maxime Devos wrote on 21 Dec 2021 17:52

Recipients:

Message-ID:596551103b98265f46051b7898f5ca38744c7401.camel@telenet.be

Timothy Sample schreef op di 21-12-2021 om 11:36 [-0500]:

Toggle quote (3 lines)

> + (limit (@ (name "auth_timeout")) "60000")

Are there any good reasons for having a timeout at all?

(Except for the local-user denial of service, but local users can do

"guix build -f something-that-allocates-almost-all-memory-and-melts-

the-cpu.scm" anyway ...)

If not, can the timeout be disabled/set to infinity?

A comment like

Toggle quote (5 lines)

> ;; Set timeout to a huge number (16.6 hours), because

> ;; upstream often sets timeouts low for spinning disks,

> ;; slow CPUs, etc.

> (limit [...] "60000")

could be useful (I'm assuming the timeout is in seconds here).

Greetings,

Maxime.

Leo Famulari wrote on 21 Dec 2021 18:32

Recipients:(name . Maxime Devos)(address . maximedevos@telenet.be)

Message-ID:YcIPpSte2D4e8b6R@jasmine.lan

On Tue, Dec 21, 2021 at 04:52:19PM +0000, Maxime Devos wrote:

Toggle quote (7 lines)

> Are there any good reasons for having a timeout at all?

> (Except for the local-user denial of service, but local users can do

> "guix build -f something-that-allocates-almost-all-memory-and-melts-

> the-cpu.scm" anyway ...)

> If not,ï¿½can the timeout be disabled/set to infinity?

Could be. But I think that, based on a several years of reports, the
X200 with an HDD is the slowest machine used with Guix System. On my
X200 with HDD, I have personally experienced similar race-y bugs that
seem to crop up after major upgrades --- I assume that it's a case of
bad luck, where important programs for booting move to distant parts of
the disk and seeking is too slow.

Toggle quote (9 lines)

> A comment like

> > ;; Set timeout to a huge number (16.6 hours), because

> > ;; upstream often sets timeouts low for spinning disks,

> > ;; slow CPUs, etc.

> > (limit [...] "60000")

> could be useful (I'm assuming the timeout is in seconds here).

I suggest we wait until such drastic action is necessary. Otherwise we
might be banging our heads against the wall in a few years, trying to
debug something. Let's not rush to extremes :)

Leo Famulari wrote on 21 Dec 2021 18:37

Recipients:(name . Timothy Sample)(address . samplet@ngyro.com)

Message-ID:YcIQ2ufni8EmmoJl@jasmine.lan

On Tue, Dec 21, 2021 at 11:36:22AM -0500, Timothy Sample wrote:

Toggle quote (14 lines)

> diff --git a/gnu/services/dbus.scm b/gnu/services/dbus.scm

> index 85a4c3e..a680ed7 100644

> --- a/gnu/services/dbus.scm

> +++ b/gnu/services/dbus.scm

> @@ -106,6 +106,7 @@ (define-syntax directives

> (define (services->sxml services)

> ;; Return the SXML 'includedir' clauses for DIRS.

> `(busconfig

> + (limit (@ (name "auth_timeout")) "60000")

> (servicehelper "/run/setuid-programs/dbus-daemon-launch-helper")

> ;; First, the '.service' files of services subject to activation.

With this patch, I can boot and login to X on my X200s with HDD.
Obviously this failure and subsequent fix are not deterministic, but
it's a good sign! Let's get some more testing! And comrades, let's test
the xorg-server update while doing it:

https://issues.guix.gnu.org/issue/52562

Leo Famulari wrote on 21 Dec 2021 18:44

Recipients:(name . Timothy Sample)(address . samplet@ngyro.com)

Message-ID:YcISjw7v7K0MqYv9@jasmine.lan

On Tue, Dec 21, 2021 at 12:37:30PM -0500, Leo Famulari wrote:

Toggle quote (4 lines)

> With this patch, I can boot and login to X on my X200s with HDD.

> Obviously this failure and subsequent fix are not deterministic, but

> it's a good sign! Let's get some more testing!

I pushed the patch to a Savannah branch to ease testing:

https://git.savannah.gnu.org/cgit/guix.git/log/?h=wip-fix-52051

So, you can do:

`guix pull --branch=wip-fix-52051 && guix system reconfigure [...]`

Or similar with time-machine

Maxime Devos wrote on 21 Dec 2021 18:51

Recipients:(name . Leo Famulari)(address . leo@famulari.name)

Message-ID:260a106ae1127257b512f599282f9456df23b8dc.camel@telenet.be

Leo Famulari schreef op di 21-12-2021 om 12:32 [-0500]:

Toggle quote (14 lines)> > A comment like 
> > 
> > > ;; Set timeout to a huge number (16.6 hours), because
> > > ;; upstream often sets timeouts low for spinning disks,
> > > ;; slow CPUs, etc.
> > > (limit [...] "60000")
> > 
> > could be useful (I'm assuming the timeout is in seconds here).
> 
> I suggest we wait until such drastic action is necessary. Otherwise
> we
> might be banging our heads against the wall in a few years, trying to
> debug something. Let's not rush to extremes :)

I don't think commenting new code is a drastic and extreme action.
I assume you were referring you were referring to the ‘16.6 hours’?
If so, just replace ‘16.6 hours’ by ‘one minute’, because it turns
out that 'auth_timeout' is in milliseconds and not seconds,
see https://dbus.freedesktop.org/doc/dbus-daemon.1.html.

Greetings,
Maxime.

Leo Famulari wrote on 21 Dec 2021 19:11

Recipients:(name . Maxime Devos)(address . maximedevos@telenet.be)

Message-ID:YcIYvEHqChOZ9Lja@jasmine.lan

On Tue, Dec 21, 2021 at 05:51:28PM +0000, Maxime Devos wrote:

Toggle quote (3 lines)

> I don't think commenting new code is a drastic and extreme action.

> I assume you were referring you were referring to the ‘16.6 hours’?

Right. We don't want to use a timeout of 16.6 hours. From the

perspective of a user, that's not meaningfully longer than, say, 30

minutes. And really, if the system can't bring up the login interface

for a laptop in 5 minutes, it's totally broken.

Toggle quote (4 lines)

> If so, just replace ‘16.6 hours’ by ‘one minute’, because it turns

> out that 'auth_timeout' is in milliseconds and not seconds,

> see <https://dbus.freedesktop.org/doc/dbus-daemon.1.html>.

I think I was confused about the units here.

Luis Felipe wrote on 22 Dec 2021 02:46

[core-updates-frozen] cannot login ('org.freedesktop.login1' service times out)

Recipients:(name . 52051@debbugs.gnu.org)(address . 52051@debbugs.gnu.org)

Message-ID:sHeDSIHyuoHesmvYdFh5Bpqpj6e4qmLCslkETTM1XkrH6y9NvpmgrbfKOwmkOZoDnngUDzQhoZqdr-5U4NGQsMDh7-QHHbdR51DYFZXFkAE=@protonmail.com

Hello,

Just wanted to inform that I ran

`guix pull --branch=wip-fix-52051 && guix system reconfigure [...]`

and I could log in without problems now. Thanks.

Attachment: publickey - luis.felipe.la@protonmail.com - 0x12DE1598.asc

Attachment: signature.asc

Leo Famulari wrote on 22 Dec 2021 03:19

Recipients:(name . Luis Felipe via Bug reports for GNU Guix)(address . bug-guix@gnu.org)(name . 52051@debbugs.gnu.org)(address . 52051@debbugs.gnu.org)

Message-ID:YcKLKNr0I2zdBlQ4@jasmine.lan

On Wed, Dec 22, 2021 at 01:46:41AM +0000, Luis Felipe via Bug reports for GNU Guix wrote:

Toggle quote (8 lines)

> Hello,

> Just wanted to inform that I ran

> `guix pull --branch=wip-fix-52051 && guix system reconfigure [...]`

> and I could log in without problems now. Thanks.

Awesome!

Would someone like to write a proper commit message and add a code

comment?

Leo Famulari wrote on 22 Dec 2021 22:16

Recipients:(name . Timothy Sample)(address . samplet@ngyro.com)(name . 52051@debbugs.gnu.org)(address . 52051@debbugs.gnu.org)

Message-ID:YcOVrk7teKyQp2LX@jasmine.lan

On Tue, Dec 21, 2021 at 09:19:20PM -0500, Leo Famulari wrote:

Toggle quote (3 lines)

> Would someone like to write a proper commit message and add a code

> comment?

How about the attached patch? I'd like to push this soon, because it's a

severe problem for some users.

From ab9c4de76dea889e5d05bcf7fa173868357d5f44 Mon Sep 17 00:00:00 2001

Fixes https://issues.guix.gnu.org/issue/52051.

* gnu/services/dbus.scm (dbus-configuration-directory): Set a 60 second timeout

in the dbus config.

---

gnu/services/dbus.scm | 4 ++++

1 file changed, 4 insertions(+)

Toggle diff (17 lines)diff --git a/gnu/services/dbus.scm b/gnu/services/dbus.scm
index 85a4c3ec9a..d2daf60497 100644
--- a/gnu/services/dbus.scm
+++ b/gnu/services/dbus.scm
@@ -106,6 +106,10 @@ (define-syntax directives
         (define (services->sxml services)
           ;; Return the SXML 'includedir' clauses for DIRS.
           `(busconfig
+             ;; Increase this timeout to 60 seconds to work around race-y
+             ;; failures such as <https://issues.guix.gnu.org/52051> on slow
+             ;; computers with slow I/O.
+            (limit (@ (name "auth_timeout")) "60000")
             (servicehelper "/run/setuid-programs/dbus-daemon-launch-helper")
 
             ;; First, the '.service' files of services subject to activation.
-- 
2.34.0

Leo Famulari wrote on 23 Dec 2021 00:38

Recipients:(name . Timothy Sample)(address . samplet@ngyro.com)(name . 52051@debbugs.gnu.org)(address . 52051-done@debbugs.gnu.org)

Message-ID:YcO28rn9HnZBLxwQ@jasmine.lan

On Wed, Dec 22, 2021 at 04:16:30PM -0500, Leo Famulari wrote:

Toggle quote (17 lines)> On Tue, Dec 21, 2021 at 09:19:20PM -0500, Leo Famulari wrote:
> > Would someone like to write a proper commit message and add a code
> > comment?
> 
> How about the attached patch? I'd like to push this soon, because it's a
> severe problem for some users.

> From ab9c4de76dea889e5d05bcf7fa173868357d5f44 Mon Sep 17 00:00:00 2001
> From: Timothy Sample <samplet@ngyro.com>
> Date: Tue, 21 Dec 2021 11:52:34 -0500
> Subject: [PATCH] services: dbus: Wait 1 minute for elogind to get ready.
> 
> Fixes <https://issues.guix.gnu.org/issue/52051>.
> 
> * gnu/services/dbus.scm (dbus-configuration-directory): Set a 60 second timeout
> in the dbus config.

Pushed as 488f1c589df00e802163af534294d93372e5c025

Closed

Your comment

This issue is archived.

To comment on this conversation send an email to 52051@debbugs.gnu.org

To respond to this issue using the mumi CLI, first switch to it

mumi current 52051

Then, you may apply the latest patchset in this issue (with sign off)

mumi am -- -s

Or, compose a reply to this issue

mumi compose

Or, send patches to this issue

mumi send-email *.patch

is:open	open issues
is:done	closed issues
submitter:<who>	search issue submitter
author:<who>	search by message author
date:yesterday..now	search by issue date
mdate:3m..2d	search by message date