elogind startup race between shepherd and dbus-daemon

DoneSubmitted by Ludovic Courtès.
Details
4 participants
  • Josselin Poiret
  • Liliana Marie Prikler
  • Ludovic Courtès
  • Maxim Cournoyer
Owner
unassigned
Severity
important
L
L
Ludovic Courtès wrote on 16 May 10:26 +0200
(address . bug-guix@gnu.org)
877d6lc28o.fsf@inria.fr
Hello!

Currently (40a729a0e6f1d660b942241416c1e2c567616d4d), shepherd and
dbus-daemon compete to start elogind: shepherd tries to start it
eagerly, and dbus-daemon starts it on-demand upon bus activation.

Sometimes dbus-daemon wins, and thus shepherd tries a few times to start
it anyway, leading to the infamous:

elogind is already running as PID 123

(elogind checks whether its PID file exists. Note that you may see that
message also when shepherd wins, because dbus-daemon tries to start it
anyway.) Eventually, shepherd considers that elogind cannot be started
and disables it.

In addition to being ridiculous, it’s harmful: the ‘xorg-server’ service
(from ‘gdm-service-type’ and ‘sddm-service-type’ depends on ‘elogind’),
so if shepherd loses the race, Xorg isn’t started (on my laptop,
shepherd never loses the race it seems, but i’ve seen it lose half of
the time on a slower machine).

The reason elogind is started by shepherd is explained in this comment:

;; Start elogind from the Shepherd rather than waiting
;; for bus activation. This ensures that it can handle
;; events like lid close, etc.

This comes from 94a881178af9a9a918ce6de55641daa245c92e73, which was a
fix for http://issues.guix.gnu.org/27580. I believe the justification
still holds.

So it would seem that the solution to this is to prevent dbus-daemon
from starting elogind. We can do that by changing
org.freedesktop.login1.service so that it has “Exec=true” instead of
“Exec=elogind --daemon”.

“Exec=true” is a bit crude because it doesn’t guarantee that elogind is
really started; if that isn’t good enough, we could instead wait for the
PID file or something (as of Shepherd 0.9.0, invoking ‘herd start
elogind’ potentially leads shepherd to start a second instance if the
first one is still being started, so we can’t really do that).

Depending on what we end up with, we might also revisit whether
xorg-server needs to explicitly depend on elogind.

Thoughts?

Ludo’.
L
L
Ludovic Courtès wrote on 16 May 10:40 +0200
control message for bug #55444
(address . control@debbugs.gnu.org)
875ym5c1lm.fsf@gnu.org
severity 55444 important
quit
M
M
Maxim Cournoyer wrote on 24 May 04:27 +0200
Re: bug#55444: elogind startup race between shepherd and dbus-daemon
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 55444@debbugs.gnu.org)
87y1yr4qd3.fsf@gmail.com
Hi,

Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (11 lines)
> Hello!
>
> Currently (40a729a0e6f1d660b942241416c1e2c567616d4d), shepherd and
> dbus-daemon compete to start elogind: shepherd tries to start it
> eagerly, and dbus-daemon starts it on-demand upon bus activation.
>
> Sometimes dbus-daemon wins, and thus shepherd tries a few times to start
> it anyway, leading to the infamous:
>
> elogind is already running as PID 123

Do we have a system test that sometimes reproduce it, or at least the
above message? I have some branch where I introduce some D-Bus
synchronization primitives I had started to fix
differently (bumping the timeout value); perhaps it could be of use
here.

Thanks,

Maxim
L
L
Liliana Marie Prikler wrote on 24 May 21:25 +0200
Re: elogind startup race between shepherd and dbus-daemon
5a8b5f20da2676e03c696e33c029895292ed397e.camel@gmail.com
Hi,

Am Montag, dem 16.05.2022 um 10:26 +0200 schrieb Ludovic Courtès:
Toggle quote (11 lines)
> [...]
> So it would seem that the solution to this is to prevent dbus-daemon
> from starting elogind.  We can do that by changing
> org.freedesktop.login1.service so that it has “Exec=true” instead of
> “Exec=elogind --daemon”.
>
> “Exec=true” is a bit crude because it doesn’t guarantee that elogind
> is really started; if that isn’t good enough, we could instead wait
> for the PID file or something (as of Shepherd 0.9.0, invoking ‘herd
> start elogind’ potentially leads shepherd to start a second instance
> if the first one is still being started, so we can’t really do that).
Why does shepherd race with itself here? That sounds like a very evil
bug. Rather than waiting for a log file, I'd suggest writing an ad-hoc
Guile script that communicates with shepherd and blocks until shepherd
signals that elogind has been started, but this script too would have
to work around shepherd racing against itself.

Toggle quote (2 lines)
> Depending on what we end up with, we might also revisit whether
> xorg-server needs to explicitly depend on elogind.
At least in the case of GDM I think it does heavily depend on elogind.
For the future, I think we also should take over dbus-daemon's
autostart in the same way systemd already has.

Cheers
L
L
Ludovic Courtès wrote on 25 May 14:26 +0200
Re: bug#55444: elogind startup race between shepherd and dbus-daemon
(name . Liliana Marie Prikler)(address . liliana.prikler@gmail.com)(address . 55444@debbugs.gnu.org)
87zgj5hk7u.fsf@gnu.org
Hi,

Liliana Marie Prikler <liliana.prikler@gmail.com> skribis:

Toggle quote (18 lines)
> Am Montag, dem 16.05.2022 um 10:26 +0200 schrieb Ludovic Courtès:
>> [...]
>> So it would seem that the solution to this is to prevent dbus-daemon
>> from starting elogind.  We can do that by changing
>> org.freedesktop.login1.service so that it has “Exec=true” instead of
>> “Exec=elogind --daemon”.
>>
>> “Exec=true” is a bit crude because it doesn’t guarantee that elogind
>> is really started; if that isn’t good enough, we could instead wait
>> for the PID file or something (as of Shepherd 0.9.0, invoking ‘herd
>> start elogind’ potentially leads shepherd to start a second instance
>> if the first one is still being started, so we can’t really do that).
> Why does shepherd race with itself here? That sounds like a very evil
> bug. Rather than waiting for a log file, I'd suggest writing an ad-hoc
> Guile script that communicates with shepherd and blocks until shepherd
> signals that elogind has been started, but this script too would have
> to work around shepherd racing against itself.

Right. Currently services have two states: stopped, and started.
Fixing that needs non-trivial changes to how shepherd handles state.

We’ll have to do that (the way I see it, we’ll move state out of
<service> and have a fiber explicitly handle state, including
distinguishing between “stopped” and “starting”), but I think/hope we
can fix this bug without first addressing this issue.

Toggle quote (6 lines)
>> Depending on what we end up with, we might also revisit whether
>> xorg-server needs to explicitly depend on elogind.
> At least in the case of GDM I think it does heavily depend on elogind.
> For the future, I think we also should take over dbus-daemon's
> autostart in the same way systemd already has.

Agreed, though that one is trickier: we’d need an implementation of the
D-Bus protocol. There’s guile-ac-d-bus but it’s probably under-tested.

Thanks,
Ludo’.
L
L
Ludovic Courtès wrote on 25 May 14:45 +0200
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)(address . 55444@debbugs.gnu.org)
87tu9dhjb5.fsf@gnu.org
Hi,

Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

Toggle quote (16 lines)
> Ludovic Courtès <ludo@gnu.org> writes:
>
>> Hello!
>>
>> Currently (40a729a0e6f1d660b942241416c1e2c567616d4d), shepherd and
>> dbus-daemon compete to start elogind: shepherd tries to start it
>> eagerly, and dbus-daemon starts it on-demand upon bus activation.
>>
>> Sometimes dbus-daemon wins, and thus shepherd tries a few times to start
>> it anyway, leading to the infamous:
>>
>> elogind is already running as PID 123
>
> Do we have a system test that sometimes reproduce it, or at least the
> above message?

Any system along the lines of gnu/system/examples/desktop.tmpl gets this
message, though usually Xorg starts without any problem.

There’s an elogind system test but it doesn’t catch the problem because
it’s a non-deterministic and rather rare issue.

Toggle quote (6 lines)
> I have some branch where I introduce some D-Bus synchronization
> primitives I had started to fix
> https://issues.guix.gnu.org/issue/52051, which ended up being fixed
> differently (bumping the timeout value); perhaps it could be of use
> here.

Hmm I had forgotten about that bug, and I wonder if it’s the same bug I
was seeing, and that in fact the shepherd/dbus-daemon race isn’t the
root cause. The machine where I saw elogind startup failures was
extremely slow to start for other reasons.

Anyhow, getting rid of this race seems like the right thing to do.

Ludo’.
L
L
Ludovic Courtès wrote on 27 May 15:54 +0200
(address . 55444@debbugs.gnu.org)
87k0a7dqsb.fsf@gnu.org
Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (7 lines)
> Currently (40a729a0e6f1d660b942241416c1e2c567616d4d), shepherd and
> dbus-daemon compete to start elogind: shepherd tries to start it
> eagerly, and dbus-daemon starts it on-demand upon bus activation.
>
> Sometimes dbus-daemon wins, and thus shepherd tries a few times to start
> it anyway, leading to the infamous:

Here’s an example where dbus-daemon wins:

Toggle snippet (46 lines)
$ sudo grep --color -E '^May 27 .*(dbus.*login|elogind)' /var/log/messages
May 27 15:06:36 localhost dbus-daemon[326]: [system] Activating service name='org.freedesktop.login1' requested by ':1.0' (uid=0 pid=307 comm="/gnu/store/2lis8khrdk0zzjzs5ydi8rs5h6f6wjr7-shadow") (using servicehelper)
May 27 15:06:37 localhost elogind-daemon[335]: New seat seat0.
May 27 15:06:37 localhost elogind-daemon[335]: Watching system buttons on /dev/input/event2 (Power Button)
May 27 15:06:37 localhost elogind-daemon[335]: Watching system buttons on /dev/input/event0 (Power Button)
May 27 15:06:37 localhost elogind-daemon[335]: Watching system buttons on /dev/input/event1 (Sleep Button)
May 27 15:06:41 localhost elogind-daemon[335]: Watching system buttons on /dev/input/event4 (Darfon HP USB Keyboard)
May 27 15:06:43 localhost elogind-daemon[335]: Watching system buttons on /dev/input/event5 (Darfon HP USB Keyboard Consumer Control)
May 27 15:06:48 localhost dbus-daemon[326]: [system] Successfully activated service 'org.freedesktop.login1'
May 27 15:06:54 localhost elogind-daemon[335]: New session c1 of user ludo.
May 27 15:07:45 localhost shepherd[1]: Service elogind has been started.
May 27 15:08:02 localhost shepherd[1]: Respawning elogind.
May 27 15:08:14 localhost shepherd[1]: Service elogind has been started.
May 27 15:08:32 localhost shepherd[1]: Respawning elogind.
May 27 15:08:43 localhost shepherd[1]: Service elogind has been started.
May 27 15:08:47 localhost vmunix: [ 25.123255] shepherd[1]: Service file-system-/sys/fs/cgroup/elogind has been started.
May 27 15:08:53 localhost shepherd[1]: Respawning elogind.
May 27 15:08:59 localhost shepherd[1]: Service elogind has been started.
May 27 15:09:00 localhost vmunix: [ 79.976531] elogind[348]: elogind is already running as PID 335
May 27 15:09:00 localhost vmunix: [ 107.880971] elogind[364]: elogind is already running as PID 335
May 27 15:09:00 localhost vmunix: [ 109.160864] elogind-daemon[335]: New session c2 of user ludo.
May 27 15:09:00 localhost vmunix: [ 135.017068] elogind[369]: elogind is already running as PID 335
May 27 15:09:00 localhost vmunix: [ 159.849027] elogind[370]: elogind is already running as PID 335
May 27 15:09:00 localhost vmunix: [ 181.608889] elogind[371]: elogind is already running as PID 335
May 27 15:09:00 localhost shepherd[1]: Respawning elogind.
May 27 15:09:00 localhost shepherd[1]: Service elogind has been started.
May 27 15:09:04 localhost shepherd[1]: Respawning elogind.
May 27 15:09:04 localhost shepherd[1]: Service elogind has been started.
May 27 15:09:04 localhost elogind[410]: elogind is already running as PID 335
May 27 15:09:04 localhost shepherd[1]: Respawning elogind.
May 27 15:09:04 localhost shepherd[1]: Service elogind has been started.
May 27 15:09:05 localhost elogind[411]: elogind is already running as PID 335
May 27 15:09:05 localhost shepherd[1]: Respawning elogind.
May 27 15:09:05 localhost shepherd[1]: Service elogind has been started.
May 27 15:09:05 localhost elogind[412]: elogind is already running as PID 335
May 27 15:09:05 localhost shepherd[1]: Respawning elogind.
May 27 15:09:05 localhost shepherd[1]: Service elogind has been started.
May 27 15:09:05 localhost elogind[416]: elogind is already running as PID 335
May 27 15:09:05 localhost shepherd[1]: Respawning elogind.
May 27 15:09:05 localhost shepherd[1]: Service elogind has been started.
May 27 15:09:05 localhost elogind[417]: elogind is already running as PID 335
May 27 15:09:05 localhost shepherd[1]: Service elogind has been disabled.
May 27 15:09:08 localhost elogind-daemon[335]: New session c3 of user gdm.
May 27 15:12:08 localhost elogind-daemon[335]: New session c4 of user ludo.

(In this case ‘xorg-server’ started but ‘herd status elogind’ shows it
is stopped and disabled.)

Contrast with a successful startup where shepherd wins:

Toggle snippet (15 lines)
May 27 11:03:55 localhost shepherd[1]: Service elogind has been started.
May 27 11:03:54 localhost elogind[476]: New seat seat0.
May 27 11:03:54 localhost elogind[476]: Watching system buttons on /dev/input/event3 (Power Button)
May 27 11:03:54 localhost elogind[476]: Watching system buttons on /dev/input/event1 (Power Button)
May 27 11:03:54 localhost elogind[476]: Watching system buttons on /dev/input/event0 (Lid Switch)
May 27 11:03:54 localhost elogind[476]: Watching system buttons on /dev/input/event2 (Sleep Button)
May 27 11:03:54 localhost elogind[476]: Watching system buttons on /dev/input/event5 (Dell Dell USB Keyboard)
May 27 11:03:54 localhost dbus-daemon[470]: [system] Activating service name='org.freedesktop.login1' requested by ':1.2' (uid=0 pid=477 comm="/gnu/store/qpaw2f734zlsq153fkn5afcv4k4fk63z-upower") (using servicehelper)
May 27 11:03:54 localhost elogind[476]: Watching system buttons on /dev/input/event4 (AT Translated Set 2 keyboard)
May 27 11:03:55 localhost elogind[496]: elogind is already running as PID 476
May 27 11:03:55 localhost dbus-daemon[470]: [system] Successfully activated service 'org.freedesktop.login1'
May 27 11:04:03 localhost vmunix: [ 2089.808033] shepherd[1]: Service file-system-/sys/fs/cgroup/elogind has been started.
May 27 11:04:04 localhost elogind[476]: New session c1 of user gdm.

Besides, Gentoo recommends starting it from runit rather than on-demand:


They don’t seem to be doing anything to avoid the race though.

Ludo’.
L
L
Ludovic Courtès wrote on 27 May 22:54 +0200
(address . 55444@debbugs.gnu.org)
878rqmelwm.fsf@gnu.org
Hey there,

Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (11 lines)
> So it would seem that the solution to this is to prevent dbus-daemon
> from starting elogind. We can do that by changing
> org.freedesktop.login1.service so that it has “Exec=true” instead of
> “Exec=elogind --daemon”.
>
> “Exec=true” is a bit crude because it doesn’t guarantee that elogind is
> really started; if that isn’t good enough, we could instead wait for the
> PID file or something (as of Shepherd 0.9.0, invoking ‘herd start
> elogind’ potentially leads shepherd to start a second instance if the
> first one is still being started, so we can’t really do that).

The patch below address that: it changes the “Exec=” line of
‘org.freedesktop.login1’ to refer to a wrapper. That wrapper connects
to shepherd and waits until ‘elogind’ is started.

That way, if dbus-daemon comes first, it won’t actually launch anything
and instead wait for the Shepherd ‘elogind’ service to be up. (And if
it comes second, dbus-daemon won’t try to launch anything, so no
spurious “already running” messages.)

I tested it in a ‘desktop.tmpl’ VM, quickly logging in on tty1. On
/var/log/messages, you can see the “Activating ….login1” message from
dbus-daemon, followed by “Service elogind started” from shepherd,
followed by “Successfully activated ….login1” from dbus-daemon.

The “elogind” system test passes too.

Thoughts? Objections?

Ludo’.
From 7ef63d7426677961afd2bd937af19b08209c5b70 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Ludovic=20Court=C3=A8s?= <ludo@gnu.org>
Date: Fri, 27 May 2022 22:41:55 +0200
Subject: [PATCH] services: elogind: When started by dbus-daemon, wait for the
Shepherd service.


Previously shepherd and dbus-daemon would race to start elogind. In
some cases (for instance if one logs in quickly enough on the tty),
dbus-daemon would "win" and start elogind before shepherd has had a
chance to do it. Consequently, shepherd would fail to start elogind and
mark it as stopped and disabled, in turn preventing services that depend
on it such as 'xorg-server' from starting.

* gnu/services/desktop.scm (elogind-dbus-service): Rewrite to refer to a
wrapper that waits for the 'elogind' Shepherd service.
---
gnu/services/desktop.scm | 79 ++++++++++++++++++++++++++++++++++++++--
1 file changed, 75 insertions(+), 4 deletions(-)

Toggle diff (92 lines)
diff --git a/gnu/services/desktop.scm b/gnu/services/desktop.scm
index 24fd43a207..318107a2ca 100644
--- a/gnu/services/desktop.scm
+++ b/gnu/services/desktop.scm
@@ -1075,10 +1075,81 @@ (define-syntax-rule (ini-file config file clause ...)
    ("HybridSleepMode" (sleep-list elogind-hybrid-sleep-mode))))
 
 (define (elogind-dbus-service config)
-  (list (wrapped-dbus-service (elogind-package config)
-                              "libexec/elogind/elogind"
-                              `(("ELOGIND_CONF_FILE"
-                                 ,(elogind-configuration-file config))))))
+  "Return a @file{org.freedesktop.login1.service} file that tells D-Bus how to
+\"start\" elogind.  In practice though, our elogind is started when booting by
+shepherd.  Thus, the @code{Exec} line of this @file{.service} file does not
+explain how to start elogind; instead, it spawns a wrapper that waits for the
+@code{elogind} shepherd service.  This avoids a race condition where both
+@command{shepherd} and @command{dbus-daemon} would attempt to start elogind."
+  ;; For more info on the elogind startup race, see
+  ;; <https://issues.guix.gnu.org/55444>.
+
+  (define elogind
+    (elogind-package config))
+
+  (define wrapper
+    (program-file "elogind-dbus-shepherd-sync"
+                  (with-imported-modules '((gnu services herd))
+                    #~(begin
+                        (use-modules (gnu services herd)
+                                     (srfi srfi-1)
+                                     (ice-9 match))
+
+                        (define (elogind-service? service)
+                          (memq 'elogind (live-service-provision service)))
+
+                        (define max-attempts
+                          ;; Number of attempts before assuming elogind failed
+                          ;; to start.
+                          20)
+
+                        ;; Repeatedly check whether the 'elogind' shepherd
+                        ;; service is up and running.  (As of Shepherd 0.9.1,
+                        ;; we cannot just call the 'start' method and wait for
+                        ;; it: it would spawn an additional elogind process.)
+                        (let loop ((attempts 0))
+                          (define services
+                            (current-services))
+
+                          (when (>= attempts max-attempts)
+                            (format (current-error-port)
+                                    "elogind shepherd service not started~%")
+                            (exit 2))
+
+                          (match (find elogind-service? services)
+                            (#f
+                             (format (current-error-port)
+                                     "no elogind shepherd service~%")
+                             (exit 1))
+                            (service
+                             (unless (live-service-running service)
+                               (sleep 1)
+                               (loop (+ attempts 1))))))))))
+
+  (define build
+    (with-imported-modules '((guix build utils))
+      #~(begin
+          (use-modules (guix build utils)
+                       (ice-9 match))
+
+          (define service-directory
+            "/share/dbus-1/system-services")
+
+          (mkdir-p (dirname (string-append #$output service-directory)))
+          (copy-recursively (string-append #$elogind service-directory)
+                            (string-append #$output service-directory))
+          (symlink (string-append #$elogind "/etc") ;for etc/dbus-1
+                   (string-append #$output "/etc"))
+
+          ;; Replace the "Exec=" line of the 'org.freedesktop.login1.service'
+          ;; file with one that refers to WRAPPER instead of elogind.
+          (match (find-files #$output "\\.service$")
+            ((file)
+             (substitute* file
+               (("Exec[[:blank:]]*=.*" _)
+                (string-append "Exec=" #$wrapper "\n"))))))))
+
+  (list (computed-file "elogind-dbus-service-wrapper" build)))
 
 (define (pam-extension-procedure config)
   "Return an extension for PAM-ROOT-SERVICE-TYPE that ensures that all the PAM
-- 
2.36.0
J
J
Josselin Poiret wrote on 28 May 10:13 +0200
87leumgjmh.fsf@jpoiret.xyz
Hello Ludo,

Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (20 lines)
> The patch below address that: it changes the “Exec=” line of
> ‘org.freedesktop.login1’ to refer to a wrapper. That wrapper connects
> to shepherd and waits until ‘elogind’ is started.
>
> That way, if dbus-daemon comes first, it won’t actually launch anything
> and instead wait for the Shepherd ‘elogind’ service to be up. (And if
> it comes second, dbus-daemon won’t try to launch anything, so no
> spurious “already running” messages.)
>
> I tested it in a ‘desktop.tmpl’ VM, quickly logging in on tty1. On
> /var/log/messages, you can see the “Activating ….login1” message from
> dbus-daemon, followed by “Service elogind started” from shepherd,
> followed by “Successfully activated ….login1” from dbus-daemon.
>
> The “elogind” system test passes too.
>
> Thoughts? Objections?
>
> Ludo’.

Great idea! The patch LGTM, although I'd argue that most of the wrapper
code could belong in (gnu services herd), in something like
(wait-for-services #:select? select? #:retries (n 20)) so that someone
else doesn't end up recoding this for another service.

Best,
--
Josselin Poiret
L
L
Ludovic Courtès wrote on 28 May 23:26 +0200
(name . Josselin Poiret)(address . dev@jpoiret.xyz)(address . 55444-done@debbugs.gnu.org)
87czfx9wn4.fsf@gnu.org
Hello,

Josselin Poiret <dev@jpoiret.xyz> skribis:

Toggle quote (5 lines)
> Great idea! The patch LGTM, although I'd argue that most of the wrapper
> code could belong in (gnu services herd), in something like
> (wait-for-services #:select? select? #:retries (n 20)) so that someone
> else doesn't end up recoding this for another service.

Good idea, done here:

f383838a09 services: elogind: When started by dbus-daemon, wait for the Shepherd service.
b04ae71def services: herd: Add 'wait-for-service'.

Let me know what you think!

Thanks,
Ludo’.
Closed
?
Your comment

This issue is archived.

To comment on this conversation send email to 55444@debbugs.gnu.org