Installer: wpa_supplicant fails to start

DoneSubmitted by Ludovic Courtès.
L
Ludovic Courtès wrote on Fri May 03 21:31:40+0200 2019
(name . Bug Guix)(address . bug-guix@gnu.org)
87sgtv8hcz.fsf@gnu.org
Hello,

Ricardo and sirgazil reported on IRC that the ‘wpa-supplicant’ would
(sometimes?) fail to start for the in the installation image on their
laptop.

As a result the ‘networking’ service (Connman) would not start either;
thus, the installer would report that it failed to establish Internet
access, and would just stop there.

Sirgazil posted /var/log/messages, which shows this:

--8<---------------cut here---------------start------------->8---
667 May 3 16:03:40 localhost vmunix: [ 9.164363] ieee80211 phy0: Atheros AR9565 Rev:2 mem=0xffffa88240e80000, irq=19
668 May 3 16:03:40 localhost vmunix: [ 9.172514] ath9k 0000:02:00.0 wlp2s0: renamed from wlan0
669 May 3 16:03:40 localhost vmunix: [ 9.199748] shepherd[1]: Service udev has been started.
670 May 3 16:03:40 localhost vmunix: [ 9.200682] shepherd[1]: Service file-system-/tmp has been started.
671 May 3 16:03:40 localhost vmunix: [ 9.201560] shepherd[1]: Service file-system-/dev/pts has been started.
672 May 3 16:03:40 localhost vmunix: [ 9.202423] shepherd[1]: Service file-system-/dev/shm has been started.
673 May 3 16:03:40 localhost vmunix: [ 9.203232] shepherd[1]: Service file-system-/gnu/store has been started.
674 May 3 16:03:40 localhost vmunix: [ 9.203891] shepherd[1]: Service file-systems has been started.
675 May 3 16:03:40 localhost vmunix: [ 9.207605] shepherd[1]: Service user-processes has been started.
676 May 3 16:03:40 localhost vmunix: [ 9.208993] shepherd[1]: Service host-name has been started.
677 May 3 16:03:40 localhost vmunix: [ 9.217925] shepherd[1]: Service user-homes has been started.
678 May 3 16:03:40 localhost vmunix: [ 10.221211] shepherd[1]: Service dbus-system has been started.
679 May 3 16:03:40 localhost vmunix: [ 10.222093] shepherd[1]: Service loopback has been started.
680 May 3 16:03:40 localhost wpa_supplicant[398]: Successfully initialized wpa_supplicant
681 May 3 16:03:40 localhost shepherd[1]: Service wpa-supplicant could not be started.
682 May 3 16:03:40 localhost shepherd[1]: Service networking depends on wpa-supplicant.
683 May 3 16:03:40 localhost shepherd[1]: Service networking could not be started.
684 May 3 16:03:40 localhost wpa_supplicant[400]: dbus: Could not request service name: already registered
685 May 3 16:03:40 localhost wpa_supplicant[400]: Failed to initialize wpa_supplicant
686 May 3 16:03:45 localhost shepherd[1]: Service wpa-supplicant could not be started.
687 May 3 16:03:45 localhost nscd: 401 monitoring file `/etc/hosts` (1)
688 May 3 16:03:45 localhost nscd: 401 monitoring directory `/etc` (2)
689 May 3 16:03:45 localhost nscd: 401 monitoring file `/etc/resolv.conf` (3)
690 May 3 16:03:45 localhost nscd: 401 monitoring directory `/etc` (2)
691 May 3 16:03:46 localhost shepherd[1]: Service nscd has been started.
692 May 3 16:03:46 localhost /gnu/store/3dnww56dsrd6f9pf9p8ll3yawhfp50d8-gpm-1.20.7/sbin/gpm[409]: *** info [daemon/startup.c(136)]:
693 May 3 16:03:46 localhost /gnu/store/3dnww56dsrd6f9pf9p8ll3yawhfp50d8-gpm-1.20.7/sbin/gpm[409]: Started gpm successfully. Entered daemon mode.
694 May 3 16:03:46 localhost shepherd[1]: Service gpm has been started.
695 May 3 16:03:46 localhost shepherd[1]: Service term-tty1 has been started.
696 May 3 16:03:46 localhost shepherd[1]: Service console-font-tty1 has been started.
697 May 3 16:03:46 localhost shepherd[1]: Service virtual-terminal has been started.
698 May 3 16:03:46 localhost shepherd[1]: Service term-tty2 has been started.
699 May 3 16:03:46 localhost shepherd[1]: Service console-font-tty2 has been started.
700 May 3 16:03:46 localhost shepherd[1]: Service term-tty3 has been started.
701 May 3 16:03:46 localhost shepherd[1]: Service console-font-tty3 has been started.
702 May 3 16:03:46 localhost shepherd[1]: Service term-tty4 has been started.
703 May 3 16:03:46 localhost shepherd[1]: Service console-font-tty4 has been started.
704 May 3 16:03:46 localhost shepherd[1]: Service term-tty5 has been started.
705 May 3 16:03:46 localhost shepherd[1]: Service console-font-tty5 has been started.
706 May 3 16:03:46 localhost shepherd[1]: Service term-tty6 has been started.
707 May 3 16:03:46 localhost shepherd[1]: Service console-font-tty6 has been started.
708 May 3 16:03:46 localhost shepherd[1]: Service guix-daemon has been started.
--8<---------------cut here---------------end--------------->8---

Ludo’.
L
Ludovic Courtès wrote on Fri May 03 22:51:31+0200 2019
(address . 35550@debbugs.gnu.org)
875zqr8dnw.fsf@gnu.org
Ludovic Courtès <ludo@gnu.org> skribis:

> 678 May 3 16:03:40 localhost vmunix: [ 10.221211] shepherd[1]: Service dbus-system has been started.
> 679 May 3 16:03:40 localhost vmunix: [ 10.222093] shepherd[1]: Service loopback has been started.
> 680 May 3 16:03:40 localhost wpa_supplicant[398]: Successfully initialized wpa_supplicant
> 681 May 3 16:03:40 localhost shepherd[1]: Service wpa-supplicant could not be started.
> 682 May 3 16:03:40 localhost shepherd[1]: Service networking depends on wpa-supplicant.
> 683 May 3 16:03:40 localhost shepherd[1]: Service networking could not be started.
> 684 May 3 16:03:40 localhost wpa_supplicant[400]: dbus: Could not request service name: already registered
> 685 May 3 16:03:40 localhost wpa_supplicant[400]: Failed to initialize wpa_supplicant
> 686 May 3 16:03:45 localhost shepherd[1]: Service wpa-supplicant could not be started.

My guess is that it goes like this:

1. shepherd starts ‘networking’, which triggers the start of
‘wpa-supplicant’ (PID 398), which immediately “fails”. Thus
‘networking’ is not started.

2. shepherd continues and starts ‘wpa-supplicant’ directly (PID 400).
This time it fails for good; after 5 seconds, since the PID file
didn’t show up, shepherd says again that it could not be started.

Indeed, by looking at shepherd.conf from:

guix gc -R $(guix system build gnu/system/install.scm) | grep shepherd.conf

one can see that ‘networking’ comes before ‘wpa-supplicant’ in the expression:

(for-each start '(… networking … wpa-supplicant …))

So why is ‘wpa-supplicant’ marked as failing to start on the first
attempt?

The only reason I can think of is if ‘read-pid-file’ from (shepherd
service) returns immediately and returns #f instead of a number. That
can actually happen if the PID file exists but is empty (or contains
garbage).

You would expect wpa_supplicant to create its PID file atomically: write
it under a different name, then rename(2)… but no:

--8<---------------cut here---------------start------------->8---
int os_daemonize(const char *pid_file)
{
#if defined(__uClinux__) || defined(__sun__)
return -1;
#else /* defined(__uClinux__) || defined(__sun__) */
if (os_daemon(0, 0)) {
perror("daemon");
return -1;
}

if (pid_file) {
FILE *f = fopen(pid_file, "w");
if (f) {
fprintf(f, "%u\n", getpid());
fclose(f);
}
}

return -0;
#endif /* defined(__uClinux__) || defined(__sun__) */
}
--8<---------------cut here---------------end--------------->8---

So there is a possibility, albeit unlikely, for shepherd to see the PID
file after it’s been open but before it’s been written to. (This
problem is not limited to the installer.)

I’m not 100% convinced that this is what’s happening there but that’s
the only lead I have. I’m surprised we haven’t seen other reports
before.

Thoughts?

Ludo’.
R
Ricardo Wurmus wrote on Fri May 03 23:26:52+0200 2019
Re: Installer: wpa_supplicant fails to start
(name . Ludovic Courtès)(address . ludo@gnu.org)
877eb7grfn.fsf@elephly.net
Ludovic Courtès <ludo@gnu.org> writes:

> Ricardo and sirgazil reported on IRC that the ‘wpa-supplicant’ would
> (sometimes?) fail to start for the in the installation image on their
> laptop.

What I observed was that the wpa-supplicant *service* would not start
and starting it manually would fail because a wpa_supplicant *process*
would already be running.

Upon killing the stray process the service could be started.

--
Ricardo
S
Installer: wpa_supplicant fails to start
(name . 35550)(address . 35550@debbugs.gnu.org)
16a80e1461c.1292821a927951.2710363725930741694@zoho.com
In my case, "wpa-supplicant" would *always* fail to start (I rebooted and tried the graphical installer around 4 times).

Killing "wpa-supplicant" and then running "herd start networking" made it possible for the graphical installer to display the steps where you select Wireless or Wired, and then continue normally.
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head><meta content="text/html;charset=UTF-8" http-equiv="Content-Type"></head><body ><div style="font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 10pt;"><div><pre class="message"><span class="font" style="font-family: verdana, sans-serif;">In my case, "wpa-supplicant" would *always* fail to start (I rebooted and tried the graphical installer around 4 times).</span><br></pre><div>Killing "wpa-supplicant" and then running "herd start networking" made it possible for the graphical installer to display the steps where you select Wireless or Wired, and then continue normally.<br></div><div><br></div></div><div><br></div></div><br></body></html>
L
Ludovic Courtès wrote on Sat May 04 14:42:53+0200 2019
control message for bug #35550
(address . control@debbugs.gnu.org)
87sgtu4che.fsf@gnu.org
severity 35550 important
L
Ludovic Courtès wrote on Sat May 04 14:49:55+0200 2019
Re: bug#35550: Installer: wpa_supplicant fails to start
(address . sirgazil@zoho.com)
87k1f64c5o.fsf@gnu.org
Hi,

sirgazil@zoho.com skribis:

> In my case, "wpa-supplicant" would *always* fail to start (I rebooted and tried the graphical installer around 4 times).

Could you check whether the sequence of events in /var/log/messages
always looks like the snippet I showed?

Could you also:

1. Let the installation image boot;

2. Confirm with ‘herd status wpa-supplicant’ that the ‘wpa-supplicant’
service is marked as stopped (failed to start);

3. Show the contents /var/run/wpa_supplicant.pid, and check whether
they correspond to the PID of a running ‘wpa_supplicant’ process.

?

Besides, do you have a Guix(SD) installation, recent or not, that uses
wpa_supplicant? If so, does it suffer from the same problem
(‘wpa-supplicant’ service fails to start)?

Thanks,
Ludo’.
P
pelzflorian (Florian Pelz) wrote on Sat May 04 16:07:54+0200 2019
(name . Ludovic Courtès)(address . ludo@gnu.org)
20190504140753.6wjgctw6ird7xj2b@pelzflorian.localdomain
On Sat, May 04, 2019 at 02:49:55PM +0200, Ludovic Courtès wrote:
> Hi,
>
> sirgazil@zoho.com skribis:
>
> > In my case, "wpa-supplicant" would *always* fail to start (I rebooted and tried the graphical installer around 4 times).
>
> Could you check whether the sequence of events in /var/log/messages
> always looks like the snippet I showed?
>

So for me it sometimes fails on two different machines. Actually it
started failing four times in a row now that I test this.

When it fails, I also get

On Fri, May 03, 2019 at 09:31:40PM +0200, Ludovic Courtès wrote:
> 680 May 3 16:03:40 localhost wpa_supplicant[398]: Successfully initialized wpa_supplicant
> 681 May 3 16:03:40 localhost shepherd[1]: Service wpa-supplicant could not be started.
> 682 May 3 16:03:40 localhost shepherd[1]: Service networking depends on wpa-supplicant.
> 683 May 3 16:03:40 localhost shepherd[1]: Service networking could not be started.
> 684 May 3 16:03:40 localhost wpa_supplicant[400]: dbus: Could not request service name: already registered
> 685 May 3 16:03:40 localhost wpa_supplicant[400]: Failed to initialize wpa_supplicant
> 686 May 3 16:03:45 localhost shepherd[1]: Service wpa-supplicant could not be started.



> Could you also:
>
> 1. Let the installation image boot;
>
> 2. Confirm with ‘herd status wpa-supplicant’ that the ‘wpa-supplicant’
> service is marked as stopped (failed to start);
>

$ herd status wpa-supplicant
Status of wpa-supplicant:
It is stopped.
It is enabled.
Provides (wpa-supplicant).
Requires (user-processes dbus-system loopback syslogd).
Conflicts with ().
Will be respawned.



> 3. Show the contents /var/run/wpa_supplicant.pid, and check whether
> they correspond to the PID of a running ‘wpa_supplicant’ process.
>

There is no such file.

Regards,
Florian
L
Ludovic Courtès wrote on Sat May 04 17:28:04+0200 2019
(address . 35550@debbugs.gnu.org)
87a7g22q9n.fsf@gnu.org
Ludovic Courtès <ludo@gnu.org> skribis:

> You would expect wpa_supplicant to create its PID file atomically: write
> it under a different name, then rename(2)… but no:

Shepherd commit 72631752149d000c8c98ae0cc66e0b0c2eda94ef changes
‘read-pid-file’ to better deal with this.

Ludo’.
L
Ludovic Courtès wrote on Sat May 04 17:32:23+0200 2019
(name . pelzflorian (Florian Pelz))(address . pelzflorian@pelzflorian.de)
877eb62q2g.fsf@gnu.org
Hi Florian,

"pelzflorian (Florian Pelz)" <pelzflorian@pelzflorian.de> skribis:

>> Could you also:
>>
>> 1. Let the installation image boot;
>>
>> 2. Confirm with ‘herd status wpa-supplicant’ that the ‘wpa-supplicant’
>> service is marked as stopped (failed to start);
>>
>
> $ herd status wpa-supplicant
> Status of wpa-supplicant:
> It is stopped.
> It is enabled.
> Provides (wpa-supplicant).
> Requires (user-processes dbus-system loopback syslogd).
> Conflicts with ().
> Will be respawned.
>
>
>
>> 3. Show the contents /var/run/wpa_supplicant.pid, and check whether
>> they correspond to the PID of a running ‘wpa_supplicant’ process.
>>
>
> There is no such file.

At this point, could you also check:

3b. The output of “pgrep -fa wpa_supplicant”.

The PID file was probably removed by the second attempt to start the
‘wpa-supplicant’ service.

Could you also try running (as root):

while herd restart wpa-supplicant; do : ; done

to see the frequency at which the service fails to start?

Thanks!

Ludo’.
P
pelzflorian (Florian Pelz) wrote on Sat May 04 17:47:44+0200 2019
(name . Ludovic Courtès)(address . ludo@gnu.org)
20190504154743.i5sv6spf4jsbxruu@pelzflorian.localdomain
On Sat, May 04, 2019 at 05:32:23PM +0200, Ludovic Courtès wrote:
> At this point, could you also check:
>
> 3b. The output of “pgrep -fa wpa_supplicant”.
>

root@gnu ~# pgrep -fa wpa_supplicant
481 /gnu/store/ifqy2dr2hbqplv14k50dzprw3k2bz9m9-wpa-supplicant-2.7/sbin/wpa_supplicant -P/var/run/wpa_supplicant.pid -B -s -u


> The PID file was probably removed by the second attempt to start the
> ‘wpa-supplicant’ service.
>
> Could you also try running (as root):
>
> while herd restart wpa-supplicant; do : ; done
>
> to see the frequency at which the service fails to start?
>

It displays

Service wpa-supplicant is not running.
Service wpa-supplicant could not be started.

after and again and again every approximately 5 seconds.

Regards,
Florian
L
Ludovic Courtès wrote on Sat May 04 23:14:29+0200 2019
(name . pelzflorian (Florian Pelz))(address . pelzflorian@pelzflorian.de)
878svl2a8a.fsf@gnu.org
Hi,

"pelzflorian (Florian Pelz)" <pelzflorian@pelzflorian.de> skribis:

> On Sat, May 04, 2019 at 05:32:23PM +0200, Ludovic Courtès wrote:

[...]

>> Could you also try running (as root):
>>
>> while herd restart wpa-supplicant; do : ; done
>>
>> to see the frequency at which the service fails to start?
>>
>
> It displays
>
> Service wpa-supplicant is not running.
> Service wpa-supplicant could not be started.

Sorry, to be complete, I should have asked you to first kill the
“dangling” wpa_supplicant process:

pkill -f wpa_supplicant
pgrep -fa wpa_supplicant # make sure it’s really dead
while herd restart wpa-supplicant; do : ; done

Could you try this?

TIA,
Ludo’.
P
pelzflorian (Florian Pelz) wrote on Sat May 04 23:34:03+0200 2019
(name . Ludovic Courtès)(address . ludo@gnu.org)
20190504213402.2pdyieijtpejpi4p@pelzflorian.localdomain
On Sat, May 04, 2019 at 11:14:29PM +0200, Ludovic Courtès wrote:
> Hi,
>
> "pelzflorian (Florian Pelz)" <pelzflorian@pelzflorian.de> skribis:
>
> > On Sat, May 04, 2019 at 05:32:23PM +0200, Ludovic Courtès wrote:
>
> [...]
>
> >> Could you also try running (as root):
> >>
> >> while herd restart wpa-supplicant; do : ; done
> >>
> >> to see the frequency at which the service fails to start?
> >>
> >
> > It displays
> >
> > Service wpa-supplicant is not running.
> > Service wpa-supplicant could not be started.
>
> Sorry, to be complete, I should have asked you to first kill the
> “dangling” wpa_supplicant process:
>
> pkill -f wpa_supplicant
> pgrep -fa wpa_supplicant # make sure it’s really dead

pgrep does not display anything.


> while herd restart wpa-supplicant; do : ; done
>
> Could you try this?
>

Now it displays

Service wpa-supplicant has been stopped.
Service wpa-supplicant has been started.

again and again incessantly, one immediately after the other with no
time in between.

By the way, the network connection only failed sometimes when I tested
Guix 1.0.0, but since I started using a Guix disk image from a more
recent Guix, wpa-supplicant apparently always fails. Strange.

Regards,
Florian
L
Ludovic Courtès wrote on Sun May 05 18:29:46+0200 2019
(name . pelzflorian (Florian Pelz))(address . pelzflorian@pelzflorian.de)
87lfzkzwxx.fsf@gnu.org
Hello,

I’ve been able to test it on the bare metal as well and I confirm: I
booted the install image a few times and the ‘wpa-supplicant’ service
would be stopped every time, with a dangling ‘wpa_supplicant’ process
running. Given that this seems to be systematic, I’m not sure my PID
file atomicity hypothesis explains everything.

I’ve tried running:

pkill wpa_supplicant
while herd start wpa-supplicant ; do herd stop wpa-supplicant ; done

and that always works (i.e., no “failed to start ‘wpa-supplicant’”
error.)

If I run in parallel this:

while touch /var/run/wpa_supplicant.pid ; do : ; done

then it occasionally “fails to start”, but not very often.

To be continued…

Ludo’.
L
Ludovic Courtès wrote on Sun May 05 22:26:01+0200 2019
(address . 35550@debbugs.gnu.org)
87zho0y7fq.fsf@gnu.org
Ludovic Courtès <ludo@gnu.org> skribis:

> You would expect wpa_supplicant to create its PID file atomically: write
> it under a different name, then rename(2)… but no:

Regarding this PID file atomicity issue, a few other data points: Tor,
syslogd (GNU Inetutils), sshd (OpenSSH), and dbus-daemon all do the
exact same non-atomic PID file creation as wpa_supplicant.

Libdaemon (used by Avahi) does something more fancy: it acquires an
exclusive write lock (fcntl with F_SETLKW) on the PID file right after
opening it, but since shepherd does not attempt to acquire locks on PID
file, it has no effect.

systemd and GNU Pies don’t take any precautions while reading PID
files. The assumption seems to be that if the file exists, then it must
contain a valid PID.

So perhaps the problem we see with ‘wpa_supplicant’ is elsewhere.

Ludo’.
L
Ludovic Courtès wrote on Mon May 06 00:21:26+0200 2019
(address . 35550@debbugs.gnu.org)
87k1f4y23d.fsf@gnu.org
Ludovic Courtès <ludo@gnu.org> skribis:

> So why is ‘wpa-supplicant’ marked as failing to start on the first
> attempt?
>
> The only reason I can think of is if ‘read-pid-file’ from (shepherd
> service) returns immediately and returns #f instead of a number. That
> can actually happen if the PID file exists but is empty (or contains
> garbage).

I’ve produced an ISO with the patch below and ran it on the bare metal
to get confirmation (too bad the bug doesn’t show in QEMU :-/). Indeed,
‘read-pid-file’ for /var/run/wpa_supplicant.pid systematically reads the
empty string the first time the ‘wpa-supplicant’ service is started.

(After that, if we kill the process and try to restart the service, the
problem doesn’t show up.)
diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm
index 53437b6..e21492e 100644
--- a/modules/shepherd/service.scm
+++ b/modules/shepherd/service.scm
@@ -717,9 +717,12 @@ otherwise return the number that was read (a PID)."
(let loop ()
(catch 'system-error
(lambda ()
- (string->number
- (string-trim-both
- (call-with-input-file file get-string-all))))
+ (define str
+ (call-with-input-file file get-string-all))
+
+ (local-output (l10n "read-pid-file ~s -> ~s")
+ file str)
+ (string->number (string-trim-both str)))
(lambda args
(let ((errno (system-error-errno args)))
(if (= ENOENT errno)
With the second patch below, I confirm that the ‘wpa-supplicant’ starts
correctly. We can see in /var/log/messages that ‘read-pid-file’ first
reads the empty string from /var/run/wpa_supplicant.pid, then tries
again, and gets a valid PID on the second attempt.

It’s surprising that the timing is always like that, and only on the
bare metal, but that’s the way it is.

It’d be great if you could do some testing with the patch below. Then I
guess we’ll push a Shepherd release with this fix.

I wonder if this could also explain
<https://issues.guix.info/issue/30993>.

Thanks,
Ludo’.
diff --git a/gnu/packages/admin.scm b/gnu/packages/admin.scm
index dfc3467bf8..e1dd248679 100644
--- a/gnu/packages/admin.scm
+++ b/gnu/packages/admin.scm
@@ -188,7 +188,8 @@ and provides a \"top-like\" mode (monitoring).")
version ".tar.gz"))
(sha256
(base32
- "1ys2w83vm62spr8bx38sccfdpy9fqmj7wfywm5k8ihsy2k61da2i"))))
+ "1ys2w83vm62spr8bx38sccfdpy9fqmj7wfywm5k8ihsy2k61da2i"))
+ (patches (search-patches "shepherd-debug.patch"))))
(build-system gnu-build-system)
(arguments
'(#:configure-flags '("--localstatedir=/var")))
diff --git a/gnu/packages/patches/shepherd-debug.patch b/gnu/packages/patches/shepherd-debug.patch
new file mode 100644
index 0000000000..2fd97cc578
--- /dev/null
+++ b/gnu/packages/patches/shepherd-debug.patch
@@ -0,0 +1,43 @@
+diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm
+index 53437b6..bef8f42 100644
+--- a/modules/shepherd/service.scm
++++ b/modules/shepherd/service.scm
+@@ -715,21 +715,28 @@ number. Return #f if FILE was not created or does not contain a number;
+ otherwise return the number that was read (a PID)."
+ (define start (current-time))
+ (let loop ()
++ (define (retry)
++ (and (< (current-time) (+ start max-delay))
++ (begin
++ ;; FILE does not exist yet, so wait and try again.
++ ;; XXX: Ideally we would yield to the main event loop
++ ;; and/or use inotify.
++ (sleep 1)
++ (loop))))
++
+ (catch 'system-error
+ (lambda ()
+- (string->number
+- (string-trim-both
+- (call-with-input-file file get-string-all))))
++ (define str
++ (call-with-input-file file get-string-all))
++
++ (local-output (l10n "read-pid-file ~s -> ~s")
++ file str)
++ (or (string->number (string-trim-both str))
++ (retry)))
+ (lambda args
+ (let ((errno (system-error-errno args)))
+ (if (= ENOENT errno)
+- (and (< (current-time) (+ start max-delay))
+- (begin
+- ;; FILE does not exist yet, so wait and try again.
+- ;; XXX: Ideally we would yield to the main event loop
+- ;; and/or use inotify.
+- (sleep 1)
+- (loop)))
++ (retry)
+ (apply throw args)))))))
+
+ (define* (exec-command command
P
pelzflorian (Florian Pelz) wrote on Mon May 06 21:47:41+0200 2019
(name . Ludovic Courtès)(address . ludo@gnu.org)
20190506194741.qm7uha4fomsngdga@pelzflorian.localdomain
On Mon, May 06, 2019 at 12:21:26AM +0200, Ludovic Courtès wrote:
> It’d be great if you could do some testing with the patch below. Then I
> guess we’ll push a Shepherd release with this fix.
>

The second patch (the one for Guix is the only one I have tested)
works fine on the system which always failed before since sometime
after 1.0.

One time wpa-supplicant worked fine but it had an error establishing
the internet connection, but I am quite certain it was an unrelated
hardware issue.

Regards,
Florian
D
Danny Milosavljevic wrote on Mon May 06 22:00:30+0200 2019
(name . Ludovic Courtès)(address . ludo@gnu.org)
20190506220030.4a608b37@scratchpost.org
Hi Ludo,

what happens when the loop reads the pid file when it contains just half of a
numeral? It won't detect it, right?
-----BEGIN PGP SIGNATURE-----

iQEzBAEBCAAdFiEEds7GsXJ0tGXALbPZ5xo1VCwwuqUFAlzQkl4ACgkQ5xo1VCww
uqXR2QgAkyT7VO/EanKhg81l0bDe7A0rLjtgoR83/+ijYjHXcqwnKYqsixD+Eh7j
8sbxvvAW6rP0Th+ngbu02ZCxkn0vWmgQdpJGYMOXOUwCo010gK56w0S+LYW4PvYq
j3ppUP+x+QZz0AUvuio/GIiMc67ede1PaFpNqKGJdkgqMCX4Gn04xUQwOOILi6tD
8Sj8m6SAvhrmgKVt9+Qu7yjvR+Yd1k+CvZIVafcI9vOTG26P/xg7cnBxpwRne7wE
iDhr+rOE0K6++sQf48hl798zeve850Rk1/LZj4EzeJNUJRU5utzPzu8OiuPSysO9
KgkepMfrgi8d26kQ0WG0WG7pETICQQ==
=UahM
-----END PGP SIGNATURE-----

L
Ludovic Courtès wrote on Tue May 07 09:53:54+0200 2019
(name . pelzflorian (Florian Pelz))(address . pelzflorian@pelzflorian.de)
87r29aaeel.fsf@gnu.org
Hi Florian,

"pelzflorian (Florian Pelz)" <pelzflorian@pelzflorian.de> skribis:

> On Mon, May 06, 2019 at 12:21:26AM +0200, Ludovic Courtès wrote:
>> It’d be great if you could do some testing with the patch below. Then I
>> guess we’ll push a Shepherd release with this fix.
>>
>
> The second patch (the one for Guix is the only one I have tested)
> works fine on the system which always failed before since sometime
> after 1.0.
>
> One time wpa-supplicant worked fine but it had an error establishing
> the internet connection, but I am quite certain it was an unrelated
> hardware issue.

Awesome, thanks for testing.

Note that I made similar changes in shepherd, which I’ll probably
release as 0.6.1 so we can use it in Guix 1.0.1:

https://git.savannah.gnu.org/cgit/shepherd.git/log

Ludo’.
L
Ludovic Courtès wrote on Tue May 07 10:05:08+0200 2019
(name . Danny Milosavljevic)(address . dannym@scratchpost.org)
87imumadvv.fsf@gnu.org
Hi Danny,

Danny Milosavljevic <dannym@scratchpost.org> skribis:

> what happens when the loop reads the pid file when it contains just half of a
> numeral? It won't detect it, right?

Correct.

I’m proposing the addition below to be on the verrrry safe side.
WDYT?

The weird thing, as I mentioned earlier, is that systemd and Pies do not
protect against truncated PID files, and I couldn’t find any
“documentation” of the problem on the intertubes. For systemd it’s
maybe less of a problem since services are started in a cgroup, I think.

Thanks,
Ludo’.
diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm
index b937609..ef27d88 100644
--- a/modules/shepherd/service.scm
+++ b/modules/shepherd/service.scm
@@ -709,10 +709,16 @@ results."
set when starting a service."
(environ))

-(define* (read-pid-file file #:key (max-delay 5))
+(define* (read-pid-file file #:key (max-delay 5)
+ (validate-pid? #f))
"Wait for MAX-DELAY seconds for FILE to show up, and read its content as a
number. Return #f if FILE was not created or does not contain a number;
-otherwise return the number that was read (a PID)."
+otherwise return the number that was read (a PID).
+
+When VALIDATE-PID? is true, succeed if and only if the number that was read is
+the PID of an existing process in the current PID namespace. This test cannot
+be used if FILE might contain a PID from another PID namespace (i.e., the
+daemon writing FILE is running in a separate PID namespace.)"
(define start (current-time))

(let loop ()
@@ -736,11 +742,13 @@ otherwise return the number that was read (a PID)."
(try-again))
((? integer? pid)
;; It's possible, though unlikely, that PID is not a valid PID, for
- ;; instance because writes to FILE did not complete. However, we
- ;; don't do (kill pid 0) because if the process lives in a separate
- ;; PID namespace, then PID is probably invalid in our own
- ;; namespace.
- pid)))
+ ;; instance because writes to FILE did not complete. When
+ ;; VALIDATE-PID? is true, check that PID is valid in the current
+ ;; PID namespace.
+ (if (or (not validate-pid?)
+ (catch-system-error (kill pid 0) #t))
+ pid
+ (try-again)))))
(lambda args
(let ((errno (system-error-errno args)))
(if (= ENOENT errno)
@@ -931,7 +939,8 @@ start."
environment-variables)))
(if pid-file
(match (read-pid-file pid-file
- #:max-delay pid-file-timeout)
+ #:max-delay pid-file-timeout
+ #:validate-pid? #t)
(#f
(catch-system-error (kill pid SIGTERM))
#f)
G
Giovanni Biscuolo wrote on Wed May 08 13:32:59+0200 2019
Re: bug#35630: graphical install, no network device found (HP EliteDesk 800 G1)
(name . pelzflorian (Florian Pelz))(address . pelzflorian@pelzflorian.de)
87v9yl6v10.fsf@roquette.mug.biscuolo.net
Hi Florian,

thank you for pointing me to the right bug report!

I checked the bug reports list via web [1] but did not link
wpa-supplicant to my issue since I did not realized that such an issue
also affects wired connections

"pelzflorian (Florian Pelz)" <pelzflorian@pelzflorian.de> writes:

> I hope this is an instance of <https://issues.guix.info/issue/35550>,
> that is “herd status wpa-supplicant” shows that wpa-supplicant is
> stopped instead of running? I hope there will be a new release soon
> where this is fixed.

yes, this is definitely an instance of #35550

I confirm I see the very same wpa_supplicant log messages in
/var/log/messages as reported by Sirgazil and pelzflorian

the following suggested workaround (in a text console activated via
CTRL-ALT-F3) works for me:

--8<---------------cut here---------------start------------->8---
pkill wpa_supplicant
herd start networking
--8<---------------cut here---------------end--------------->8---

after that I was able to restart the "Network interface" installer step
(by choosing "Exit" and selecting it again in the proposed dialog) and
was able to select the wired interface

Thanks! Gio'


[1] I'm going to subscribe bug-guix :-)

--
Giovanni Biscuolo

Xelera IT Infrastructures
-----BEGIN PGP SIGNATURE-----

iQIzBAEBCgAdFiEERcxjuFJYydVfNLI5030Op87MORIFAlzSvmsACgkQ030Op87M
ORKJHg//Vt8TSxusa37j3Y22Pw2xYySCrW/sf9NBwDzoxa7eJKJBxMHeFhEIbSL3
vVe5WQ9eb+ZOeARacPBmWrR/JgLp1RTTtmgToUWmcRhT4gjB0NTt6H4nLonddSXZ
Z/2jG/w9yDwBdh6/74Y8kAVJvnld76eRpvLqmFbr3az5BOllXF5Zsk4tRft0SQ9t
C1A+cC/Wh19MKtvwQ+guqua9khgeMuCp3IKY9SvpPi5w+WxGAS4NSUQc6i9JC2Dv
xSEbMIhCJwHBw3yoZesM1yojYgIb/kXP9cueUtuh7TBQW6cX1rjCTkRNoF5cegUX
230BvqCr17DldU9Ew2VbRw2LwqQZRBqtvkxE29TwecrzHnbHZe1P4emIMHEusnOW
Vh87YuCfh6Oxz4o8EaYucHr8C8C3KaiHDyBj/yWmwcs+9sKXcLuIJOtqjrRdOxdL
dI2S7ZTMuefkyGPcxILXvaAow+UFnNjzgdHlD98YfRLBxdwadX4S1u9EqeAT5rHa
maIxHeO/fIvHp22MAG+wm/aGfndz8yvFIYEam4Xp/ADz1UGj+FVq0tUVrM1Rb02K
sDYerjPuRL4PodVxpcC41fl/tn8z/17JoS7/Qmeim3I94JIU+k/hL0Y2u9C8s5Sw
fXnMCvLzQyBAJZtKUC65Nt+1CaFB1sblTucVuRQHi7d8rCpeaME=
=NmY7
-----END PGP SIGNATURE-----
L
Ludovic Courtès wrote on Wed May 08 16:32:00+0200 2019
Re: bug#35550: Installer: wpa_supplicant fails to start
(name . Danny Milosavljevic)(address . dannym@scratchpost.org)
874l65vwyn.fsf@gnu.org
Ludovic Courtès <ludo@gnu.org> skribis:

> I’m proposing the addition below to be on the verrrry safe side.

Pushed as Shepherd commit 1ebea0a6e4c6ff11212eda348072acf9c379e7b2.

Unless something goes wrong, I’ll make Shepherd commit
fbb9c3fac745552eaf0f354bd0134cca7027bf17 version 0.6.1 in the coming
days (well, with an updated NEWS file.)

Thanks,
Ludo’.
L
Ludovic Courtès wrote on Thu May 09 00:25:59+0200 2019
(name . Danny Milosavljevic)(address . dannym@scratchpost.org)
87imuktwg8.fsf@gnu.org
Ludovic Courtès <ludo@gnu.org> skribis:

> Ludovic Courtès <ludo@gnu.org> skribis:
>
>> I’m proposing the addition below to be on the verrrry safe side.
>
> Pushed as Shepherd commit 1ebea0a6e4c6ff11212eda348072acf9c379e7b2.
>
> Unless something goes wrong, I’ll make Shepherd commit
> fbb9c3fac745552eaf0f354bd0134cca7027bf17 version 0.6.1 in the coming
> days (well, with an updated NEWS file.)

If you want to re-test that wpa_supplicant does start correctly with the
latest Shepherd, you can do so by applying the patch below to Guix.

Ludo’.
diff --git a/gnu/packages/admin.scm b/gnu/packages/admin.scm
index dfc3467bf8..6cabea0fb9 100644
--- a/gnu/packages/admin.scm
+++ b/gnu/packages/admin.scm
@@ -181,22 +181,30 @@ and provides a \"top-like\" mode (monitoring).")
(define-public shepherd
(package
(name "shepherd")
- (version "0.6.0")
+ (version "0.6.1pre1")
(source (origin
- (method url-fetch)
- (uri (string-append "mirror://gnu/shepherd/shepherd-"
- version ".tar.gz"))
+ (method git-fetch)
+ (uri (git-reference
+ (url "https://git.savannah.gnu.org/git/shepherd.git")
+ (commit "fbb9c3fac745552eaf0f354bd0134cca7027bf17")))
(sha256
(base32
- "1ys2w83vm62spr8bx38sccfdpy9fqmj7wfywm5k8ihsy2k61da2i"))))
+ "0w7dx2i04qgykzhz4qir3rxnrdfza3q7kzh5z4mpmgjrp08pfnrn"))))
(build-system gnu-build-system)
(arguments
- '(#:configure-flags '("--localstatedir=/var")))
+ '(#:configure-flags '("--localstatedir=/var"
+ "ac_cv_path_MSGMERGE=true")))
(native-inputs
`(("pkg-config" ,pkg-config)

;; This is the Guile we use as a cross-compiler...
- ("guile" ,guile-2.2)))
+ ("guile" ,guile-2.2)
+
+ ("texinfo" ,texinfo)
+ ("help2man" ,help2man)
+ ("gettext" ,gnu-gettext)
+ ("autoconf" ,autoconf)
+ ("automake" ,automake)))
(inputs
;; ... and this is the one that appears in shebangs when cross-compiling.
`(("guile" ,guile-2.2)
P
pelzflorian (Florian Pelz) wrote on Thu May 09 08:46:13+0200 2019
(name . Ludovic Courtès)(address . ludo@gnu.org)
20190509064613.cci7yd44toggezln@pelzflorian.localdomain
On Thu, May 09, 2019 at 12:25:59AM +0200, Ludovic Courtès wrote:
> If you want to re-test that wpa_supplicant does start correctly with the
> latest Shepherd, you can do so by applying the patch below to Guix.
>

Works fine too on Macbook and PC (using a non-ISO disk image).

Regards,
Florian
L
Ludovic Courtès wrote on Thu May 09 12:18:12+0200 2019
(name . pelzflorian (Florian Pelz))(address . pelzflorian@pelzflorian.de)
87ef58exsr.fsf@gnu.org
"pelzflorian (Florian Pelz)" <pelzflorian@pelzflorian.de> skribis:

> On Thu, May 09, 2019 at 12:25:59AM +0200, Ludovic Courtès wrote:
>> If you want to re-test that wpa_supplicant does start correctly with the
>> latest Shepherd, you can do so by applying the patch below to Guix.
>>
>
> Works fine too on Macbook and PC (using a non-ISO disk image).

Great, thanks for testing!
L
Ludovic Courtès wrote on Sat May 11 20:13:25+0200 2019
(name . Danny Milosavljevic)(address . dannym@scratchpost.org)
875zqgx3ju.fsf@gnu.org
Ludovic Courtès <ludo@gnu.org> skribis:

> Ludovic Courtès <ludo@gnu.org> skribis:
>
>> I’m proposing the addition below to be on the verrrry safe side.
>
> Pushed as Shepherd commit 1ebea0a6e4c6ff11212eda348072acf9c379e7b2.
>
> Unless something goes wrong, I’ll make Shepherd commit
> fbb9c3fac745552eaf0f354bd0134cca7027bf17 version 0.6.1 in the coming
> days (well, with an updated NEWS file.)

Commit 3b8699f9c2a9f88d44a5320398b4c968c432429d upgrades the Shepherd to
0.6.1, which fixes the PID file race condition we’ve been talking about.

Closing!

Ludo’.
Closed
?
Your comment

Comments via the web interface are not currently supported. To comment on this conversation send email to 35550@debbugs.gnu.org

6 participants
  • Danny Milosavljevic
  • Giovanni Biscuolo
  • Ludovic Courtès
  • pelzflorian (Florian Pelz)
  • Ricardo Wurmus
  • sirgazil
Owner
unassigned
Severity
important
Status
Done