network-manager shepherd services does not wait to be online

  • Done
  • quality assurance status badge
Details
4 participants
  • Bone Baboon
  • Mark H Weaver
  • Bruno Victal
  • raid5atemyhomework
Owner
unassigned
Submitted by
raid5atemyhomework
Severity
normal
R
R
raid5atemyhomework wrote on 19 Mar 2021 04:38
(name . bug-guix@gnu.org)(address . bug-guix@gnu.org)
PdivRceeZdWO61FRE9ZHSeRqlynUianTTWX-t15FaDI2eWjZ-wMnfl7mQwz6tfvCcky2lCugWsvlpFw9jRI1u0ZPOueJXGA-MxxrmX1SRLk=@protonmail.com
I have a small number of daemons that need access to the network at startup. I have configured their Shepherd services to require `networking`.

However, to my puzzlement, I consistently find that they are unable to access the network at startup. One daemon dies (and gets respawned so often that it sometimes gets disabled by Shepherd), the other daemon just keeps running without having set up the server that I need it to expose.

Thus, in many cases whenever I reboot I have to manually `herd enable` and `herd restart` the first daemon and `herd restart` the second. This is fairly bad since I want to be able to leave this server alone and have it survive power interruptions etc.

Checking on other systems, I stumbled on this file on SystemD-based systems:

```systemd
[Unit]
Description=Network Manager Wait Online
Documentation=man:nm-online(1)
Requires=NetworkManager.service
After=NetworkManager.service
Before=network-online.target

[Service]
Type=oneshot
ExecStart=/usr/bin/nm-online -s -q --timeout=30
RemainAfterExit=yes

[Install]
WantedBy=network-online.target
```

Searching the Guix source code, I can't find any `nm-online` command. So I think that, when using the `network-manager` service type, the `networking` provision is considered started even though networking isn't actually online yet.

I would like to propose this change:

```diff
--- a/gnu/services/networking.scm
+++ b/gnu/services/networking.scm
@@ -1106,17 +1106,22 @@ and @command{wicd-curses} user interfaces."
(documentation "Run the NetworkManager.")
(provision '(networking))
(requirement '(user-processes dbus-system wpa-supplicant loopback))
- (start #~(make-forkexec-constructor
- (list (string-append #$network-manager
- "/sbin/NetworkManager")
- (string-append "--config=" #$conf)
- "--no-daemon")
- #:environment-variables
- (list (string-append "NM_VPN_PLUGIN_DIR=" #$vpn
- "/lib/NetworkManager/VPN")
- ;; Override non-existent default users
- "NM_OPENVPN_USER="
- "NM_OPENVPN_GROUP=")))
+ (start #~(let ((constructor (make-forkexec-constructor
+ (list (string-append #$network-manager
+ "/sbin/NetworkManager")
+ (string-append "--config=" #$conf)
+ "--no-daemon")
+ #:environment-variables
+ (list (string-append "NM_VPN_PLUGIN_DIR=" #$vpn
+ "/lib/NetworkManager/VPN")
+ ;; Override non-existent default users
+ "NM_OPENVPN_USER="
+ "NM_OPENVPN_GROUP="))))
+ (lambda args
+ (let ((pid (apply constructor args)))
+ (invoke/quiet (string-append #$network-manager "/bin/nm-online")
+ "-s" "-q" "--timeout=30")
+ pid))))
(stop #~(make-kill-destructor))))))))

(define network-manager-service-type
```


Of course, the big problem is that Shepherd is single-threadded and `nm-online` will block all other bootup.
M
M
Mark H Weaver wrote on 19 Mar 2021 13:07
87r1kbmjmc.fsf@netris.org
Hi,

raid5atemyhomework via Bug reports for GNU Guix <bug-guix@gnu.org>
writes:

Toggle quote (14 lines)
> I have a small number of daemons that need access to the network at
> startup. I have configured their Shepherd services to require
> `networking`.
>
> However, to my puzzlement, I consistently find that they are unable to
> access the network at startup. One daemon dies (and gets respawned so
> often that it sometimes gets disabled by Shepherd), the other daemon
> just keeps running without having set up the server that I need it to
> expose.
>
> Thus, in many cases whenever I reboot I have to manually `herd enable`
> and `herd restart` the first daemon and `herd restart` the second.
> This is fairly bad since I want to be able to leave this server alone
> and have it survive power interruptions etc.
[...]
Toggle quote (45 lines)
> I would like to propose this change:
>
> ```diff
> --- a/gnu/services/networking.scm
> +++ b/gnu/services/networking.scm
> @@ -1106,17 +1106,22 @@ and @command{wicd-curses} user interfaces."
> (documentation "Run the NetworkManager.")
> (provision '(networking))
> (requirement '(user-processes dbus-system wpa-supplicant loopback))
> - (start #~(make-forkexec-constructor
> - (list (string-append #$network-manager
> - "/sbin/NetworkManager")
> - (string-append "--config=" #$conf)
> - "--no-daemon")
> - #:environment-variables
> - (list (string-append "NM_VPN_PLUGIN_DIR=" #$vpn
> - "/lib/NetworkManager/VPN")
> - ;; Override non-existent default users
> - "NM_OPENVPN_USER="
> - "NM_OPENVPN_GROUP=")))
> + (start #~(let ((constructor (make-forkexec-constructor
> + (list (string-append #$network-manager
> + "/sbin/NetworkManager")
> + (string-append "--config=" #$conf)
> + "--no-daemon")
> + #:environment-variables
> + (list (string-append "NM_VPN_PLUGIN_DIR=" #$vpn
> + "/lib/NetworkManager/VPN")
> + ;; Override non-existent default users
> + "NM_OPENVPN_USER="
> + "NM_OPENVPN_GROUP="))))
> + (lambda args
> + (let ((pid (apply constructor args)))
> + (invoke/quiet (string-append #$network-manager "/bin/nm-online")
> + "-s" "-q" "--timeout=30")
> + pid))))
> (stop #~(make-kill-destructor))))))))
>
> (define network-manager-service-type
> ```
>
>
> Of course, the big problem is that Shepherd is single-threadded and
> `nm-online` will block all other bootup.

That's not good. For the sake of users who are not always connected to
the internet, I'd strongly prefer for the Guix boot process of a desktop
system to *not* be blocked for 30 seconds when there's no active
internet connection.

How about leaving "networking" as it is now, and instead adding a new
service called "network-online" or similar, that requires "networking"
and then waits until a network connection is established?

What do you think?

Mark
R
R
raid5atemyhomework wrote on 19 Mar 2021 17:03
(name . Mark H Weaver)(address . mhw@netris.org)(name . 47253@debbugs.gnu.org)(address . 47253@debbugs.gnu.org)
Qw8LEYPetewimqDmATAmRMsL3NFdSqxQjyMwq6bvH9WEeAeLGcRFwDb5dZ6gGXHECMhFV6NcTkrRu9K_ZS5TgX8zIFr29swilZct5lOGVaw=@protonmail.com
Hello Mark,

Toggle quote (15 lines)
> > Of course, the big problem is that Shepherd is single-threadded and
> > `nm-online` will block all other bootup.
>
> That's not good. For the sake of users who are not always connected to
> the internet, I'd strongly prefer for the Guix boot process of a desktop
> system to not be blocked for 30 seconds when there's no active
> internet connection.
>
> How about leaving "networking" as it is now, and instead adding a new
> service called "network-online" or similar, that requires "networking"
> and then waits until a network connection is established?
>
> What do you think?


Ideally the `init` system should be multithreaded, such that anything that isn't dependent on `networking` does not get delayed but gets started as soon as its dependencies complete.

In particular, `transmission-daemon-service-type` creates a Shepherd service that is dependent on `networking`, but is in fact the second daemon I mentioned, which fails to properly bind to the command 9091 port, requiring the daemon to be restarted each time. So if we use a separate `network-online` shepherd provision, `transmission-daemon-service-type` also needs to be modified (on my system I have a separate provision similar to your `network-online` idea and I wrote my own shepherd service for `transmission-daemon` just to add this requirement).

With a separate `network-online` shepherd provision we would also need to audit all the other network-requiring daemons to see if similar problems exist.

As well, `networking` is provided by multiple possible services, so if we add a separate `network-online` we also need to modify the other options.

* `network-manager-service-type`.
* `dhcp-client-service-type`.
* `wicd-service-type`.
* `connman-service-type`.

For that matter, we probably need to review the above other options, as they might just start up the networking service without actually ensuring that the networking service has actually completed. I use `network-manager-service-type` as part of `%desktop-services` but if this issue isn't properly handled by Guix for NetworkManager then it probably isn't properly handled for the above other options --- in all likelihood the network interfaces are not available just after the networking shepherd services are started.

In addition --- do we always have a `network-online` shepherd service, or not?

* Each of the `network-manager-service-type`, `dhcp-client-service-type`, `wicd-service-type`, `connman-service-type` instantiate both a `networking` and `network-online` shepherd provision.
* Then other network-requiring services can always assume that `network-online` exists.
* However, not-always-online users would always find that `shepherd` completion is delayed.
* This manifests as `herd` commands not responding until the wait-to-be-online timeout ends.
* We have separate `network-manager-online-service-type`, `dhcp-client-online-service-type`, `wicd-online-service-type`, and `connman-online-service-type` that provides the `network-online` shepherd provision to the corresponding `networking` backend.
* Thus, not-always-online users would omit the `*-online-*` service type in order not to suffer the wait.
* However, the user has to know to add the *corresponding* service type as well if they have to use a daemon like `transmission-daemon`.
* Do we add `network-manager-online-service-type` to `%desktop-services`?
* I think we should, as most `%desktop-services`-using users will be mostly online anyway, and they are the ones most likely to want to start other network-using services as well.
* We somehow implement polymorphic service types so that services like `transmission-daemon` have a `(service-extension i-need-network-online-service-type (const #f))`, which only instantiates `network-online` provision, appropriately for the network backend, if at least one service requires it.
* Probably a lot more code and design and nerd wars about the best possible design and delays and ...

What do *you* think?


Note as well that in the SystemD case, typically, `NetworkManager-wait-online.service` is always enabled, and when I boot up my system on SystemD-based OSs even without any network available I don't experience any network-going-up delays during boot (at least in the last few years, I do remember circa oughts Ubuntu having that problem).

In addition, `nm-online -s` has this:

Toggle quote (5 lines)
> -s | --wait-for-startup
> Wait for NetworkManager startup to complete, rather than waiting for network connectivity specifically. Startup is considered complete once NetworkManager has activated (or
> attempted to activate) every auto-activate connection which is available given the current network state. (This is generally only useful at boot time; after startup has
> completed, nm-online -s will just return immediately, regardless of the current network state.)

My interpretation of the above is that `-s` means that NetworkManager has *tried* to activate, not necessarily that there is an actual network connection, but I am not an expert on NetworkManager.

Thanks
raid5atemyhomework
M
M
Mark H Weaver wrote on 20 Mar 2021 09:07
(name . raid5atemyhomework)(address . raid5atemyhomework@protonmail.com)(name . 47253@debbugs.gnu.org)(address . 47253@debbugs.gnu.org)
87h7l6l03c.fsf@netris.org
Hi,

Earlier, I wrote:
Toggle quote (4 lines)
>> How about leaving "networking" as it is now, and instead adding a new
>> service called "network-online" or similar, that requires "networking"
>> and then waits until a network connection is established?

I withdraw my proposal for a separate "network-online" service. It was
a half-baked idea made in haste. Now that I've looked, I see that
almost every service in Guix that requires 'networking' should
arguably[*] wait until the network comes up before starting up.
Moreover, now that I think about it, I'm not sure what the use case
would be for requiring 'networking', if not to wait for the network to
come up.

My immediate concern here is to avoid blocking the startup of a typical
Guix desktop or laptop system for 30 seconds if there's no network
connection, and more generally to keep Guix working well for users like
myself who are not "always online".

I haven't yet looked into the details, but at first glance, I'm inclined
to agree with you that the right place to fix this is in Shepherd.
Somehow, it ought to be possible to delay the startup of services that
require 'networking', without delaying anything else.

Mark


[*] I'll note, however, that merely waiting up to 30 seconds (or
whatever timeout you choose) is not, in itself, a robust solution. What
happens if the network is down for more than 30 seconds? What if it
goes down after 'nm-online' checks, but before the dependent service has
finished starting? Also, if a service fails to handle lack of network
when it starts, it makes me wonder whether it properly handles a
prolonged network failure while its running. It seems to me that the
only fully satisfactory solution is for each service to robustly handle
network failures at any time, although I acknowledge that workarounds
are needed in the meantime.
R
R
raid5atemyhomework wrote on 20 Mar 2021 11:15
(name . Mark H Weaver)(address . mhw@netris.org)(name . 47253@debbugs.gnu.org)(address . 47253@debbugs.gnu.org)
zZOWw1MKUCcZb18KAwykQUH41yrZI78ONRlm3uWN-lBqtGSSaTTp_vLHvJL4mvke4_gHUAR5yG5UIN67eWHKam-MO0xB0u_XmJMWH_RbNmo=@protonmail.com
Hello MArk,

Toggle quote (5 lines)
> [] I'll note, however, that merely waiting up to 30 seconds (orwhatever timeout you choose) is not, in itself, a robust solution. What
> happens if the network is down for more than 30 seconds? What if it
> goes down after 'nm-online' checks, but before the dependent service has
> finished starting?

The sysad has to go look at what is wrong and fix it, then restart services manually as needed. Presumably the sysad is competent enough to care for the hardware so this doesn't occur (too often).

What this avoids is if everything in the hardware setup (cables, NIC, router, hub, router config, etc.) is 100% fine but a reboot of the system for any reason causes services starting at boot to fail to start properly. Competent sysads will put alarm bells if an important daemon is not running. But if such alarm bells keep getting set off during a server restart, it gets annoying and makes the sysad pay less attention to alarm bells that *are* important enough for them to check the hardware setup.

So the common 30-second timeout used in SystemD is a fairly good compromise anyway. Probably your alarm bells checks things hourly or so, and exiting after 30 seconds allows other services (e.g. a direct X server on the server, perhaps?) to start as well so a sysad can sit at the console and work the issue directly. It's not perfect, but it's good enough for most things.

Toggle quote (7 lines)
> Also, if a service fails to handle lack of network
> when it starts, it makes me wonder whether it properly handles a
> prolonged network failure while its running. It seems to me that the
> only fully satisfactory solution is for each service to robustly handle
> network failures at any time, although I acknowledge that workarounds
> are needed in the meantime.

Indeed, and the Guix substituter for example is fairly brittle against internet connectivity problems, not just at the local networking level, but from issues from the local network connection all the way to ci.guix.gnu.org.

Thanks
raid5atemyhomework
R
R
raid5atemyhomework wrote on 23 Jul 2021 17:27
(name . 47253@debbugs.gnu.org)(address . 47253@debbugs.gnu.org)
-7ecsGr8nG6wxU0N5Mlc6p9uS8imKF8ikIq3HQvrP9IUYDMggsFlAO9ZZuPwSVU5MVsefN4wFhw4nV47gjKzpE2bK6STGo_biKT22KUoXfw=@protonmail.com
Is there any chance any thought will be given over to this, or am I stuck trying to work around a single-threaded "does the job, but not well" Shepherd?

I'm beginning to wonder if just using SystemD would work better, especially since it's so popular nearly every daemon package includes support for it anyway.
B
B
Bone Baboon wrote on 24 Jul 2021 13:56
(name . raid5atemyhomework)(address . raid5atemyhomework@protonmail.com)
877dhgsz6v.fsf@disroot.org
raid5atemyhomework via Bug reports for GNU Guix writes:

Toggle quote (4 lines)
> Is there any chance any thought will be given over to this, or am I stuck trying to work around a single-threaded "does the job, but not well" Shepherd?
>
> I'm beginning to wonder if just using SystemD would work better, especially since it's so popular nearly every daemon package includes support for it anyway.

There appears to be previous email threads on the guix-devel mailing
list that you may find interesting. Just search for systemd and there
are several results.

B
B
Bruno Victal wrote on 11 Mar 2023 00:28
control-msg
(name . control)(address . control@debbugs.gnu.org)
e856200f-9f2a-2248-db1f-40fa842f56d8@makinata.eu
close 47253
close 60300

quit

---

Fixed with commit d04955972e42bd85ba6137625e09e9e31de52f72.
?
Your comment

This issue is archived.

To comment on this conversation send an email to 47253@debbugs.gnu.org

To respond to this issue using the mumi CLI, first switch to it
mumi current 47253
Then, you may apply the latest patchset in this issue (with sign off)
mumi am -- -s
Or, compose a reply to this issue
mumi compose
Or, send patches to this issue
mumi send-email *.patch