‘static-networking’ fails to start

  • Done
  • quality assurance status badge
Details
6 participants
  • Felix Lechner
  • Leo Nikkilä
  • Ludovic Courtès
  • Ludovic Courtès
  • Matt Wette
  • Fabio Natali
Owner
unassigned
Submitted by
Ludovic Courtès
Severity
important
L
L
Ludovic Courtès wrote on 15 Jul 2023 22:04
(address . bug-guix@gnu.org)
87pm4tuej8.fsf@inria.fr
Hi!

On the machine that exhibited https://issues.guix.gnu.org/63516, I’m
now seeing this, with the fix from commit
26602f4063a6e0c626e8deb3423166bcd0abeb90:

Toggle snippet (25 lines)
[ 121.017522] shepherd[1]: Starting service user-homes...
[ 121.049038] tg3 0000:05:00.0 eth0: Tigon3 [partno(BCM95720) rev 5720000] (PCI Express) MAC address b8:cb:29:b5:1c:3a
[ 121.049042] tg3 0000:05:00.0 eth0: attached PHY is 5720C (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[1])
[ 121.049044] tg3 0000:05:00.0 eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
[ 121.049045] tg3 0000:05:00.0 eth0: dma_rwctrl[00000001] dma_mask[64-bit]
[ 121.084342] tg3 0000:05:00.1 eth1: Tigon3 [partno(BCM95720) rev 5720000] (PCI Express) MAC address b8:cb:29:b5:1c:3b
[ 121.084355] tg3 0000:05:00.1 eth1: attached PHY is 5720C (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[1])
[ 121.084363] tg3 0000:05:00.1 eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
[ 121.084370] tg3 0000:05:00.1 eth1: dma_rwctrl[00000001] dma_mask[64-bit]
[ 121.102367] iTCO_vendor_support: vendor-support=0
[ 121.103831] Error: Driver 'pcspkr' is already registered, aborting...
[ 121.108617] dcdbas dcdbas: Dell Systems Management Base Driver (version 5.6.0-3.4)
[ 121.113037] tg3 0000:05:00.1 eno2: renamed from eth1

[...]

[ 121.281600] shepherd[1]: Service user-homes has been started.
[ 121.282538] shepherd[1]: Service user-homes started.
[ 121.368316] ipmi_si IPI0001:00: Using irq 10
[ 121.405790] ipmi_si IPI0001:00: IPMI message handler: Found new BMC (man_id: 0x0002a2, prod_id: 0x0100, dev_id: 0x20)
[ 121.419871] shepherd[1]: Exception caught while starting #<<service> 7f19889012a0>: (wrong-type-arg "port-filename" "Wrong type argument in position ~A: ~S" (1 #<closed: file 7f1981887000>) (#<closed: file 7f1981887000>))
[ 121.420074] shepherd[1]: Service user-homes running with value #t.
[ 121.420218] shepherd[1]: Service networking failed to start.

The failure seems to happen after the whole static networking config has
been set up though (‘ip a’ shows that everything’s in place).

Problem is that at this point ‘networking’ cannot be started unless you
manually tear down everything with ‘ip’:

Toggle snippet (5 lines)
$ sudo herd start networking
herd: error: exception rattrapée pendant l’exécution de « start » sur le service « networking » :
Throw to key `%exception' with args `("#<&netlink-response-error errno: 17>")'.

(17 = EEXIST)

This makes me think we should make the set up phase idempotent or,
alternatively, add special actions to force a change.

Thoughts?

Ludo’.
M
M
Matt Wette wrote on 17 Sep 2023 18:42
stopping ntp and dnsmasq
(address . 64653@debbugs.gnu.org)
a67e6fa6-31c3-4d00-add1-c3629d632a8a@gmail.com
Are there any workarounds for this.   I've been digging into anything to
help.
I'm dead in the water trying to get ntpd and tftpd (dnsmasq) working. 
They require this.
Or, is there a way to get dnsmasq working itself?

Matt
M
M
Matt Wette wrote on 17 Sep 2023 19:09
(address . 64653@debbugs.gnu.org)
af7eb6e7-faed-4f82-77f1-5a5708c0a571@gmail.com
On 9/17/23 9:42 AM, Matt Wette wrote:
Toggle quote (6 lines)
> Are there any workarounds for this.   I've been digging into anything
> to help.
> I'm dead in the water trying to get ntpd and tftpd (dnsmasq) working. 
> They require this.
> Or, is there a way to get dnsmasq working itself?

I see there is atftp, so I'll try that.   Still no working ntpd.
L
L
Ludovic Courtès wrote on 2 Oct 2023 12:24
control message for bug #64653
(address . control@debbugs.gnu.org)
871qedl3tz.fsf@gnu.org
severity 64653 important
quit
L
L
Ludovic Courtès wrote on 2 Oct 2023 13:59
Re: bug#64653: ‘static-networking’ fails to start
(address . 64653@debbugs.gnu.org)
87msx1jkvi.fsf@gnu.org
Ludovic Courtès <ludovic.courtes@inria.fr> skribis:

Toggle quote (19 lines)
> [ 121.281600] shepherd[1]: Service user-homes has been started.
> [ 121.282538] shepherd[1]: Service user-homes started.
> [ 121.368316] ipmi_si IPI0001:00: Using irq 10
> [ 121.405790] ipmi_si IPI0001:00: IPMI message handler: Found new BMC (man_id: 0x0002a2, prod_id: 0x0100, dev_id: 0x20)
> [ 121.419871] shepherd[1]: Exception caught while starting #<<service> 7f19889012a0>: (wrong-type-arg "port-filename" "Wrong type argument in position ~A: ~S" (1 #<closed: file 7f1981887000>) (#<closed: file 7f1981887000>))
> [ 121.420074] shepherd[1]: Service user-homes running with value #t.
> [ 121.420218] shepherd[1]: Service networking failed to start.
>
>
> The failure seems to happen after the whole static networking config has
> been set up though (‘ip a’ shows that everything’s in place).
>
> Problem is that at this point ‘networking’ cannot be started unless you
> manually tear down everything with ‘ip’:
>
> $ sudo herd start networking
> herd: error: exception rattrapée pendant l’exécution de « start » sur le service « networking » :
> Throw to key `%exception' with args `("#<&netlink-response-error errno: 17>")'.

Quick workaround if you encounter this bug:

1. Find the “tear-down” script of your system with:

guix gc -R /run/current-system |grep tear-down-network

2. In a ‘screen’ session, run this as root:

while true ; do herd enable networking; herd start networking; sleep 3; done

3. Run:

sudo guile --no-auto-compile TEAR_DOWN_SCRIPT_FROM_STEP_1

Beautiful, isn’t it?

(We’ll actually work on fixing the bug, too…)

Ludo’.
L
L
Leo Nikkilä wrote on 11 Nov 2023 17:25
Re: bug#64653: ‘static-networking’ fails to st art
(address . 64653@debbugs.gnu.org)
e5c80dd5-21dc-407c-a3c0-5d8746f8fbf1@betaapp.fastmail.com
I'm also seeing this issue on a headless RockPro64 system. Do you know anything I could change in the configuration to work around this during boot, e.g. patch a specific commit out?

Happy to provide further details or test things on my system.
L
L
Ludovic Courtès wrote on 4 Jan 00:42 +0100
Re: bug#64653: ‘static-networking’ fails to start
(address . 64653@debbugs.gnu.org)
87mstmf0g4.fsf@gnu.org
Hello!

Ludovic Courtès <ludovic.courtes@inria.fr> skribis:

Toggle quote (7 lines)
> [ 121.282538] shepherd[1]: Service user-homes started.
> [ 121.368316] ipmi_si IPI0001:00: Using irq 10
> [ 121.405790] ipmi_si IPI0001:00: IPMI message handler: Found new BMC (man_id: 0x0002a2, prod_id: 0x0100, dev_id: 0x20)
> [ 121.419871] shepherd[1]: Exception caught while starting #<<service> 7f19889012a0>: (wrong-type-arg "port-filename" "Wrong type argument in position ~A: ~S" (1 #<closed: file 7f1981887000>) (#<closed: file 7f1981887000>))
> [ 121.420074] shepherd[1]: Service user-homes running with value #t.
> [ 121.420218] shepherd[1]: Service networking failed to start.

I’m seeing a similar exception in a Hurd VM running shepherd 0.10.3rc1:

Toggle snippet (4 lines)
Jan 3 23:13:22 localhost shepherd[1]: Exception caught while starting networking: (wrong-type-arg "port-filename" "Wrong type argument in position ~A: ~S" (1 #<closed: file 207e498>) (#<closed: file 207e498>))
Jan 3 23:13:22 localhost shepherd[1]: Service networking failed to start.

It’s interesting because it suggests that the offending ‘port-filename’
call comes from ‘load’, not from the network-setup code being loaded
(here, the /hurd/pfinet translator has been properly set up).

Looking at the code in ‘boot-9.scm’, I *think* we end up calling
‘primitive-load’; ‘shepherd’ replaces it with its own (@ (shepherd
support) primitive-load*).

I managed to grab this backtrace:

Toggle snippet (20 lines)
Evaluating user expression (catch #t (lambda () (load "/gnu/store/64?")) # ?).
starting '/gnu/store/gn8q7p790a9zdnlciyp1vlncpin366r0-hurd-v0.9.git20230216/hurd/pfinet "--ipv6" "/servers/socket/26" "--interface" "/dev/eth0" "--address" "10.0.2.15" "--netmask" "255.255.255.0" "--gateway" "10.0.2.2"'
In ice-9/boot-9.scm:
142:2 7 (dynamic-wind #<procedure 20393a0 at ice-9/eval.scm:33?> ?)
In shepherd/support.scm:
486:15 6 (_ #<closed: file 50a7e38>)
In ice-9/read.scm:
859:19 5 (read _)
In unknown file:
4 (port-filename #<closed: file 50a7e38>)
In ice-9/boot-9.scm:
1685:16 3 (raise-exception _ #:continuable? _)
1780:13 2 (_ #<&compound-exception components: (#<&assertion-fail?>)
In ice-9/eval.scm:
159:9 1 (_ #(#(#<module (#{ g171}#) 3cd25f0>) (# "port-fil?" ?)))
In unknown file:
0 (make-stack #t)
#t

So it’s indeed ‘read’ as called from ‘primitive-load*’ that stumbles
upon a closed port. It also happens when loading a file that simply
suspends the current fiber via ‘sleep’ or similar, but only on the Hurd
though.

To be continued…

Ludo’.
L
L
Ludovic Courtès wrote on 5 Jan 17:32 +0100
(address . 64653-done@debbugs.gnu.org)
87r0iveo6l.fsf@gnu.org
Hi!

Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (22 lines)
> Evaluating user expression (catch #t (lambda () (load "/gnu/store/64?")) # ?).
> starting '/gnu/store/gn8q7p790a9zdnlciyp1vlncpin366r0-hurd-v0.9.git20230216/hurd/pfinet "--ipv6" "/servers/socket/26" "--interface" "/dev/eth0" "--address" "10.0.2.15" "--netmask" "255.255.255.0" "--gateway" "10.0.2.2"'
> In ice-9/boot-9.scm:
> 142:2 7 (dynamic-wind #<procedure 20393a0 at ice-9/eval.scm:33?> ?)
> In shepherd/support.scm:
> 486:15 6 (_ #<closed: file 50a7e38>)
> In ice-9/read.scm:
> 859:19 5 (read _)
> In unknown file:
> 4 (port-filename #<closed: file 50a7e38>)
> In ice-9/boot-9.scm:
> 1685:16 3 (raise-exception _ #:continuable? _)
> 1780:13 2 (_ #<&compound-exception components: (#<&assertion-fail?>)
> In ice-9/eval.scm:
> 159:9 1 (_ #(#(#<module (#{ g171}#) 3cd25f0>) (# "port-fil?" ?)))
> In unknown file:
> 0 (make-stack #t)
> #t
>
> So it’s indeed ‘read’ as called from ‘primitive-load*’ that stumbles
> upon a closed port.

Good news: this is fixed by 4e431fda5f2ec76b6d6a271be7c30b1324431329!
Silly me had introduced a ‘dynamic-wind’ there.

(The funny thing with extensible systems like the Shepherd is that the
problem can be anywhere. :-))

Ludo’.
Closed
M
M
Matt Wette wrote on 20 Jan 22:14 +0100
works now
(address . 64653@debbugs.gnu.org)
9b5bcee0-3a77-4147-8f32-42b4720b250e@gmail.com
This bug no longer occurs on my system.   That change occurred over the
last week.
F
F
Felix Lechner wrote on 25 Mar 16:36 +0100
(no subject)
(address . control@patchwise.org)
87h6gu492h.fsf@lease-up.com
unarchive 64653
thanks
F
F
Fabio Natali wrote on 25 Mar 12:52 +0100
'static-networking' fails to start
(address . 64653@debbugs.gnu.org)
87zfumleab.fsf@fabionatali.com
Hi,

I've been trying to reconfigure a machine from static IPv4 to static
dual-stack or IPv6-only. I followed one? of the examples in the manual,
so I'd think I got the syntax right.

Once the reconfiguration has taken place and when restarting the
networking service, I get this error:

,----
| herd: error: exception caught while executing 'start' on service 'networking':
| Throw to key `%exception' with args `("#<&netlink-response-error errno: 17>")'.
`----

This would seem to be relevant to this bug report 64653?

Do you know what this might be related to and what I can do to solve it?

This happens on an up-to-date Guix system.

Thanks, best wishes, Fabio.



--
Fabio Natali
F
F
Fabio Natali wrote on 25 Mar 19:43 +0100
(address . 64653@debbugs.gnu.org)
87il1akv9a.fsf@fabionatali.com
On 2024-03-25, 11:52 +0000, Fabio Natali <me@fabionatali.com> wrote:
Toggle quote (8 lines)
> Once the reconfiguration has taken place and when restarting the
> networking service, I get this error:
>
> ,----
> | herd: error: exception caught while executing 'start' on service 'networking':
> | Throw to key `%exception' with args `("#<&netlink-response-error errno: 17>")'.
> `----

Ok, good news, thanks to Felix's advice[0] I was able to get this
sorted!

Apparently, specifying a default IPv6 gateway (as a link local address)
is what was causing the issue for me. Once the following bit was
commented out, everything started working again.

,----
| (static-networking
| (addresses (list (network-address
| (device "eth0")
| (value "10.0.0.2/24"))
| (network-address
| (device "eth0")
| (value "2001:db8::1/64"))))
| (routes (list (network-route
| (destination "default")
| (gateway "10.0.0.1"))))
| ;; (network-route
| ;; (destination "default")
| ;; (gateway "fe80::"))))
| (name-servers '("10.0.0.1" "2001:db8::")))
`----

("fe80::" and "2001:db8::" are just placeholders.)

I assume the router address gets retrieved automatically via Router
Advertisment (RA), so no need for that in my case.

Still, I'd expect to be possible to indicate the router's link-local
address... Do you see a possible bug here or is there anything else that
I might be missing?

Thanks, cheers, Fabio.




--
Fabio Natali
?