[shepherd] EADDRINUSE while restarting ssh-daemon

  • Done
  • quality assurance status badge
Details
4 participants
  • Lars-Dominik Braun
  • Lars-Dominik Braun
  • Liliana Marie Prikler
  • Ludovic Courtès
Owner
unassigned
Submitted by
Lars-Dominik Braun
Severity
important
L
L
Lars-Dominik Braun wrote on 13 Oct 2022 09:51
[shepherd] Restarting guix-publish fails
(address . bug-guix@gnu.org)(name . Ludovic Courtès)(address . ludo@gnu.org)
Y0fDnizcW8UsBxKh@noor.fritz.box
Hi,

it seems that `herd restart guix-publish` stopped working after the
introduction of socket activation into shepherd. This is a problem,
because I restart guix-publish automatically after unattended-upgrades. It
fails with the following error for me:

---snip---
Backtrace:
7 (primitive-load "/gnu/store/7xrg2sbb529ki6hv99n27svg0fi?")
In ice-9/boot-9.scm:
724:2 6 (call-with-prompt ("prompt") #<procedure 7f8173184940 ?> ?)
1752:10 5 (with-exception-handler _ _ #:unwind? _ # _)
In ice-9/eval.scm:
619:8 4 (_ #(#(#<directory (guile-user) 7f817318ac80>)))
In ice-9/boot-9.scm:
260:13 3 (for-each #<procedure restart-service (name)> _)
In gnu/services/herd.scm:
168:4 2 (invoke-action guix-publish restart () #<procedure 7f81?>)
176:7 1 (failure)
In ice-9/boot-9.scm:
1685:16 0 (raise-exception _ #:continuable? _)

ice-9/boot-9.scm:1685:16: In procedure raise-exception:
ERROR:
1. &action-exception-error:
service: guix-publish
action: start
key: system-error
args: ("bind" "~A" ("Address already in use") (98))
---snap---

Note that due to the socket activation you must visit the URL at least
once to start up the guix-publish process. Otherwise a restart will
work fine. It also works fine the second time I invoke `herd restart
guix-publish`, because `guix-publish` is dead by that time.

Looking at an strace shepherd is indeed trying to kill `guix-publish`
and re-bind to the same address:

---snip---
1 read(23, "(shepherd-command (version 0) (action restart) (service guix-publish) (arguments ()) (directory \"/root\"))", 1024) = 105
1 getpgid(18096) = 18096
1 getpgid(0) = 0
1 kill(-18096, SIGTERM) = 0
1 newfstatat(AT_FDCWD, "/etc/localtime", {st_mode=S_IFREG|0444, st_size=2298, ...}, 0) = 0
1 write(17, "shepherd[1]: Service guix-publish has been stopped.\n", 52) = 52
1 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 36
1 setsockopt(36, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
1 bind(36, {sa_family=AF_INET, sin_port=htons(8082), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EADDRINUSE (Address already in use)
1 write(23, "(reply (version 0) (result #f) (error (error (version 0) action-exception start guix-publish system-error (\"bind\" \"~A\" (\"Address already in use\") (98)))) (messages (\"Service guix-publish has been stopped.\")))", 208) = 208
1 close(23)
---snap---

The obvious explanation would be that stopping does not wait for the
process to actually exit. make-kill-destructor does not waitpid it seems
and 'running is set unconditionally to #f after 'stop has finished.

Cheers,
Lars
L
L
Liliana Marie Prikler wrote on 13 Oct 2022 11:28
(name . Ludovic Courtès)(address . ludo@gnu.org)
7f56da61347bc101230cb65b17643b700464006a.camel@ist.tugraz.at
Am Donnerstag, dem 13.10.2022 um 09:51 +0200 schrieb Lars-Dominik
Braun:
Toggle quote (4 lines)
> The obvious explanation would be that stopping does not wait for the
> process to actually exit. make-kill-destructor does not waitpid it
> seems and 'running is set unconditionally to #f after 'stop has
> finished.
L
L
Lars-Dominik Braun wrote on 13 Oct 2022 13:35
(name . Liliana Marie Prikler)(address . liliana.prikler@ist.tugraz.at)
Y0f4G97mIZehPNlJ@noor.fritz.box
Hi Liliana,

Toggle quote (3 lines)
> Shouldn't [1] address this very issue?
> [1]
> http://git.savannah.gnu.org/cgit/guix.git/commit/?id=2a37f174becbafd70591f6eb1d98493c5c1df0e2
no, if the process is running make-systemd-destructor is just an alias
for make-kill-destructor. So it does not matter which one we use in this case.

Lars
L
L
Liliana Marie Prikler wrote on 13 Oct 2022 15:38
(name . Lars-Dominik Braun)(address . lars@6xq.net)
6449fe424fed49c89ff25e505607f8ea27739e5b.camel@ist.tugraz.at
Am Donnerstag, dem 13.10.2022 um 13:35 +0200 schrieb Lars-Dominik
Braun:
Toggle quote (8 lines)
> Hi Liliana,
>
> > Shouldn't [1] address this very issue?
> > [1]
> > http://git.savannah.gnu.org/cgit/guix.git/commit/?id=2a37f174becbafd70591f6eb1d98493c5c1df0e2
> no, if the process is running make-systemd-destructor is just an
> alias for make-kill-destructor. So it does not matter which one we
> use in this case.
Ahh, so the issue is that shepherd waits neither for the process to be
actually killed nor for the socket to become available, isn't it?
L
L
Lars-Dominik Braun wrote on 14 Oct 2022 08:18
(name . Liliana Marie Prikler)(address . liliana.prikler@ist.tugraz.at)
Y0j/J2ARzOYa2YUM@noor.fritz.box
Hi,

Toggle quote (2 lines)
> Ahh, so the issue is that shepherd waits neither for the process to be
> actually killed nor for the socket to become available, isn't it?
I would argue it’s the former, but having either of them would solve
the problem, I think.

Lars
L
L
Liliana Marie Prikler wrote on 14 Oct 2022 08:57
(name . Lars-Dominik Braun)(address . lars@6xq.net)
44ef002a53713d84cb57dbc7bc5f3eb9f8422290.camel@ist.tugraz.at
Am Freitag, dem 14.10.2022 um 08:18 +0200 schrieb Lars-Dominik Braun:
Toggle quote (7 lines)
> Hi,
>
> > Ahh, so the issue is that shepherd waits neither for the process to
> > be
> > actually killed nor for the socket to become available, isn't it?
> I would argue it’s the former, but having either of them would solve
> the problem, I think.
I think you need both: if the process is killed, but the socket
remains, you need to clean it up. As far as I'm aware, that does not
happen automatically.

Cheers
L
L
Ludovic Courtès wrote on 15 Oct 2022 22:17
control message for bug #58485
(address . control@debbugs.gnu.org)
87o7ucj0y2.fsf@gnu.org
severity 58485 important
quit
L
L
Ludovic Courtès wrote on 17 Nov 2022 09:24
Re: bug#58485: [shepherd] Restarting guix-publish fails
(name . Lars-Dominik Braun)(address . lars@6xq.net)(address . 58485@debbugs.gnu.org)
875yfekmze.fsf@gnu.org
Hi,

Lars-Dominik Braun <lars@6xq.net> skribis:

Toggle quote (14 lines)
> 1 kill(-18096, SIGTERM) = 0
> 1 newfstatat(AT_FDCWD, "/etc/localtime", {st_mode=S_IFREG|0444, st_size=2298, ...}, 0) = 0
> 1 write(17, "shepherd[1]: Service guix-publish has been stopped.\n", 52) = 52
> 1 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 36
> 1 setsockopt(36, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> 1 bind(36, {sa_family=AF_INET, sin_port=htons(8082), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EADDRINUSE (Address already in use)
> 1 write(23, "(reply (version 0) (result #f) (error (error (version 0) action-exception start guix-publish system-error (\"bind\" \"~A\" (\"Address already in use\") (98)))) (messages (\"Service guix-publish has been stopped.\")))", 208) = 208
> 1 close(23)
> ---snap---
>
> The obvious explanation would be that stopping does not wait for the
> process to actually exit. make-kill-destructor does not waitpid it seems
> and 'running is set unconditionally to #f after 'stop has finished.

Indeed. This is fixed by Shepherd commit
d97592f58603ff51cb280ae57d413c8731e601b3, which will be in the upcoming
0.9.3 release.

Thanks,
Ludo’.
L
L
Ludovic Courtès wrote on 17 Nov 2022 11:19
(name . Lars-Dominik Braun)(address . lars@6xq.net)(address . 58485-done@debbugs.gnu.org)
87h6yxkhnr.fsf@gnu.org
Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (4 lines)
> Indeed. This is fixed by Shepherd commit
> d97592f58603ff51cb280ae57d413c8731e601b3, which will be in the upcoming
> 0.9.3 release.

The Shepherd 0.9.3 has landed in Guix commit
283d7318c5b312d7129adb6dbeea6ad205ce89d1.

Ludo’.
Closed
L
L
Lars-Dominik Braun wrote on 7 Feb 2023 09:38
(no subject)
(address . control@debbugs.gnu.org)
Y+IODheWWviSYi+P@zpidnb93
unarchive 58485
L
L
Lars-Dominik Braun wrote on 7 Feb 2023 09:39
Re: bug#58485: [shepherd] Restarting guix-publish fails
(address . 58485@debbugs.gnu.org)
Y+IONFTuvQMz0LAU@zpidnb93
Hi Ludo,

Toggle quote (3 lines)
> Indeed. This is fixed by Shepherd commit
> d97592f58603ff51cb280ae57d413c8731e601b3, which will be in the upcoming
> 0.9.3 release.
I’m on 0.9.3 and it works fine with `herd restart` now. But ssh-daemon
has the same issue when being restarted by unattended-upgrades (which
is fatal, because unable to use SSH I have to restart the entire box):

---snip---
shepherd: Service nginx has been stopped.
shepherd: Service nginx has been started.
shepherd: Service collectd has been stopped.
shepherd: Service collectd has been started.
shepherd: Service ntpd has been stopped.
shepherd: Service ntpd has been started.
shepherd: Service guix-publish has been stopped.
shepherd: Service guix-publish has been started.
shepherd: Service ssh-daemon has been stopped.
Backtrace:
7 (primitive-load "/gnu/store/ip5m1n8kb6p0rfglzpkk17k060a?")
In ice-9/boot-9.scm:
724:2 6 (call-with-prompt ("prompt") #<procedure 7f89a11f3840 ?> ?)
1752:10 5 (with-exception-handler _ _ #:unwind? _ # _)
In ice-9/eval.scm:
619:8 4 (_ #(#(#<directory (guile-user) 7f89a11ffc80>)))
In ice-9/boot-9.scm:
260:13 3 (for-each #<procedure restart-service (name)> _)
In gnu/services/herd.scm:
168:4 2 (invoke-action ssh-daemon restart () #<procedure 7f89a0?>)
176:7 1 (failure)
In ice-9/boot-9.scm:
1685:16 0 (raise-exception _ #:continuable? _)

ice-9/boot-9.scm:1685:16: In procedure raise-exception:
ERROR:
1. &action-exception-error:
service: ssh-daemon
action: start
key: system-error
args: ("bind" "~A" ("Address already in use") (98)
---snap---

Maybe I can strace herd and see what happens exactly.

Thanks,
Lars

--
Lars-Dominik Braun
Wissenschaftlicher Mitarbeiter/Research Associate

www.leibniz-psychology.org
ZPID - Leibniz-Institut für Psychologie /
ZPID - Leibniz Institute for Psychology
Universitätsring 15
D-54296 Trier - Germany
Tel.: +49–651–201-4964
-----BEGIN PGP SIGNATURE-----

iQGzBAABCAAdFiEEyk+M9DfXR4/aBV/UQhN3ARo3hEYFAmPiDjEACgkQQhN3ARo3
hEaTXAwAs287VvnJyKW5DCE+ZcH2zMv7OxTdskznDCa4YP61wjL5fGeS+KdyX/GS
lqQ0AcnDC6UKuweh/JTFucVWkdY+N35lhAH27xkJDMQAXTk+CjuMSfiPv6asOQ4t
SIGX+yH5RVKTlhBAhDgd8csZipuoVjp6/vL3x3jo65/7sKN0oYJFpq0LqrT4oxbb
Gg9K+tGTqup5rdpjN/IciO8rjxDmncl5ubOJ1vFEX4ufyfw1Q99W0xEsL/mh764a
fBgP3LlhRdE2F8GZ/ZvdMw5TgxGEpWRPJepTzQiIMf6PC2ZRgsPrhGTWHUryAFZK
9AdBc7B5cxmvEPdb9UyYUNA2eFGiDPikGZ+Y66sf4rUY9ya+ryNONdUgepiGYp4v
p6PHW48GWoR4tsdhw6ibUax29k/LB6AEDPnpXEf3Veoncm/bbqOWS0Krw5RQZA5z
R7qSjKT6IgGQWabnktUibNsk+jO6PFsl1qFwcfDuhBpyZExtU7fLDPAYLvjVEmfY
jhsYiy+0
=/zFv
-----END PGP SIGNATURE-----


L
L
Ludovic Courtès wrote on 20 Feb 2023 11:20
(name . Lars-Dominik Braun)(address . ldb@leibniz-psychology.org)
878rgsljtx.fsf@gnu.org
Hi Lars,

Lars-Dominik Braun <ldb@leibniz-psychology.org> skribis:

Toggle quote (43 lines)
>> Indeed. This is fixed by Shepherd commit
>> d97592f58603ff51cb280ae57d413c8731e601b3, which will be in the upcoming
>> 0.9.3 release.
> I’m on 0.9.3 and it works fine with `herd restart` now. But ssh-daemon
> has the same issue when being restarted by unattended-upgrades (which
> is fatal, because unable to use SSH I have to restart the entire box):
>
> ---snip---
> shepherd: Service nginx has been stopped.
> shepherd: Service nginx has been started.
> shepherd: Service collectd has been stopped.
> shepherd: Service collectd has been started.
> shepherd: Service ntpd has been stopped.
> shepherd: Service ntpd has been started.
> shepherd: Service guix-publish has been stopped.
> shepherd: Service guix-publish has been started.
> shepherd: Service ssh-daemon has been stopped.
> Backtrace:
> 7 (primitive-load "/gnu/store/ip5m1n8kb6p0rfglzpkk17k060a?")
> In ice-9/boot-9.scm:
> 724:2 6 (call-with-prompt ("prompt") #<procedure 7f89a11f3840 ?> ?)
> 1752:10 5 (with-exception-handler _ _ #:unwind? _ # _)
> In ice-9/eval.scm:
> 619:8 4 (_ #(#(#<directory (guile-user) 7f89a11ffc80>)))
> In ice-9/boot-9.scm:
> 260:13 3 (for-each #<procedure restart-service (name)> _)
> In gnu/services/herd.scm:
> 168:4 2 (invoke-action ssh-daemon restart () #<procedure 7f89a0?>)
> 176:7 1 (failure)
> In ice-9/boot-9.scm:
> 1685:16 0 (raise-exception _ #:continuable? _)
>
> ice-9/boot-9.scm:1685:16: In procedure raise-exception:
> ERROR:
> 1. &action-exception-error:
> service: ssh-daemon
> action: start
> key: system-error
> args: ("bind" "~A" ("Address already in use") (98)
> ---snap---
>
> Maybe I can strace herd and see what happens exactly.

Can you confirm shepherd (PID 1) is 0.9.3?

‘sudo herd restart ssh-daemon’ works fine on my laptop FWIW.

Note that the situation is different from that of ‘guix publish’: here
it’s inetd style, as opposed to systemd style for ‘guix publish’.

Thanks,
Ludo’.
L
L
Lars-Dominik Braun wrote on 20 Feb 2023 14:25
(name . Ludovic Courtès)(address . ludo@gnu.org)
Y/N0wj0CQGRuxmwj@zpidnb93
Hi Ludo,

Toggle quote (1 lines)
> Can you confirm shepherd (PID 1) is 0.9.3?
it is:

root 1 0.2 0.2 308148 76816 ? Sl Feb07 52:08 /gnu/store/kphp5d85rrb3q1rdc2lfqc1mdklwh3qp-guile-3.0.9/bin/guile --no-auto-compile /gnu/store/4nw0zb4swga0cb8i35nvng3rg6z5qm8p-shepherd-0.9.3/bin/shepherd --config /gnu/store/cvrai6z8777jf7860rnvppfznl1lcxi1-shepherd.conf

Toggle quote (1 lines)
> ‘sudo herd restart ssh-daemon’ works fine on my laptop FWIW.
This works fine too. Only unattended-upgrades seems to have this issue :/

The strace looks unsuspicious right now:

---snip---
1 14:12:15.117035 read(21, "(shepherd-command (version 0) (action restart) (service ssh-daemon) (arguments ()) (directory \"/root\"))", 1024) = 103
1 14:12:15.117254 close(27) = 0
1 14:12:15.117283 close(30) = 0
1 14:12:15.117416 newfstatat(AT_FDCWD, "/etc/localtime", {st_dev=makedev(0x8, 0x2), st_ino=110100491, st_mode=S_IFREG|0444, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, s
t_size=2298, st_atime=1676898665 /* 2023-02-20T14:11:05.338746772+0100 */, st_atime_nsec=338746772, st_mtime=1676898664 /* 2023-02-20T14:11:04.874743456+0100 */, st_mtime_nsec=874743456, st_c
time=1676898664 /* 2023-02-20T14:11:04.874743456+0100 */, st_ctime_nsec=874743456}, 0) = 0
1 14:12:15.117475 write(17, "shepherd[1]: Service ssh-daemon has been stopped.\n", 50) = 50
1 14:12:15.117524 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 26
1 14:12:15.117561 setsockopt(26, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
1 14:12:15.117598 bind(26, {sa_family=AF_INET, sin_port=htons(2222), sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EADDRINUSE (Address already in use)
1 14:12:15.117724 write(21, "(reply (version 0) (result #f) (error (error (version 0) action-exception start ssh-daemon system-error (\"bind\" \"~A\" (\"Address already in use\") (98)))) (messages (\"Service ssh-daemon has been stopped.\")))", 204) = 204
1 14:12:15.117754 close(21) = 0
---snap---

But nginx seems to have the same issue, except that it does not fail
entirely and succeeds after waiting a short period of time:

---snip---
2023/02/20 14:12:14 [notice] 7136#0: signal 15 (SIGTERM) received from 6644, exiting
2023/02/20 14:12:14 [notice] 7137#0: exiting
2023/02/20 14:12:14 [notice] 7137#0: exit
2023/02/20 14:12:14 [notice] 7136#0: signal 17 (SIGCHLD) received from 7137
2023/02/20 14:12:14 [notice] 7136#0: worker process 7137 exited with code 0
2023/02/20 14:12:14 [emerg] 6645#0: bind() to 0.0.0.0:443 failed (98: Address already in use)
2023/02/20 14:12:14 [emerg] 6645#0: bind() to 0.0.0.0:80 failed (98: Address already in use)
2023/02/20 14:12:14 [emerg] 6645#0: bind() to [::]:80 failed (98: Address already in use)
2023/02/20 14:12:14 [notice] 7136#0: exit
2023/02/20 14:12:14 [notice] 6645#0: try again to bind() after 500ms
2023/02/20 14:12:14 [notice] 6645#0: using the "epoll" event method
2023/02/20 14:12:14 [notice] 6645#0: nginx/1.23.3
2023/02/20 14:12:14 [notice] 6645#0: OS: Linux 6.1.9
2023/02/20 14:12:14 [notice] 6645#0: getrlimit(RLIMIT_NOFILE): 1024:4096
2023/02/20 14:12:14 [notice] 6648#0: start worker processes
2023/02/20 14:12:14 [notice] 6648#0: start worker process 6649
2023/02/20 14:12:32 [info] 6649#0: epoll_wait() failed (4: Interrupted system call)
---snap---

I see we’re already using SO_REUSEADDR, so all of this is a bit of a
mystery to me.

Thanks,
Lars

--
Lars-Dominik Braun
Wissenschaftlicher Mitarbeiter/Research Associate

www.leibniz-psychology.org
ZPID - Leibniz-Institut für Psychologie /
ZPID - Leibniz Institute for Psychology
Universitätsring 15
D-54296 Trier - Germany
Tel.: +49–651–201-4964
-----BEGIN PGP SIGNATURE-----

iQGzBAABCAAdFiEEyk+M9DfXR4/aBV/UQhN3ARo3hEYFAmPzdLcACgkQQhN3ARo3
hEaZsAv+LM5XHA/LZnBM5CxiZq39HJFExQ13RybQ41w/kpm8khpGXYbJWpS+U2t9
tTpbq7Mge9BPQT6uk9jb0qDBnGKRmKWS+LCkL7HQvzbmOi0sxV418nejZlRv5Igf
mxbtkD2aztIXfxqEa+ZO8sO7H16qoZHtcmwaEqpCU9+kAELq+FxO6L1RKvIKAS7S
dXuEgHCx92LlUbnxPYK2vbKn7Y93ruKukd+nIze9UhtcwAXrhiirUdgVcZ4oDmoK
cYmm5H1wJv21sPB0/GXD9KRBmXteRqYpYmi6fj/sFY9WH35ILvuqpDe7sS8GcEqF
db/jfHI5dNoNDSKQzO/bS/Copr0HSADWD6oLnSuab1UCyVLM+amekclEs+BaPg6f
3TMG+DPBboMf4oGr7cCyr0KWxHf58V9v36R30N8U1UkYzhIveovoWMOX2nb0lwZW
ypbCHjmpxvpGU7PBHZNOkcChLF/DkkCg22rwYP41CnxxrfYAjx5vh8Vca6WHAv9q
EraAcxx9
=JBDq
-----END PGP SIGNATURE-----


L
L
Ludovic Courtès wrote on 27 Apr 2023 22:35
control message for bug #58485
(address . control@debbugs.gnu.org)
871qk5vz9s.fsf@gnu.org
unarchive 58485
quit
L
L
Ludovic Courtès wrote on 27 Apr 2023 22:36
(address . control@debbugs.gnu.org)
87y1mdukoe.fsf@gnu.org
retitle 58485 [shepherd] EADDRINUSE while restarting ssh-daemon
quit
L
L
Ludovic Courtès wrote on 27 Apr 2023 23:23
Re: bug#58485: [shepherd] Restarting guix-publish fails
(name . Lars-Dominik Braun)(address . ldb@leibniz-psychology.org)
87pm7puihd.fsf@gnu.org
Hi,

Sorry for the late reply. I’m going through Shepherd bug reports and I
remembered this discussion…

Lars-Dominik Braun <ldb@leibniz-psychology.org> skribis:

Toggle quote (24 lines)
>> Can you confirm shepherd (PID 1) is 0.9.3?
> it is:
>
> root 1 0.2 0.2 308148 76816 ? Sl Feb07 52:08 /gnu/store/kphp5d85rrb3q1rdc2lfqc1mdklwh3qp-guile-3.0.9/bin/guile --no-auto-compile /gnu/store/4nw0zb4swga0cb8i35nvng3rg6z5qm8p-shepherd-0.9.3/bin/shepherd --config /gnu/store/cvrai6z8777jf7860rnvppfznl1lcxi1-shepherd.conf
>
>> ‘sudo herd restart ssh-daemon’ works fine on my laptop FWIW.
> This works fine too. Only unattended-upgrades seems to have this issue :/
>
> The strace looks unsuspicious right now:
>
> ---snip---
> 1 14:12:15.117035 read(21, "(shepherd-command (version 0) (action restart) (service ssh-daemon) (arguments ()) (directory \"/root\"))", 1024) = 103
> 1 14:12:15.117254 close(27) = 0
> 1 14:12:15.117283 close(30) = 0
> 1 14:12:15.117416 newfstatat(AT_FDCWD, "/etc/localtime", {st_dev=makedev(0x8, 0x2), st_ino=110100491, st_mode=S_IFREG|0444, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, s
> t_size=2298, st_atime=1676898665 /* 2023-02-20T14:11:05.338746772+0100 */, st_atime_nsec=338746772, st_mtime=1676898664 /* 2023-02-20T14:11:04.874743456+0100 */, st_mtime_nsec=874743456, st_c
> time=1676898664 /* 2023-02-20T14:11:04.874743456+0100 */, st_ctime_nsec=874743456}, 0) = 0
> 1 14:12:15.117475 write(17, "shepherd[1]: Service ssh-daemon has been stopped.\n", 50) = 50
> 1 14:12:15.117524 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 26
> 1 14:12:15.117561 setsockopt(26, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> 1 14:12:15.117598 bind(26, {sa_family=AF_INET, sin_port=htons(2222), sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EADDRINUSE (Address already in use)
> 1 14:12:15.117724 write(21, "(reply (version 0) (result #f) (error (error (version 0) action-exception start ssh-daemon system-error (\"bind\" \"~A\" (\"Address already in use\") (98)))) (messages (\"Service ssh-daemon has been stopped.\")))", 204) = 204
> 1 14:12:15.117754 close(21) = 0

This suggests ‘bind’ can return EADDRINUSE even though the sockets have
been closed before (presumably file descriptors 27 and 30 above).

Can you confirm nothing else is competing to bind port 2222 on that
machine?

I tried to reproduce it with something as brutal as:

while sudo herd restart sshd ; do : ; done

… to no avail (I’m on current Shepherd ‘master’ though).

Maybe we should just have shepherd retry upon EADDRINUSE (like nginx
does, as you wrote), though I’d like to understand under what conditions
we can get EADDRINUSE in the first place.

Ludo’.
L
L
Lars-Dominik Braun wrote on 28 Apr 2023 14:31
(name . Ludovic Courtès)(address . ludo@gnu.org)
ZEu8jKAiumAuVMz0@zpidnb93
Hi,

Toggle quote (3 lines)
> Can you confirm nothing else is competing to bind port 2222 on that
> machine?

not sure how to confirm that with certainty (it’s hard to get an lsof
in the exact right moment), but according to the OS config only SSHd is
supposed to use port 2222, see [1].


Cheers,
Lars

--
Lars-Dominik Braun
Wissenschaftlicher Mitarbeiter/Research Associate

www.leibniz-psychology.org
ZPID - Leibniz-Institut für Psychologie /
ZPID - Leibniz Institute for Psychology
Universitätsring 15
D-54296 Trier - Germany
Tel.: +49–651–201-4964
-----BEGIN PGP SIGNATURE-----

iQGzBAABCAAdFiEEyk+M9DfXR4/aBV/UQhN3ARo3hEYFAmRLvIoACgkQQhN3ARo3
hEa3cwwAn4XB+7JSIhtyI4geOCDQS64cz473qezl63iYP0OxDugft5uvFYJI6b/V
Yy3JrwTbYlLecTEZDIp3C6rstFvmOZ4/qEjW+4ePIfhkftUVJIGRq7JTBWu0uFKu
Gg+g5HzS2kF3VLNyQaLfzC4zFbMh75GxqAuUSRBdSk4AlFg1GfaSpMb96eEJf1IT
E+yKTPd7GQFL3RgvtJps+RJienBQCXx3rMZiQyaz1BZvBESP79G2kd4lYWYGNSNx
sodznsNhFBb4qR73vKzeVc4WXD9n7zTrbOUQzX+cQGZ0ADcvEGtuHZ2fVCsUVYig
ZHNOjUvbA3F2AQRnBVl3Hz4n62APc//3IEoxbUBS+I+vhe66ZyNzbIbHsF8lFUwo
BnAB1y8RIVjZPcK/SPtJYiGikzoY9eH8Fi/BxVDefFMcNKGkCEEwF0WGXQg5QvE4
03KbXS6dBfjYva6fWo199T6uRPE6XTafgaGawa8PYFhIZi3KoF95deNiKrAswfdk
faD62/Lb
=AcDX
-----END PGP SIGNATURE-----


L
L
Ludovic Courtès wrote on 28 Apr 2023 15:09
Re: bug#58485: [shepherd] EADDRINUSE while restarting ssh-daemon
(name . Lars-Dominik Braun)(address . ldb@leibniz-psychology.org)
87h6t0up9t.fsf_-_@gnu.org
Hi,

Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (13 lines)
>> 1 14:12:15.117035 read(21, "(shepherd-command (version 0) (action restart) (service ssh-daemon) (arguments ()) (directory \"/root\"))", 1024) = 103
>> 1 14:12:15.117254 close(27) = 0
>> 1 14:12:15.117283 close(30) = 0
>> 1 14:12:15.117416 newfstatat(AT_FDCWD, "/etc/localtime", {st_dev=makedev(0x8, 0x2), st_ino=110100491, st_mode=S_IFREG|0444, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, s
>> t_size=2298, st_atime=1676898665 /* 2023-02-20T14:11:05.338746772+0100 */, st_atime_nsec=338746772, st_mtime=1676898664 /* 2023-02-20T14:11:04.874743456+0100 */, st_mtime_nsec=874743456, st_c
>> time=1676898664 /* 2023-02-20T14:11:04.874743456+0100 */, st_ctime_nsec=874743456}, 0) = 0
>> 1 14:12:15.117475 write(17, "shepherd[1]: Service ssh-daemon has been stopped.\n", 50) = 50
>> 1 14:12:15.117524 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 26
>> 1 14:12:15.117561 setsockopt(26, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
>> 1 14:12:15.117598 bind(26, {sa_family=AF_INET, sin_port=htons(2222), sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EADDRINUSE (Address already in use)
>> 1 14:12:15.117724 write(21, "(reply (version 0) (result #f) (error (error (version 0) action-exception start ssh-daemon system-error (\"bind\" \"~A\" (\"Address already in use\") (98)))) (messages (\"Service ssh-daemon has been stopped.\")))", 204) = 204
>> 1 14:12:15.117754 close(21) = 0

[...]

Toggle quote (4 lines)
> Maybe we should just have shepherd retry upon EADDRINUSE (like nginx
> does, as you wrote), though I’d like to understand under what conditions
> we can get EADDRINUSE in the first place.

L
L
Ludovic Courtès wrote on 11 Jun 2023 16:20
(name . Lars-Dominik Braun)(address . ldb@leibniz-psychology.org)
87ttveqdwa.fsf@gnu.org
Hi Lars,

Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (25 lines)
> Ludovic Courtès <ludo@gnu.org> skribis:
>
>>> 1 14:12:15.117035 read(21, "(shepherd-command (version 0) (action restart) (service ssh-daemon) (arguments ()) (directory \"/root\"))", 1024) = 103
>>> 1 14:12:15.117254 close(27) = 0
>>> 1 14:12:15.117283 close(30) = 0
>>> 1 14:12:15.117416 newfstatat(AT_FDCWD, "/etc/localtime", {st_dev=makedev(0x8, 0x2), st_ino=110100491, st_mode=S_IFREG|0444, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, s
>>> t_size=2298, st_atime=1676898665 /* 2023-02-20T14:11:05.338746772+0100 */, st_atime_nsec=338746772, st_mtime=1676898664 /* 2023-02-20T14:11:04.874743456+0100 */, st_mtime_nsec=874743456, st_c
>>> time=1676898664 /* 2023-02-20T14:11:04.874743456+0100 */, st_ctime_nsec=874743456}, 0) = 0
>>> 1 14:12:15.117475 write(17, "shepherd[1]: Service ssh-daemon has been stopped.\n", 50) = 50
>>> 1 14:12:15.117524 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 26
>>> 1 14:12:15.117561 setsockopt(26, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
>>> 1 14:12:15.117598 bind(26, {sa_family=AF_INET, sin_port=htons(2222), sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EADDRINUSE (Address already in use)
>>> 1 14:12:15.117724 write(21, "(reply (version 0) (result #f) (error (error (version 0) action-exception start ssh-daemon system-error (\"bind\" \"~A\" (\"Address already in use\") (98)))) (messages (\"Service ssh-daemon has been stopped.\")))", 204) = 204
>>> 1 14:12:15.117754 close(21) = 0
>
> [...]
>
>> Maybe we should just have shepherd retry upon EADDRINUSE (like nginx
>> does, as you wrote), though I’d like to understand under what conditions
>> we can get EADDRINUSE in the first place.
>
> Done:
>
> https://git.savannah.gnu.org/cgit/shepherd.git/commit/?id=41789ee8d0e164967f9ca196db4e9601400a462e

I’m assuming that this is fixed in Shepherd 0.10.x. Please reopen if
you stumble upon this issue again.

Ludo’.
Closed
?