shepherd: sometimes hangs on `guix system reconfigure`

  • Done
  • quality assurance status badge
Details
4 participants
  • Attila Lendvai
  • Attila Lendvai
  • Timo Wilken
  • Ludovic Courtès
Owner
unassigned
Submitted by
Attila Lendvai
Severity
normal
A
A
Attila Lendvai wrote on 15 Dec 2023 20:20
(name . bug-guix@gnu.org)(address . bug-guix@gnu.org)
tiryjH24wix4bRsFPYYQqznvxQeUyfJGNBjPGw5igWnkeFQTppjX9hL7Ju-n9hsZENUbfi2Vxq_ikILA52-IgQYEo-sNzB2Di-CvYd_dykA=@lendvai.name
my fellow hackers,

i'm going to attach two patches that is essentially just adding a couple of asserts that trigger a test failure (tests/replacement.sh).

my current codebase (https://codeberg.org/attila-lendvai-patches/shepherd/commits/branch/attila)logs a whole lot more information, and has a more sophisticated error handling. triggering the same error on that codebase shows that the first assert is already failing (the one that is before spawning the new fiber for the controller of the replacement).

maybe the root cause is this: https://github.com/wingo/fibers/issues/29

HTH,

--
• attila lendvai
• PGP: 963F 5D5F 45C7 DFCD 0A39
--
“Angry people want you to see how powerful they are… loving people want you to see how powerful You are.”
— Chief Red Eagle
A
A
Attila Lendvai wrote on 15 Dec 2023 20:37
[PATCH 1/2] shepherd: Move root-service start under with-process-monitor.
(address . 67839@debbugs.gnu.org)(name . Attila Lendvai)(address . attila@lendvai.name)
20231215193708.25169-2-attila@lendvai.name
* modules/shepherd.scm (main): move the (start-service root-service) under the
dynamic extent of with-process-monitor, so that (current-process-monitor) is
valid for the root-service, too.
---
modules/shepherd.scm | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

Toggle diff (23 lines)
diff --git a/modules/shepherd.scm b/modules/shepherd.scm
index efc5517..77c6d18 100644
--- a/modules/shepherd.scm
+++ b/modules/shepherd.scm
@@ -451,12 +451,12 @@ fork in the child process."
(run-fibers
(lambda ()
(with-service-registry
+ (with-process-monitor
- ;; Register and start the 'root' service.
- (register-services (list root-service))
- (start-service root-service)
+ ;; Register and start the 'root' service.
+ (register-services (list root-service))
+ (start-service root-service)
- (with-process-monitor
;; Replace the default 'system*' binding with one that
;; cooperates instead of blocking on 'waitpid'. Replace
;; 'primitive-load' (in C as of 3.0.9) with one that does
--
2.41.0
A
A
Attila Lendvai wrote on 15 Dec 2023 20:37
[PATCH 2/2] service: Add two asserts that will make tests/replacement.sh fail.
(address . 67839@debbugs.gnu.org)(name . Attila Lendvai)(address . attila@lendvai.name)
20231215193708.25169-4-attila@lendvai.name
* modules/shepherd/service.scm (spawn-service-controller): Add two asserts.
This is the bug that causes `guix system reconfigure ...` to sometimes hang,
and subsequently all shepherd commands, because a match-error flies out from
the service-controller of a replaced service, and thus its fiber dies.
---
modules/shepherd/service.scm | 2 ++
1 file changed, 2 insertions(+)

Toggle diff (18 lines)
diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm
index c3bdf44..0ee6929 100644
--- a/modules/shepherd/service.scm
+++ b/modules/shepherd/service.scm
@@ -382,9 +382,11 @@ denoting what the service provides."
(define (spawn-service-controller service)
"Return a channel over which @var{service} may be controlled."
+ (assert (current-process-monitor))
(let ((channel (make-channel)))
(spawn-fiber
(lambda ()
+ (assert (current-process-monitor))
;; The controller writes to its current output port via 'local-output'.
;; Make sure that goes to the right port. If the controller got a
;; wrong output port, it could crash and stop responding just because a
--
2.41.0
T
T
Timo Wilken wrote on 15 Dec 2023 21:33
Re: Shepherd stops responding during "guix system reconfigure"
(name . Attila Lendvai)(address . attila@lendvai.name)
CXP6VUA179NT.24MHZOOPR4XQN@lap.twilken.net
On Fri Dec 15, 2023 at 8:47 PM CET, Attila Lendvai wrote:
Toggle quote (4 lines)
> i think i have found the root cause of this, as documented here: https://issues.guix.gnu.org/67839
>
> that issue contains patches for shepherd to reproduce it in its test suite.

Thank you very much for this, Attila!

Are the patch in 67839 and/or your branch "attila" linked from there in a
state that I could test them locally? Would it be valuable to you if I ran a
patched Shepherd and sent logs and/or backtraces as I encountered them?
A
A
Attila Lendvai wrote on 15 Dec 2023 22:24
(name . Timo Wilken)(address . guix@twilken.net)
Q25YR0CIwjc3XL1sWDYgC7JrIarMKDw3wg326jlNiwx0Rtmj7e_kkoFiry6KGSbid8fBo8bqW5jiaOs8WOK43ijSvz8qWq-RPNoCjz26kus=@lendvai.name
Toggle quote (3 lines)
> Thank you very much for this, Attila!


you're welcome! :)


Toggle quote (5 lines)
> Are the patch in 67839 and/or your branch "attila" linked from there in a
> state that I could test them locally? Would it be valuable to you if I ran a
> patched Shepherd and sent logs and/or backtraces as I encountered them?


it's nice of you, but not really. now that we have a failing test case in shepherd's unit tests that can reproduce it much easier.

with #67839 you would only get you an extra "Assertion failed" message over master, without much useful output.

as for my branch, it would emit a lot of useful log, including backtraces, but i keep force-pushing into it. i'm running my servers with it, though, so if you feel really adventurous, and want to join the debugging, then you can try... otherwise it's too much in flux.

what we need to focus on now is making shepherd's test suite run clean again, one way or another. then i can test it in a real life environment, and report back with any possible findings.

--
• attila lendvai
• PGP: 963F 5D5F 45C7 DFCD 0A39
--
“Ignorance might be bliss for the ignorant, but for the rest of us it's a fucking pain in the ass.”
— Ricky Gervais
A
A
Attila Lendvai wrote on 17 Dec 2023 01:06
[PATCH 3/2] shepherd: Fix tests/replacement.sh
(address . 67839@debbugs.gnu.org)(name . Attila Lendvai)(address . attila@lendvai.name)
20231217000623.1430-2-attila@lendvai.name
* modules/shepherd.scm (main): Switch with-service-registry and
with-process-monitor. Fix proposed by @emixa-d at
the parameterize of the process monitor covers everything else.
---
modules/shepherd.scm | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

Toggle diff (17 lines)
diff --git a/modules/shepherd.scm b/modules/shepherd.scm
index 77c6d18..3303de3 100644
--- a/modules/shepherd.scm
+++ b/modules/shepherd.scm
@@ -450,8 +450,8 @@ fork in the child process."
;; because POSIX threads and 'fork' cannot be used together.
(run-fibers
(lambda ()
- (with-service-registry
- (with-process-monitor
+ (with-process-monitor
+ (with-service-registry
;; Register and start the 'root' service.
(register-services (list root-service))
--
2.41.0
A
A
Attila Lendvai wrote on 17 Dec 2023 01:44
[PATCH v2 1/2] shepherd: Make sure with-process-monitor covers everything needed.
(address . 67839@debbugs.gnu.org)(name . Attila Lendvai)(address . attila@lendvai.name)
20231217004424.28303-2-attila@lendvai.name
* modules/shepherd.scm (main): Switch with-service-registry and
with-process-monitor. This way the parameterize of the process monitor covers
everything else. This fixes the bug that caused `guix system reconfigure` to
hang in certain situations. Fix proposed by @emixa-d at:
---
modules/shepherd.scm | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

Toggle diff (25 lines)
diff --git a/modules/shepherd.scm b/modules/shepherd.scm
index efc5517..3303de3 100644
--- a/modules/shepherd.scm
+++ b/modules/shepherd.scm
@@ -450,13 +450,13 @@ fork in the child process."
;; because POSIX threads and 'fork' cannot be used together.
(run-fibers
(lambda ()
- (with-service-registry
+ (with-process-monitor
+ (with-service-registry
- ;; Register and start the 'root' service.
- (register-services (list root-service))
- (start-service root-service)
+ ;; Register and start the 'root' service.
+ (register-services (list root-service))
+ (start-service root-service)
- (with-process-monitor
;; Replace the default 'system*' binding with one that
;; cooperates instead of blocking on 'waitpid'. Replace
;; 'primitive-load' (in C as of 3.0.9) with one that does
--
2.41.0
A
A
Attila Lendvai wrote on 17 Dec 2023 01:44
[PATCH v2 2/2] service: Add asserts that used to make tests/replacement.sh fail.
(address . 67839@debbugs.gnu.org)(name . Attila Lendvai)(address . attila@lendvai.name)
20231217004424.28303-3-attila@lendvai.name
* modules/shepherd/service.scm (spawn-service-controller): Add two asserts.
This is the bug that causes `guix system reconfigure ...` to sometimes hang,
and subsequently all shepherd commands, because a match-error flies out from
the service-controller of a replaced service, and thus its fiber dies. These
asserts get triggered without the previous commit that fixes the issue.
---
modules/shepherd/service.scm | 2 ++
1 file changed, 2 insertions(+)

Toggle diff (18 lines)
diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm
index c3bdf44..0ee6929 100644
--- a/modules/shepherd/service.scm
+++ b/modules/shepherd/service.scm
@@ -382,9 +382,11 @@ denoting what the service provides."
(define (spawn-service-controller service)
"Return a channel over which @var{service} may be controlled."
+ (assert (current-process-monitor))
(let ((channel (make-channel)))
(spawn-fiber
(lambda ()
+ (assert (current-process-monitor))
;; The controller writes to its current output port via 'local-output'.
;; Make sure that goes to the right port. If the controller got a
;; wrong output port, it could crash and stop responding just because a
--
2.41.0
L
L
Ludovic Courtès wrote on 17 Dec 2023 22:59
Re: bug#67839: shepherd: sometimes hangs on `guix system reconfigure`
(name . Attila Lendvai)(address . attila.lendvai@gmail.com)
87sf40qylc.fsf_-_@gnu.org
Hi Attila,

Attila Lendvai <attila.lendvai@gmail.com> skribis:

Toggle quote (6 lines)
> * modules/shepherd.scm (main): Switch with-service-registry and
> with-process-monitor. This way the parameterize of the process monitor covers
> everything else. This fixes the bug that caused `guix system reconfigure` to
> hang in certain situations. Fix proposed by @emixa-d at:
> https://github.com/wingo/fibers/issues/29#issuecomment-1858922276.

[...]

Toggle quote (6 lines)
> * modules/shepherd/service.scm (spawn-service-controller): Add two asserts.
> This is the bug that causes `guix system reconfigure ...` to sometimes hang,
> and subsequently all shepherd commands, because a match-error flies out from
> the service-controller of a replaced service, and thus its fiber dies. These
> asserts get triggered without the previous commit that fixes the issue.

Good catch!!

I pushed these patches with small edits, in particular adding a test
that reproduces the bug without relying on assertion failures:

5dbde1c support: ‘assert’ logs source location information.
0bcf02a Update NEWS.
c07f0a8 service: Add asserts to ensure a process monitor is running.
9be0b7e shepherd: Make sure ‘with-process-monitor’ covers everything needed.

Thanks for the tedious but fruitful debugging work!

Ludo’.
L
L
Ludovic Courtès wrote on 17 Dec 2023 23:00
control message for bug #67839
(address . control@debbugs.gnu.org)
87r0jkqyl3.fsf@gnu.org
close 67839
quit
?