[Cuirass] Workers not waking up after server went away

  • Done
  • quality assurance status badge
Details
2 participants
  • Ludovic Courtès
  • Ludovic Courtès
Owner
unassigned
Submitted by
Ludovic Courtès
Severity
normal
L
L
Ludovic Courtès wrote on 27 Nov 2023 14:28
(address . bug-guix@gnu.org)
875y1nxr3q.fsf@inria.fr
Hello,

The ‘cuirass remote-worker’ processes (1.2.0-1.bdc1f9f) didn’t wake up
after ‘cuirass remote-server’ stopped responding earlier today,
remaining stuck while waiting for a reply to their latest “request work”
message:

Toggle snippet (28 lines)
Nov 27 02:47:30 guixp9 cuirass[22122]: COhE8Mw6: derivation `/gnu/store/acljcvz7wb3pc9bxipkl1vf74ac7ns2z-calf-0.90.3.drv' build failed: build o
Nov 27 02:47:30 guixp9 cuirass[22122]: COhE8Mw6: request work.
Nov 27 02:47:30 guixp9 cuirass[22122]: HKCtyhxH: derivation `/gnu/store/z51fxy3j476136wcqd5gmy9v9r2vyqwn-csdr-0.18.2.drv' build failed: build o
Nov 27 02:47:30 guixp9 cuirass[22122]: HKCtyhxH: request work.
Nov 27 02:47:44 guixp9 cuirass[22122]: COhE8Mw6: ping tcp://10.0.0.1:5555.
Nov 27 02:47:44 guixp9 cuirass[22122]: HKCtyhxH: ping tcp://10.0.0.1:5555.
Nov 27 02:48:44 guixp9 cuirass[22122]: COhE8Mw6: ping tcp://10.0.0.1:5555.
Nov 27 02:48:44 guixp9 cuirass[22122]: HKCtyhxH: ping tcp://10.0.0.1:5555.
Nov 27 02:49:45 guixp9 cuirass[22122]: COhE8Mw6: ping tcp://10.0.0.1:5555.
Nov 27 02:49:45 guixp9 cuirass[22122]: HKCtyhxH: ping tcp://10.0.0.1:5555.
Nov 27 02:50:45 guixp9 cuirass[22122]: COhE8Mw6: ping tcp://10.0.0.1:5555.
Nov 27 02:50:45 guixp9 cuirass[22122]: HKCtyhxH: ping tcp://10.0.0.1:5555.
Nov 27 02:51:45 guixp9 cuirass[22122]: COhE8Mw6: ping tcp://10.0.0.1:5555.
Nov 27 02:51:45 guixp9 cuirass[22122]: HKCtyhxH: ping tcp://10.0.0.1:5555.
Nov 27 02:52:45 guixp9 cuirass[22122]: COhE8Mw6: ping tcp://10.0.0.1:5555.
Nov 27 02:52:45 guixp9 cuirass[22122]: HKCtyhxH: ping tcp://10.0.0.1:5555.
Nov 27 02:53:46 guixp9 cuirass[22122]: COhE8Mw6: ping tcp://10.0.0.1:5555.
Nov 27 02:53:46 guixp9 cuirass[22122]: HKCtyhxH: ping tcp://10.0.0.1:5555.
Nov 27 02:54:46 guixp9 cuirass[22122]: COhE8Mw6: ping tcp://10.0.0.1:5555.
Nov 27 02:54:46 guixp9 cuirass[22122]: HKCtyhxH: ping tcp://10.0.0.1:5555.
Nov 27 02:55:46 guixp9 cuirass[22122]: COhE8Mw6: ping tcp://10.0.0.1:5555.
Nov 27 02:55:46 guixp9 cuirass[22122]: HKCtyhxH: ping tcp://10.0.0.1:5555.
Nov 27 02:55:53 guixp9 cuirass[22122]: worker's alive
Nov 27 02:56:46 guixp9 cuirass[22122]: COhE8Mw6: ping tcp://10.0.0.1:5555.
Nov 27 02:56:46 guixp9 cuirass[22122]: HKCtyhxH: ping tcp://10.0.0.1:5555.
Nov 27 02:57:47 guixp9 cuirass[22122]: COhE8Mw6: ping tcp://10.0.0.1:5555.

They had to be manually restarted.

This shouldn’t be the case. Instead, they should say “received
bootstrap message” when the new ‘cuirass remote-server’ is spawned and
keep going.

Ludo’.
L
L
Ludovic Courtès wrote on 29 Aug 2024 11:38
(address . 67485-done@debbugs.gnu.org)
87ttf3fz3l.fsf@gnu.org
Ludovic Courtès <ludovic.courtes@inria.fr> skribis:

Toggle quote (5 lines)
> The ‘cuirass remote-worker’ processes (1.2.0-1.bdc1f9f) didn’t wake up
> after ‘cuirass remote-server’ stopped responding earlier today,
> remaining stuck while waiting for a reply to their latest “request work”
> message:

I believe this is fixed. In particular, Cuirass commit
fdb6bdfa27d9da8d052ed76b6a05b3817ff19777 added a timeout waiting for
“request work” replies.

Ludo’.
Closed
?
Your comment

This issue is archived.

To comment on this conversation send an email to 67485@debbugs.gnu.org

To respond to this issue using the mumi CLI, first switch to it
mumi current 67485
Then, you may apply the latest patchset in this issue (with sign off)
mumi am -- -s
Or, compose a reply to this issue
mumi compose
Or, send patches to this issue
mumi send-email *.patch