[Cuirass] Workers not waking up after server went away

  • Open
  • quality assurance status badge
Details
One participant
  • Ludovic Courtès
Owner
unassigned
Submitted by
Ludovic Courtès
Severity
normal
L
L
Ludovic Courtès wrote on 27 Nov 2023 14:28
(address . bug-guix@gnu.org)
875y1nxr3q.fsf@inria.fr
Hello,

The ‘cuirass remote-worker’ processes (1.2.0-1.bdc1f9f) didn’t wake up
after ‘cuirass remote-server’ stopped responding earlier today,
remaining stuck while waiting for a reply to their latest “request work”
message:

Toggle snippet (28 lines)
Nov 27 02:47:30 guixp9 cuirass[22122]: COhE8Mw6: derivation `/gnu/store/acljcvz7wb3pc9bxipkl1vf74ac7ns2z-calf-0.90.3.drv' build failed: build o
Nov 27 02:47:30 guixp9 cuirass[22122]: COhE8Mw6: request work.
Nov 27 02:47:30 guixp9 cuirass[22122]: HKCtyhxH: derivation `/gnu/store/z51fxy3j476136wcqd5gmy9v9r2vyqwn-csdr-0.18.2.drv' build failed: build o
Nov 27 02:47:30 guixp9 cuirass[22122]: HKCtyhxH: request work.
Nov 27 02:47:44 guixp9 cuirass[22122]: COhE8Mw6: ping tcp://10.0.0.1:5555.
Nov 27 02:47:44 guixp9 cuirass[22122]: HKCtyhxH: ping tcp://10.0.0.1:5555.
Nov 27 02:48:44 guixp9 cuirass[22122]: COhE8Mw6: ping tcp://10.0.0.1:5555.
Nov 27 02:48:44 guixp9 cuirass[22122]: HKCtyhxH: ping tcp://10.0.0.1:5555.
Nov 27 02:49:45 guixp9 cuirass[22122]: COhE8Mw6: ping tcp://10.0.0.1:5555.
Nov 27 02:49:45 guixp9 cuirass[22122]: HKCtyhxH: ping tcp://10.0.0.1:5555.
Nov 27 02:50:45 guixp9 cuirass[22122]: COhE8Mw6: ping tcp://10.0.0.1:5555.
Nov 27 02:50:45 guixp9 cuirass[22122]: HKCtyhxH: ping tcp://10.0.0.1:5555.
Nov 27 02:51:45 guixp9 cuirass[22122]: COhE8Mw6: ping tcp://10.0.0.1:5555.
Nov 27 02:51:45 guixp9 cuirass[22122]: HKCtyhxH: ping tcp://10.0.0.1:5555.
Nov 27 02:52:45 guixp9 cuirass[22122]: COhE8Mw6: ping tcp://10.0.0.1:5555.
Nov 27 02:52:45 guixp9 cuirass[22122]: HKCtyhxH: ping tcp://10.0.0.1:5555.
Nov 27 02:53:46 guixp9 cuirass[22122]: COhE8Mw6: ping tcp://10.0.0.1:5555.
Nov 27 02:53:46 guixp9 cuirass[22122]: HKCtyhxH: ping tcp://10.0.0.1:5555.
Nov 27 02:54:46 guixp9 cuirass[22122]: COhE8Mw6: ping tcp://10.0.0.1:5555.
Nov 27 02:54:46 guixp9 cuirass[22122]: HKCtyhxH: ping tcp://10.0.0.1:5555.
Nov 27 02:55:46 guixp9 cuirass[22122]: COhE8Mw6: ping tcp://10.0.0.1:5555.
Nov 27 02:55:46 guixp9 cuirass[22122]: HKCtyhxH: ping tcp://10.0.0.1:5555.
Nov 27 02:55:53 guixp9 cuirass[22122]: worker's alive
Nov 27 02:56:46 guixp9 cuirass[22122]: COhE8Mw6: ping tcp://10.0.0.1:5555.
Nov 27 02:56:46 guixp9 cuirass[22122]: HKCtyhxH: ping tcp://10.0.0.1:5555.
Nov 27 02:57:47 guixp9 cuirass[22122]: COhE8Mw6: ping tcp://10.0.0.1:5555.

They had to be manually restarted.

This shouldn’t be the case. Instead, they should say “received
bootstrap message” when the new ‘cuirass remote-server’ is spawned and
keep going.

Ludo’.
?