[Cuirass] ‘request-work’ responses received by several workers

  • Done
  • quality assurance status badge
Details
2 participants
  • Ludovic Courtès
  • Ludovic Courtès
Owner
unassigned
Submitted by
Ludovic Courtès
Severity
normal
L
L
Ludovic Courtès wrote on 23 Dec 2023 10:13
(address . bug-guix@gnu.org)
87wmt5704i.fsf@inria.fr
Hello,

I’m under the impression that sometimes, when the server replies to
‘worker-request-work’ messages, its reply is received by more than just
the target worker, leading to builds being performed twice:

Toggle snippet (15 lines)
ludo@berlin ~$ sudo grep lyhz5d1jb396m32dy0fs9h8vqzw95ddp /var/log/cuirass-remote-server.log
2023-12-23 00:15:29 141.80.167.184 (0LFowqzr): build started: '/gnu/store/lyhz5d1jb396m32dy0fs9h8vqzw95ddp-cdrdao-1.2.5.drv'.
2023-12-23 00:18:41 fetching 1 outputs of '/gnu/store/lyhz5d1jb396m32dy0fs9h8vqzw95ddp-cdrdao-1.2.5.drv' from http://141.80.167.184:5558
2023-12-23 00:18:45 build succeeded: '/gnu/store/lyhz5d1jb396m32dy0fs9h8vqzw95ddp-cdrdao-1.2.5.drv'
2023-12-23 00:21:20 141.80.167.159 (oNzYXCv5): build started: '/gnu/store/lyhz5d1jb396m32dy0fs9h8vqzw95ddp-cdrdao-1.2.5.drv'.
2023-12-23 00:24:31 fetching 1 outputs of '/gnu/store/lyhz5d1jb396m32dy0fs9h8vqzw95ddp-cdrdao-1.2.5.drv' from http://141.80.167.159:5558
2023-12-23 00:24:32 build succeeded: '/gnu/store/lyhz5d1jb396m32dy0fs9h8vqzw95ddp-cdrdao-1.2.5.drv'
ludo@berlin ~$ sudo ssh root@141.80.167.184 grep lyhz5d1jb396m32dy0fs9h8vqzw95ddp /var/log/cuirass-remote-worker.log
2023-12-23 00:12:32 0LFowqzr: building derivation `/gnu/store/lyhz5d1jb396m32dy0fs9h8vqzw95ddp-cdrdao-1.2.5.drv' (system: x86_64-linux)
2023-12-23 00:12:54 0LFowqzr: derivation /gnu/store/lyhz5d1jb396m32dy0fs9h8vqzw95ddp-cdrdao-1.2.5.drv build succeeded.
ludo@berlin ~$ sudo ssh root@141.80.167.159 grep lyhz5d1jb396m32dy0fs9h8vqzw95ddp /var/log/cuirass-remote-worker.log
2023-12-23 00:17:51 oNzYXCv5: building derivation `/gnu/store/lyhz5d1jb396m32dy0fs9h8vqzw95ddp-cdrdao-1.2.5.drv' (system: x86_64-linux)
2023-12-23 00:18:17 oNzYXCv5: derivation /gnu/store/lyhz5d1jb396m32dy0fs9h8vqzw95ddp-cdrdao-1.2.5.drv build succeeded.

This is with Cuirass 1.2.0-1.bdc1f9f.

To be continued…

Ludo’.
L
L
Ludovic Courtès wrote on 28 May 23:50 +0200
(address . 67988@debbugs.gnu.org)
8734q1wqq8.fsf@gnu.org
Ludovic Courtès <ludovic.courtes@inria.fr> skribis:

Toggle quote (4 lines)
> I’m under the impression that sometimes, when the server replies to
> ‘worker-request-work’ messages, its reply is received by more than just
> the target worker, leading to builds being performed twice:

Seen again:

Toggle snippet (12 lines)
ludo@guix-hpc4 ~/src/cuirass$ sudo grep nmhvrka9i4qng54w3d478j1lsp9dn7r7 /var/log/cuirass-remote-server.log
2024-05-28 21:31:43 194.199.1.26 (PajrOfGX): build started: '/gnu/store/nmhvrka9i4qng54w3d478j1lsp9dn7r7-firefox-126.0.1.drv'.
2024-05-28 21:34:22 194.199.1.27 (exataaY9): build started: '/gnu/store/nmhvrka9i4qng54w3d478j1lsp9dn7r7-firefox-126.0.1.drv'.
2024-05-28 21:38:32 194.199.1.17 (DIwFaVSn): build started: '/gnu/store/nmhvrka9i4qng54w3d478j1lsp9dn7r7-firefox-126.0.1.drv'.
2024-05-28 22:16:13 fetching 1 outputs of '/gnu/store/nmhvrka9i4qng54w3d478j1lsp9dn7r7-firefox-126.0.1.drv' from http://194.199.1.26:5558
2024-05-28 22:16:18 build succeeded: '/gnu/store/nmhvrka9i4qng54w3d478j1lsp9dn7r7-firefox-126.0.1.drv'
2024-05-28 22:53:49 fetching 1 outputs of '/gnu/store/nmhvrka9i4qng54w3d478j1lsp9dn7r7-firefox-126.0.1.drv' from http://194.199.1.27:5558
2024-05-28 22:53:49 build succeeded: '/gnu/store/nmhvrka9i4qng54w3d478j1lsp9dn7r7-firefox-126.0.1.drv'
2024-05-28 23:03:50 fetching 1 outputs of '/gnu/store/nmhvrka9i4qng54w3d478j1lsp9dn7r7-firefox-126.0.1.drv' from http://194.199.1.17:5558
2024-05-28 23:03:50 build succeeded: '/gnu/store/nmhvrka9i4qng54w3d478j1lsp9dn7r7-firefox-126.0.1.drv'

And on workers:

Toggle snippet (11 lines)
$ ssh root@guix-hpc3 grep nmhvrka9i4qng54w3d478j1lsp9dn7r7 /var/log/cuirass-remote-worker.log
2024-05-28 21:57:43 DIwFaVSn: building derivation `/gnu/store/nmhvrka9i4qng54w3d478j1lsp9dn7r7-firefox-126.0.1.drv' (system: x86_64-linux)
2024-05-28 23:22:58 DIwFaVSn: derivation /gnu/store/nmhvrka9i4qng54w3d478j1lsp9dn7r7-firefox-126.0.1.drv build succeeded.
$ ssh root@guix-hpc5 grep nmhvrka9i4qng54w3d478j1lsp9dn7r7 /var/log/cuirass-remote-worker.log
2024-05-28 21:34:13 PajrOfGX: building derivation `/gnu/store/nmhvrka9i4qng54w3d478j1lsp9dn7r7-firefox-126.0.1.drv' (system: x86_64-linux)
2024-05-28 22:18:40 PajrOfGX: derivation /gnu/store/nmhvrka9i4qng54w3d478j1lsp9dn7r7-firefox-126.0.1.drv build succeeded.
$ ssh root@guix-hpc7 grep nmhvrka9i4qng54w3d478j1lsp9dn7r7 /var/log/cuirass-remote-worker.log
2024-05-28 21:34:11 exataaY9: building derivation `/gnu/store/nmhvrka9i4qng54w3d478j1lsp9dn7r7-firefox-126.0.1.drv' (system: x86_64-linux)
2024-05-28 22:53:35 exataaY9: derivation /gnu/store/nmhvrka9i4qng54w3d478j1lsp9dn7r7-firefox-126.0.1.drv build succeeded.

Ludo’.
L
L
Ludovic Courtès wrote on 31 May 21:55 +0200
(address . 67988@debbugs.gnu.org)
87ttidrc2j.fsf@gnu.org
Ludovic Courtès <ludovic.courtes@inria.fr> skribis:

Toggle quote (4 lines)
> I’m under the impression that sometimes, when the server replies to
> ‘worker-request-work’ messages, its reply is received by more than just
> the target worker, leading to builds being performed twice:

On closer inspection, the theory of the message being received by two
different peers doesn’t hold.

Instead, I believe ‘db-get-pending-build’ would return the same build at
two different points in time, typically while the first one is still
running.

That’s normally not possible because the build’s status is changed to
‘submitted’ once it’s been picked up. Turns out that, due to slowness
of the query in ‘db-get-pending-build’ (fixed in
17338588d4862b04e9e405c1244a2ea703b50d98), ‘remote-server’ would
sometimes fail to see worker pings in a timely fashion. Thus, it would
call ‘db-remove-unresponsive-workers’, which would reschedule builds
that were being carried out by said worker(s). And that’s how we would
end up with multiple concurrent builds of the same derivation.

I added logging in c2061ca845d05694ebeb88935a6ff2254711beb2, which
should give a hint, should that happen again.

Ludo’.
L
L
Ludovic Courtès wrote on 4 Jun 15:56 +0200
control message for bug #67988
(address . control@debbugs.gnu.org)
87jzj47qw7.fsf@gnu.org
close 67988
quit
?
Your comment

This issue is archived.

To comment on this conversation send an email to 67988@debbugs.gnu.org

To respond to this issue using the mumi CLI, first switch to it
mumi current 67988
Then, you may apply the latest patchset in this issue (with sign off)
mumi am -- -s
Or, compose a reply to this issue
mumi compose
Or, send patches to this issue
mumi send-email *.patch