cuirass-remote-worker crash

  • Done
  • quality assurance status badge
Details
3 participants
  • Ludovic Courtès
  • Ludovic Courtès
  • Mathieu Othacehe
Owner
unassigned
Submitted by
Ludovic Courtès
Severity
normal
L
L
Ludovic Courtès wrote on 22 Nov 2022 23:14
(address . bug-guix@gnu.org)
87ilj6hc2a.fsf@inria.fr
Hi,

In /var/log/cuirass-remote-worker.log on overdrive1.guix, I found this:

Toggle snippet (52 lines)
2022-11-21 14:27:24 Backtrace:
2022-11-21 14:27:24 Backtrace:
2022-11-21 14:27:24 In ice-9/boot-9.scm:
2022-11-21 14:27:24 In ice-9/boot-9.scm:
2022-11-21 14:27:24 1752:10 10 (with-exception-handler _ _ #:unwind? _ # _)
2022-11-21 14:27:24 In unknown file:
2022-11-21 14:27:24 9 (apply-smob/0 #<thunk 3903a300>)
2022-11-21 14:27:24 In ice-9/boot-9.scm:
2022-11-21 14:27:24 724:2 8 (call-with-prompt _ _ #<procedure default-prompt-handle?>)
2022-11-21 14:27:24 In ice-9/eval.scm:
2022-11-21 14:27:24 1752:10 10 (with-exception-handler _ _ #:unwind? _ # _)
2022-11-21 14:27:24 619:8 7 (_ #(#(#<directory (guile-user) 3903dc80>)))
2022-11-21 14:27:24 In cuirass/ui.scm:
2022-11-21 14:27:24 In unknown file:
2022-11-21 14:27:24 9 (apply-smob/0 #<thunk 3903a300>)
2022-11-21 14:27:24 104:10 6 (run-cuirass-command _ . _)
2022-11-21 14:27:24 In ice-9/boot-9.scm:
2022-11-21 14:27:24 In ice-9/boot-9.scm:
2022-11-21 14:27:24 724:2 8 (call-with-prompt _ _ #<procedure default-prompt-handle?>)
2022-11-21 14:27:24 1752:10 5 (with-exception-handler _ _ #:unwind? _ # _)
2022-11-21 14:27:24 In ice-9/eval.scm:
2022-11-21 14:27:24 In cuirass/scripts/remote-worker.scm:
2022-11-21 14:27:24 619:8 7 (_ #(#(#<directory (guile-user) 3903dc80>)))
2022-11-21 14:27:24 In cuirass/ui.scm:
2022-11-21 14:27:24 104:10 6 (run-cuirass-command _ . _)
2022-11-21 14:27:24 435:12 4 (_)
2022-11-21 14:27:24 In srfi/srfi-1.scm:
2022-11-21 14:27:24 In ice-9/boot-9.scm:
2022-11-21 14:27:24 1752:10 5 (with-exception-handler _ _ #:unwind? _ # _)
2022-11-21 14:27:24 634:9 3 (for-each #<procedure 398a3510 at cuirass/scripts/remo?> ?)
2022-11-21 14:27:24 In cuirass/scripts/remote-worker.scm:
2022-11-21 14:27:24 In cuirass/scripts/remote-worker.scm:
2022-11-21 14:27:24 448:18 2 (_ _)
2022-11-21 14:27:24 435:12 4 (_)
2022-11-21 14:27:24 In srfi/srfi-1.scm:
2022-11-21 14:27:24 634:9 3 (for-each #<procedure 398a3510 at cuirass/scripts/remo?> ?)
2022-11-21 14:27:24 356:11 1 (start-worker _ _)
2022-11-21 14:27:24 In cuirass/scripts/remote-worker.scm:
2022-11-21 14:27:24 In ice-9/boot-9.scm:
2022-11-21 14:27:24 448:18 2 (_ _)
2022-11-21 14:27:24 1685:16 0 (raise-exception _ #:continuable? _)
2022-11-21 14:27:24
2022-11-21 14:27:24 ice-9/boot-9.scm:1685:16: In procedure raise-exception:
2022-11-21 14:27:24 Throw to key `match-error' with args `("match" "no matching pattern" (#vu8()))'.
2022-11-21 14:27:24 356:11 1 (start-worker _ _)
2022-11-21 14:27:24 In ice-9/boot-9.scm:
2022-11-21 14:27:24 1685:16 0 (raise-exception _ #:continuable? _)
2022-11-21 14:27:24
2022-11-21 14:27:24 ice-9/boot-9.scm:1685:16: In procedure raise-exception:
2022-11-21 14:27:24 Throw to key `match-error' with args `("match" "no matching pattern" (#vu8()))'.

(Stuttering is due to the unprotected use of ‘primitive-fork’: a
non-local exit in the child leads it to execute the same code as its
parent. We should fix that, but should we really fork in the first
place? :-))

This comes from here:

Toggle snippet (12 lines)
(define (read-server-info socket)
(request-info socket)
(match (zmq-get-msg-parts-bytevector socket '()) ;<-- here
((empty info)
(match (zmq-read-message (bv->string info))
(('server-info
('worker-address worker-address)
('log-port log-port)
('publish-port publish-port))
(list worker-address log-port publish-port))))))

This is the version being used:

Toggle snippet (17 lines)
ludo@overdrive1 ~$ cat /proc/24019/cmdline |xargs -0
/gnu/store/zpir9n73amaxrwz2k7x46l73v21vxk6s-guile-3.0.8/bin/guile --no-auto-compile -e main -s /gnu/store/rlqdzmfyamjpn6lz07yqk2hsabv3l7g5-cuirass-1.1.0-11.9f08035/bin/.cuirass-real remote-worker --workers=2 --server=10.0.0.1:5555 --systems=armhf-linux,aarch64-linux --publish-port=5558 --substitute-urls=http://10.0.0.1
ludo@overdrive1 ~$ guix system describe
Generation 36 Sep 27 2022 09:06:48 (current)
file name: /var/guix/profiles/system-36-link
canonical file name: /gnu/store/m04qw6f0lfd0wpn1skiys4b56wqfc3b8-system
label: GNU with Linux-Libre 5.19.11
bootloader: grub-efi
root device: /dev/sda3
kernel: /gnu/store/09r4wbbabskmbrnwmshpdk7vh6g87gam-linux-libre-5.19.11/Image
channels:
guix:
repository URL: https://git.savannah.gnu.org/git/guix.git
commit: f15a141cf35bd4188767f0e91c0654991d4c49e0
configuration file: /gnu/store/myvzd1kpw2pfzfj3krl4lzpcbqsdn48x-configuration.scm

The sequence leading to this seems to be:

Toggle snippet (13 lines)
22340 eventfd2(0, EFD_CLOEXEC <unfinished ...>
[…]
22340 <... eventfd2 resumed>) = 15
[…]
22340 ppoll([{fd=15, events=POLLIN}], 1, NULL, NULL, 0 <unfinished ...>
[…]
22340 <... ppoll resumed>) = 1 ([{fd=15, revents=POLLIN}])
22343 epoll_pwait(8, <unfinished ...>
22340 read(15, "\1\0\0\0\0\0\0\0", 8) = 8
22340 ppoll([{fd=15, events=POLLIN}], 1, {tv_sec=0, tv_nsec=0}, NULL, 0) = 0 (Timeout)
22340 write(2, "Backtrace:\n", 11) = 11

Does that ring a bell? Perhaps that was fixed in the meantime?

Right now it cannot be restarted: it always fails at start up with the
error above. 10.0.0.1 is reachable though so I’m not sure what’s up.

Ludo’.
M
M
Mathieu Othacehe wrote on 23 Nov 2022 09:08
(name . Ludovic Courtès)(address . ludovic.courtes@inria.fr)(address . 59493@debbugs.gnu.org)
87h6yqw0sf.fsf@gnu.org
Hello Ludo,

Thanks for gathering those information.

Toggle quote (5 lines)
> 2022-11-21 14:27:24 1685:16 0 (raise-exception _ #:continuable? _)
> 2022-11-21 14:27:24
> 2022-11-21 14:27:24 ice-9/boot-9.scm:1685:16: In procedure raise-exception:
> 2022-11-21 14:27:24 Throw to key `match-error' with args `("match" "no matching pattern" (#vu8()))'.

Yes this is because a new remote-server is running on Berlin and it
sends an empty sequence at every connection:

All remote-workers must update, and I have deployed Cuirass
1.1.0-13.1341725 on all hydra workers + guix9p.

I have been trying to deploy that to overdrive1 for two days but Berlin
offloads the builds to kreuzberg which has some issues because a lot of
builds are timeouting:

Toggle snippet (6 lines)
\building of `/gnu/store/9jg75a8rvdz3qxcbbm95312rlc4hyi98-mrustc-0.10-2.597593a-checkout.drv' timed out after 3600 seconds of silence
build of /gnu/store/9jg75a8rvdz3qxcbbm95312rlc4hyi98-mrustc-0.10-2.597593a-checkout.drv failed
View build log at '/var/log/guix/drvs/9j/g75a8rvdz3qxcbbm95312rlc4hyi98-mrustc-0.10-2.597593a-checkout.drv.gz'.
cannot build derivation `/gnu/store/wavx7rl6h93fpmc46nggnhkyxm75lqa4-mrustc-0.10-2.597593a-checkout.drv': 1 dependencies couldn't be built

Toggle quote (5 lines)
> (Stuttering is due to the unprotected use of ‘primitive-fork’: a
> non-local exit in the child leads it to execute the same code as its
> parent. We should fix that, but should we really fork in the first
> place? :-))

Right, this is problematic. I can't remember why I chose to fork.

In the meantime, this should be fixed by updating to 1.1.0-13.1341725 so
we can close this one I guess.

Mathieu
L
L
Ludovic Courtès wrote on 23 Nov 2022 16:47
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 59493@debbugs.gnu.org)
87tu2pfzaj.fsf@gnu.org
Hi,

Mathieu Othacehe <othacehe@gnu.org> skribis:

Toggle quote (9 lines)
>> 2022-11-21 14:27:24 1685:16 0 (raise-exception _ #:continuable? _)
>> 2022-11-21 14:27:24
>> 2022-11-21 14:27:24 ice-9/boot-9.scm:1685:16: In procedure raise-exception:
>> 2022-11-21 14:27:24 Throw to key `match-error' with args `("match" "no matching pattern" (#vu8()))'.
>
> Yes this is because a new remote-server is running on Berlin and it
> sends an empty sequence at every connection:
> https://git.savannah.gnu.org/cgit/guix/guix-cuirass.git/commit/?id=fc1641381d2a8a0472a71ef5ad2b64361faaaab4

Oh I see. It would be nice to avoid non-backward-compatible changes in
the protocol so we can upgrade more smoothly.

Toggle quote (7 lines)
> All remote-workers must update, and I have deployed Cuirass
> 1.1.0-13.1341725 on all hydra workers + guix9p.
>
> I have been trying to deploy that to overdrive1 for two days but Berlin
> offloads the builds to kreuzberg which has some issues because a lot of
> builds are timeouting:

Done now!

Toggle snippet (15 lines)
ludo@overdrive1 ~$ guix system describe
Generation 37 Nov 23 2022 15:58:08 (current)
file name: /var/guix/profiles/system-37-link
canonical file name: /gnu/store/62dr875n7i30l375j87flbqfym78kddg-system
label: GNU with Linux-Libre 6.0.9
bootloader: grub-efi
root device: /dev/sda3
kernel: /gnu/store/p4impcxw8lba8600acrxs21lgzc06xzq-linux-libre-6.0.9/Image
channels:
guix:
repository URL: https://git.savannah.gnu.org/git/guix.git
commit: 78f03567f44f704dfbc03cb64368aa42a01e78ad
configuration file: /gnu/store/myvzd1kpw2pfzfj3krl4lzpcbqsdn48x-configuration.scm

Running the Shepherd 0.9.3 and all, wonderful.

Toggle quote (5 lines)
>> (Stuttering is due to the unprotected use of ‘primitive-fork’: a
>> non-local exit in the child leads it to execute the same code as its
>> parent. We should fix that, but should we really fork in the first
>> place? :-))

Fixed in Cuirass commit 9fb6f21d29c5398b35f4c1a77cf6c20f207c9ebb.

Toggle quote (2 lines)
> Right, this is problematic. I can't remember why I chose to fork.

One concern is that, in the Avahi case, we create at least one thread
before forking, and as we know that doesn’t work (as in: it might work
sometimes). ZMQ may also create threads behind our back.

The parent doesn’t call ‘waitpid’ on its children, which isn’t great.

To me, ideally this would be either multi-threaded or Fiberized. The
latter would be more fruitful but what might be difficult is
guile-simple-zmq integration with Fibers (but maybe not: zmq_getsockopt
+ ZMQ_FD lets us get the file descriptor of a socket).

Something to consider…

Thanks,
Ludo’.
L
L
Ludovic Courtès wrote on 23 Nov 2022 16:47
control message for bug #59493
(address . control@debbugs.gnu.org)
87sfi9fzaa.fsf@gnu.org
close 59493
quit
M
M
Mathieu Othacehe wrote on 23 Nov 2022 17:03
Re: bug#59493: cuirass-remote-worker crash
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 59493-done@debbugs.gnu.org)
87k03lwtd7.fsf@gnu.org
Hey,

Toggle quote (3 lines)
> Oh I see. It would be nice to avoid non-backward-compatible changes in
> the protocol so we can upgrade more smoothly.

Right, sorry. We should introduce a protocol version to avoid that in
the future.

Toggle quote (2 lines)
> Fixed in Cuirass commit 9fb6f21d29c5398b35f4c1a77cf6c20f207c9ebb.

Awesome, thanks :)

Toggle quote (5 lines)
> To me, ideally this would be either multi-threaded or Fiberized. The
> latter would be more fruitful but what might be difficult is
> guile-simple-zmq integration with Fibers (but maybe not: zmq_getsockopt
> + ZMQ_FD lets us get the file descriptor of a socket).

I would prefer the multi-threaded approach if possible. While the
concept of Fiber is nice it adds another layer of complexity and
instability to those programs which are already hard to debug.

Mathieu
Closed
L
L
Ludovic Courtès wrote on 26 Nov 2022 16:04
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 59493-done@debbugs.gnu.org)
87edtp92q3.fsf@gnu.org
Hi,

Mathieu Othacehe <othacehe@gnu.org> skribis:

Toggle quote (9 lines)
>> To me, ideally this would be either multi-threaded or Fiberized. The
>> latter would be more fruitful but what might be difficult is
>> guile-simple-zmq integration with Fibers (but maybe not: zmq_getsockopt
>> + ZMQ_FD lets us get the file descriptor of a socket).
>
> I would prefer the multi-threaded approach if possible. While the
> concept of Fiber is nice it adds another layer of complexity and
> instability to those programs which are already hard to debug.

I guess it’s not black and white. Shared-state multithreading is an
endless source of bugs, regardless of the language being used;
message-passing (what Fibers is about) is more tractable.

Sure Fibers can have bugs of its own (I’m well aware of that :-)) but at
Fiber-using code can be simpler and less error-ridden than the
equivalent shared-state code.

Anyway, we’re not there yet.

Can you remember the rationale for forking in remote-worker.scm, or do
you think we might as well do it all in a single process?

Thanks,
Ludo’.
Closed
?