cuirass: Fibers scheduling blocked.

  • Done
  • quality assurance status badge
Details
2 participants
  • Ludovic Courtès
  • Mathieu Othacehe
Owner
unassigned
Submitted by
Mathieu Othacehe
Severity
normal
M
M
Mathieu Othacehe wrote on 22 Sep 2020 18:58
(address . bug-guix@gnu.org)
87eemtzr1q.fsf@gnu.org
Hello,

Today between 04:04 and 10:36 no inputs were fetched. Fetching is
supposed to happen every 5 minutes. This seem to be correlated to the
duration of the garbage collection happening on berlin.

Toggle snippet (7 lines)
2020-09-22T04:04:23 fetching input 'core-updates' of spec 'core-updates-core-updates'
2020-09-22T04:04:25 build succeeded: '/gnu/store/c7m6jxdkyjs7m5ynavagjwgp172a3xzv-partition.img.drv'
waiting for the big garbage collector lock...
...
2020-09-22T10:36:02 fetching input 'guix' of spec 'guix-master'

A potential cause is described here:

Thanks,

Mathieu

--
L
L
Ludovic Courtès wrote on 5 Oct 2020 14:13
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 43565@debbugs.gnu.org)
87r1qc27mo.fsf@gnu.org
Hi,

Mathieu Othacehe <othacehe@gnu.org> skribis:

Toggle quote (13 lines)
> Today between 04:04 and 10:36 no inputs were fetched. Fetching is
> supposed to happen every 5 minutes. This seem to be correlated to the
> duration of the garbage collection happening on berlin.
>
> 2020-09-22T04:04:23 fetching input 'core-updates' of spec 'core-updates-core-updates'
> 2020-09-22T04:04:25 build succeeded: '/gnu/store/c7m6jxdkyjs7m5ynavagjwgp172a3xzv-partition.img.drv'
> waiting for the big garbage collector lock...
> ...
> 2020-09-22T10:36:02 fetching input 'guix' of spec 'guix-master'
>
> A potential cause is described here:
> https://issues.guix.gnu.org/43552#1.

‘process-build-log’ in Cuirass uses ‘read-line/non-blocking’ to read a
line from the log port of ‘build-derivations&’. If that really is
non-blocking (and I think it is), then we should be fine?

We should attach GDB to Cuirass next time to see what’s blocking.

Ludo’.
M
M
Mathieu Othacehe wrote on 22 Oct 2020 13:55
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 43565@debbugs.gnu.org)
874kmmzd92.fsf@gnu.org
Hey Ludo!

Toggle quote (6 lines)
> ‘process-build-log’ in Cuirass uses ‘read-line/non-blocking’ to read a
> line from the log port of ‘build-derivations&’. If that really is
> non-blocking (and I think it is), then we should be fine?
>
> We should attach GDB to Cuirass next time to see what’s blocking.

Cuirass is currently hanging probably due to the same issue. I saved a
GDB core dump in /home/mathieu/core.76483.

Could use your help finding the guilty thread :)

Thanks,

Mathieu
L
L
Ludovic Courtès wrote on 23 Oct 2020 14:21
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 43565@debbugs.gnu.org)
871rhpqgjy.fsf@gnu.org
Good afternoon fearless hacker!

Mathieu Othacehe <othacehe@gnu.org> skribis:

Toggle quote (9 lines)
>> ‘process-build-log’ in Cuirass uses ‘read-line/non-blocking’ to read a
>> line from the log port of ‘build-derivations&’. If that really is
>> non-blocking (and I think it is), then we should be fine?
>>
>> We should attach GDB to Cuirass next time to see what’s blocking.
>
> Cuirass is currently hanging probably due to the same issue. I saved a
> GDB core dump in /home/mathieu/core.76483.

For those following along at home, we have 60 threads in there.

A couple of threads are blocked in ‘clock_nanosleep’, which I considered
fishy at first:

Toggle snippet (23 lines)
(gdb) bt
#0 0x00007fe26752f7a1 in __GI___clock_nanosleep (clock_id=-612010, flags=0, req=0x7fdf6b40d140, rem=0x7fdf6b40d140)
at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:48
#1 0x00007fe267a0166d in ffi_call_unix64 ()
from /gnu/store/bw15z9kh9c65ycc2vbhl2izwfwfva7p1-libffi-3.3/lib/libffi.so.7
#2 0x00007fe2679ffac0 in ffi_call_int () from /gnu/store/bw15z9kh9c65ycc2vbhl2izwfwfva7p1-libffi-3.3/lib/libffi.so.7
#3 0x00007fe267af5f2e in scm_i_foreign_call (cif_scm=<optimized out>, pointer_scm=<optimized out>,
errno_ret=errno_ret@entry=0x7fe25a8e86cc, argv=0x7fe25b955df0) at foreign.c:1073
#4 0x00007fe267b64a84 in foreign_call (thread=0x7fe26741e480, cif=<optimized out>, pointer=<optimized out>)
at vm.c:1282
#5 0x00007fe2505253e0 in ?? ()
#6 0x00007fe26741e480 in ?? ()
#7 0x00007fe267bd7620 in ?? () from /gnu/store/0w76khfspfy8qmcpjya41chj3bgfcy0k-guile-3.0.4/lib/libguile-3.0.so.1
#8 0x00007fe26741e480 in ?? ()
#9 0x00007fe267b1043b in scm_jit_enter_mcode (thread=0x7fe26741e480, thread@entry=0x7fe2505253b0,
mcode=0x7fe25052627c "L\215\243\210") at jit.c:5852
#10 0x00007fe267b6bc24 in vm_regular_engine (thread=0x7fe2505253b0) at vm-engine.c:415
#11 0x00007fe267b6c5b5 in scm_call_n (proc=proc@entry=#<unmatched-tag 20045>, argv=argv@entry=0x0,
nargs=nargs@entry=0) at vm.c:1608
#12 0x00007fe267ae8ae9 in scm_call_0 (proc=proc@entry=#<unmatched-tag 20045>) at eval.c:490
#13 0x00007fe267adb138 in scm_call_with_unblocked_asyncs (proc=#<unmatched-tag 20045>) at async.c:406

This can only come from (fibers posix-clocks) via
‘with-interrupts’—probably OK.

Then there’s a couple of threads block in ‘pthread_cond_wait’, but
that’s presumably also Fibers internals.

Then there’s a whole bunch of threads stuck in ‘read’:

Toggle snippet (20 lines)
(gdb) bt
#0 0x00007fe267a180a4 in __libc_read (fd=80, buf=buf@entry=0x7fe22b0bb8f0, nbytes=nbytes@entry=8)
at ../sysdeps/unix/sysv/linux/read.c:26
#1 0x00007fe267af69c7 in fport_read (port=<optimized out>, dst=<optimized out>, start=<optimized out>, count=8)
at fports.c:597
#2 0x00007fe267b30542 in trampoline_to_c_read (port=#<port #<port-type file 7fe25fb4db40> 7fe22b7b9880>,
dst="#<vu8vector>" = {...}, start=0, count=8) at ports.c:266
#3 0x00007fe2580cb5fe in ?? ()
#4 0x00007fe267431d80 in ?? ()
#5 0x00007fe267bd7620 in ?? () from /gnu/store/0w76khfspfy8qmcpjya41chj3bgfcy0k-guile-3.0.4/lib/libguile-3.0.so.1
#6 0x00007fe267431d80 in ?? ()
#7 0x00007fe267b1043b in scm_jit_enter_mcode (thread=0x7fe267431d80, thread@entry=0x7fe2580cb5d0,
mcode=0x7fe229340690 "H\203\350(I\211\314I)\304I\203\374\060\017\205T\003") at jit.c:5852
#8 0x00007fe267b6b8e9 in vm_regular_engine (thread=0x7fe2580cb5d0) at vm-engine.c:360
#9 0x00007fe267b6c5b5 in scm_call_n (proc=proc@entry=#<unmatched-tag 20045>, argv=argv@entry=0x0,
nargs=nargs@entry=0) at vm.c:1608
#10 0x00007fe267ae8ae9 in scm_call_0 (proc=proc@entry=#<unmatched-tag 20045>) at eval.c:490
#11 0x00007fe267adb138 in scm_call_with_unblocked_asyncs (proc=#<unmatched-tag 20045>) at async.c:406

‘trampoline_to_c_read’ is known as ‘port-read’ in Scheme, so I think the
call above comes from ‘read-bytes’ in (ice-9 suspendable-ports).

Normally, this file descriptor is O_NONBLOCK, and thus ‘fport_read’
immediately returns EAGAIN, so ‘trampoline_to_c_read’ returns #false.

But does Cuirass create file descriptors as O_NONBLOCK? This has to be
done explicitly, Fibers won’t do it for us. As it turns out, the answer
is no, in at least one important case: the connection to the daemon
(untested patch below).

While GC is running, Cuirass typically sends ‘build-derivations’ RPCs
and they block until the GC lock is released. That can lead to the
situation above: a bunch of threads blocked in ‘read’ from their daemon
socket, waiting for the RPC reply. OTOH, ‘build-derivations’ RPCs are
made from a fresh thread created by ‘build-derivations&’.

There are probably other situations where the daemon replies slowly.
For instance, ‘fetch-input’ can remain stuck until GC is over.

WDYT?

Thanks for investigating!

Ludo’.
Toggle diff (42 lines)
diff --git a/src/cuirass/base.scm b/src/cuirass/base.scm
index 5a0c826..6db43c4 100644
--- a/src/cuirass/base.scm
+++ b/src/cuirass/base.scm
@@ -36,6 +36,9 @@
#:use-module ((guix config) #:select (%state-directory))
#:use-module (git)
#:use-module (ice-9 binary-ports)
+ #:use-module ((ice-9 suspendable-ports)
+ #:select (current-read-waiter
+ current-write-waiter))
#:use-module (ice-9 format)
#:use-module (ice-9 match)
#:use-module (ice-9 popen)
@@ -79,7 +82,12 @@
;; currently closes in a 'dynamic-wind' handler, which means it would close
;; the store at each context switch. Remove this when the real 'with-store'
;; has been fixed.
- (let ((store (open-connection)))
+ (let* ((store (open-connection))
+ (socket (store-connection-socket store)))
+ ;; Mark SOCKET as non-blocking so Fibers can schedule the way it wants.
+ (let ((flags (fcntl socket F_GETFL)))
+ (fcntl socket F_SETFL (logior O_NONBLOCK flags)))
+
(unwind-protect
;; Always set #:keep-going? so we don't stop on the first build failure.
;; Set #:print-build-trace explicitly to make sure 'process-build-log'
@@ -422,7 +430,12 @@ Essentially this procedure inverts the inversion-of-control that
(lambda ()
(guard (c ((store-error? c)
(atomic-box-set! result c)))
- (parameterize ((current-build-output-port output))
+ (parameterize ((current-build-output-port output)
+
+ ;; STORE's socket is O_NONBLOCK but since we're
+ ;; not in a fiber, disable Fiber's handlers.
+ (current-read-waiter #f)
+ (current-write-waiter #f))
(let ((x (build-derivations store lst)))
(atomic-box-set! result x))))
(close-port output))
M
M
Mathieu Othacehe wrote on 26 Oct 2020 15:22
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 43565@debbugs.gnu.org)
87v9excbj8.fsf@gnu.org
Hey!

Many thanks for your help, you rock!

Toggle quote (11 lines)
> But does Cuirass create file descriptors as O_NONBLOCK? This has to be
> done explicitly, Fibers won’t do it for us. As it turns out, the answer
> is no, in at least one important case: the connection to the daemon
> (untested patch below).
>
> While GC is running, Cuirass typically sends ‘build-derivations’ RPCs
> and they block until the GC lock is released. That can lead to the
> situation above: a bunch of threads blocked in ‘read’ from their daemon
> socket, waiting for the RPC reply. OTOH, ‘build-derivations’ RPCs are
> made from a fresh thread created by ‘build-derivations&’.

While I agree not opening file descriptors with O_NONBLOCK is an issue,
build-derivations is called in a separate thread. Blocking this separate
thread should not block the fibers.

For instance, the following program:

Toggle snippet (19 lines)
(use-modules (fibers)
(ice-9 threads))

(run-fibers
(lambda ()
(spawn-fiber
(lambda ()
(call-with-new-thread
(lambda ()
(read (car (pipe)))))))
(spawn-fiber
(lambda ()
(while #t
(format #t "alive~%")
(sleep 1)))))
#:hz 10
#:drain? #t)

keeps displaying "alive" even if the spawned thread is blocking. I guess
that's also what's happening in Cuirass because the log shows that some
fibers are scheduled while the GC is running.

Now the question is why there's no fetching while the GC is running? The
answer is that "latest-repository-commit" called by "fetch-input" will
block the only fiber dedicated to fetching. Having multiple fibers
trying to fetch wouldn't solve anything because fetching requires some
building from the daemon.

Long story short, I think we can apply your patch that can be useful to
prevent fibers talking directly to the daemon to block, even though it
won't help for this particular hang, that will only be fixed the GC time
will be reduced to something more acceptable.

Thanks,

Mathieu
L
L
Ludovic Courtès wrote on 26 Oct 2020 17:20
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 43565@debbugs.gnu.org)
87tuuh9cxe.fsf@gnu.org
Hello!

Mathieu Othacehe <othacehe@gnu.org> skribis:

Toggle quote (15 lines)
>> But does Cuirass create file descriptors as O_NONBLOCK? This has to be
>> done explicitly, Fibers won’t do it for us. As it turns out, the answer
>> is no, in at least one important case: the connection to the daemon
>> (untested patch below).
>>
>> While GC is running, Cuirass typically sends ‘build-derivations’ RPCs
>> and they block until the GC lock is released. That can lead to the
>> situation above: a bunch of threads blocked in ‘read’ from their daemon
>> socket, waiting for the RPC reply. OTOH, ‘build-derivations’ RPCs are
>> made from a fresh thread created by ‘build-derivations&’.
>
> While I agree not opening file descriptors with O_NONBLOCK is an issue,
> build-derivations is called in a separate thread. Blocking this separate
> thread should not block the fibers.

Agreed.

Toggle quote (6 lines)
> Now the question is why there's no fetching while the GC is running? The
> answer is that "latest-repository-commit" called by "fetch-input" will
> block the only fiber dedicated to fetching. Having multiple fibers
> trying to fetch wouldn't solve anything because fetching requires some
> building from the daemon.

Exactly: when the GC lock is taken, ‘latest-repository-commit’ makes an
‘add-to-store’ RPC, and that RPC blocks. Thus the whole fetch fiber is
blocked.

The patch should address this case. That said, nothing useful happens
anyway when the GC lock is held, so it wouldn’t have any practical
effect.

I believe there are other cases where RPCs can be slow, for example when
there’s contention on the sqlite database. Perhaps that could help a
bit there although again, it’s a situation where nothing useful can
happen.

Toggle quote (5 lines)
> Long story short, I think we can apply your patch that can be useful to
> prevent fibers talking directly to the daemon to block, even though it
> won't help for this particular hang, that will only be fixed the GC time
> will be reduced to something more acceptable.

Yeah please go ahead if you want, or let me know if you’d rather let me
apply it.

Thanks!

Ludo’.
M
M
Mathieu Othacehe wrote on 27 Oct 2020 19:03
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 43565-done@debbugs.gnu.org)
87sg9zh7go.fsf@gnu.org
Hey,

Toggle quote (3 lines)
> Yeah please go ahead if you want, or let me know if you’d rather let me
> apply it.

I applied your patch, thanks! I'm closing this one, because there's
nothing much that can be done right now.

Thanks,

Mathieu
Closed
M
M
Mathieu Othacehe wrote on 2 Nov 2020 11:09
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 43565@debbugs.gnu.org)
87k0v4ax42.fsf@gnu.org
Hey,

Toggle quote (3 lines)
> Yeah please go ahead if you want, or let me know if you’d rather let me
> apply it.

I finally reverted this patch that causes the following error:

Toggle snippet (27 lines)
2020-11-02T11:05:08 fatal: uncaught exception 'wrong-type-arg' in 'build' fiber!
2020-11-02T11:05:08 exception arguments: ("struct-vtable" "Wrong type argument in position 1 (expecting struct): ~S" (#f) (#f))
In ice-9/boot-9.scm:
1731:15 12 (with-exception-handler #<procedure 7fb1a93f9930 at ic…> …)
1736:10 11 (with-exception-handler _ _ #:unwind? _ # _)
718:2 10 (call-with-prompt ("break") #<procedure 7fb1ab76f440 a…> …)
718:2 9 (call-with-prompt ("continue") #<procedure 7fb1ab77084…> …)
In ice-9/eval.scm:
619:8 8 (_ #(#(#<directory (guile-user) 7fb1ac680f00> #<var…> …)))
In srfi/srfi-1.scm:
634:9 7 (for-each #<procedure 7fb1a9525900 at cuirass/base.scm…> …)
In ice-9/boot-9.scm:
1731:15 6 (with-exception-handler #<procedure 7fb1a95a94e0 at ic…> …)
1669:16 5 (raise-exception _ #:continuable? _)
1764:13 4 (_ #<&compound-exception components: (#<&assertion-fail…>)
In cuirass/utils.scm:
319:8 3 (_ _ . _)
In ice-9/boot-9.scm:
1731:15 2 (with-exception-handler #<procedure 7fb1ab2e3720 at ic…> …)
In cuirass/utils.scm:
320:22 1 (_)
In unknown file:
0 (make-stack #t)
ERROR: In procedure make-stack:
In procedure struct-vtable: Wrong type argument in position 1 (expecting struct): #f

Thanks,

Mathieu
M
M
Mathieu Othacehe wrote on 19 Nov 2020 11:56
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 43565@debbugs.gnu.org)
87ima1pqc7.fsf@gnu.org
Hey,

Toggle quote (7 lines)
> In cuirass/utils.scm:
> 320:22 1 (_)
> In unknown file:
> 0 (make-stack #t)
> ERROR: In procedure make-stack:
> In procedure struct-vtable: Wrong type argument in position 1 (expecting struct): #f

I think this error is caused by setting:

Toggle snippet (6 lines)
;; STORE's socket is O_NONBLOCK but since we're
;; not in a fiber, disable Fiber's handlers.
(current-read-waiter #f)
(current-write-waiter #f)

where it should be:

Toggle snippet (10 lines)
;; STORE's socket is O_NONBLOCK but since we're
;; not in a fiber, disable Fiber's handlers.
(current-read-waiter
(lambda (port)
(port-poll port "r")))
(current-write-waiter
(lambda (port)
(port-poll port "w")))

then this should also be done in "fetch-inputs" that is using non
blocking ports outside of Fibers.

However, I still have the following error:

Toggle snippet (37 lines)
In ice-9/boot-9.scm:
1731:15 17 (with-exception-handler #<procedure 7fac67194000 at ic…> …)
1736:10 16 (with-exception-handler _ _ #:unwind? _ # _)
In ice-9/eval.scm:
619:8 15 (_ #(#(#(#(#<directory (cuirass base) 7fac6b51c…>)) …) …))
In unknown file:
14 (_ #<procedure 7fac69b10b20 at ice-9/eval.scm:330:13 ()> …)
13 (partition #<procedure 7fac69b10880 at ice-9/eval.scm:…> …)
In guix/store.scm:
1008:0 12 (valid-path? #<store-connection 256.99 7fac6b3fd6e0> "/…")
2020-11-19T11:47:23 Failed to compute metric average-eval-build-start-time (1).
717:11 11 (process-stderr #<store-connection 256.99 7fac6b3fd6e0> _)
In guix/serialization.scm:
76:12 10 (read-int #<input-output: socket 49>)
In ice-9/suspendable-ports.scm:
307:17 9 (get-bytevector-n #<input-output: socket 49> 8)
2020-11-19T11:47:23 Failed to compute metric average-eval-build-complete-time (1).
2020-11-19T11:47:23 Failed to compute metric evaluation-completion-speed (1).
284:18 8 (get-bytevector-n! #<input-output: socket 49> #vu8(0 …) …)
67:33 7 (read-bytes #<input-output: socket 49> #vu8(0 0 0 0 0 …) …)
In fibers/internal.scm:
402:6 6 (suspend-current-fiber _)
In ice-9/boot-9.scm:
1669:16 5 (raise-exception _ #:continuable? _)
1764:13 4 (_ #<&compound-exception components: (#<&error> #<&orig…>)
In cuirass/utils.scm:
319:8 3 (_ _ . _)
In ice-9/boot-9.scm:
1731:15 2 (with-exception-handler #<procedure 7fac683ea300 at ic…> …)
In cuirass/utils.scm:
320:22 1 (_)
In unknown file:
0 (make-stack #t)
ERROR: In procedure make-stack:
Attempt to suspend fiber within continuation barrier

that originates from "valid-path?" in "restart-builds", not sure how to
fix it yet.

Thanks,

Mathieu
L
L
Ludovic Courtès wrote on 20 Nov 2020 09:37
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 43565@debbugs.gnu.org)
87wnygfmot.fsf@gnu.org
Hi,

Mathieu Othacehe <othacehe@gnu.org> skribis:

Toggle quote (26 lines)
>> In cuirass/utils.scm:
>> 320:22 1 (_)
>> In unknown file:
>> 0 (make-stack #t)
>> ERROR: In procedure make-stack:
>> In procedure struct-vtable: Wrong type argument in position 1 (expecting struct): #f
>
> I think this error is caused by setting:
>
> ;; STORE's socket is O_NONBLOCK but since we're
> ;; not in a fiber, disable Fiber's handlers.
> (current-read-waiter #f)
> (current-write-waiter #f)
>
>
> where it should be:
>
> ;; STORE's socket is O_NONBLOCK but since we're
> ;; not in a fiber, disable Fiber's handlers.
> (current-read-waiter
> (lambda (port)
> (port-poll port "r")))
> (current-write-waiter
> (lambda (port)
> (port-poll port "w")))

Ooh, good catch.

Toggle quote (44 lines)
> then this should also be done in "fetch-inputs" that is using non
> blocking ports outside of Fibers.
>
> However, I still have the following error:
>
> In ice-9/boot-9.scm:
> 1731:15 17 (with-exception-handler #<procedure 7fac67194000 at ic…> …)
> 1736:10 16 (with-exception-handler _ _ #:unwind? _ # _)
> In ice-9/eval.scm:
> 619:8 15 (_ #(#(#(#(#<directory (cuirass base) 7fac6b51c…>)) …) …))
> In unknown file:
> 14 (_ #<procedure 7fac69b10b20 at ice-9/eval.scm:330:13 ()> …)
> 13 (partition #<procedure 7fac69b10880 at ice-9/eval.scm:…> …)
> In guix/store.scm:
> 1008:0 12 (valid-path? #<store-connection 256.99 7fac6b3fd6e0> "/…")
> 2020-11-19T11:47:23 Failed to compute metric average-eval-build-start-time (1).
> 717:11 11 (process-stderr #<store-connection 256.99 7fac6b3fd6e0> _)
> In guix/serialization.scm:
> 76:12 10 (read-int #<input-output: socket 49>)
> In ice-9/suspendable-ports.scm:
> 307:17 9 (get-bytevector-n #<input-output: socket 49> 8)
> 2020-11-19T11:47:23 Failed to compute metric average-eval-build-complete-time (1).
> 2020-11-19T11:47:23 Failed to compute metric evaluation-completion-speed (1).
> 284:18 8 (get-bytevector-n! #<input-output: socket 49> #vu8(0 …) …)
> 67:33 7 (read-bytes #<input-output: socket 49> #vu8(0 0 0 0 0 …) …)
> In fibers/internal.scm:
> 402:6 6 (suspend-current-fiber _)
> In ice-9/boot-9.scm:
> 1669:16 5 (raise-exception _ #:continuable? _)
> 1764:13 4 (_ #<&compound-exception components: (#<&error> #<&orig…>)
> In cuirass/utils.scm:
> 319:8 3 (_ _ . _)
> In ice-9/boot-9.scm:
> 1731:15 2 (with-exception-handler #<procedure 7fac683ea300 at ic…> …)
> In cuirass/utils.scm:
> 320:22 1 (_)
> In unknown file:
> 0 (make-stack #t)
> ERROR: In procedure make-stack:
> Attempt to suspend fiber within continuation barrier
>
> that originates from "valid-path?" in "restart-builds", not sure how to
> fix it yet.

I think that’s because of the ‘partition’ call: ‘partition’ is currently
implemented in C and the stack cannot be captured if it contains C calls
in the middle.

The simplest fix is probably to have a Scheme implementation:
Toggle diff (26 lines)
diff --git a/src/cuirass/base.scm b/src/cuirass/base.scm
index 5a0c826..99a17fa 100644
--- a/src/cuirass/base.scm
+++ b/src/cuirass/base.scm
@@ -632,6 +632,21 @@ This procedure is meant to be called at startup."
db "UPDATE Builds SET status = 4 WHERE status = -2 AND timestamp < "
(- (time-second (current-time time-utc)) age) ";")))
+(define (partition pred lst)
+ ;; Scheme implementation of SRFI-1 'partition' so stack activations can be
+ ;; captured via 'abort-to-prompt'.
+ (let loop ((lst lst)
+ (pass '())
+ (fail '()))
+ (match lst
+ (()
+ (values (reverse pass) (reverse fail)))
+ ((head . tail)
+ (let ((pass? (pred head)))
+ (loop tail
+ (if pass? (cons head pass) pass)
+ (if pass? fail (cons head fail))))))))
+
(define (restart-builds)
"Restart builds whose status in the database is \"pending\" (scheduled or
started)."
It’s a bummer that one has to be aware of all these implementation
details when using Fibers. The vision I think is that asymptotically
these issues would vanish as more things move from C to Scheme.

Thanks,
Ludo’.
?