[cuirass] remote-worker process freeze

  • Done
  • quality assurance status badge
Details
3 participants
  • Ludovic Courtès
  • Maxim Cournoyer
  • Mathieu Othacehe
Owner
unassigned
Submitted by
Mathieu Othacehe
Severity
important
M
M
Mathieu Othacehe wrote on 29 Nov 2021 16:19
(address . bug-guix@gnu.org)
87ilwbj8ff.fsf@gnu.org
Hello,

On the newly installed honeycomb machines, some Cuirass remote-worker
process freeze completely and stop communicating with the
remote-server.

This has already been observed, but is for some reason more repeatable
on those machines.

Here are the information I could collect on such a frozen process using
GDB:

Toggle snippet (38 lines)
(gdb) attach 5660 ;frozen cuirass-remote-worker PID
(gdb) info thr
Id Target Id Frame
* 1 Thread 0xffffafd32e20 (LWP 5660) "yHg3r3fS" 0x0000ffffafb3fa80 in do_futex_wait.constprop () from /gnu/store/cb88z63hyg1icd2kkahiink2p291mhr2-glibc-2.31/lib/libpthread.so.0
2 Thread 0xffffa6c1c1d0 (LWP 5666) "ZMQbg/Reaper" 0x0000ffffaf7ec294 in epoll_pwait () from /gnu/store/cb88z63hyg1icd2kkahiink2p291mhr2-glibc-2.31/lib/libc.so.6
3 Thread 0xffffaf0071d0 (LWP 5667) "ZMQbg/IO/0" 0x0000ffffaf7ec294 in epoll_pwait () from /gnu/store/cb88z63hyg1icd2kkahiink2p291mhr2-glibc-2.31/lib/libc.so.6
4 Thread 0xffffa641b1d0 (LWP 5674) "yHg3r3fS" 0x0000ffffaf7b9d04 in clock_nanosleep@@GLIBC_2.17 () from /gnu/store/cb88z63hyg1icd2kkahiink2p291mhr2-glibc-2.31/lib/libc.so.6
(gdb) bt
#0 0x0000ffffafb3fa80 in do_futex_wait.constprop () from /gnu/store/cb88z63hyg1icd2kkahiink2p291mhr2-glibc-2.31/lib/libpthread.so.0
#1 0x0000ffffafb3fb78 in __new_sem_wait_slow.constprop.0 () from /gnu/store/cb88z63hyg1icd2kkahiink2p291mhr2-glibc-2.31/lib/libpthread.so.0
#2 0x0000ffffafb80318 in GC_stop_world () from /gnu/store/jsda4njqwjp4kb60fwa7n4mlfi1aanpq-libgc-7.6.12/lib/libgc.so.1
#3 0x0000ffffafb6c020 in GC_stopped_mark () from /gnu/store/jsda4njqwjp4kb60fwa7n4mlfi1aanpq-libgc-7.6.12/lib/libgc.so.1
#4 0x0000ffffafb6c8dc in GC_try_to_collect_inner () from /gnu/store/jsda4njqwjp4kb60fwa7n4mlfi1aanpq-libgc-7.6.12/lib/libgc.so.1
#5 0x0000ffffafb6d598 in GC_collect_or_expand () from /gnu/store/jsda4njqwjp4kb60fwa7n4mlfi1aanpq-libgc-7.6.12/lib/libgc.so.1
#6 0x0000ffffafb73b4c in GC_alloc_large () from /gnu/store/jsda4njqwjp4kb60fwa7n4mlfi1aanpq-libgc-7.6.12/lib/libgc.so.1
#7 0x0000ffffafb74038 in GC_generic_malloc () from /gnu/store/jsda4njqwjp4kb60fwa7n4mlfi1aanpq-libgc-7.6.12/lib/libgc.so.1
#8 0x0000ffffafb74298 in GC_malloc_kind_global () from /gnu/store/jsda4njqwjp4kb60fwa7n4mlfi1aanpq-libgc-7.6.12/lib/libgc.so.1
#9 0x0000ffffafc11fa8 in scm_make_bytevector () from /gnu/store/7g3nbnf2kf31jk696k0nyz9ck55b11a0-guile-3.0.7/lib/libguile-3.0.so.1
#10 0x0000ffffacacc418 in ?? ()
#11 0x0000ffffacc2ef2c in ?? ()
(gdb) thr 4
[Switching to thread 4 (Thread 0xffffa641b1d0 (LWP 5674))]
#0 0x0000ffffaf7b9d04 in clock_nanosleep@@GLIBC_2.17 () from /gnu/store/cb88z63hyg1icd2kkahiink2p291mhr2-glibc-2.31/lib/libc.so.6
(gdb) bt
#0 0x0000ffffaf7b9d04 in clock_nanosleep@@GLIBC_2.17 () from /gnu/store/cb88z63hyg1icd2kkahiink2p291mhr2-glibc-2.31/lib/libc.so.6
#1 0x0000ffffaf7bf55c in nanosleep () from /gnu/store/cb88z63hyg1icd2kkahiink2p291mhr2-glibc-2.31/lib/libc.so.6
#2 0x0000ffffafb7e844 in GC_lock () from /gnu/store/jsda4njqwjp4kb60fwa7n4mlfi1aanpq-libgc-7.6.12/lib/libgc.so.1
#3 0x0000ffffafb7ecdc in GC_do_blocking_inner () from /gnu/store/jsda4njqwjp4kb60fwa7n4mlfi1aanpq-libgc-7.6.12/lib/libgc.so.1
#4 0x0000ffffafb73998 in GC_with_callee_saves_pushed () from /gnu/store/jsda4njqwjp4kb60fwa7n4mlfi1aanpq-libgc-7.6.12/lib/libgc.so.1
#5 0x0000ffffafb79654 in GC_do_blocking () from /gnu/store/jsda4njqwjp4kb60fwa7n4mlfi1aanpq-libgc-7.6.12/lib/libgc.so.1
#6 0x0000ffffafc96d94 in scm_without_guile () from /gnu/store/7g3nbnf2kf31jk696k0nyz9ck55b11a0-guile-3.0.7/lib/libguile-3.0.so.1
#7 0x0000ffffafc97050 in scm_std_select () from /gnu/store/7g3nbnf2kf31jk696k0nyz9ck55b11a0-guile-3.0.7/lib/libguile-3.0.so.1
#8 0x0000ffffafc97b5c in scm_std_sleep () from /gnu/store/7g3nbnf2kf31jk696k0nyz9ck55b11a0-guile-3.0.7/lib/libguile-3.0.so.1
#9 0x0000ffffafc75918 in scm_sleep () from /gnu/store/7g3nbnf2kf31jk696k0nyz9ck55b11a0-guile-3.0.7/lib/libguile-3.0.so.1
#10 0x0000ffffa6c50d94 in ?? ()
#11 0x0000ffffacc2ee0c in ?? ()

So the threads 2 and 3 are managed internally by ZMQ. The threads 1 and
4 are respectively the thread pinging the remote-server and the thread
actually building stuff.

Looks like they are both stuck doing GC stuff.

Thanks,

Mathieu
M
M
Maxim Cournoyer wrote on 1 Feb 2022 16:15
control message for bug #52182
(address . control@debbugs.gnu.org)
87y22uehyi.fsf@gmail.com
severity 52182 important
quit
M
M
Maxim Cournoyer wrote on 7 Feb 2022 20:39
Re: bug#52182: [cuirass] remote-worker process freeze
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 52182@debbugs.gnu.org)
874k5a1n6v.fsf@gmail.com
Hello Mathieu,

Mathieu Othacehe <othacehe@gnu.org> writes:

Toggle quote (53 lines)
> Hello,
>
> On the newly installed honeycomb machines, some Cuirass remote-worker
> process freeze completely and stop communicating with the
> remote-server.
>
> This has already been observed, but is for some reason more repeatable
> on those machines.
>
> Here are the information I could collect on such a frozen process using
> GDB:
>
> (gdb) attach 5660 ;frozen cuirass-remote-worker PID
> (gdb) info thr
> Id Target Id Frame
> * 1 Thread 0xffffafd32e20 (LWP 5660) "yHg3r3fS" 0x0000ffffafb3fa80 in do_futex_wait.constprop () from /gnu/store/cb88z63hyg1icd2kkahiink2p291mhr2-glibc-2.31/lib/libpthread.so.0
> 2 Thread 0xffffa6c1c1d0 (LWP 5666) "ZMQbg/Reaper" 0x0000ffffaf7ec294 in epoll_pwait () from /gnu/store/cb88z63hyg1icd2kkahiink2p291mhr2-glibc-2.31/lib/libc.so.6
> 3 Thread 0xffffaf0071d0 (LWP 5667) "ZMQbg/IO/0" 0x0000ffffaf7ec294 in epoll_pwait () from /gnu/store/cb88z63hyg1icd2kkahiink2p291mhr2-glibc-2.31/lib/libc.so.6
> 4 Thread 0xffffa641b1d0 (LWP 5674) "yHg3r3fS" 0x0000ffffaf7b9d04 in clock_nanosleep@@GLIBC_2.17 () from /gnu/store/cb88z63hyg1icd2kkahiink2p291mhr2-glibc-2.31/lib/libc.so.6
> (gdb) bt
> #0 0x0000ffffafb3fa80 in do_futex_wait.constprop () from /gnu/store/cb88z63hyg1icd2kkahiink2p291mhr2-glibc-2.31/lib/libpthread.so.0
> #1 0x0000ffffafb3fb78 in __new_sem_wait_slow.constprop.0 () from /gnu/store/cb88z63hyg1icd2kkahiink2p291mhr2-glibc-2.31/lib/libpthread.so.0
> #2 0x0000ffffafb80318 in GC_stop_world () from /gnu/store/jsda4njqwjp4kb60fwa7n4mlfi1aanpq-libgc-7.6.12/lib/libgc.so.1
> #3 0x0000ffffafb6c020 in GC_stopped_mark () from /gnu/store/jsda4njqwjp4kb60fwa7n4mlfi1aanpq-libgc-7.6.12/lib/libgc.so.1
> #4 0x0000ffffafb6c8dc in GC_try_to_collect_inner () from /gnu/store/jsda4njqwjp4kb60fwa7n4mlfi1aanpq-libgc-7.6.12/lib/libgc.so.1
> #5 0x0000ffffafb6d598 in GC_collect_or_expand () from /gnu/store/jsda4njqwjp4kb60fwa7n4mlfi1aanpq-libgc-7.6.12/lib/libgc.so.1
> #6 0x0000ffffafb73b4c in GC_alloc_large () from /gnu/store/jsda4njqwjp4kb60fwa7n4mlfi1aanpq-libgc-7.6.12/lib/libgc.so.1
> #7 0x0000ffffafb74038 in GC_generic_malloc () from /gnu/store/jsda4njqwjp4kb60fwa7n4mlfi1aanpq-libgc-7.6.12/lib/libgc.so.1
> #8 0x0000ffffafb74298 in GC_malloc_kind_global () from /gnu/store/jsda4njqwjp4kb60fwa7n4mlfi1aanpq-libgc-7.6.12/lib/libgc.so.1
> #9 0x0000ffffafc11fa8 in scm_make_bytevector () from /gnu/store/7g3nbnf2kf31jk696k0nyz9ck55b11a0-guile-3.0.7/lib/libguile-3.0.so.1
> #10 0x0000ffffacacc418 in ?? ()
> #11 0x0000ffffacc2ef2c in ?? ()
> (gdb) thr 4
> [Switching to thread 4 (Thread 0xffffa641b1d0 (LWP 5674))]
> #0 0x0000ffffaf7b9d04 in clock_nanosleep@@GLIBC_2.17 () from /gnu/store/cb88z63hyg1icd2kkahiink2p291mhr2-glibc-2.31/lib/libc.so.6
> (gdb) bt
> #0 0x0000ffffaf7b9d04 in clock_nanosleep@@GLIBC_2.17 () from /gnu/store/cb88z63hyg1icd2kkahiink2p291mhr2-glibc-2.31/lib/libc.so.6
> #1 0x0000ffffaf7bf55c in nanosleep () from /gnu/store/cb88z63hyg1icd2kkahiink2p291mhr2-glibc-2.31/lib/libc.so.6
> #2 0x0000ffffafb7e844 in GC_lock () from /gnu/store/jsda4njqwjp4kb60fwa7n4mlfi1aanpq-libgc-7.6.12/lib/libgc.so.1
> #3 0x0000ffffafb7ecdc in GC_do_blocking_inner () from /gnu/store/jsda4njqwjp4kb60fwa7n4mlfi1aanpq-libgc-7.6.12/lib/libgc.so.1
> #4 0x0000ffffafb73998 in GC_with_callee_saves_pushed () from /gnu/store/jsda4njqwjp4kb60fwa7n4mlfi1aanpq-libgc-7.6.12/lib/libgc.so.1
> #5 0x0000ffffafb79654 in GC_do_blocking () from /gnu/store/jsda4njqwjp4kb60fwa7n4mlfi1aanpq-libgc-7.6.12/lib/libgc.so.1
> #6 0x0000ffffafc96d94 in scm_without_guile () from /gnu/store/7g3nbnf2kf31jk696k0nyz9ck55b11a0-guile-3.0.7/lib/libguile-3.0.so.1
> #7 0x0000ffffafc97050 in scm_std_select () from /gnu/store/7g3nbnf2kf31jk696k0nyz9ck55b11a0-guile-3.0.7/lib/libguile-3.0.so.1
> #8 0x0000ffffafc97b5c in scm_std_sleep () from /gnu/store/7g3nbnf2kf31jk696k0nyz9ck55b11a0-guile-3.0.7/lib/libguile-3.0.so.1
> #9 0x0000ffffafc75918 in scm_sleep () from /gnu/store/7g3nbnf2kf31jk696k0nyz9ck55b11a0-guile-3.0.7/lib/libguile-3.0.so.1
> #10 0x0000ffffa6c50d94 in ?? ()
> #11 0x0000ffffacc2ee0c in ?? ()
>
> So the threads 2 and 3 are managed internally by ZMQ. The threads 1 and
> 4 are respectively the thread pinging the remote-server and the thread
> actually building stuff.

After asking about this issue on #guix, cbaines pointed to this relevant
thread:

Ludovic mentioned it is known that forking processes in threads would be
undefined behavior in POSIX, but that doesn't match what the worker code
is currently doing, which is to fork *then* spawn threads.

Christopher mentioned perhaps something to try is to call execlp in the
primitive-fork new process; this seems to work reliable for them in the
guix-data-service.

Thanks,

Maxim
L
Closed
M
M
Maxim Cournoyer wrote on 11 Sep 2023 16:11
(name . Ludovic Courtès)(address . ludo@gnu.org)
87tts0yf3o.fsf@gmail.com
Hi,

Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (8 lines)
> Hello!
>
> I’m optimistically closing this bug because the implementation of
> ‘cuirass remote-worker’ has now changed to a single Fibers process:
>
> https://lists.gnu.org/archive/html/guix-devel/2023-07/msg00096.html
> https://git.savannah.gnu.org/cgit/guix/guix-cuirass.git/commit/?id=de8586080e04677cbe34c58f34715757ac61eea3

Thanks, this reads fantastic!

--
Thanks,
Maxim
Closed
?