[Shepherd] Indefinite heap growth (memory leak)

  • Done
  • quality assurance status badge
Details
4 participants
  • Joshua Branson
  • Liliana Marie Prikler
  • Ludovic Courtès
  • Maxime Devos
Owner
unassigned
Submitted by
Ludovic Courtès
Severity
important
L
L
Ludovic Courtès wrote on 19 Oct 2022 14:17
Shepherd crash on berlin
(address . bug-guix@gnu.org)
87pmeohute.fsf@gnu.org
Earlier today, berlin was unreachable. From /var/log/messages, this
appears to be due to a crash of shepherd (PID 1); here are its last
words:

Toggle snippet (16 lines)
Oct 19 06:32:36 localhost shepherd[1]: Respawning mumi-worker.
Oct 19 06:32:37 localhost shepherd[1]: Service mumi-worker has been started.
Oct 19 06:32:52 localhost ntpd[1837]: Soliciting pool server 193.141.27.6
Oct 19 06:33:56 localhost ntpd[1837]: Soliciting pool server 193.158.22.13
Oct 19 06:35:02 localhost ntpd[1837]: Soliciting pool server 62.75.236.38
Oct 19 06:35:43 localhost vmunix: [1586150.310354] shepherd[87089]: segfault at 0 ip 00007ff3018dc264 sp 00007ffeae066810 error 6 in crash-handler.so[7ff3018dc000+1000]
Oct 19 06:35:43 localhost vmunix: [1586150.322247] Code: ff ff ff e8 6e fe ff ff 48 8d 3d b7 0d 00 00 e8 12 fe ff ff bf 27 00 00 00 31 c0 e8 26 fe ff ff 89 ee 48 89 c7 e8 2c fe ff ff <c7> 04 25 00 00 00 00 00 00 00 00 0f 0b 48 89 c3 be 01 00 00 00 89
Oct 19 06:36:09 localhost ntpd[1837]: Soliciting pool server 51.75.67.47
Oct 19 06:36:40 localhost postgres[1389]: [73-1] 2022-10-19 04:36:40.036 GMT [1389] LOG: using stale statistics instead of current ones because stats collector is not responding
Oct 19 06:36:50 localhost postgres[87362]: [6-1] 2022-10-19 04:36:50.101 GMT [87362] LOG: using stale statistics instead of current ones because stats collector is not responding
Oct 19 06:37:00 localhost postgres[1389]: [74-1] 2022-10-19 04:37:00.062 GMT [1389] LOG: using stale statistics instead of current ones because stats collector is not responding
Oct 19 06:37:10 localhost postgres[87368]: [6-1] 2022-10-19 04:37:10.128 GMT [87368] LOG: using stale statistics instead of current ones because stats collector is not responding
Oct 19 06:37:16 localhost ntpd[1837]: Soliciting pool server 3.64.117.201
Oct 19 12:48:58 localhost syslogd (GNU inetutils 2.0): restart

The interesting part is that we have a core dump—see ‘crash-handler.so’
above, which is shepherd’s mechanism to ensure there’s a core dump
before the machine stops. Here’s what we get:

Toggle snippet (104 lines)
ludo@berlin ~$ gdb -ix ~/.gdbinit /gnu/store/cnfsv9ywaacyafkqdqsv2ry8f01yr7a9-guile-3.0.7/bin/guile /core.shepherd-20221019
GNU gdb (GDB) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /gnu/store/cnfsv9ywaacyafkqdqsv2ry8f01yr7a9-guile-3.0.7/bin/guile...
Reading symbols from /gnu/store/4ws3vh3zrs1yi9lfaibha64chf3vn2rm-guile-3.0.7-debug/lib/debug//gnu/store/cnfsv9ywaacyafkqdqsv2ry8f01yr7a9-guile-3.0.7/bin/guile.debug...

warning: core file may not match specified executable file.
[New LWP 87089]

warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.
Core was generated by `/gnu/store/cnfsv9ywaacyafkqdqsv2ry8f01yr7a9-guile-3.0.7/bin/guile --no-auto-com'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007ff3018dc264 in ?? ()
from /gnu/store/cdc1gzbp3q15kdiwn2i5j3437jwx61ac-shepherd-0.9.2/lib/shepherd/crash-handler.so
(gdb) bt
#0 0x00007ff3018dc264 in ?? ()
from /gnu/store/cdc1gzbp3q15kdiwn2i5j3437jwx61ac-shepherd-0.9.2/lib/shepherd/crash-handler.so
#1 <signal handler called>
#2 0x00007ff3018dc264 in ?? ()
from /gnu/store/cdc1gzbp3q15kdiwn2i5j3437jwx61ac-shepherd-0.9.2/lib/shepherd/crash-handler.so
#3 <signal handler called>
#4 0x00007ff30b2d2030 in raise ()
from /gnu/store/5h2w4qi9hk1qzzgi1w83220ydslinr4s-glibc-2.33/lib/libc.so.6
#5 0x00007ff30b2bc526 in abort ()
from /gnu/store/5h2w4qi9hk1qzzgi1w83220ydslinr4s-glibc-2.33/lib/libc.so.6
#6 0x00007ff30b7bb263 in ?? () from /gnu/store/2lczkxbdbzh4gk7wh91bzrqrk7h5g1dl-libgc-8.0.4/lib/libgc.so.1
#7 0x00007ff30b7c2232 in ?? () from /gnu/store/2lczkxbdbzh4gk7wh91bzrqrk7h5g1dl-libgc-8.0.4/lib/libgc.so.1
#8 0x00007ff30b7c25a8 in ?? () from /gnu/store/2lczkxbdbzh4gk7wh91bzrqrk7h5g1dl-libgc-8.0.4/lib/libgc.so.1
#9 0x00007ff30b7c290f in ?? () from /gnu/store/2lczkxbdbzh4gk7wh91bzrqrk7h5g1dl-libgc-8.0.4/lib/libgc.so.1
#10 0x00007ff30b7c78da in GC_generic_malloc_many ()
from /gnu/store/2lczkxbdbzh4gk7wh91bzrqrk7h5g1dl-libgc-8.0.4/lib/libgc.so.1
#11 0x00007ff30b87fb45 in scm_inline_gc_alloc (kind=SCM_INLINE_GC_KIND_NORMAL, idx=6,
freelist=0x7ff30b246e48) at gc-inline.h:79
#12 allocate_words_with_freelist (thread=0x7ff30b246d80, freelist_idx=6) at intrinsics.c:470
#13 0x00007ff300039f95 in ?? ()
#14 0x00007ff30b246d80 in ?? ()
#15 0x00007ff302a2f7e8 in ?? ()
#16 0x0000000000000040 in ?? ()
#17 0x00007ff30b88bb1c in scm_jit_enter_mcode (thread=0x7ff30b246d80,
mcode=0x7ff30231e75c "B", <incomplete sequence \340>) at jit.c:6038
#18 0x00007ff30b8e80c5 in scm_call_n (proc=<optimized out>, argv=argv@entry=0x7ffeae067878,
nargs=nargs@entry=1) at vm.c:1602
#19 0x00007ff30b862ea7 in scm_primitive_eval (exp=<optimized out>) at eval.c:671
#20 0x00007ff30b88d729 in scm_primitive_load (filename=<optimized out>) at load.c:131
#21 0x00007ff30b8e5915 in vm_regular_engine (thread=0x7ff30b246d80) at vm-engine.c:972
#22 0x00007ff30b8e8029 in scm_call_n (proc=<optimized out>, argv=argv@entry=0x7ffeae067a58,
nargs=nargs@entry=1) at vm.c:1608
#23 0x00007ff30b862ea7 in scm_primitive_eval (exp=<optimized out>,
exp@entry=((@ (ice-9 control) %) (begin ((@@ (ice-9 command-line) load/lang) "/gnu/store/cdc1gzbp3q15kdiwn2i5j3437jwx61ac-shepherd-0.9.2/bin/shepherd") (quit)))) at eval.c:671
#24 0x00007ff30b862f06 in scm_eval (
exp=((@ (ice-9 control) %) (begin ((@@ (ice-9 command-line) load/lang) "/gnu/store/cdc1gzbp3q15kdiwn2i5j3437jwx61ac-shepherd-0.9.2/bin/shepherd") (quit))),
module_or_state=module_or_state@entry="#<struct module>" = {...}) at eval.c:705
#25 0x00007ff30b8bde76 in scm_shell (argc=5, argv=0x7ffeae0680e8) at script.c:357
#26 0x00007ff30b87b36d in invoke_main_func (body_data=0x7ffeae067f80) at init.c:313
#27 0x00007ff30b85cbea in c_body (d=0x7ffeae067ec0) at continuations.c:430
#28 0x00007ff30b8e5915 in vm_regular_engine (thread=0x7ff30b246d80) at vm-engine.c:972
#29 0x00007ff30b8e8029 in scm_call_n (proc=<optimized out>, argv=argv@entry=0x7ffeae067c80,
nargs=nargs@entry=2) at vm.c:1608
#30 0x00007ff30b861dfa in scm_call_2 (proc=<optimized out>, arg1=<optimized out>, arg2=<optimized out>)
at eval.c:503
#31 0x00007ff30b863529 in scm_c_with_exception_handler (type=type@entry=#t,
handler=handler@entry=0x7ff30b8dd750 <catch_post_unwind_handler>,
handler_data=handler_data@entry=0x7ffeae067df0, thunk=thunk@entry=0x7ff30b8dd890 <catch_body>,
thunk_data=thunk_data@entry=0x7ffeae067df0) at exceptions.c:170
#32 0x00007ff30b8dda8d in scm_c_catch (tag=tag@entry=#t, body=body@entry=0x7ff30b85cbe0 <c_body>,
body_data=body_data@entry=0x7ffeae067ec0, handler=handler@entry=0x7ff30b85ce80 <c_handler>,
handler_data=handler_data@entry=0x7ffeae067ec0,
pre_unwind_handler=pre_unwind_handler@entry=0x7ff30b85ccd0 <pre_unwind_handler>,
pre_unwind_handler_data=0x7ff3038c4b00) at throw.c:168
#33 0x00007ff30b85d238 in scm_i_with_continuation_barrier (body=0x7ff30b85cbe0 <c_body>,
body_data=0x7ffeae067ec0, handler=0x7ff30b85ce80 <c_handler>, handler_data=0x7ffeae067ec0,
pre_unwind_handler=0x7ff30b85ccd0 <pre_unwind_handler>, pre_unwind_handler_data=0x7ff3038c4b00)
at continuations.c:368
#34 0x00007ff30b85d295 in scm_c_with_continuation_barrier (func=<optimized out>, data=<optimized out>)
at continuations.c:464
#35 0x00007ff30b8dc549 in with_guile (base=0x7ffeae067f28, data=0x7ffeae067f50) at threads.c:645
#36 0x00007ff30b7b90ba in GC_call_with_stack_base ()
from /gnu/store/2lczkxbdbzh4gk7wh91bzrqrk7h5g1dl-libgc-8.0.4/lib/libgc.so.1
#37 0x00007ff30b8dc848 in scm_i_with_guile (dynamic_state=<optimized out>, data=data@entry=0x7ffeae067f30,
func=func@entry=0x7ff30b87b350 <invoke_main_func>) at threads.c:688
#38 scm_with_guile (func=func@entry=0x7ff30b87b350 <invoke_main_func>, data=data@entry=0x7ffeae067f80)
at threads.c:694
#39 0x00007ff30b87b4e2 in scm_boot_guile (argc=argc@entry=5, argv=argv@entry=0x7ffeae0680e8,
main_func=main_func@entry=0x401230 <inner_main>, closure=closure@entry=0x0) at init.c:296
#40 0x00000000004010f7 in main (argc=5, argv=0x7ffeae0680e8) at guile.c:94
(gdb) info proc status
unable to handle request
(gdb) info proc
exe = '/gnu/store/cnfsv9ywaacyafkqdqsv2ry8f01yr7a9-guile-3.0.7/bin/guile --no-auto-com'

Ludo’.
L
L
Ludovic Courtès wrote on 19 Oct 2022 18:16
control message for bug #58631
(address . control@debbugs.gnu.org)
877d0viyaz.fsf@gnu.org
severity 58631 important
quit
L
L
Ludovic Courtès wrote on 19 Oct 2022 18:24
Re: bug#58631: Shepherd crash on berlin
(address . 58631@debbugs.gnu.org)
87sfjjhjcd.fsf@gnu.org
Same backtrace with debugging symbols for libgc and libc:

Toggle snippet (38 lines)
(gdb) bt
#0 0x00007ff3018dc264 in ?? ()
from /gnu/store/cdc1gzbp3q15kdiwn2i5j3437jwx61ac-shepherd-0.9.2/lib/shepherd/crash-handler.so
#1 <signal handler called>
#2 0x00007ff3018dc264 in ?? ()
from /gnu/store/cdc1gzbp3q15kdiwn2i5j3437jwx61ac-shepherd-0.9.2/lib/shepherd/crash-handler.so
#3 <signal handler called>
#4 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:49
#5 0x00007ff30b2bc526 in __GI_abort () at abort.c:79
#6 0x00007ff30b7bb263 in GC_add_to_heap (bytes=<optimized out>, p=0x7fefd1f70000) at extra/../alloc.c:1232
#7 GC_expand_hp_inner (n=n@entry=2048) at extra/../alloc.c:1386
#8 0x00007ff30b7c2232 in GC_collect_or_expand (needed_blocks=needed_blocks@entry=1,
ignore_off_page=ignore_off_page@entry=0, retry=retry@entry=0) at extra/../alloc.c:1480
#9 0x00007ff30b7c25a8 in GC_allocobj (gran=gran@entry=7, kind=<optimized out>) at extra/../alloc.c:1543
#10 0x00007ff30b7c290f in GC_generic_malloc_inner (lb=lb@entry=112, k=k@entry=1) at extra/../malloc.c:191
#11 0x00007ff30b7c78da in GC_generic_malloc_many (lb=lb@entry=112, k=k@entry=1,
result=result@entry=0x7ff30b246e48) at extra/../mallocx.c:473
#12 0x00007ff30b87fb45 in scm_inline_gc_alloc (kind=SCM_INLINE_GC_KIND_NORMAL, idx=6,
freelist=0x7ff30b246e48) at gc-inline.h:79
#13 allocate_words_with_freelist (thread=0x7ff30b246d80, freelist_idx=6) at intrinsics.c:470
#14 0x00007ff300039f95 in ?? ()
#15 0x00007ff30b246d80 in ?? ()
#16 0x00007ff302a2f7e8 in ?? ()
#17 0x0000000000000040 in ?? ()
#18 0x00007ff30b88bb1c in scm_jit_enter_mcode (thread=thread@entry=0x7ff30b246d80,
mcode=0x7ff30231e75c "B", <incomplete sequence \340>,
mcode@entry=0x7ff302a2f7e8 "I\211\314I)\304I\203\374 \017\205\345\b") at jit.c:6038
#19 0x00007ff30b8e80c5 in scm_call_n (proc=<optimized out>, argv=argv@entry=0x7ffeae067878,
nargs=nargs@entry=1) at vm.c:1602
#20 0x00007ff30b862ea7 in scm_primitive_eval (exp=<optimized out>) at eval.c:671
#21 0x00007ff30b88d729 in scm_primitive_load (filename=<optimized out>) at load.c:131
#22 0x00007ff30b8e5915 in vm_regular_engine (thread=0x7ff30b246d80) at vm-engine.c:972
#23 0x00007ff30b8e8029 in scm_call_n (proc=<optimized out>, argv=argv@entry=0x7ffeae067a58,
nargs=nargs@entry=1) at vm.c:1608
#24 0x00007ff30b862ea7 in scm_primitive_eval (exp=<optimized out>,
exp@entry=((@ (ice-9 control) %) (begin ((@@ (ice-9 command-line) load/lang) "/gnu/store/cdc1gzbp3q15kdiwn2i5j3437jwx61ac-shepherd-0.9.2/bin/shepherd") (quit)))) at eval.c:671

This means we’re hitting this ‘abort’ call:

Toggle snippet (10 lines)
GC_INNER void GC_add_to_heap(struct hblk *p, size_t bytes)
{
hdr * phdr;
word endp;

if (GC_n_heap_sects >= MAX_HEAP_SECTS) {
ABORT("Too many heap sections: Increase MAXHINCR or MAX_HEAP_SECTS");
}

Indeed, the core dump weighs 13 GiB, so it looks like a memory leak (too
bad ‘info proc stat’ in GDB doesn’t work).

I have not observed it on any other machine. The difference between
berlin and machines I have access to is that berlin is being hammered on
port 22, which means it gets to spawn sshd processes very often. This
could be where the leak is.

Ludo’.
L
L
Ludovic Courtès wrote on 20 Oct 2022 11:29
(address . 58631@debbugs.gnu.org)
87a65qhmh3.fsf@gnu.org
Heap usage of shepherd on berlin gathered roughly at T0 (a few hours
after booting), T0+8h, an T0+10h:

Toggle snippet (53 lines)
ludo@berlin ~$ sudo herd eval root '(gc-stats)'
Evaluating user expression (gc-stats).
((gc-time-taken . 36730013844) (heap-size . 31494144) (heap-free-size . 8339456) (heap-total-allocated . 1198237472) (heap-allocated-since-gc . 7581824) (protected-objects . 0) (gc-times . 157))
ludo@berlin ~$ sudo cat /proc/1/maps |grep -A10 heap
016fe000-01810000 rw-p 00000000 00:00 0 [heap]
7f78a387d000-7f78a4000000 rw-p 00000000 00:00 0
7f78a4000000-7f78a4021000 rw-p 00000000 00:00 0
7f78a4021000-7f78a8000000 ---p 00000000 00:00 0
7f78a81d0000-7f78a8600000 rw-p 00000000 00:00 0
7f78a8600000-7f78a8640000 rwxp 00000000 00:00 0
7f78a8640000-7f78a8700000 rw-p 00000000 00:00 0
7f78a8700000-7f78a8740000 rwxp 00000000 00:00 0
7f78a8740000-7f78a8c6c000 rw-p 00000000 00:00 0
7f78a8c6c000-7f78a8cac000 rwxp 00000000 00:00 0
7f78a8cac000-7f78a8ccc000 r--p 00000000 00:17 167921 /gnu/store/9nb1nplyx4gx5pdrmmidngir9jmrzxi3-guile-netlink-1.1.1/lib/guile/3.0/site-ccache/ip/route.go
ludo@berlin ~$ sudo herd eval root '(gc-stats)'
Password:
Evaluating user expression (gc-stats).
((gc-time-taken . 2698835942232) (heap-size . 322998272) (heap-free-size . 56537088) (heap-total-allocated . 49605825968) (heap-allocated-since-gc . 67360208) (protected-objects . 0) (gc-times . 1356))
ludo@berlin ~$ sudo cat /proc/1/maps |grep -A12 heap
016fe000-01810000 rw-p 00000000 00:00 0 [heap]
7f788ce8d000-7f78a1bdd000 rw-p 00000000 00:00 0
7f78a1bdd000-7f78a1c1d000 rwxp 00000000 00:00 0
7f78a1c1d000-7f78a4000000 rw-p 00000000 00:00 0
7f78a4000000-7f78a4021000 rw-p 00000000 00:00 0
7f78a4021000-7f78a8000000 ---p 00000000 00:00 0
7f78a8000000-7f78a8600000 rw-p 00000000 00:00 0
7f78a8600000-7f78a8640000 rwxp 00000000 00:00 0
7f78a8640000-7f78a8700000 rw-p 00000000 00:00 0
7f78a8700000-7f78a8740000 rwxp 00000000 00:00 0
7f78a8740000-7f78a8c6c000 rw-p 00000000 00:00 0
7f78a8c6c000-7f78a8cac000 rwxp 00000000 00:00 0
7f78a8cac000-7f78a8ccc000 r--p 00000000 00:17 167921 /gnu/store/9nb1nplyx4gx5pdrmmidngir9jmrzxi3-guile-netlink-1.1.1/lib/guile/3.0/site-ccache/ip/route.go
ludo@berlin ~$ sudo herd eval root '(gc-stats)'
Password:
Evaluating user expression (gc-stats).
((gc-time-taken . 2909168084878) (heap-size . 348164096) (heap-free-size . 51949568) (heap-total-allocated . 53131500656) (heap-allocated-since-gc . 94062880) (protected-objects . 0) (gc-times . 1383))
ludo@berlin ~$ sudo cat /proc/1/maps |grep -A12 heap
016fe000-01810000 rw-p 00000000 00:00 0 [heap]
7f788b4bd000-7f78a1bdd000 rw-p 00000000 00:00 0
7f78a1bdd000-7f78a1c1d000 rwxp 00000000 00:00 0
7f78a1c1d000-7f78a4000000 rw-p 00000000 00:00 0
7f78a4000000-7f78a4021000 rw-p 00000000 00:00 0
7f78a4021000-7f78a8000000 ---p 00000000 00:00 0
7f78a8000000-7f78a8600000 rw-p 00000000 00:00 0
7f78a8600000-7f78a8640000 rwxp 00000000 00:00 0
7f78a8640000-7f78a8700000 rw-p 00000000 00:00 0
7f78a8700000-7f78a8740000 rwxp 00000000 00:00 0
7f78a8740000-7f78a8c6c000 rw-p 00000000 00:00 0
7f78a8c6c000-7f78a8cac000 rwxp 00000000 00:00 0
7f78a8cac000-7f78a8ccc000 r--p 00000000 00:17 167921 /gnu/store/9nb1nplyx4gx5pdrmmidngir9jmrzxi3-guile-netlink-1.1.1/lib/guile/3.0/site-ccache/ip/route.go

Heap keeps increasing, and quite quickly. That includes code arenas
allocated for JIT (the “rwx” mappings), but the biggest part seems to be
the rest of the heap (in particular the second mapping on the second and
third samples).

During that time shepherd is mostly receiving and logging messages from
‘guix publish’, and occasionally accepting a connection on port 22 and
spawning ‘sshd’:

Toggle snippet (25 lines)
ludo@berlin ~$ sudo strace -p1 -Tt
strace: Process 1 attached
10:33:59 rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0 <0.000009>
10:33:59 epoll_wait(13, [{events=EPOLLIN, data={u32=50, u64=50}}], 8, -1) = 1 <0.925330>
10:34:00 read(50, "GET /nar/lzip/z0l3fp12ck539qkzc2"..., 4096) = 76 <0.000012>
10:34:00 newfstatat(AT_FDCWD, "/etc/localtime", {st_mode=S_IFREG|0444, st_size=2298, ...}, 0) = 0 <0.000012>
10:34:00 write(51, "2022-10-20 10:34:00 GET /nar/lzi"..., 96) = 96 <0.000036>
10:34:00 read(50, 0x7f78a88d8020, 4096) = -1 EAGAIN (Resource temporarily unavailable) <0.000008>
10:34:00 epoll_ctl(13, EPOLL_CTL_MOD, 50, {events=EPOLLIN|EPOLLRDHUP|EPOLLONESHOT, data={u32=50, u64=50}}) = 0 <0.000009>
10:34:00 rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0 <0.000007>
10:34:00 epoll_wait(13, [{events=EPOLLIN, data={u32=50, u64=50}}], 8, -1) = 1 <0.001968>
10:34:00 read(50, "GET /x2ii2b9zwxnaf020zinccckl2bh"..., 4096) = 46 <0.000010>
10:34:00 newfstatat(AT_FDCWD, "/etc/localtime", {st_mode=S_IFREG|0444, st_size=2298, ...}, 0) = 0 <0.000012>
10:34:00 write(51, "2022-10-20 10:34:00 GET /x2ii2b9"..., 66) = 66 <0.000019>
10:34:00 read(50, 0x7f78a88d8020, 4096) = -1 EAGAIN (Resource temporarily unavailable) <0.000007>
10:34:00 epoll_ctl(13, EPOLL_CTL_MOD, 50, {events=EPOLLIN|EPOLLRDHUP|EPOLLONESHOT, data={u32=50, u64=50}}) = 0 <0.000008>
10:34:00 rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0 <0.000007>
10:34:00 epoll_wait(13, [{events=EPOLLIN, data={u32=50, u64=50}}], 8, -1) = 1 <0.097714>
10:34:00 read(50, "GET /lclbcq0jds63zal1p55g6v0mwz9"..., 4096) = 46 <0.000009>
10:34:00 newfstatat(AT_FDCWD, "/etc/localtime", {st_mode=S_IFREG|0444, st_size=2298, ...}, 0) = 0 <0.000011>
10:34:00 write(51, "2022-10-20 10:34:00 GET /lclbcq0"..., 66) = 66 <0.000018>
10:34:00 read(50, 0x7f78a88d8020, 4096) = -1 EAGAIN (Resource temporarily unavailable) <0.000009>
10:34:00 epoll_ctl(13, EPOLL_CTL_MOD, 50, {events=EPOLLIN|EPOLLRDHUP|EPOLLONESHOT, data={u32=50, u64=50}}) = 0 <0.000011>

Logging as performed by ‘%service-file-logger’ is quite GC-intensive
(but shouldn’t be leaky!); this bit:

(let ((prefix (strftime default-logfile-date-format
(localtime (current-time)))))
(format output "~a~a~%" prefix line)
(loop))

ends up allocating a couple of vectors (for ‘localtime’), a bunch of
strings, a string port (via (ice-9 format) it seems), etc. That puts GC
under stress.

Still, what’s leaking? Why are JIT code arenas being allocated?

Ludo’.
L
L
Ludovic Courtès wrote on 20 Oct 2022 11:30
control message for bug #58631
(address . control@debbugs.gnu.org)
878rlahmfq.fsf@gnu.org
retitle 58631 [Shepherd] Indefinite heap growth (memory leak)
quit
J
J
Joshua Branson wrote on 20 Oct 2022 19:49
Re: bug#58631: [Shepherd] Indefinite heap growth (memory leak)
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 58631@debbugs.gnu.org)
878rlajsgd.fsf_-_@dismail.de
Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (9 lines)
> Indeed, the core dump weighs 13 GiB, so it looks like a memory leak (too
> bad ‘info proc stat’ in GDB doesn’t work).
>
> I have not observed it on any other machine. The difference between
> berlin and machines I have access to is that berlin is being hammered on
> port 22, which means it gets to spawn sshd processes very often. This
> could be where the leak is.
>

So, this is a hack, but perhaps berlin could start to use endlessh on
port 22, and then the real ssh port could be on another standard port.


It would change most guix developers work loads. But it's a thought. :)

Toggle quote (1 lines)
> Ludo’.
L
L
Ludovic Courtès wrote on 20 Oct 2022 23:46
(name . Joshua Branson)(address . jbranso@dismail.de)(address . 58631@debbugs.gnu.org)
87ilkeb238.fsf@gnu.org
Joshua Branson <jbranso@dismail.de> skribis:

Toggle quote (3 lines)
> So, this is a hack, but perhaps berlin could start to use endlessh on
> port 22, and then the real ssh port could be on another standard port.

We should do that, I agree; we should still fix the actual bug though!

Ludo’.
L
L
Liliana Marie Prikler wrote on 21 Oct 2022 15:07
Re: bug#58631: Shepherd crash on berlin
3ba120cf2d9ec36d15bc89bc797eb004dbe91c18.camel@ist.tugraz.at
Am Donnerstag, dem 20.10.2022 um 11:29 +0200 schrieb Ludovic Courtès:
Toggle quote (8 lines)
> Logging as performed by ‘%service-file-logger’ is quite GC-intensive
> (but shouldn’t be leaky!); this bit:
>
>      (let ((prefix (strftime default-logfile-date-format
>                              (localtime (current-time)))))
>        (format output "~a~a~%" prefix line)
>        (loop))

(let ((prefix (strftime default-logfile-date-format
(localtime (current-time)))))
(format output "~a~a~%" prefix line)
(loop))

Would it make a difference if you instead wrote

(let ((prefix (strftime default-logfile-date-format
(localtime (current-time)))))
(format output "~a~a~%" prefix line))
(loop)

? Semantically, they're equivalent (perhaps with a wrapping begin
needed), but the compiler might fail to see this.

Cheers
L
L
Ludovic Courtès wrote on 22 Oct 2022 22:21
Re: bug#58631: [Shepherd] Indefinite heap growth (memory leak)
(name . Liliana Marie Prikler)(address . liliana.prikler@ist.tugraz.at)(address . 58631@debbugs.gnu.org)
87sfjf4njz.fsf_-_@gnu.org
Liliana Marie Prikler <liliana.prikler@ist.tugraz.at> skribis:

Toggle quote (24 lines)
> Am Donnerstag, dem 20.10.2022 um 11:29 +0200 schrieb Ludovic Courtès:
>> Logging as performed by ‘%service-file-logger’ is quite GC-intensive
>> (but shouldn’t be leaky!); this bit:
>>
>>      (let ((prefix (strftime default-logfile-date-format
>>                              (localtime (current-time)))))
>>        (format output "~a~a~%" prefix line)
>>        (loop))
>
> (let ((prefix (strftime default-logfile-date-format
> (localtime (current-time)))))
> (format output "~a~a~%" prefix line)
> (loop))
>
> Would it make a difference if you instead wrote
>
> (let ((prefix (strftime default-logfile-date-format
> (localtime (current-time)))))
> (format output "~a~a~%" prefix line))
> (loop)
>
> ? Semantically, they're equivalent (perhaps with a wrapping begin
> needed), but the compiler might fail to see this.

I’m confident the compiler does the right thing, and it’s a tail call so
‘prefix’ doesn’t outlive one loop iteration.

Ludo’.
L
L
Ludovic Courtès wrote on 22 Oct 2022 22:20
(name . Liliana Marie Prikler)(address . liliana.prikler@ist.tugraz.at)(address . 58631@debbugs.gnu.org)
87tu3v4nl8.fsf_-_@gnu.org
Liliana Marie Prikler <liliana.prikler@ist.tugraz.at> skribis:

Toggle quote (24 lines)
> Am Donnerstag, dem 20.10.2022 um 11:29 +0200 schrieb Ludovic Courtès:
>> Logging as performed by ‘%service-file-logger’ is quite GC-intensive
>> (but shouldn’t be leaky!); this bit:
>>
>>      (let ((prefix (strftime default-logfile-date-format
>>                              (localtime (current-time)))))
>>        (format output "~a~a~%" prefix line)
>>        (loop))
>
> (let ((prefix (strftime default-logfile-date-format
> (localtime (current-time)))))
> (format output "~a~a~%" prefix line)
> (loop))
>
> Would it make a difference if you instead wrote
>
> (let ((prefix (strftime default-logfile-date-format
> (localtime (current-time)))))
> (format output "~a~a~%" prefix line))
> (loop)
>
> ? Semantically, they're equivalent (perhaps with a wrapping begin
> needed), but the compiler might fail to see this.

I’m confident the compiler does the right thing, and it’s a tail call so
‘prefix’ doesn’t outlive one loop iteration.

Ludo’.
L
L
Ludovic Courtès wrote on 22 Oct 2022 22:29
(address . 58631@debbugs.gnu.org)
87ilkb4n5l.fsf_-_@gnu.org
An update: I can reproduce in a VM running a slightly simplified version
of ‘hydra/berlin.scm’, with loop running:

while true; do wget -qO/dev/null http://localhost:3000/nix-cache-info;done

to trigger ‘guix publish’ logging.

Better, I can mostly reproduce the issue with the attached config file,
then starting shepherd:

rm -f sock && ./shepherd -s sock -c log-conf.scm -I

… monitoring heap usage:

./herd -s sock eval root '(gc-stats)'

… and triggering inetd service startup, which in turn triggers heap
growth:

while : ; do echo foo> /dev/tcp/localhost/4567 ; done

Then the heap size reported by ‘gc-stats’ seems to hit a threshold above
which is stop growing, or it grows too slowly (IOW the problem is not as
easy to observe as on berlin).

That’s pretty much all I have so far. :-/

Ludo’.
;; https://issues.guix.gnu.org/58631 (define %command (list "/bin/sh" "-c" (string-append "while true; do " (string-concatenate (make-list 30 (string-append "echo " (string-concatenate (make-list 8 "logging ")) "; "))) "sleep 0.2; " " done"))) (define %echo-server ;; Simple echo server. '("/bin/sh" "-c" "echo hello; read line; echo line; echo done")) (define loss (make-vector (* 10 (expt 2 20)))) (register-services (make <service> #:provides '(test-logging) #:start (make-forkexec-constructor %command #:log-file "/tmp/service.log") #:stop (make-kill-destructor)) (make <service> #:provides '(test-inetd) #:start (make-inetd-constructor %echo-server (list (endpoint (make-socket-address AF_INET INADDR_LOOPBACK 4567)))) #:stop (make-inetd-destructor))) (start 'test-logging) (start 'test-inetd) (pk 'init-gc (gc-stats))
L
L
Ludovic Courtès wrote on 29 Oct 2022 12:01
(address . 58631@debbugs.gnu.org)
87r0yrymlm.fsf@gnu.org
The attached Fibers program illustrates the problem: heap grows even
though it’s not supposed to.

Ludo’.
;; https://issues.guix.gnu.org/58631 (use-modules (fibers) (fibers channels) (ice-9 rdelim) (statprof)) (run-fibers (lambda () (define channel (make-channel)) (define leave-channel (make-channel)) (spawn-fiber (lambda () (sleep 10) (put-message leave-channel 'leave))) (spawn-fiber (lambda () (let loop () (put-message channel 'hi!) (get-message channel) (loop)))) (spawn-fiber (lambda () (let loop () (get-message channel) (put-message channel 'hey!) (loop)))) (spawn-fiber (lambda () (let loop () (pk 'heap-size (assoc-ref (gc-stats) 'heap-size)) (sleep 2) (loop)))) (get-message leave-channel)) ;; #:drain? #t #:parallelism 1 ;don't create POSIX threads #:hz 0)
M
M
Maxime Devos wrote on 29 Oct 2022 16:54
1061590e-c167-7cf3-2f64-1e0117efbf8a@telenet.be
On 29-10-2022 12:01, Ludovic Courtès wrote:
Toggle quote (2 lines)
> The attached Fibers program illustrates the problem: heap grows even
> though it’s not supposed to.
If I add (gc) before the (pk 'heap-size ...), I can't reproduce:
;;; (heap-size 12742656)
;;; (heap-size 12808192)
;;; (heap-size 12808192)
;;; (heap-size 12808192)
;;; (heap-size 12808192)
(another run)
;;; (heap-size 4976640)
;;; (heap-size 8855552)
;;; (heap-size 12029952)
;;; (heap-size 12029952)
;;; (heap-size 12029952)
However, if I increase the number of seconds waited, I can eventually
reproduce some heap growth:
;;; (heap-size 12742656)
;;; (heap-size 12742656)
;;; (heap-size 12808192)
;;; (heap-size 12808192)
;;; (heap-size 12808192)
;;; (heap-size 12808192)
;;; (heap-size 12808192)
;;; (heap-size 12808192)
;;; (heap-size 12808192)
;;; (heap-size 12808192)
;;; (heap-size 12808192)
;; <--- here's the jump
;;; (heap-size 23359488)
;;; (heap-size 23359488)
;;; (heap-size 23359488)
;;; (heap-size 23359488)
;;; (heap-size 23359488)
;;; (heap-size 23359488)
;;; (heap-size 23359488)
;;; (heap-size 23359488)
;;; (heap-size 23359488)
(This is in a "guix shell guile guile-fibers" environment)
Greetings,
Maxime.
Attachment: OpenPGP_signature
L
L
Ludovic Courtès wrote on 30 Oct 2022 10:39
(address . 58631@debbugs.gnu.org)
87mt9dzm3q.fsf@gnu.org
Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (3 lines)
> The attached Fibers program illustrates the problem: heap grows even
> though it’s not supposed to.

L
L
Ludovic Courtès wrote on 6 Nov 2022 23:45
(address . 58631@debbugs.gnu.org)
87edufg0st.fsf@gnu.org
Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (7 lines)
> Ludovic Courtès <ludo@gnu.org> skribis:
>
>> The attached Fibers program illustrates the problem: heap grows even
>> though it’s not supposed to.
>
> The saga continues at <https://github.com/wingo/fibers/issues/65>.

… and then at https://issues.guix.gnu.org/59021, which appears to have
a happy end: these Guix commits deploy the fix:

072fd1124a gnu: shepherd@0.9: Use 'guile-3.0-latest' to address memory leak.
7138ba34fa gnu: guile@3.0.8: Add patch to address continuation memory leak.

In my testing it fixes the problem. It’s now deployed it on
bayfront.guix.gnu.org and we’ll deploy to berlin.guix.gnu.org later on
and see how it goes in the coming days.

Ludo’.
L
L
Ludovic Courtès wrote on 7 Nov 2022 11:43
(address . 58631@debbugs.gnu.org)
87sfivujrk.fsf@gnu.org
Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (19 lines)
> Ludovic Courtès <ludo@gnu.org> skribis:
>
>> Ludovic Courtès <ludo@gnu.org> skribis:
>>
>>> The attached Fibers program illustrates the problem: heap grows even
>>> though it’s not supposed to.
>>
>> The saga continues at <https://github.com/wingo/fibers/issues/65>.
>
> … and then at <https://issues.guix.gnu.org/59021>, which appears to have
> a happy end: these Guix commits deploy the fix:
>
> 072fd1124a gnu: shepherd@0.9: Use 'guile-3.0-latest' to address memory leak.
> 7138ba34fa gnu: guile@3.0.8: Add patch to address continuation memory leak.
>
> In my testing it fixes the problem. It’s now deployed it on
> bayfront.guix.gnu.org and we’ll deploy to berlin.guix.gnu.org later on
> and see how it goes in the coming days.

Turns out it’s neither the end nor happy! On bayfront, shepherd is
apparently still leaking:

Toggle snippet (17 lines)
ludo@bayfront ~$ date && sudo herd eval root '(gc-stats)'
Sun Nov 6 11:50:48 PM CET 2022
Password:
Evaluating user expression (gc-stats).
((gc-time-taken . 2193894749) (heap-size . 16404480) (heap-free-size . 5505024) (heap-total-allocated . 119183424) (heap-allocated-since-gc . 4071216) (protected-objects . 0) (gc-times . 21))
ludo@bayfront ~$ date && sudo herd eval root '(gc-stats)'
Mon Nov 7 09:45:35 AM CET 2022
Password:
Evaluating user expression (gc-stats).
((gc-time-taken . 741717175797) (heap-size . 47304704) (heap-free-size . 10625024) (heap-total-allocated . 17295529056) (heap-allocated-since-gc . 10237808) (protected-objects . 0) (gc-times . 1700))
ludo@bayfront ~$ date && sudo herd eval root '(gc-stats)'
Mon Nov 7 11:18:05 AM CET 2022
Password:
Evaluating user expression (gc-stats).
((gc-time-taken . 1090097976826) (heap-size . 55693312) (heap-free-size . 5206016) (heap-total-allocated . 23190066736) (heap-allocated-since-gc . 18995744) (protected-objects . 0) (gc-times . 1996))

There seems to be an unusually high number of “program” objects in the
heap:

Toggle snippet (39 lines)
ludo@bayfront ~$ sudo herd eval root '(begin (load "/home/ludo/heap-profiler.scm") (profile-heap))'
Password:
Evaluating user expression (begin (load "/home/ludo/heap-profiler.scm") (#)).
% type self avg obj size
48.0 program 1188240 32.1
21.8 pair 539904 16.0
7.4 bytevector 184368 784.5
5.6 vector 138373 7.1
4.6 struct 113328 85.8
3.5 unknown 86912 39.8
2.0 stringbuf 50144 60.5
1.3 symbol 33152 32.9
0.6 string 13728 34.1
0.5 smob 12656 39.9
0.5 variable 12432 22.6
0.4 frame 10720 34.6
0.4 pointer 10256 32.5
0.4 weak-vector 9968 32.8
0.4 bitvector 9936 34.4
0.4 hash-table 9840 34.5
0.3 weak-table 7504 34.0
0.3 vm-continuation 6976 31.7
0.3 heap-number 6848 35.3
0.2 atomic-box 5488 36.1
0.2 port 5248 36.4
0.2 dynamic-state 3760 30.1
0.1 fluid 3664 70.5
0.1 syntax 3312 48.0
0.1 keyword 2016 40.3
0.1 weak-set 1872 48.0
0.1 array 1776 48.0
0.1 partial-continuation 1760 39.1
0.0 primitive 528 16.0
0.0 foreign-program 448 44.8
0.0 continuation 336 48.0
0.0 primitive-generic 176 35.2
sampled heap: 2.36098 MiB (heap size: 53.11328 MiB)

(Note that ‘partial-continuation’, which are programs with the
SCM_F_PROGRAM_IS_PARTIAL_CONTINUATION flag, are counted separately. So
here we really talking about programs with flags == 0.)

Ludo’.
L
L
Ludovic Courtès wrote on 10 Nov 2022 09:51
(address . 58631@debbugs.gnu.org)
87k043kx9z.fsf@gnu.org
backported in Guix commit d9ca9cdd01bf1097343a047b51a1392131c7cf58, this
shepherd memory leak seems to be finally fixed (details in the issue
above).

Bayfront.guix has been running it for more than 12h and heap size hasn’t
changed in that time span.

I’ll wait a bit more before closing the issue, but we’ll soon be done.

Ludo’.
L
L
Ludovic Courtès wrote on 11 Nov 2022 11:55
(address . 58631-done@debbugs.gnu.org)
877d01g3py.fsf@gnu.org
Hi!

Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (3 lines)
> Bayfront.guix has been running it for more than 12h and heap size hasn’t
> changed in that time span.

Heap size has remained unchanged for the past 48h. Berlin.guix was
reconfigured & rebooted 12h and shepherd’s heap size has remained
unchanged.

Officially closing now!

Ludo’.
Closed
?