shepherd segfaults upon shutdown (kernel panic)

DoneSubmitted by Jesse Gibbons.
Details
4 participants
  • Arne Babenhauserheide
  • Jesse Gibbons
  • Ludovic Courtès
  • Jan
Owner
unassigned
Severity
important
J
J
Jesse Gibbons wrote on 15 Oct 2019 05:49
kernel panic
(name . bug-guix mailing list)(address . bug-guix@gnu.org)
0876c9961fdffa47be54b756a05eb6320b6bdb18.camel@gmail.com
Attached is a picture of the kernel panic. It happened when I tried to shutdown.I do not know what log to look at to get any details about what happenedabout that time. Of course, the panic itself is not in any of the logs in/var/log.This is not the first time there was a kernel panic during the shutdownprocess.
Attachment: Image-HRCX9Z.png
L
L
Ludovic Courtès wrote on 28 Oct 2019 23:18
control message for bug #37757
(address . control@debbugs.gnu.org)
87mudkfr8s.fsf@gnu.org
retitle 37757 Kernel panic upon shutdownquit
L
L
Ludovic Courtès wrote on 28 Oct 2019 23:19
(address . control@debbugs.gnu.org)
87lft4fr8n.fsf@gnu.org
severity 37757 importantquit
L
L
Ludovic Courtès wrote on 28 Oct 2019 23:28
Kernel panic upon shutdown
(name . Jesse Gibbons)(address . jgibbons2357@gmail.com)(address . 37757@debbugs.gnu.org)
874kzsfqsx.fsf@gnu.org
Hi,
Jesse Gibbons <jgibbons2357@gmail.com> skribis:
Toggle quote (8 lines)> Attached is a picture of the kernel panic. It happened when I tried to shut> down.> I do not know what log to look at to get any details about what happened> about that time. Of course, the panic itself is not in any of the logs in> /var/log.> This is not the first time there was a kernel panic during the shutdown> process.
I’ve just seen it on a laptop running GNOME and ‘%desktop-services’.The kernel panic appeared right after shutting down ModemManager (Idon’t have ModemManager on my own laptop and I’ve never experienced thebug, but I don’t know if it’s significant.)
Note that we see (roughly):
attempted to kill init! exit code=0x0000000b
which, unless I’m mistaken, means that PID 1 segfaulted (SIGSEGV = 11),which is bad.
According to reboot(2), the ‘reboot’ syscall doesn’t return in thiscase, so the segfault must have happened before the ‘reboot’ call.
The problem appeared roughly after the ‘core-updates’ merge, but I don’tsee any change to the ‘reboot’ wrapper in glibc 2.29.
Is it reproducible for you in a VM built with ‘guix system vm’? Ifwould be helpful if we had that.
Thanks,Ludo’.
L
L
Ludovic Courtès wrote on 13 Nov 2019 23:05
(name . Jesse Gibbons)(address . jgibbons2357@gmail.com)(address . 37757@debbugs.gnu.org)
87k183mnza.fsf@gnu.org
Ludovic Courtès <ludo@gnu.org> skribis:
Toggle quote (9 lines)> I’ve just seen it on a laptop running GNOME and ‘%desktop-services’.> The kernel panic appeared right after shutting down ModemManager (I> don’t have ModemManager on my own laptop and I’ve never experienced the> bug, but I don’t know if it’s significant.)>> Note that we see (roughly):>> attempted to kill init! exit code=0x0000000b
[...]
Toggle quote (3 lines)> Is it reproducible for you in a VM built with ‘guix system vm’? If> would be helpful if we had that.
For the record, apparently I can’t reproduce it in a ‘guix system vmgnu/system/examples/desktop.tmpl’ VM.
Ludo’.
J
(name . Ludovic Courtès)(address . ludo@gnu.org)
20191113232202.42702143@kompiuter
Hi,I encountered the same error today. I had ran "sudo herd stop tor" andthen "sudo herd stop xorg-server" and it panicked.

Jan Wielkiewicz
L
L
Ludovic Courtès wrote on 28 Nov 2019 12:41
control message for bug #37757
(address . control@debbugs.gnu.org)
878so0xm77.fsf@gnu.org
retitle 37757 shepherd segfaults upon shutdown (kernel panic)quit
L
L
Ludovic Courtès wrote on 28 Nov 2019 12:45
Re: bug#37757: Kernel panic upon shutdown
(name . Jesse Gibbons)(address . jgibbons2357@gmail.com)(address . 37757@debbugs.gnu.org)
87wobkw7gj.fsf@gnu.org
Hello!
The attached patch should allow shepherd (PID 1) to dump core when itcrashes (systemd does something similar).
Jesse (and anyone else experiencing this!), could you try to (1)reconfigure with this patch, (2) reboot, (3) try to halt the system toreproduce the crash, and (4) retrieve a backtrace from the ‘core’ file?
For #4, you’ll have to do something along these lines once you’verebooted after the crash:
sudo gdb /run/current-system/profile/bin/guile /core
and then type “thread apply all bt” at the GDB prompt.
I’ll also try to do that on another machine where I’ve seen it happen.
Thanks in advance!
Ludo’.
Toggle diff (126 lines)diff --git a/gnu/services/shepherd.scm b/gnu/services/shepherd.scmindex 08bb33039c..ec49244cf6 100644--- a/gnu/services/shepherd.scm+++ b/gnu/services/shepherd.scm@@ -277,45 +277,87 @@ and return the resulting '.go' file." (let ((files (map shepherd-service-file services))) (define config- #~(begin- (use-modules (srfi srfi-34)- (system repl error-handling))+ (with-imported-modules '((guix build syscalls))+ #~(begin+ (use-modules (srfi srfi-34)+ (system repl error-handling)+ (guix build syscalls)+ (system foreign)) - ;; Arrange to spawn a REPL if something goes wrong. This is better- ;; than a kernel panic.- (call-with-error-handling- (lambda ()- (apply register-services- (map load-compiled '#$(map scm->go files)))))+ (define signal+ (let ((proc (pointer->procedure int+ (dynamic-func "signal"+ (dynamic-link))+ (list int '*))))+ (lambda (signum handler)+ (proc signum+ (if (integer? handler) ;SIG_DFL, etc.+ (make-pointer handler)+ (procedure->pointer void handler (list int))))))) - ;; guix-daemon 0.6 aborts if 'PATH' is undefined, so work around- ;; it.- (setenv "PATH" "/run/current-system/profile/bin")+ (define (handle-crash sig)+ (dynamic-wind+ (const #t)+ (lambda ()+ (gc-disable)+ (pk 'crash! sig)+ ;; Fork and have the child dump core at the root.+ (match (clone SIGCHLD)+ (0+ (setrlimit 'core #f #f)+ (chdir "/")+ (signal sig SIG_DFL)+ ;; Note: 'getpid' would return 1, hence this hack.+ (kill (string->number (readlink "/proc/self"))+ sig)+ (primitive-_exit 253))+ (child+ (waitpid child)+ (sync)+ ;; Hopefully at this point core has been dumped.+ (pk 'done)+ (sleep 3)+ (primitive-_exit 255))))+ (lambda ()+ (primitive-_exit 254)))) - (format #t "starting services...~%")- (for-each (lambda (service)- ;; In the Shepherd 0.3 the 'start' method can raise- ;; '&action-runtime-error' if it fails, so protect- ;; against it. (XXX: 'action-runtime-error?' is not- ;; exported is 0.3, hence 'service-error?'.)- (guard (c ((service-error? c)- (format (current-error-port)- "failed to start service '~a'~%"- service)))- (start service)))- '#$(append-map shepherd-service-provision- (filter shepherd-service-auto-start?- services)))+ (signal SIGSEGV handle-crash) - ;; Hang up stdin. At this point, we assume that 'start' methods- ;; that required user interaction on the console (e.g.,- ;; 'cryptsetup open' invocations, post-fsck emergency REPL) have- ;; completed. User interaction becomes impossible after this- ;; call; this avoids situations where services wrongfully lead- ;; PID 1 to read from stdin (the console), which users may not- ;; have access to (see <https://bugs.gnu.org/23697>).- (redirect-port (open-input-file "/dev/null")- (current-input-port))))+ ;; Arrange to spawn a REPL if something goes wrong. This is better+ ;; than a kernel panic.+ (call-with-error-handling+ (lambda ()+ (apply register-services+ (map load-compiled '#$(map scm->go files)))))++ ;; guix-daemon 0.6 aborts if 'PATH' is undefined, so work around+ ;; it.+ (setenv "PATH" "/run/current-system/profile/bin")++ (format #t "starting services...~%")+ (for-each (lambda (service)+ ;; In the Shepherd 0.3 the 'start' method can raise+ ;; '&action-runtime-error' if it fails, so protect+ ;; against it. (XXX: 'action-runtime-error?' is not+ ;; exported is 0.3, hence 'service-error?'.)+ (guard (c ((service-error? c)+ (format (current-error-port)+ "failed to start service '~a'~%"+ service)))+ (start service)))+ '#$(append-map shepherd-service-provision+ (filter shepherd-service-auto-start?+ services)))++ ;; Hang up stdin. At this point, we assume that 'start' methods+ ;; that required user interaction on the console (e.g.,+ ;; 'cryptsetup open' invocations, post-fsck emergency REPL) have+ ;; completed. User interaction becomes impossible after this+ ;; call; this avoids situations where services wrongfully lead+ ;; PID 1 to read from stdin (the console), which users may not+ ;; have access to (see <https://bugs.gnu.org/23697>).+ (redirect-port (open-input-file "/dev/null")+ (current-input-port))))) (scheme-file "shepherd.conf" config)))
L
L
Ludovic Courtès wrote on 2 Dec 2019 18:33
(address . 37757@debbugs.gnu.org)
87d0d6k4z4.fsf@gnu.org
Hi!
Ludovic Courtès <ludo@gnu.org> skribis:
Toggle quote (11 lines)> Jesse (and anyone else experiencing this!), could you try to (1)> reconfigure with this patch, (2) reboot, (3) try to halt the system to> reproduce the crash, and (4) retrieve a backtrace from the ‘core’ file?>> For #4, you’ll have to do something along these lines once you’ve> rebooted after the crash:>> sudo gdb /run/current-system/profile/bin/guile /core>> and then type “thread apply all bt” at the GDB prompt.
It turns out the previous patch didn’t work; in short, we really have touse async-signal-safe functions only from the signal handler, so thishas to be done in C.
The attached patch does that. I’ve tried it with ‘guix systemcontainer’ and it seems to dump core as expected, from what I can see.
Let me know if you manage to reproduce the bug and to get a core dumpedwith this patch.
To everyone reading this: if you’re experiencing shepherd crashes,please raise your hand :-) and consider applying this patch so we cangather debugging info!
Thanks,Ludo’.
Toggle diff (114 lines)diff --git a/gnu/services/shepherd.scm b/gnu/services/shepherd.scmindex 08bb33039c..cf82ef0a4c 100644--- a/gnu/services/shepherd.scm+++ b/gnu/services/shepherd.scm@@ -271,6 +271,23 @@ and return the resulting '.go' file." (compile-file #$file #:output-file #$output #:env env)))))) +(define (crash-handler)+ (define gcc-toolchain+ (module-ref (resolve-interface '(gnu packages commencement))+ 'gcc-toolchain))++ (define source+ (local-file "../system/aux-files/shepherd-crash-handler.c"))++ (computed-file "crash-handler.so"+ #~(begin+ (setenv "PATH" #+(file-append gcc-toolchain "/bin"))+ (setenv "CPATH" #+(file-append gcc-toolchain "/include"))+ (setenv "LIBRARY_PATH"+ #+(file-append gcc-toolchain "/lib"))+ (system* "gcc" "-Wall" "-g" "-O3" "-fPIC"+ "-shared" "-o" #$output #$source))))+ (define (shepherd-configuration-file services) "Return the shepherd configuration file for SERVICES." (assert-valid-graph services)@@ -281,6 +298,9 @@ and return the resulting '.go' file." (use-modules (srfi srfi-34) (system repl error-handling)) + ;; Load the crash handler, which allows shepherd to dump core.+ (dynamic-link #$(crash-handler))+ ;; Arrange to spawn a REPL if something goes wrong. This is better ;; than a kernel panic. (call-with-error-handlingdiff --git a/gnu/system/aux-files/shepherd-crash-handler.c b/gnu/system/aux-files/shepherd-crash-handler.cnew file mode 100644index 0000000000..6b2db10866--- /dev/null+++ b/gnu/system/aux-files/shepherd-crash-handler.c@@ -0,0 +1,70 @@+#define _GNU_SOURCE++#include <stdlib.h>+#include <unistd.h>+#include <sched.h>+#include <sys/time.h>+#include <sys/resource.h>+#include <sys/types.h>+#include <sys/wait.h>+#include <sys/syscall.h> /* For SYS_xxx definitions */+#include <signal.h>++static void+handle_crash (int sig)+{+ static const char msg[] = "Shepherd crashed!\n";+ write (2, msg, sizeof msg);++#ifdef __sparc__+ /* See 'raw_clone' in systemd. */+# error "SPARC uses a different 'clone' syscall convention"+#endif++ pid_t pid = syscall (SYS_clone, SIGCHLD, NULL);+ if (pid < 0)+ abort ();++ if (pid == 0)+ {+ /* Restore the default signal handler to get a core dump. */+ signal (sig, SIG_DFL);++ const struct rlimit infinity = { RLIM_INFINITY, RLIM_INFINITY };+ setrlimit (RLIMIT_CORE, &infinity);+ chdir ("/");++ int pid = syscall (SYS_getpid);+ kill (pid, sig);++ /* As it turns out, 'kill' simply returns without doing anything, which+ is consistent with the "Notes" section of kill(2). Thus, force a+ crash. */+ * (int *) 0 = 42;++ _exit (254);+ }+ else+ {+ signal (sig, SIG_IGN);++ int status;+ waitpid (pid, &status, 0);++ sync ();++ _exit (255);+ }++ _exit (253);+}++static void initialize_crash_handler (void)+ __attribute__ ((constructor));++static void+initialize_crash_handler (void)+{+ signal (SIGSEGV, handle_crash);+ signal (SIGABRT, handle_crash);+}
A
A
Arne Babenhauserheide wrote on 3 Dec 2019 10:43
(address . bug-guix@gnu.org)
875zixd9se.fsf@web.de
Ludovic Courtès <ludo@gnu.org> writes:
Toggle quote (3 lines)> To everyone reading this: if you’re experiencing shepherd crashes,> please raise your hand :-)
\o
Toggle quote (2 lines)> and consider applying this patch so we can gather debugging info!
Can I do that without installing from a local checkout?
Best wishes,Arne--Unpolitisch seinheißt politisch seinohne es zu merken
-----BEGIN PGP SIGNATURE-----
iQIzBAEBCAAdFiEE801qEjXQSQPNItXAE++NRSQDw+sFAl3mLjMACgkQE++NRSQDw+tfEQ/+Mv0GQJ1v31XkQCwtQMsKZBtG8iQloRcpC8i97APIaNzTxlBv40PqrJ0bEJE5MC5QpPCCkwJJ8dCMEwdURDRH8UbDul5t35rpjsZFGoa49zFEugI9WPi/oOgWHXuRTy3A6lpg4e7+fr99VULgpXYSRnlFAeu1ESJDpj/bQ6q1Tb5Xg8JcQQypxcQmLzXNyOJ7Ut2xLxYaRNikXBU/Cuo/3aqTPNJvHiM/JdOgGRow4tjTYM1rePCNWPp76kRZJhsvNaLjc0lW2AA9215sqRMoTKThQJCWqFDvlMRB4VBiGCdHMin5zIkVGPSEeB2fbKf/2TmpnpCZMT3cQK1ip34GTnr/njXLWXHGunSS9oYYi3D+E83pezUWR8QyjDGn8EUVpb9SkBeICqWJpVO12zId4IQ4ljSAQVKXlbquFV1Lx6FKAiK1Gnx63Fed00zOqCqy6Felpr2jZ3ts1yu+A67y2t58eOXCMSyExe3KGjncyYaSVAV+bIi4W6riE6fLPFtxqrQzM027775RC0JCODw+ZKF3qadkqnoi0O5L1OT+WIHKlMYb+NarRJaUki8o1SrSsZZBoN7gI58P08u15Jhz2AvmKKEctHO8x+g3veFD4+726+kbSPTpbYO/jCJvhouBGOk0/Ec/uKl6wlvOe142qwsM95mCaDU09SB1o8irAbCIswQBAQgAHRYhBN0ovebZh1yrzkqLHdzPDbMLwQVIBQJd5i4zAAoJENzPDbMLwQVIdg8D/igC24GNZV0KWX7kEmWMcGGTRwNfUlqJVTmYnJ/fMSWjlvH/Y0v+EyOjxXU6xGLMslz7dqTFCi2FdGbYtN2t4UhqTqgJlJJXIHom1xBVIbHMuc0DGRb66V6bXrCsfowu37m4NbWEqFSNMmhj1hL8qvKQTijcY6I+c4AQhU0W0v6t=3IBP-----END PGP SIGNATURE-----
L
L
Ludovic Courtès wrote on 9 Dec 2019 14:47
(name . Jesse Gibbons)(address . jgibbons2357@gmail.com)
87lfrlfw4w.fsf@gnu.org
Hello,
[+Cc: Andy for a heads-up on the fix below.]
Ludovic Courtès <ludo@gnu.org> skribis:
Toggle quote (10 lines)> It turns out the previous patch didn’t work; in short, we really have to> use async-signal-safe functions only from the signal handler, so this> has to be done in C.>> The attached patch does that. I’ve tried it with ‘guix system> container’ and it seems to dump core as expected, from what I can see.>> Let me know if you manage to reproduce the bug and to get a core dumped> with this patch.
Good news! The patch does indeed allow shepherd to dump core, and Imanaged to grab the backtrace below on an x86_64 machine running GuixSystem (from yesterday) with GNOME:
Toggle snippet (60 lines)Using host libthread_db library "/gnu/store/ahqgl4h89xqj695lgqvsaf6zh2nhy4pj-glibc-2.29/lib/libthread_db.so.1".Core was generated by `/gnu/store/1mkkv2caiqbdbbd256c4dirfi4kwsacv-guile-2.2.6/bin/guile --no-auto-com'.Program terminated with signal SIGSEGV, Segmentation fault.#0 handle_crash (sig=11) at /gnu/store/dayk54wxskp14w53813384azhxmd5awz-shepherd-crash-handler.c:4343 * (int *) 0 = 42;[Current thread is 1 (LWP 4635)]
[…]
Thread 1 (LWP 4635):#0 handle_crash (sig=11) at /gnu/store/dayk54wxskp14w53813384azhxmd5awz-shepherd-crash-handler.c:43 infinity = {rlim_cur = 18446744073709551615, rlim_max = 18446744073709551615} pid = <optimized out> msg = "Shepherd crashed!\n" pid = <optimized out>#1 <signal handler called>No locals.#2 handle_crash (sig=6) at /gnu/store/dayk54wxskp14w53813384azhxmd5awz-shepherd-crash-handler.c:43 infinity = {rlim_cur = 18446744073709551615, rlim_max = 18446744073709551615} pid = <optimized out> msg = "Shepherd crashed!\n" pid = <optimized out>#3 <signal handler called>No locals.#4 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 set = {__val = {0, 2314885530818445312, 0 <repeats 14 times>}} pid = <optimized out> tid = <optimized out> ret = <optimized out>#5 0x00007f03eef40891 in __GI_abort () at abort.c:79 save_stage = 1 act = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0}, sa_mask = {__val = {0 <repeats 13 times>, 139654877144192, 0, 139654877624544}}, sa_flags = -279049286, sa_restorer = 0x7f03ef57e480 <read_finalization_pipe_data>} sigs = {__val = {32, 0 <repeats 15 times>}}#6 0x00007f03ef57e89a in finalization_thread_proc (unused=<optimized out>) at finalizers.c:228 data = {byte = -24 '\350', n = -1, err = 4}#7 0x00007f03ef56f35a in c_body (d=0x7f03ed152e50) at continuations.c:422 data = 0x7f03ed152e50#8 0x00007f03ef5f079f in vm_regular_engine (thread=0x2, vp=0x7f03eb1caea0, registers=0x0, resume=-286001158) at vm-engine.c:786 ret = 2 ip = <optimized out> sp = <optimized out> op = 10 jump_table_ = {…} jump_table = 0x7f03ef64d8e0 <jump_table_>
[…]
#19 scm_with_guile (func=<optimized out>, data=<optimized out>) at threads.c:710No locals.#20 0x00007f03ef497015 in start_thread (arg=0x7f03ed153700) at pthread_create.c:486 ret = <optimized out> pd = 0x7f03ed153700 now = <optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139654839219968, -749312912628550421, 140727702524830, 140727702524831, 140727702524832, 139654839219968, 837174519050892523, 837169745183601899}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <optimized out>#21 0x00007f03eeffd91f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95No locals.
So what happens is that ‘finalization_thread_proc’ in Guile receivesEINTR (data.err == 4) but then, despite EINTR, it goes on to check thevalue of ‘data.byte’ and aborts because it’s neither 0 nor 1.
My plan is to:
1. push the patch below to the ‘stable-2.2’ branch of Guile; done: https://git.savannah.gnu.org/cgit/guile.git/commit/?h=stable-2.2&id=edf5aea7ac852db2356ef36cba4a119eb0c81ea9;
2. use a patched Guile for the ‘shepherd’ package;
3. include the crash handler in the Shepherd.
Thoughts?
Thanks,Ludo’.
Toggle diff (49 lines)diff --git a/libguile/finalizers.c b/libguile/finalizers.cindex c5d69e8e3..94a6e6b0a 100644--- a/libguile/finalizers.c+++ b/libguile/finalizers.c@@ -1,4 +1,4 @@-/* Copyright (C) 2012, 2013, 2014 Free Software Foundation, Inc.+/* Copyright (C) 2012, 2013, 2014, 2019 Free Software Foundation, Inc. * * This library is free software; you can redistribute it and/or * modify it under the terms of the GNU Lesser General Public License@@ -211,21 +211,26 @@ finalization_thread_proc (void *unused) scm_without_guile (read_finalization_pipe_data, &data); - if (data.n <= 0 && data.err != EINTR) + if (data.n <= 0) {- perror ("error in finalization thread");- return NULL;+ if (data.err != EINTR)+ {+ perror ("error in finalization thread");+ return NULL;+ } }-- switch (data.byte)+ else {- case 0:- scm_run_finalizers ();- break;- case 1:- return NULL;- default:- abort ();+ switch (data.byte)+ {+ case 0:+ scm_run_finalizers ();+ break;+ case 1:+ return NULL;+ default:+ abort ();+ } } } }
L
L
Ludovic Courtès wrote on 10 Dec 2019 00:13
(name . Jesse Gibbons)(address . jgibbons2357@gmail.com)
87v9qp9jo7.fsf@gnu.org
Hi,
Ludovic Courtès <ludo@gnu.org> skribis:
Toggle quote (8 lines)> My plan is to:>> 1. push the patch below to the ‘stable-2.2’ branch of Guile;> done:> <https://git.savannah.gnu.org/cgit/guile.git/commit/?h=stable-2.2&id=edf5aea7ac852db2356ef36cba4a119eb0c81ea9>;>> 2. use a patched Guile for the ‘shepherd’ package;
Toggle quote (2 lines)> 3. include the crash handler in the Shepherd.
Done:https://git.savannah.gnu.org/cgit/shepherd.git/commit/?id=dfb7c7ecdb2d12061073e6939ec6e765ae59c00c.
I’m closing the bug. Please reopen it if you notice anything wrong!
Ludo’.
Closed
?
Your comment

This issue is archived.

To comment on this conversation send email to 37757@debbugs.gnu.org