[Shepherd] Non-responding service control fiber

  • Done
  • quality assurance status badge
Details
4 participants
  • Attila Lendvai
  • Timo Wilken
  • Hilton Chain
  • Ludovic Courtès
Owner
unassigned
Submitted by
Hilton Chain
Severity
important
Merged with
H
H
Hilton Chain wrote on 9 Aug 2023 14:41
Shepherd hangs (was: Getting Guix to shutdown my laptop properly with Sway and no DE)
(address . bug-guix@gnu.org)
87bkfg1j7b.wl-hako@ultrarare.space
Hello!

I have experienced many instances of Shepherd hanging through my use
of Guix, though I don't have a clear record of when it first happened.

These days I have seen a few reports on the subject. With a quick
search of recent bug reports, I can't find any related, only to find
this thread [1] on help-guix. So I'll start a bug report here, but I
don't know how to debug Shepherd and I haven't found a way to
reproduce it stably.

I'm not sure if Shepherd hangs at usual, but most of the time I find
it already hanging is when doing a reconfiguration. The
reconfiguration becomes unresponsive and it won't accept a ^C, herd
actions also hang. This usually happens with home reconfiguration,
but I can remember once with system reconfiguration when adding and
deleting some services in the configuration file.

I'm not sure how Shepherd hangs either, because in the latter case
(system one) I can still see logs indicating that it's trying to
respawn a process I killed manually, even though that's just a output
and no processes are actually spawned.

And as shown in [1], there are also cases where Shepherd hangs at some
point in the halting process, usually after syslogd has been
terminated but before term-tty*.

(The termination message indicates that Shepherd is still functional,
and no logs after that point shows that that's a real action, but
because of this I can't know anything happened further either. After
that I'm still able to switch ttys so I assume term-tty* are alive.)

Although I don't know how they are related, I have linked my
configurations below:

Thanks

[1]:
(public-inbox mirror on yhetil.org)
H
H
Hilton Chain wrote on 13 Aug 2023 17:25
(address . 65178@debbugs.gnu.org)
87leefx8u0.wl-hako@ultrarare.space
On Wed, 09 Aug 2023 20:41:44 +0800,
Hilton Chain wrote:
Toggle quote (5 lines)
> I'm not sure if Shepherd hangs at usual, but most of the time I find
> it already hanging is when doing a reconfiguration. The
> reconfiguration becomes unresponsive and it won't accept a ^C, herd
> actions also hang. This usually happens with home reconfiguration,

Today I encountered the home reconfiguration issue. The behavior is

Ending part of output for the hanging reconfiguration:
Toggle snippet (11 lines)
[...]
Symlinking /home/hako/.config/fontconfig/fonts.conf -> /gnu/store/fvvqbma1xxgisfcq7rrwihbw7jwnyliv-fonts.conf... done
Symlinking /home/hako/.gnupg/gpg-agent.conf -> /gnu/store/kfaz4zrxmfz6p72x47c7qrqvb873gbyi-gpg-agent.conf... done
Symlinking /home/hako/.ssh/config -> /gnu/store/xb6f584pwclg48fr28wl21v1mxplqp6f-ssh.conf... done
Symlinking /home/hako/.icons/default/index.theme -> /gnu/store/3sraq69nrs04ii0fjgk36aw2c57q6z27-icons.theme... done
done
Finished updating symlinks.



And `herd status' also hangs:
Toggle snippet (3 lines)
$ herd status

H
H
Hilton Chain wrote on 15 Aug 2023 15:20
(address . 65178@debbugs.gnu.org)
87y1icmohn.wl-hako@ultrarare.space
On Sun, 13 Aug 2023 23:25:59 +0800,
Hilton Chain wrote:
Toggle quote (4 lines)
>
> Today I encountered the home reconfiguration issue. The behavior is
> similar to <https://issues.guix.gnu.org/54919>.

And today Shepherd hung after starting a service [1], the service
itself started successfully (process started, logs available):
Toggle snippet (5 lines)
$ sudo herd enable cloudflare-tunnel && sudo herd start cloudflare-tunnel
Enabled service cloudflare-tunnel.


L
L
Ludovic Courtès wrote on 2 Sep 2023 22:49
Re: bug#65178: Shepherd hangs (was: Getting Guix to shutdown my laptop properly with Sway and no DE)
(name . Hilton Chain)(address . hako@ultrarare.space)(address . 65178@debbugs.gnu.org)
87a5u4e1wg.fsf_-_@gnu.org
Hi!

Hilton Chain <hako@ultrarare.space> scribes:

Toggle quote (9 lines)
> On Sun, 13 Aug 2023 23:25:59 +0800,
> Hilton Chain wrote:
>>
>> Today I encountered the home reconfiguration issue. The behavior is
>> similar to <https://issues.guix.gnu.org/54919>.
>
> And today Shepherd hung after starting a service [1], the service
> itself started successfully (process started, logs available):

I’m assuming this is shepherd 0.10.2, right?

Toggle quote (5 lines)
> $ sudo herd enable cloudflare-tunnel && sudo herd start cloudflare-tunnel
> Enabled service cloudflare-tunnel.
>
> [1]: <https://codeberg.org/hako/Rosenthal/src/commit/c7dc95c2932d7362673c28cdc2f52e6bb8357c18/rosenthal/services/child-error.scm#L151>

Is any of the services you’re using doing “non-standard things” such as
using constructors/destructors other than those provided by shepherd
(‘make-forkexec-constructor’ et al.)?

Is it reproducible, and do you think you could come up with a reduce
test case (for example by removing services from the config until you
reach the minimum)?

Thanks,
Ludo’.
H
H
Hilton Chain wrote on 3 Sep 2023 10:21
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 65178@debbugs.gnu.org)
87h6ob4qh9.wl-hako@ultrarare.space
On Sun, 03 Sep 2023 04:49:35 +0800,
Ludovic Courtès wrote:
Toggle quote (15 lines)
>
> Hi!
>
> Hilton Chain <hako@ultrarare.space> scribes:
>
> > On Sun, 13 Aug 2023 23:25:59 +0800,
> > Hilton Chain wrote:
> >>
> >> Today I encountered the home reconfiguration issue. The behavior is
> >> similar to <https://issues.guix.gnu.org/54919>.
> >
> > And today Shepherd hung after starting a service [1], the service
> > itself started successfully (process started, logs available):
>
> I’m assuming this is shepherd 0.10.2, right?
Yes!
Toggle quote (9 lines)
>
> > $ sudo herd enable cloudflare-tunnel && sudo herd start cloudflare-tunnel
> > Enabled service cloudflare-tunnel.
> >
> > [1]: <https://codeberg.org/hako/Rosenthal/src/commit/c7dc95c2932d7362673c28cdc2f52e6bb8357c18/rosenthal/services/child-error.scm#L151>
>
> Is any of the services you’re using doing “non-standard things” such as
> using constructors/destructors other than those provided by shepherd
> (‘make-forkexec-constructor’ et al.)?
No, I'm unaware of such things.
Toggle quote (3 lines)
> Is it reproducible, and do you think you could come up with a reduce
> test case (for example by removing services from the config until you
> reach the minimum)?
I still don't know which condition triggers it, so I can't make a test
case.
It's unreproducible. And I don't think it's really related to the
config, since Shepherd won't hang when rebooting to a system
generation which made it hanging at reconfiguration before.
It might be related to bug#65419 ([Shepherd] Non-reponding service
control fiber) you have reported, since there's similar behavior that
`herd status nscd' still works when Shepherd hangs.
L
L
Ludovic Courtès wrote on 3 Sep 2023 21:59
control message for bug #65419
(address . control@debbugs.gnu.org)
87v8crc9k2.fsf@gnu.org
merge 65419 65178
quit
L
L
Ludovic Courtès wrote on 3 Sep 2023 21:59
(address . control@debbugs.gnu.org)
87ttsbc9jn.fsf@gnu.org
severity 65419 important
quit
L
L
Ludovic Courtès wrote on 23 Nov 2023 21:42
(address . control@debbugs.gnu.org)
87sf4w8ana.fsf@gnu.org
retitle 65419 [Shepherd] Non-responding service control fiber
quit
T
T
Timo Wilken wrote on 14 Dec 2023 23:55
Re: Shepherd stops responding during "guix system reconfigure"
(name . Attila Lendvai)(address . attila@lendvai.name)
CXODHED6PPG6.3PMFS738SPPMZ@lap.twilken.net
After a bit of searching, it looks like 67538, 67230 and 65178 may be the same
issue.

Attila Lendvai wrote:
Toggle quote (9 lines)
> > > my suspicion is that it's due to some error coming from a start
> > > GEXP that somehow derails shepherd's event loop.
> >
> > iirc I once managed to get a debugger out when it happened and it's
> > stuck waiting in one of the epoll/select/alike calls,
>
> ...or one of the start/stop GEXP's calls something that (sometimes?) blocks
> indefinitely (which violates the API of shepherd).

Same symptoms here again.

For context: this time I was trying to deploy some OCI/Docker containers using
Guix' `oci-container-service-type', specifically a Shepherd service called
"conduit". My code is here:


(Specifically, commits bf94f7872a1df293bd904bbd2c1ef7229f4f98a8 and
c87dcdae79c6266ac3dac70af08fbef5eb21629b.)

This is with Guix commit 1b2505217cf222d98cc960b8510660976a01cfa1.

I first ran "guix system reconfigure -L . tw/system/lud.scm" with commit
bf94f7872a1df293bd904bbd2c1ef7229f4f98a8, which had a bug (an env var was
wrong, so the container failed to start). This worked as expected in that
Shepherd tried to start the service, which failed, so Shepherd disabled it.

Then, I fixed the env var and re-ran "guix system reconfigure -L .
tw/system/lud.scm" with commit c87dcdae79c6266ac3dac70af08fbef5eb21629b.
Shepherd loaded the new "conduit" service fine, as far as I can tell, but
didn't restart it because it was still disabled.

I then enabled and started the service manually. Enabling worked fine, but on
start, I got no terminal output from Shepherd, and it hung.

I still had an error in my setup (directory permissions were wrong), and I got
a message in /var/log/messages to that effect:

Toggle snippet (13 lines)
Dec 14 21:33:50 localhost shepherd[1]: Service conduit is currently disabled.
Dec 14 21:34:04 localhost shepherd[1]: Enabled service conduit.
Dec 14 21:34:07 localhost shepherd[1]: Starting service user-homes...
Dec 14 21:34:07 localhost shepherd[1]: Service user-homes has been started.
Dec 14 21:34:07 localhost shepherd[1]: Service user-homes started.
Dec 14 21:34:07 localhost shepherd[1]: Service user-homes running with value #t.
Dec 14 21:34:07 localhost shepherd[1]: Starting service conduit...
Dec 14 21:34:07 localhost shepherd[1]: Service conduit has been started.
Dec 14 21:34:07 localhost shepherd[1]: Service conduit started.
Dec 14 21:34:07 localhost shepherd[1]: Service conduit running with value 13226.
Dec 14 21:34:07 localhost shepherd[1]: [docker] conduit: [...] "IO error: While open a file for appending: /var/lib/matrix-conduit/LOG: Permission denied"

...showing that Shepherd had at least tried to start the new container. The
container is not running, though (due to the error shown above), and nothing
with PID 13226 is running.

The "herd start conduit" command did not return, and ^C-ing it did not help.
Afterwards, every "herd" command also hung without any output.

Here are the last four lines of the output of "sudo strace -s1000 herd status"
on such a hung machine:

Toggle snippet (6 lines)
connect(10, {sa_family=AF_UNIX, sun_path="/var/run/shepherd/socket"}, 26) = 0
getcwd("/home/timo", 100) = 11
write(10, "(shepherd-command (version 0) (action status) (service root) (arguments ()) (directory \"/home/timo\"))", 101) = 101
read(10,

The "read(10, " call never completes.

At least in this case, Shepherd still seems to be processing inbound inet
connections, so I can open new SSH connections to the machine.

Attaching to PID 1 with strace shows it is stuck in "epoll_wait(13, "
(unsurprisingly, fd 13 points to "anon_inode:[eventpoll]"). Here's a backtrace
of all threads in "gdb -p 1":

Toggle snippet (105 lines)
(gdb) info threads
Id Target Id Frame
* 1 Thread 0x7f786544c380 (LWP 1) "shepherd" 0x00007f7865552626 in epoll_wait ()
from /gnu/store/ln6hxqjvz6m9gdd9s97pivlqck7hzs99-glibc-2.35/lib/libc.so.6
2 Thread 0x7f7864e16640 (LWP 186) "GC-marker-0" 0x00007f78654cf16a in __futex_abstimed_wait_common ()
from /gnu/store/ln6hxqjvz6m9gdd9s97pivlqck7hzs99-glibc-2.35/lib/libc.so.6
3 Thread 0x7f7864615640 (LWP 187) "GC-marker-1" 0x00007f78654cf16a in __futex_abstimed_wait_common ()
from /gnu/store/ln6hxqjvz6m9gdd9s97pivlqck7hzs99-glibc-2.35/lib/libc.so.6
4 Thread 0x7f7863e14640 (LWP 188) "GC-marker-2" 0x00007f78654cf16a in __futex_abstimed_wait_common ()
from /gnu/store/ln6hxqjvz6m9gdd9s97pivlqck7hzs99-glibc-2.35/lib/libc.so.6
5 Thread 0x7f78634c6640 (LWP 190) "shepherd" 0x00007f786554300c in read ()
from /gnu/store/ln6hxqjvz6m9gdd9s97pivlqck7hzs99-glibc-2.35/lib/libc.so.6
(gdb) thread apply all bt

Thread 5 (Thread 0x7f78634c6640 (LWP 190) "shepherd"):
#0 0x00007f786554300c in read () from /gnu/store/ln6hxqjvz6m9gdd9s97pivlqck7hzs99-glibc-2.35/lib/libc.so.6
#1 0x00007f7865a48cc7 in ?? () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#2 0x00007f78659427d1 in ?? () from /gnu/store/k1ha4n9v8d7myiiszvl2ic7xnb56l219-libgc-8.2.2/lib/libgc.so.1
#3 0x00007f786594438c in ?? () from /gnu/store/k1ha4n9v8d7myiiszvl2ic7xnb56l219-libgc-8.2.2/lib/libgc.so.1
#4 0x00007f786594e83c in GC_do_blocking () from /gnu/store/k1ha4n9v8d7myiiszvl2ic7xnb56l219-libgc-8.2.2/lib/libgc.so.1
#5 0x00007f7865a65455 in scm_without_guile () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#6 0x00007f7865a4d570 in ?? () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#7 0x00007f7865a71390 in ?? () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#8 0x00007f7865a7edb5 in scm_call_n () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#9 0x00007f78659e5b3e in scm_call_with_unblocked_asyncs () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#10 0x00007f7865a71390 in ?? () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#11 0x00007f7865a7edb5 in scm_call_n () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#12 0x00007f7865a6b0f3 in ?? () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#13 0x00007f78659e7e1a in ?? () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#14 0x00007f7865a71390 in ?? () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#15 0x00007f7865a7edb5 in scm_call_n () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#16 0x00007f78659e95ca in scm_call_2 () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#17 0x00007f7865a90092 in ?? () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#18 0x00007f7865a6be1f in scm_c_catch () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#19 0x00007f78659ea396 in scm_c_with_continuation_barrier () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#20 0x00007f7865a6b049 in ?? () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#21 0x00007f786594e7fa in GC_call_with_stack_base () from /gnu/store/k1ha4n9v8d7myiiszvl2ic7xnb56l219-libgc-8.2.2/lib/libgc.so.1
#22 0x00007f7865a64c5d in ?? () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#23 0x00007f78654d23aa in start_thread () from /gnu/store/ln6hxqjvz6m9gdd9s97pivlqck7hzs99-glibc-2.35/lib/libc.so.6
#24 0x00007f7865552f7c in clone3 () from /gnu/store/ln6hxqjvz6m9gdd9s97pivlqck7hzs99-glibc-2.35/lib/libc.so.6

Thread 4 (Thread 0x7f7863e14640 (LWP 188) "GC-marker-2"):
#0 0x00007f78654cf16a in __futex_abstimed_wait_common () from /gnu/store/ln6hxqjvz6m9gdd9s97pivlqck7hzs99-glibc-2.35/lib/libc.so.6
#1 0x00007f78654d17e8 in pthread_cond_wait@@GLIBC_2.3.2 () from /gnu/store/ln6hxqjvz6m9gdd9s97pivlqck7hzs99-glibc-2.35/lib/libc.so.6
#2 0x00007f7865948740 in ?? () from /gnu/store/k1ha4n9v8d7myiiszvl2ic7xnb56l219-libgc-8.2.2/lib/libgc.so.1
#3 0x00007f7865948897 in ?? () from /gnu/store/k1ha4n9v8d7myiiszvl2ic7xnb56l219-libgc-8.2.2/lib/libgc.so.1
#4 0x00007f78654d23aa in start_thread () from /gnu/store/ln6hxqjvz6m9gdd9s97pivlqck7hzs99-glibc-2.35/lib/libc.so.6
#5 0x00007f7865552f7c in clone3 () from /gnu/store/ln6hxqjvz6m9gdd9s97pivlqck7hzs99-glibc-2.35/lib/libc.so.6

Thread 3 (Thread 0x7f7864615640 (LWP 187) "GC-marker-1"):
#0 0x00007f78654cf16a in __futex_abstimed_wait_common () from /gnu/store/ln6hxqjvz6m9gdd9s97pivlqck7hzs99-glibc-2.35/lib/libc.so.6
#1 0x00007f78654d17e8 in pthread_cond_wait@@GLIBC_2.3.2 () from /gnu/store/ln6hxqjvz6m9gdd9s97pivlqck7hzs99-glibc-2.35/lib/libc.so.6
#2 0x00007f7865948740 in ?? () from /gnu/store/k1ha4n9v8d7myiiszvl2ic7xnb56l219-libgc-8.2.2/lib/libgc.so.1
#3 0x00007f7865948897 in ?? () from /gnu/store/k1ha4n9v8d7myiiszvl2ic7xnb56l219-libgc-8.2.2/lib/libgc.so.1
#4 0x00007f78654d23aa in start_thread () from /gnu/store/ln6hxqjvz6m9gdd9s97pivlqck7hzs99-glibc-2.35/lib/libc.so.6
#5 0x00007f7865552f7c in clone3 () from /gnu/store/ln6hxqjvz6m9gdd9s97pivlqck7hzs99-glibc-2.35/lib/libc.so.6

Thread 2 (Thread 0x7f7864e16640 (LWP 186) "GC-marker-0"):
#0 0x00007f78654cf16a in __futex_abstimed_wait_common () from /gnu/store/ln6hxqjvz6m9gdd9s97pivlqck7hzs99-glibc-2.35/lib/libc.so.6
#1 0x00007f78654d17e8 in pthread_cond_wait@@GLIBC_2.3.2 () from /gnu/store/ln6hxqjvz6m9gdd9s97pivlqck7hzs99-glibc-2.35/lib/libc.so.6
#2 0x00007f7865948740 in ?? () from /gnu/store/k1ha4n9v8d7myiiszvl2ic7xnb56l219-libgc-8.2.2/lib/libgc.so.1
#3 0x00007f7865948897 in ?? () from /gnu/store/k1ha4n9v8d7myiiszvl2ic7xnb56l219-libgc-8.2.2/lib/libgc.so.1
#4 0x00007f78654d23aa in start_thread () from /gnu/store/ln6hxqjvz6m9gdd9s97pivlqck7hzs99-glibc-2.35/lib/libc.so.6
#5 0x00007f7865552f7c in clone3 () from /gnu/store/ln6hxqjvz6m9gdd9s97pivlqck7hzs99-glibc-2.35/lib/libc.so.6

Thread 1 (Thread 0x7f786544c380 (LWP 1) "shepherd"):
#0 0x00007f7865552626 in epoll_wait () from /gnu/store/ln6hxqjvz6m9gdd9s97pivlqck7hzs99-glibc-2.35/lib/libc.so.6
#1 0x00007f7862bb9335 in ?? () from /gnu/store/h4nsywbhn8b4qyh40fhykk3q40qkr3wd-guile-fibers-1.3.1/lib/guile/3.0/extensions/fibers-epoll.so
#2 0x00007f78659427d1 in ?? () from /gnu/store/k1ha4n9v8d7myiiszvl2ic7xnb56l219-libgc-8.2.2/lib/libgc.so.1
#3 0x00007f786594438c in ?? () from /gnu/store/k1ha4n9v8d7myiiszvl2ic7xnb56l219-libgc-8.2.2/lib/libgc.so.1
#4 0x00007f786594e83c in GC_do_blocking () from /gnu/store/k1ha4n9v8d7myiiszvl2ic7xnb56l219-libgc-8.2.2/lib/libgc.so.1
#5 0x00007f7865a65455 in scm_without_guile () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#6 0x00007f7862bb96ce in ?? () from /gnu/store/h4nsywbhn8b4qyh40fhykk3q40qkr3wd-guile-fibers-1.3.1/lib/guile/3.0/extensions/fibers-epoll.so
#7 0x00007f78606246c2 in ?? ()
#8 0x00007f78620ba628 in ?? ()
#9 0x00007f7860627610 in ?? ()
#10 0x00007f786520ad80 in ?? ()
#11 0x00007f7865a14edc in ?? () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#12 0x00007f7865a71215 in ?? () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#13 0x00007f7865a7edb5 in scm_call_n () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#14 0x00007f78659e9977 in scm_primitive_eval () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#15 0x00007f7865a1dff9 in scm_primitive_load () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#16 0x00007f7865a71390 in ?? () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#17 0x00007f7865a7edb5 in scm_call_n () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#18 0x00007f78659e9977 in scm_primitive_eval () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#19 0x00007f78659ef846 in scm_eval () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#20 0x00007f7865a4e3e6 in scm_shell () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#21 0x00007f7865a008cc in ?? () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#22 0x00007f78659e7e1a in ?? () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#23 0x00007f7865a71390 in ?? () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#24 0x00007f7865a7edb5 in scm_call_n () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#25 0x00007f78659e95ca in scm_call_2 () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#26 0x00007f7865a90092 in ?? () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#27 0x00007f7865a6be1f in scm_c_catch () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#28 0x00007f78659ea396 in scm_c_with_continuation_barrier () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#29 0x00007f7865a6b049 in ?? () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#30 0x00007f786594e7fa in GC_call_with_stack_base () from /gnu/store/k1ha4n9v8d7myiiszvl2ic7xnb56l219-libgc-8.2.2/lib/libgc.so.1
#31 0x00007f7865a653f8 in scm_with_guile () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#32 0x00007f7865a098e5 in scm_boot_guile () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#33 0x00000000004010f7 in ?? ()
#34 0x00007f78654761f7 in __libc_start_call_main () from /gnu/store/ln6hxqjvz6m9gdd9s97pivlqck7hzs99-glibc-2.35/lib/libc.so.6
#35 0x00007f78654762ac in __libc_start_main_impl () from /gnu/store/ln6hxqjvz6m9gdd9s97pivlqck7hzs99-glibc-2.35/lib/libc.so.6
#36 0x0000000000401171 in ?? ()

Unrelatedly, I also have another Shepherd on a different machine that became
stuck after I ran a bunch of "guix system reconfigure" commands. The
backtraces there, if it helps:

Toggle snippet (59 lines)
(gdb) info threads
Id Target Id Frame
* 1 Thread 0x7ffaceef2380 (LWP 1) "shepherd" 0x00007fface938626 in epoll_wait ()
from /gnu/store/ln6hxqjvz6m9gdd9s97pivlqck7hzs99-glibc-2.35/lib/libc.so.6
2 Thread 0x7fface1aa640 (LWP 231) "GC-marker-0" 0x00007fface8b516a in __futex_abstimed_wait_common ()
from /gnu/store/ln6hxqjvz6m9gdd9s97pivlqck7hzs99-glibc-2.35/lib/libc.so.6
3 Thread 0x7ffacd9a9640 (LWP 232) "GC-marker-1" 0x00007fface8b516a in __futex_abstimed_wait_common ()
from /gnu/store/ln6hxqjvz6m9gdd9s97pivlqck7hzs99-glibc-2.35/lib/libc.so.6
4 Thread 0x7ffacd1a8640 (LWP 233) "GC-marker-2" 0x00007fface8b516a in __futex_abstimed_wait_common ()
from /gnu/store/ln6hxqjvz6m9gdd9s97pivlqck7hzs99-glibc-2.35/lib/libc.so.6
5 Thread 0x7ffacc9a7640 (LWP 234) "GC-marker-3" 0x00007fface8b516a in __futex_abstimed_wait_common ()
from /gnu/store/ln6hxqjvz6m9gdd9s97pivlqck7hzs99-glibc-2.35/lib/libc.so.6
6 Thread 0x7ffacc1a6640 (LWP 235) "GC-marker-4" 0x00007fface8b516a in __futex_abstimed_wait_common ()
from /gnu/store/ln6hxqjvz6m9gdd9s97pivlqck7hzs99-glibc-2.35/lib/libc.so.6
7 Thread 0x7ffacb9a5640 (LWP 236) "GC-marker-5" 0x00007fface8b516a in __futex_abstimed_wait_common ()
from /gnu/store/ln6hxqjvz6m9gdd9s97pivlqck7hzs99-glibc-2.35/lib/libc.so.6
8 Thread 0x7ffacb1a4640 (LWP 237) "GC-marker-6" 0x00007fface8b516a in __futex_abstimed_wait_common ()
from /gnu/store/ln6hxqjvz6m9gdd9s97pivlqck7hzs99-glibc-2.35/lib/libc.so.6
9 Thread 0x7ffaca832640 (LWP 249) "shepherd" 0x00007fface92900c in read ()
from /gnu/store/ln6hxqjvz6m9gdd9s97pivlqck7hzs99-glibc-2.35/lib/libc.so.6
10 Thread 0x7ffac89ca640 (LWP 26693) "shepherd" 0x00007fface92900c in read ()
from /gnu/store/ln6hxqjvz6m9gdd9s97pivlqck7hzs99-glibc-2.35/lib/libc.so.6
(gdb) thread apply all bt

Thread 10 (Thread 0x7ffac89ca640 (LWP 26693) "shepherd"):
#0 0x00007fface92900c in read () from /gnu/store/ln6hxqjvz6m9gdd9s97pivlqck7hzs99-glibc-2.35/lib/libc.so.6
#1 0x00007ffacedf0e57 in ?? () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#2 0x00007ffaced3c7d1 in ?? () from /gnu/store/k1ha4n9v8d7myiiszvl2ic7xnb56l219-libgc-8.2.2/lib/libgc.so.1
#3 0x00007ffaced3e38c in ?? () from /gnu/store/k1ha4n9v8d7myiiszvl2ic7xnb56l219-libgc-8.2.2/lib/libgc.so.1
#4 0x00007ffaced4883c in GC_do_blocking () from /gnu/store/k1ha4n9v8d7myiiszvl2ic7xnb56l219-libgc-8.2.2/lib/libgc.so.1
#5 0x00007ffacee62455 in scm_without_guile () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#6 0x00007ffacedf903d in ?? () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#7 0x00007ffacede4e1a in ?? () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#8 0x00007ffac6832022 in ?? ()
#9 0x00007fface4d97f0 in ?? ()
#10 0x00007ffac94766c0 in ?? ()
#11 0x00007fface5f4b40 in ?? ()
#12 0x00007ffacee11edc in ?? () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#13 0x00007ffacee6e215 in ?? () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#14 0x00007ffacee7bdb5 in scm_call_n () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#15 0x00007ffacede65ca in scm_call_2 () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#16 0x00007ffacee8d092 in ?? () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#17 0x00007ffacee68e1f in scm_c_catch () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#18 0x00007ffacede7396 in scm_c_with_continuation_barrier () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#19 0x00007ffacee68049 in ?? () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#20 0x00007ffaced487fa in GC_call_with_stack_base () from /gnu/store/k1ha4n9v8d7myiiszvl2ic7xnb56l219-libgc-8.2.2/lib/libgc.so.1
#21 0x00007ffacee623f8 in scm_with_guile () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#22 0x00007fface8b83aa in start_thread () from /gnu/store/ln6hxqjvz6m9gdd9s97pivlqck7hzs99-glibc-2.35/lib/libc.so.6
#23 0x00007fface938f7c in clone3 () from /gnu/store/ln6hxqjvz6m9gdd9s97pivlqck7hzs99-glibc-2.35/lib/libc.so.6

Thread 9 (Thread 0x7ffaca832640 (LWP 249) "shepherd"):
#0 0x00007fface92900c in read () from /gnu/store/ln6hxqjvz6m9gdd9s97pivlqck7hzs99-glibc-2.35/lib/libc.so.6
#1 0x00007ffacee45cc7 in ?? () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#2 0x00007ffaced3c7d1 in ?? () from /gnu/store/k1ha4n9v8d7myiiszvl2ic7xnb56l219-libgc-8.2.2/lib/libgc.so.1
#3 0x00007ffaced3e38c in ?? () from /gnu/store/k1ha4n9v8d7myiiszvl2ic7xnb56l219-libgc-8.2.2/lib/libgc.so.1
#4 0x00007ffaced4883c in GC_do_blocking () from /gnu/store/k1ha4n9v8d7myiiszvl2ic7xnb56l219-libgc-8.2.2/lib/libgc.so.1
#5 0x00007ffacee62455 in scm_without_guile () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4-guile-3.0.9/lib/libguile-3.0.so.1
#6 0x00007ffacee4a570 in ?? () from /gnu/store/n24l8hxn6nvb7lz7zjlyd7i05khrm0i4
This message was truncated. Download the full message here.
A
A
Attila Lendvai wrote on 15 Dec 2023 20:47
(name . Timo Wilken)(address . guix@twilken.net)
mnlkXTye7wOZNVq5cAf2dviIwfvb8APeTaM5rHUGoPKVlZz9l4VxGNAde8FbJIPf9yPYn0FRXSh0Vb5myA3E3407Us_echjM6SOWv6sE8jY=@lendvai.name
i think i have found the root cause of this, as documented here: https://issues.guix.gnu.org/67839

that issue contains patches for shepherd to reproduce it in its test suite.

--
• attila lendvai
• PGP: 963F 5D5F 45C7 DFCD 0A39
--
“What divides libertarians from everybody else is not a belief about rights or what rights people have, because the judgments libertarians make about the state are the same as the judgments almost everyone makes about private agents. So it's not that we believe in rights that other people don't believe in, or that other people believe in rights that we don't believe in. It's that other people think the state is exempt from the moral principles that apply to non-government agents.”
— Michael Huemer
T
T
Timo Wilken wrote on 15 Dec 2023 21:33
(name . Attila Lendvai)(address . attila@lendvai.name)
CXP6VUA179NT.24MHZOOPR4XQN@lap.twilken.net
On Fri Dec 15, 2023 at 8:47 PM CET, Attila Lendvai wrote:
Toggle quote (4 lines)
> i think i have found the root cause of this, as documented here: https://issues.guix.gnu.org/67839
>
> that issue contains patches for shepherd to reproduce it in its test suite.

Thank you very much for this, Attila!

Are the patch in 67839 and/or your branch "attila" linked from there in a
state that I could test them locally? Would it be valuable to you if I ran a
patched Shepherd and sent logs and/or backtraces as I encountered them?
A
A
Attila Lendvai wrote on 15 Dec 2023 22:24
(name . Timo Wilken)(address . guix@twilken.net)
Q25YR0CIwjc3XL1sWDYgC7JrIarMKDw3wg326jlNiwx0Rtmj7e_kkoFiry6KGSbid8fBo8bqW5jiaOs8WOK43ijSvz8qWq-RPNoCjz26kus=@lendvai.name
Toggle quote (3 lines)
> Thank you very much for this, Attila!


you're welcome! :)


Toggle quote (5 lines)
> Are the patch in 67839 and/or your branch "attila" linked from there in a
> state that I could test them locally? Would it be valuable to you if I ran a
> patched Shepherd and sent logs and/or backtraces as I encountered them?


it's nice of you, but not really. now that we have a failing test case in shepherd's unit tests that can reproduce it much easier.

with #67839 you would only get you an extra "Assertion failed" message over master, without much useful output.

as for my branch, it would emit a lot of useful log, including backtraces, but i keep force-pushing into it. i'm running my servers with it, though, so if you feel really adventurous, and want to join the debugging, then you can try... otherwise it's too much in flux.

what we need to focus on now is making shepherd's test suite run clean again, one way or another. then i can test it in a real life environment, and report back with any possible findings.

--
• attila lendvai
• PGP: 963F 5D5F 45C7 DFCD 0A39
--
“Ignorance might be bliss for the ignorant, but for the rest of us it's a fucking pain in the ass.”
— Ricky Gervais
L
L
Ludovic Courtès wrote on 20 Dec 2023 00:00
Re: bug#65419: [Shepherd] Non-responding service control fiber
(name . Attila Lendvai)(address . attila@lendvai.name)
87plz1hk6j.fsf_-_@gnu.org
Hello,

Attila Lendvai <attila@lendvai.name> skribis:

Toggle quote (4 lines)
> i think i have found the root cause of this, as documented here: https://issues.guix.gnu.org/67839
>
> that issue contains patches for shepherd to reproduce it in its test suite.

Yes, it looks like this long-standing and hard-to-debug issue may well
be fixed now, thumbs up Attila!!

We have accumulated quite a few fixes by now so I think I’ll release
0.10.3 hopefully in 2023 and otherwise soon after.

Thanks,
Ludo’.
L
L
Ludovic Courtès wrote on 2 Jan 23:09 +0100
control message for bug #65419
(address . control@debbugs.gnu.org)
87mstngzff.fsf@gnu.org
close 65419
quit
?
Your comment

This issue is archived.

To comment on this conversation send an email to 65178@debbugs.gnu.org

To respond to this issue using the mumi CLI, first switch to it
mumi current 65178
Then, you may apply the latest patchset in this issue (with sign off)
mumi am -- -s
Or, compose a reply to this issue
mumi compose
Or, send patches to this issue
mumi send-email *.patch