Shepherd 0.9.2 possible regressions

  • Done
  • quality assurance status badge
Details
4 participants
  • Ludovic Courtès
  • Christopher Baines
  • Mathieu Othacehe
  • Mathieu Othacehe
Owner
unassigned
Submitted by
Mathieu Othacehe
Severity
important
M
M
Mathieu Othacehe wrote on 15 Sep 2022 15:05
(address . bug-guix@gnu.org)
87sfksn5yg.fsf@gnu.org
Hello,

Since Shepherd 0.9.2 the following tests are failing:


It seems that an unexpected #<eof> object is received on the marionette
socket.

* gui-installed-desktop-os-encrypted https://ci.guix.gnu.org/build/1431044/details

It seems that the Shepherd cannot be restarted in the install-system
procedure.

Thanks,

Mathieu
M
M
Mathieu Othacehe wrote on 15 Sep 2022 15:06
control message for bug #57827
(address . control@debbugs.gnu.org)
87r10cn5wo.fsf@meije.mail-host-address-is-not-set
severity 57827 important
quit
M
M
Mathieu Othacehe wrote on 15 Sep 2022 15:07
control message for bug #53214
(address . control@debbugs.gnu.org)
87pmfwn5uy.fsf@meije.mail-host-address-is-not-set
block 53214 by 57827
quit
C
C
Christopher Baines wrote on 16 Sep 2022 09:35
Re: bug#57827: Shepherd 0.9.2 possible regressions
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 57827@debbugs.gnu.org)
87czbvkbqp.fsf@cbaines.net
Mathieu Othacehe <othacehe@gnu.org> writes:

Toggle quote (8 lines)
> Since Shepherd 0.9.2 the following tests are failing:
>
> * cgit: https://ci.guix.gnu.org/build/1427375/details
> * gitile https://ci.guix.gnu.org/build/1427377/details
>
> It seems that an unexpected #<eof> object is received on the marionette
> socket.

I had a look at the cgit system test, and it seems like this #<eof> is
coming from the NGinx pid file being empty.

Since empty files is a possibility with wait-for-file, I've sent a patch
to [1] which prevents the eof issue, plus another change to make it
easier to debug.


With that change, both of the above tests seem to pass for me.

This could be related to the Shepherd upgrade, but only indirectly, as I
think the failure at least for cgit was also timing dependent.

Chris
-----BEGIN PGP SIGNATURE-----

iQKlBAEBCgCPFiEEPonu50WOcg2XVOCyXiijOwuE9XcFAmMkKI5fFIAAAAAALgAo
aXNzdWVyLWZwckBub3RhdGlvbnMub3BlbnBncC5maWZ0aGhvcnNlbWFuLm5ldDNF
ODlFRUU3NDU4RTcyMEQ5NzU0RTBCMjVFMjhBMzNCMEI4NEY1NzcRHG1haWxAY2Jh
aW5lcy5uZXQACgkQXiijOwuE9XeuexAAhGhqYVXS5Pfea84VfYP0VRViHYxry5Qm
J254BigUmoFsjezM6KiQKZjZ5tzwjSvpnjguPqFjro1eRwTOD0L5kh0JvNPfgKjo
bjD8U7KCDvgMd9sXi6W1R64Uh8u58eha5ux2uM4yacbBpY9neVKEQPjbO55PbqDr
WRPWrqQbIUAGPI0kdK0X5QWLNyrqrAP4WS3oVXpcGEFAueyyVw2HViNnH2limSiJ
MxQscXsAPEfoMsLWNMFQETxCzqHZ/KMJnUyElqgtLXSr8jeLX4FlHLsH3MCg2f44
oCco7o6eumtUgRmsHXSJmPR082ZyM4DjZHwYzQBARhsEdjsYbzt0SacgOEU3I5a1
eNOaYxbJUAm+abjThJMnI2e/IECSyRlpBQmFigMuOU1J+Q6/2IquZCQM7nqL7byQ
lkfM0zJg5lo/hIf5/E2fasNAqFmnRVLqzrgoERv7KEakzw/0Kp8b1GayV6HR/Brq
nddKqU4Y8PSGs8cdfw8xuBZYboyrb6/BU5aJ5mxCtukUg2CC84rInjDN6neQC2Qe
sjtuncvbPm+vEdleT4uVEUGXBY2lH1Cl9lZWDGX3jThwhbXZ3ncJnM9iAVECi3L7
lYnMyds6f9ToDwgz18snDyjGgAJdeyKdeZAQ/2+IGtH2CHhOAavUkjGradP7Hjx8
Mvr2nEV9nzs=
=dezb
-----END PGP SIGNATURE-----

M
M
Mathieu Othacehe wrote on 16 Sep 2022 10:35
(name . Christopher Baines)(address . mail@cbaines.net)(address . 57827@debbugs.gnu.org)
874jx7hg2m.fsf@gnu.org
Hey Chris,

Toggle quote (4 lines)
> Since empty files is a possibility with wait-for-file, I've sent a patch
> to [1] which prevents the eof issue, plus another change to make it
> easier to debug.

Nice! I wonder if we should maybe block in wait-for-file until the file
has some content. In the cgit/gitile tests, we will keep going while the
pid file is empty so the nginx server is not maybe completely up yet.

Toggle quote (3 lines)
> This could be related to the Shepherd upgrade, but only indirectly, as I
> think the failure at least for cgit was also timing dependent.

Yes looks like so.

Thanks,

Mathieu
M
M
Mathieu Othacehe wrote on 19 Sep 2022 08:41
(address . 57827@debbugs.gnu.org)
871qs76f3y.fsf@gnu.org
Hey,

Toggle quote (3 lines)
> * gitile https://ci.guix.gnu.org/build/1427377/details

OK so those two are fixed by Chris patch.

Toggle quote (5 lines)
> * gui-installed-os https://ci.guix.gnu.org/build/1431027/details
> * gui-installed-os-encrypted https://ci.guix.gnu.org/build/1431040/details
> * gui-installed-desktop-os-encrypted https://ci.guix.gnu.org/build/1431044/details

Regarding those four, I was able to reproduce the issue this way:

Toggle snippet (5 lines)
$ guix repl
(stop-service 'guix-daemon)
(start-service 'guix-daemon (list (number->string (getpid))))

The latter command hangs and Shepherd becomes unresponsive. I collected
an (attached) strace dump of Shepherd showing that there is no response
on the socket when the service is started.

Note that, this works:

Toggle snippet (5 lines)
$ guix repl
(stop-service 'guix-daemon)
(start-service 'guix-daemon)

So the problem could be caused by the "container-excursion*" in the
"fork+exec-command/container" procedure.

Thanks,

Mathieu
Attachment: out
L
L
Ludovic Courtès wrote on 20 Sep 2022 19:30
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 57827-done@debbugs.gnu.org)
87r106lzqq.fsf@gnu.org
Hi,

Mathieu Othacehe <othacehe@gnu.org> skribis:

Toggle quote (6 lines)
> Regarding those four, I was able to reproduce the issue this way:
>
> $ guix repl
> (stop-service 'guix-daemon)
> (start-service 'guix-daemon (list (number->string (getpid))))

Or from the shell:

herd stop guix-daemon
herd start guix-daemon $$

I was able to reproduce it using a bare-bones.tmpl VM.

Toggle quote (13 lines)
> The latter command hangs and Shepherd becomes unresponsive. I collected
> an (attached) strace dump of Shepherd showing that there is no response
> on the socket when the service is started.
>
> Note that, this works:
>
> $ guix repl
> (stop-service 'guix-daemon)
> (start-service 'guix-daemon)
>
> So the problem could be caused by the "container-excursion*" in the
> "fork+exec-command/container" procedure.

PID 1 gets stuck on read(16, …) forever, after reading the string “2866”
(a PID):

Toggle snippet (41 lines)
[pid 2865] clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLDstrace: Process 2866 attached
, child_tidptr=0x7fccfbe00a10) = 2866
[pid 2866] set_robust_list(0x7fccfbe00a20, 24) = 0
[pid 2866] close(3) = 0
[pid 2865] write(39, "2866", 4 <unfinished ...>
[pid 2866] close(4 <unfinished ...>
[pid 2865] <... write resumed>) = 4
[pid 2866] <... close resumed>) = 0
[pid 2866] pipe2( <unfinished ...>
[pid 2865] close(39 <unfinished ...>
[pid 2866] <... pipe2 resumed>[3, 4], O_CLOEXEC) = 0
[pid 2865] <... close resumed>) = 0
[pid 2865] exit_group(0) = ?
[pid 2866] rt_sigaction(SIGCHLD, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fccfc304d80}, {sa_handler=0x7fccfc427d50, sa_mask=[], sa_flags=SA_RESTORER|SA_NOCLDSTOP, sa_restorer=0x7fccfc304d80}, 8) = 0
[pid 2866] rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fccfc304d80}, {sa_handler=0x7fccfc427d50, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fccfc304d80}, 8) = 0
[pid 2866] rt_sigaction(SIGHUP, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fccfc304d80}, {sa_handler=0x7fccfc427d50, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fccfc304d80}, 8) = 0
[pid 2866] rt_sigaction(SIGTERM, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fccfc304d80}, {sa_handler=0x7fccfc427d50, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fccfc304d80}, 8) = 0
[pid 2866] rt_sigprocmask(SIG_UNBLOCK, [HUP INT TERM CHLD], [HUP INT TERM CHLD], 8) = 0
[pid 2866] mkdir("/var", 0777) = -1 EEXIST (File exists)
[pid 2866] mkdir("/var/run", 0777) = -1 EEXIST (File exists)
[pid 2865] +++ exited with 0 +++
[pid 1] <... wait4 resumed>[{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 2865
[pid 1] close(39) = 0
[pid 2866] setsid( <unfinished ...>
[pid 1] read(16, <unfinished ...>
[pid 2866] <... setsid resumed>) = 2866
[pid 1] <... read resumed>"2866", 4096) = 4
[pid 2866] chdir("/") = 0
[pid 1] read(16, <unfinished ...>
[pid 2866] prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=1024, rlim_max=4*1024}) = 0
[pid 2866] close(0) = 0
[pid 2866] openat(AT_FDCWD, "/dev/null", O_RDONLY) = 0
[pid 2866] dup2(0, 0) = 0
[pid 2866] close(1) = 0
[pid 2866] close(2) = 0
[pid 2866] openat(AT_FDCWD, "/var/log/guix-daemon.log", O_WRONLY|O_CREAT|O_APPEND, 0640) = 1
[pid 2866] dup2(1, 1) = 1
[pid 2866] dup2(1, 2) = 2
[pid 2866] execve("/gnu/store/bxnkqnpbf4q4z6245b61wgpm8gkr9nj1-guix-1.3.0-29.9e46320/bin/guix-daemon", ["/gnu/store/bxnkqnpbf4q4z6245b61w"..., "--build-users-group", "guixbuild", "--max-silent-time", "0", "--timeout", "0", "--log-compression", "gzip", "--discover=yes", "--substitute-urls", "https://substitutes.nonguix.org "...], 0x7fccf71fa480 /* 3 vars */) = 0

This happens because the other end of the file descriptor happens to be
inherited by 2866, which will never close it because it just execs
guix-daemon.

This is fixed by 6abdcef4a68e98f538ab69fde096adc5f5ca4ff4; the log
contains extra details.

Thanks!

Ludo’.
Closed
M
M
Mathieu Othacehe wrote on 24 Sep 2022 14:04
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 57827-done@debbugs.gnu.org)
878rm97zd3.fsf@gnu.org
Hey,

Toggle quote (3 lines)
> This is fixed by 6abdcef4a68e98f538ab69fde096adc5f5ca4ff4; the log
> contains extra details.

Thanks for fixing it! Turns out we still have an issue on all four
installer tests.

The error messages look like:

Toggle snippet (5 lines)
Sep 22 09:49:19 localhost installer[252]: running command ("guix" "system" "init" "--fallback" "--no-grafts" "--no-substitutes" "/mnt/etc/config.scm" "/mnt")

Sep 22 09:49:21 localhost installer[252]: guix system: error: read error while loading '/mnt/etc/config.scm': /mnt/etc/config.scm:63:51: unexpected ")"

Having a look at the configuration file (attached), it seems that the
problem comes from the edit-configuration-file function. As the
rewritten configuration is shorter than the initial because the comments
are stripped by the pretty-print function, we have some leftovers from
the initial config.

I do not really understand why because call-with-output-file is supposed
to use the O_TRUNC flag and remove the existing content of the
configuration file. Plus, calling edit-configuration-file from a guix
repl does the right thing.

Anyways, I pushed fe4663ae2476cb527d4f1f49ff8fa077d43f7251 which fixes
the issue by removing the file before rewriting it.

Mathieu
Closed
?