tests/no-home test failure in Shepherd

  • Done
  • quality assurance status badge
Details
2 participants
  • Ludovic Courtès
  • Maxim Cournoyer
Owner
unassigned
Submitted by
Maxim Cournoyer
Severity
normal
M
M
Maxim Cournoyer wrote on 24 Dec 2021 15:52
(name . bug-guix)(address . bug-guix@gnu.org)
87h7ayys14.fsf@gmail.com
Hello Guix,

I encountered the following (nondeterministic) test failure:

Toggle snippet (57 lines)
# TOTAL: 13
# PASS: 12
# SKIP: 0
# XFAIL: 0
# FAIL: 1
# XPASS: 0
# ERROR: 0

.. contents:: :depth: 2

FAIL: tests/no-home
===================

+ shepherd --version
shepherd (GNU Shepherd) 0.8.1
Copyright (C) 2020 the Shepherd authors
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
+ HOME=/nowhere
+ export HOME
+ test -w /nowhere
+ socket=t-socket-1651
+ pid=t-pid-1651
+ herd='herd -s t-socket-1651'
+ trap 'rm -f t-socket-1651;
test -f t-pid-1651 && kill `cat t-pid-1651` || true; rm -f t-pid-1651' EXIT
+ shepherd_pid=1896
+ test -f t-pid-1651
+ shepherd -I -s t-socket-1651 -c /dev/null -l /dev/null --pid=t-pid-1651
+ kill -0 1896
+ sleep 0.3
Service root has been started.
+ test -f t-pid-1651
++ cat t-pid-1651
+ kill -0 1896
+ herd -s t-socket-1651 status root
Started:
+ root
+ herd -s t-socket-1651 stop root
++ cat t-pid-1651
+ kill 1896
+ exit 1
+ rm -f t-socket-1651
+ test -f t-pid-1651
++ cat t-pid-1651
+ kill 1896
+ rm -f t-pid-1651
FAIL tests/no-home.sh (exit status: 1)


error: in phase 'check': uncaught exception:
%exception #<&invoke-error program: "make" arguments: ("check" "-j" "24" "GUILE_AUTO_COMPILE=0") exit-status: 2 term-signal: #f stop-signal: #f>
phase `check' failed after 15.3 seconds
command "make" "check" "-j" "24" "GUILE_AUTO_COMPILE=0" failed with status 2

Perhaps due to high load, as I can't seem to reproduce when building
Shepherd alone.

Thanks,

Maxim
M
M
Maxim Cournoyer wrote on 3 Jan 2022 22:09
(address . 52779@debbugs.gnu.org)
87h7akv851.fsf@gmail.com
Hi,

Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:

Toggle quote (63 lines)
> Hello Guix,
>
> I encountered the following (nondeterministic) test failure:
>
> # TOTAL: 13
> # PASS: 12
> # SKIP: 0
> # XFAIL: 0
> # FAIL: 1
> # XPASS: 0
> # ERROR: 0
>
> .. contents:: :depth: 2
>
> FAIL: tests/no-home
> ===================
>
> + shepherd --version
> shepherd (GNU Shepherd) 0.8.1
> Copyright (C) 2020 the Shepherd authors
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.
> + HOME=/nowhere
> + export HOME
> + test -w /nowhere
> + socket=t-socket-1651
> + pid=t-pid-1651
> + herd='herd -s t-socket-1651'
> + trap 'rm -f t-socket-1651;
> test -f t-pid-1651 && kill `cat t-pid-1651` || true; rm -f t-pid-1651' EXIT
> + shepherd_pid=1896
> + test -f t-pid-1651
> + shepherd -I -s t-socket-1651 -c /dev/null -l /dev/null --pid=t-pid-1651
> + kill -0 1896
> + sleep 0.3
> Service root has been started.
> + test -f t-pid-1651
> ++ cat t-pid-1651
> + kill -0 1896
> + herd -s t-socket-1651 status root
> Started:
> + root
> + herd -s t-socket-1651 stop root
> ++ cat t-pid-1651
> + kill 1896
> + exit 1
> + rm -f t-socket-1651
> + test -f t-pid-1651
> ++ cat t-pid-1651
> + kill 1896
> + rm -f t-pid-1651
> FAIL tests/no-home.sh (exit status: 1)
>
>
> error: in phase 'check': uncaught exception:
> %exception #<&invoke-error program: "make" arguments: ("check" "-j" "24" "GUILE_AUTO_COMPILE=0") exit-status: 2 term-signal: #f stop-signal: #f>
> phase `check' failed after 15.3 seconds
> command "make" "check" "-j" "24" "GUILE_AUTO_COMPILE=0" failed with status 2
>
> Perhaps due to high load, as I can't seem to reproduce when building
> Shepherd alone.

I encountered the problem once more; the errors are slightly different
so I'm pasting the failed test output here:

Toggle snippet (49 lines)
FAIL: tests/no-home
===================

+ shepherd --version
shepherd (GNU Shepherd) 0.8.1
Copyright (C) 2020 the Shepherd authors
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
+ HOME=/nowhere
+ export HOME
+ test -w /nowhere
+ socket=t-socket-1606
+ pid=t-pid-1606
+ herd='herd -s t-socket-1606'
+ trap 'rm -f t-socket-1606;
test -f t-pid-1606 && kill `cat t-pid-1606` || true; rm -f t-pid-1606' EXIT
+ shepherd_pid=1868
+ test -f t-pid-1606
+ shepherd -I -s t-socket-1606 -c /dev/null -l /dev/null --pid=t-pid-1606
+ kill -0 1868
+ sleep 0.3
Service root has been started.
+ test -f t-pid-1606
++ cat t-pid-1606
+ kill -0 1868
+ herd -s t-socket-1606 status root
Started:
+ root
+ herd -s t-socket-1606 stop root
++ cat t-pid-1606
+ kill 1868
+ exit 1
+ rm -f t-socket-1606
+ test -f t-pid-1606
++ cat t-pid-1606
+ kill 1868
./tests/no-home.sh: line 2: kill: (1868) - No such process
+ true
+ rm -f t-pid-1606
FAIL tests/no-home.sh (exit status: 1)


error: in phase 'check': uncaught exception:
%exception #<&invoke-error program: "make" arguments: ("check" "-j" "24" "GUILE_AUTO_COMPILE=0") exit-status: 2 term-signal: #f stop-signal: #f>
phase `check' failed after 15.6 seconds
command "make" "check" "-j" "24" "GUILE_AUTO_COMPILE=0" failed with status 2

Thanks,

Maxim
M
M
Maxim Cournoyer wrote on 10 Jan 2022 19:17
(address . 52779@debbugs.gnu.org)
87ilurpi9b.fsf@gmail.com
Hi,

Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:

Toggle quote (21 lines)
> Hi,
>
> Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:
>
>> Hello Guix,
>>
>> I encountered the following (nondeterministic) test failure:
>>
>> # TOTAL: 13
>> # PASS: 12
>> # SKIP: 0
>> # XFAIL: 0
>> # FAIL: 1
>> # XPASS: 0
>> # ERROR: 0
>>
>> .. contents:: :depth: 2
>>
>> FAIL: tests/no-home
>> ===================

[...]

I tried to reproduce this is a controlled setting, using the 'stress'
command to effect a high load on the system, but in vain.

I could only trigger the problem on a highly parallel CPU (24 logical
cores) busy building 10s of Guix packages at the same time.

Thanks,

Maxim
L
L
Ludovic Courtès wrote on 13 Jan 2022 16:37
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)(address . 52779@debbugs.gnu.org)
87ee5b7ik6.fsf@gnu.org
Hello,

Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

Toggle quote (14 lines)
> + herd -s t-socket-1651 status root
> Started:
> + root
> + herd -s t-socket-1651 stop root
> ++ cat t-pid-1651
> + kill 1896
> + exit 1
> + rm -f t-socket-1651
> + test -f t-pid-1651
> ++ cat t-pid-1651
> + kill 1896
> + rm -f t-pid-1651
> FAIL tests/no-home.sh (exit status: 1)

What happens here is that the shepherd process is still alive after
‘herd stop root’ has completed, contrary to what’s expected:

Toggle snippet (8 lines)
$herd stop root

if kill `cat "$pid"`
then
exit 1
fi

The expectation is that shepherd has terminated by the time ‘herd stop
root’ exits; I wonder if that’s bogus.

‘herd stop root’ terminates when shepherd has closed its connection,
which normally happens when shepherd exits:

Toggle snippet (24 lines)
28003 read(15, "(shepherd-command (version 0) (action stop) (service root) (arguments ()) (directory \"/data/src/shepherd\"))", 1024) = 107
28003 brk(0x1030000) = 0x1030000
28003 mmap(NULL, 262144, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0072be8000
28003 brk(0x100f000) = 0x100f000
28003 getcwd("/data/src/shepherd", 100) = 19
28003 chdir("/data/src/shepherd") = 0
28003 newfstatat(AT_FDCWD, "/etc/localtime", {st_mode=S_IFREG|0444, st_size=2962, ...}, 0) = 0
28003 write(7, "2022-01-13 16:21:16 Exiting shepherd...\n", 40) = 40
28003 chdir("/data/src/shepherd") = 0
28003 getuid() = 1000
28003 close(13) = 0
28003 unlink("test") = 0
28003 exit_group(0) = ?
28006 <... futex resumed>) = ?
28008 <... read resumed> <unfinished ...>) = ?
28005 <... futex resumed>) = ?
28004 <... futex resumed>) = ?
28008 +++ exited with 0 +++
28006 +++ exited with 0 +++
28005 +++ exited with 0 +++
28004 +++ exited with 0 +++
28003 +++ exited with 0 +++

Maybe there’s a chance that the shell hasn’t processed the shepherd’s
SIGCHLD when it evaluates the “if kill `cat "$pid"`” condition; in that
case, the shepherd process still exists as a zombie.

A more robust approach might be to use the shell’s builtin ‘wait’,
because then I suppose the shell will be forced to process pending
SIGCHLDs:
Toggle diff (20 lines)
diff --git a/tests/no-home.sh b/tests/no-home.sh
index 85b6116..5a8c278 100644
--- a/tests/no-home.sh
+++ b/tests/no-home.sh
@@ -1,5 +1,5 @@
# GNU Shepherd --- Make sure shepherd doesn't fail when $HOME is not writable.
-# Copyright © 2014, 2016 Ludovic Courtès <ludo@gnu.org>
+# Copyright © 2014, 2016, 2022 Ludovic Courtès <ludo@gnu.org>
#
# This file is part of the GNU Shepherd.
#
@@ -46,7 +46,4 @@ kill -0 `cat "$pid"`
$herd status root
$herd stop root
-if kill `cat "$pid"`
-then
- exit 1
-fi
+wait `cat "$pid"`
I can’t get it to fail while waiting for a few minutes of:

while make check TESTS=tests/no-home.sh ; do : ; done

… but I cannot get the original one to fail either.

Does it work for you?

Ludo’.
M
M
Maxim Cournoyer wrote on 13 Jan 2022 18:29
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 52779@debbugs.gnu.org)
875yqn1r4l.fsf@gmail.com
Hi Ludo!

Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (28 lines)
> Hello,
>
> Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:
>
>> + herd -s t-socket-1651 status root
>> Started:
>> + root
>> + herd -s t-socket-1651 stop root
>> ++ cat t-pid-1651
>> + kill 1896
>> + exit 1
>> + rm -f t-socket-1651
>> + test -f t-pid-1651
>> ++ cat t-pid-1651
>> + kill 1896
>> + rm -f t-pid-1651
>> FAIL tests/no-home.sh (exit status: 1)
>
> What happens here is that the shepherd process is still alive after
> ‘herd stop root’ has completed, contrary to what’s expected:
>
> $herd stop root
>
> if kill `cat "$pid"`
> then
> exit 1
> fi

Yes!

[...]

Toggle quote (29 lines)
> Maybe there’s a chance that the shell hasn’t processed the shepherd’s
> SIGCHLD when it evaluates the “if kill `cat "$pid"`” condition; in that
> case, the shepherd process still exists as a zombie.
>
> A more robust approach might be to use the shell’s builtin ‘wait’,
> because then I suppose the shell will be forced to process pending
> SIGCHLDs:
>
> diff --git a/tests/no-home.sh b/tests/no-home.sh
> index 85b6116..5a8c278 100644
> --- a/tests/no-home.sh
> +++ b/tests/no-home.sh
> @@ -1,5 +1,5 @@
> # GNU Shepherd --- Make sure shepherd doesn't fail when $HOME is not writable.
> -# Copyright © 2014, 2016 Ludovic Courtès <ludo@gnu.org>
> +# Copyright © 2014, 2016, 2022 Ludovic Courtès <ludo@gnu.org>
> #
> # This file is part of the GNU Shepherd.
> #
> @@ -46,7 +46,4 @@ kill -0 `cat "$pid"`
> $herd status root
> $herd stop root
>
> -if kill `cat "$pid"`
> -then
> - exit 1
> -fi
> +wait `cat "$pid"`

As I wrote, I was also unable to reproduce this (but when I had a high
load of packages to build at the same time, I could get it to happen a
couple times upon retrying). Your analysis (and the narrow window which
would allow for a failure) makes sense to me, along with the proposed
fix.

I think you should commit it and tentatively mark this bug as fixed :-).

Thank you for looking into it!

Maxim
L
L
Ludovic Courtès wrote on 17 Jan 2022 14:43
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)(address . 52779-done@debbugs.gnu.org)
8735lmpjdw.fsf@gnu.org
Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

Toggle quote (2 lines)
> I think you should commit it and tentatively mark this bug as fixed :-).

Alright, pushed as Shepherd commit
e025f83c2072e4a6533a5747682810a885e20b2e!

Thanks,
Ludo’.
Closed
?
Your comment

This issue is archived.

To comment on this conversation send an email to 52779@debbugs.gnu.org

To respond to this issue using the mumi CLI, first switch to it
mumi current 52779
Then, you may apply the latest patchset in this issue (with sign off)
mumi am -- -s
Or, compose a reply to this issue
mumi compose
Or, send patches to this issue
mumi send-email *.patch