tests/no-home test failure in Shepherd

  • Done
  • quality assurance status badge
Details
2 participants
  • Ludovic Courtès
  • Maxim Cournoyer
Owner
unassigned
Submitted by
Maxim Cournoyer
Severity
normal
M
M
Maxim Cournoyer wrote on 24 Dec 2021 15:52
(name . bug-guix)(address . bug-guix@gnu.org)
87h7ayys14.fsf@gmail.com
Hello Guix,

I encountered the following (nondeterministic) test failure:

Toggle snippet (57 lines)
# TOTAL: 13
# PASS: 12
# SKIP: 0
# XFAIL: 0
# FAIL: 1
# XPASS: 0
# ERROR: 0

.. contents:: :depth: 2

FAIL: tests/no-home
===================

+ shepherd --version
shepherd (GNU Shepherd) 0.8.1
Copyright (C) 2020 the Shepherd authors
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
+ HOME=/nowhere
+ export HOME
+ test -w /nowhere
+ socket=t-socket-1651
+ pid=t-pid-1651
+ herd='herd -s t-socket-1651'
+ trap 'rm -f t-socket-1651;
test -f t-pid-1651 && kill `cat t-pid-1651` || true; rm -f t-pid-1651' EXIT
+ shepherd_pid=1896
+ test -f t-pid-1651
+ shepherd -I -s t-socket-1651 -c /dev/null -l /dev/null --pid=t-pid-1651
+ kill -0 1896
+ sleep 0.3
Service root has been started.
+ test -f t-pid-1651
++ cat t-pid-1651
+ kill -0 1896
+ herd -s t-socket-1651 status root
Started:
+ root
+ herd -s t-socket-1651 stop root
++ cat t-pid-1651
+ kill 1896
+ exit 1
+ rm -f t-socket-1651
+ test -f t-pid-1651
++ cat t-pid-1651
+ kill 1896
+ rm -f t-pid-1651
FAIL tests/no-home.sh (exit status: 1)


error: in phase 'check': uncaught exception:
%exception #<&invoke-error program: "make" arguments: ("check" "-j" "24" "GUILE_AUTO_COMPILE=0") exit-status: 2 term-signal: #f stop-signal: #f>
phase `check' failed after 15.3 seconds
command "make" "check" "-j" "24" "GUILE_AUTO_COMPILE=0" failed with status 2

Perhaps due to high load, as I can't seem to reproduce when building
Shepherd alone.

Thanks,

Maxim
M
M
Maxim Cournoyer wrote on 3 Jan 2022 22:09
(address . 52779@debbugs.gnu.org)
87h7akv851.fsf@gmail.com
Hi,

Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:

Toggle quote (63 lines)
> Hello Guix,
>
> I encountered the following (nondeterministic) test failure:
>
> # TOTAL: 13
> # PASS: 12
> # SKIP: 0
> # XFAIL: 0
> # FAIL: 1
> # XPASS: 0
> # ERROR: 0
>
> .. contents:: :depth: 2
>
> FAIL: tests/no-home
> ===================
>
> + shepherd --version
> shepherd (GNU Shepherd) 0.8.1
> Copyright (C) 2020 the Shepherd authors
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.
> + HOME=/nowhere
> + export HOME
> + test -w /nowhere
> + socket=t-socket-1651
> + pid=t-pid-1651
> + herd='herd -s t-socket-1651'
> + trap 'rm -f t-socket-1651;
> test -f t-pid-1651 && kill `cat t-pid-1651` || true; rm -f t-pid-1651' EXIT
> + shepherd_pid=1896
> + test -f t-pid-1651
> + shepherd -I -s t-socket-1651 -c /dev/null -l /dev/null --pid=t-pid-1651
> + kill -0 1896
> + sleep 0.3
> Service root has been started.
> + test -f t-pid-1651
> ++ cat t-pid-1651
> + kill -0 1896
> + herd -s t-socket-1651 status root
> Started:
> + root
> + herd -s t-socket-1651 stop root
> ++ cat t-pid-1651
> + kill 1896
> + exit 1
> + rm -f t-socket-1651
> + test -f t-pid-1651
> ++ cat t-pid-1651
> + kill 1896
> + rm -f t-pid-1651
> FAIL tests/no-home.sh (exit status: 1)
>
>
> error: in phase 'check': uncaught exception:
> %exception #<&invoke-error program: "make" arguments: ("check" "-j" "24" "GUILE_AUTO_COMPILE=0") exit-status: 2 term-signal: #f stop-signal: #f>
> phase `check' failed after 15.3 seconds
> command "make" "check" "-j" "24" "GUILE_AUTO_COMPILE=0" failed with status 2
>
> Perhaps due to high load, as I can't seem to reproduce when building
> Shepherd alone.

I encountered the problem once more; the errors are slightly different
so I'm pasting the failed test output here:

Toggle snippet (49 lines)
FAIL: tests/no-home
===================

+ shepherd --version
shepherd (GNU Shepherd) 0.8.1
Copyright (C) 2020 the Shepherd authors
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
+ HOME=/nowhere
+ export HOME
+ test -w /nowhere
+ socket=t-socket-1606
+ pid=t-pid-1606
+ herd='herd -s t-socket-1606'
+ trap 'rm -f t-socket-1606;
test -f t-pid-1606 && kill `cat t-pid-1606` || true; rm -f t-pid-1606' EXIT
+ shepherd_pid=1868
+ test -f t-pid-1606
+ shepherd -I -s t-socket-1606 -c /dev/null -l /dev/null --pid=t-pid-1606
+ kill -0 1868
+ sleep 0.3
Service root has been started.
+ test -f t-pid-1606
++ cat t-pid-1606
+ kill -0 1868
+ herd -s t-socket-1606 status root
Started:
+ root
+ herd -s t-socket-1606 stop root
++ cat t-pid-1606
+ kill 1868
+ exit 1
+ rm -f t-socket-1606
+ test -f t-pid-1606
++ cat t-pid-1606
+ kill 1868
./tests/no-home.sh: line 2: kill: (1868) - No such process
+ true
+ rm -f t-pid-1606
FAIL tests/no-home.sh (exit status: 1)


error: in phase 'check': uncaught exception:
%exception #<&invoke-error program: "make" arguments: ("check" "-j" "24" "GUILE_AUTO_COMPILE=0") exit-status: 2 term-signal: #f stop-signal: #f>
phase `check' failed after 15.6 seconds
command "make" "check" "-j" "24" "GUILE_AUTO_COMPILE=0" failed with status 2

Thanks,

Maxim
M
M
Maxim Cournoyer wrote on 10 Jan 2022 19:17
(address . 52779@debbugs.gnu.org)
87ilurpi9b.fsf@gmail.com
Hi,

Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:

Toggle quote (21 lines)
> Hi,
>
> Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:
>
>> Hello Guix,
>>
>> I encountered the following (nondeterministic) test failure:
>>
>> # TOTAL: 13
>> # PASS: 12
>> # SKIP: 0
>> # XFAIL: 0
>> # FAIL: 1
>> # XPASS: 0
>> # ERROR: 0
>>
>> .. contents:: :depth: 2
>>
>> FAIL: tests/no-home
>> ===================

[...]

I tried to reproduce this is a controlled setting, using the 'stress'
command to effect a high load on the system, but in vain.

I could only trigger the problem on a highly parallel CPU (24 logical
cores) busy building 10s of Guix packages at the same time.

Thanks,

Maxim
L
L
Ludovic Courtès wrote on 13 Jan 2022 16:37
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)(address . 52779@debbugs.gnu.org)
87ee5b7ik6.fsf@gnu.org
Hello,

Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

Toggle quote (14 lines)
> + herd -s t-socket-1651 status root
> Started:
> + root
> + herd -s t-socket-1651 stop root
> ++ cat t-pid-1651
> + kill 1896
> + exit 1
> + rm -f t-socket-1651
> + test -f t-pid-1651
> ++ cat t-pid-1651
> + kill 1896
> + rm -f t-pid-1651
> FAIL tests/no-home.sh (exit status: 1)

What happens here is that the shepherd process is still alive after
‘herd stop root’ has completed, contrary to what’s expected:

Toggle snippet (8 lines)
$herd stop root

if kill `cat "$pid"`
then
exit 1
fi

The expectation is that shepherd has terminated by the time ‘herd stop
root’ exits; I wonder if that’s bogus.

‘herd stop root’ terminates when shepherd has closed its connection,
which normally happens when shepherd exits:

Toggle snippet (24 lines)
28003 read(15, "(shepherd-command (version 0) (action stop) (service root) (arguments ()) (directory \"/data/src/shepherd\"))", 1024) = 107
28003 brk(0x1030000) = 0x1030000
28003 mmap(NULL, 262144, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0072be8000
28003 brk(0x100f000) = 0x100f000
28003 getcwd("/data/src/shepherd", 100) = 19
28003 chdir("/data/src/shepherd") = 0
28003 newfstatat(AT_FDCWD, "/etc/localtime", {st_mode=S_IFREG|0444, st_size=2962, ...}, 0) = 0
28003 write(7, "2022-01-13 16:21:16 Exiting shepherd...\n", 40) = 40
28003 chdir("/data/src/shepherd") = 0
28003 getuid() = 1000
28003 close(13) = 0
28003 unlink("test") = 0
28003 exit_group(0) = ?
28006 <... futex resumed>) = ?
28008 <... read resumed> <unfinished ...>) = ?
28005 <... futex resumed>) = ?
28004 <... futex resumed>) = ?
28008 +++ exited with 0 +++
28006 +++ exited with 0 +++
28005 +++ exited with 0 +++
28004 +++ exited with 0 +++
28003 +++ exited with 0 +++

Maybe there’s a chance that the shell hasn’t processed the shepherd’s
SIGCHLD when it evaluates the “if kill `cat "$pid"`” condition; in that
case, the shepherd process still exists as a zombie.

A more robust approach might be to use the shell’s builtin ‘wait’,
because then I suppose the shell will be forced to process pending
SIGCHLDs:
Toggle diff (20 lines)
diff --git a/tests/no-home.sh b/tests/no-home.sh
index 85b6116..5a8c278 100644
--- a/tests/no-home.sh
+++ b/tests/no-home.sh
@@ -1,5 +1,5 @@
# GNU Shepherd --- Make sure shepherd doesn't fail when $HOME is not writable.
-# Copyright © 2014, 2016 Ludovic Courtès <ludo@gnu.org>
+# Copyright © 2014, 2016, 2022 Ludovic Courtès <ludo@gnu.org>
#
# This file is part of the GNU Shepherd.
#
@@ -46,7 +46,4 @@ kill -0 `cat "$pid"`
$herd status root
$herd stop root
-if kill `cat "$pid"`
-then
- exit 1
-fi
+wait `cat "$pid"`
I can’t get it to fail while waiting for a few minutes of:

while make check TESTS=tests/no-home.sh ; do : ; done

… but I cannot get the original one to fail either.

Does it work for you?

Ludo’.
M
M
Maxim Cournoyer wrote on 13 Jan 2022 18:29
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 52779@debbugs.gnu.org)
875yqn1r4l.fsf@gmail.com
Hi Ludo!

Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (28 lines)
> Hello,
>
> Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:
>
>> + herd -s t-socket-1651 status root
>> Started:
>> + root
>> + herd -s t-socket-1651 stop root
>> ++ cat t-pid-1651
>> + kill 1896
>> + exit 1
>> + rm -f t-socket-1651
>> + test -f t-pid-1651
>> ++ cat t-pid-1651
>> + kill 1896
>> + rm -f t-pid-1651
>> FAIL tests/no-home.sh (exit status: 1)
>
> What happens here is that the shepherd process is still alive after
> ‘herd stop root’ has completed, contrary to what’s expected:
>
> $herd stop root
>
> if kill `cat "$pid"`
> then
> exit 1
> fi

Yes!

[...]

Toggle quote (29 lines)
> Maybe there’s a chance that the shell hasn’t processed the shepherd’s
> SIGCHLD when it evaluates the “if kill `cat "$pid"`” condition; in that
> case, the shepherd process still exists as a zombie.
>
> A more robust approach might be to use the shell’s builtin ‘wait’,
> because then I suppose the shell will be forced to process pending
> SIGCHLDs:
>
> diff --git a/tests/no-home.sh b/tests/no-home.sh
> index 85b6116..5a8c278 100644
> --- a/tests/no-home.sh
> +++ b/tests/no-home.sh
> @@ -1,5 +1,5 @@
> # GNU Shepherd --- Make sure shepherd doesn't fail when $HOME is not writable.
> -# Copyright © 2014, 2016 Ludovic Courtès <ludo@gnu.org>
> +# Copyright © 2014, 2016, 2022 Ludovic Courtès <ludo@gnu.org>
> #
> # This file is part of the GNU Shepherd.
> #
> @@ -46,7 +46,4 @@ kill -0 `cat "$pid"`
> $herd status root
> $herd stop root
>
> -if kill `cat "$pid"`
> -then
> - exit 1
> -fi
> +wait `cat "$pid"`

As I wrote, I was also unable to reproduce this (but when I had a high
load of packages to build at the same time, I could get it to happen a
couple times upon retrying). Your analysis (and the narrow window which
would allow for a failure) makes sense to me, along with the proposed
fix.

I think you should commit it and tentatively mark this bug as fixed :-).

Thank you for looking into it!

Maxim
L
L
Ludovic Courtès wrote on 17 Jan 2022 14:43
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)(address . 52779-done@debbugs.gnu.org)
8735lmpjdw.fsf@gnu.org
Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

Toggle quote (2 lines)
> I think you should commit it and tentatively mark this bug as fixed :-).

Alright, pushed as Shepherd commit
e025f83c2072e4a6533a5747682810a885e20b2e!

Thanks,
Ludo’.
Closed
?