Installation tests are failing

OpenSubmitted by Mathieu Othacehe.
Details
4 participants
  • Ludovic Courtès
  • Mathieu Othacehe
  • Maxim Cournoyer
  • Mathieu Othacehe
Owner
unassigned
Severity
important
M
M
Mathieu Othacehe wrote on 8 Apr 11:51 +0200
(address . bug-guix@gnu.org)(name . Ludovic Courtès)(address . ludo@gnu.org)
87r167rjhv.fsf@gnu.org
Hello,

The installation tests are failing this way:

Toggle snippet (4 lines)
conversation expecting pattern ((quote pause))
Apr 7 17:41:58 localhost installer[227]: guix system: error: failed to connect to `/var/guix/daemon-socket/socket': Connection refused

this is right after the 'guix-daemon' service is restarted. It looks
like this regression is introduced by the switch to the new Shepherd
release.

See:

Thanks,

Mathieu
M
M
Mathieu Othacehe wrote on 8 Apr 17:10 +0200
(address . bug-guix@gnu.org)(name . Ludovic Courtès)(address . ludo@gnu.org)
87v8vjwqzk.fsf@gnu.org
The following tests are also failing since the Shepherd upgrade:


Thanks,

Mathieu
M
M
Mathieu Othacehe wrote on 26 Apr 10:27 +0200
control message for bug #54786
(address . control@debbugs.gnu.org)
87bkwow8oo.fsf@meije.i-did-not-set--mail-host-address--so-tickle-me
severity 54786 important
quit
M
M
Mathieu Othacehe wrote on 28 Apr 09:22 +0200
Re: bug#54786: Installation tests are failing
(address . 54786@debbugs.gnu.org)(name . Ludovic Courtès)(address . ludo@gnu.org)
87zgk5brkd.fsf@gnu.org
Hello,

Those tests are still failing. It looks like most of the failures are
caused by daemons started multiple times.

Toggle quote (2 lines)
The nginx daemon seems to be started multiple times:

Toggle snippet (20 lines)
nginx: [emerg] bind() to 0.0.0.0:19418 failed (98: Address already in use)
nginx: [emerg] bind() to 0.0.0.0:19418 failed (98: Address already in use)



This is the GNU system. Welcome.
komputilo login: nginx: [emerg] bind() to 0.0.0.0:19418 failed (98: Address already in use)
nginx: [emerg] bind() to 0.0.0.0:19418 failed (98: Address already in use)
nginx: [emerg] bind() to 0.0.0.0:19418 failed (98: Address already in use)
nginx: [emerg] still could not bind()
/gnu/store/01phrvxnxrg1q0gxa35g7f77q06crf6v-shepherd-marionette.scm:1:1718: ERROR:
1. &action-exception-error:
service: nginx
action: start
key: %exception
args: ("#<&invoke-error program: \"/gnu/store/815abphg8vr8qkl8ykd8pyxp1v62c9gk-nginx-1.21.6/sbin/nginx\" arguments: (\"-c\" \"/gnu/store/rbjgg41p22lgkjwrc8inrhbmqah54cgq-nginx.conf\" \"-p\" \"/var/run/nginx\") exit-status: 1 term-signal: #f stop-signal: #f>")

Tests failed, dumping log file '/gnu/store/p72g83l9nag6c830pzwgcgpnvnyr53p1-cgit-test/cgit.log'.

Toggle quote (2 lines)
The nginx daemon seems to be started multiple times:

Toggle snippet (20 lines)
nginx: [emerg] bind() to 0.0.0.0:19418 failed (98: Address already in use)
nginx: [emerg] bind() to 0.0.0.0:19418 failed (98: Address already in use)



This is the GNU system. Welcome.
komputilo login: nginx: [emerg] bind() to 0.0.0.0:19418 failed (98: Address already in use)
nginx: [emerg] bind() to 0.0.0.0:19418 failed (98: Address already in use)
nginx: [emerg] bind() to 0.0.0.0:19418 failed (98: Address already in use)
nginx: [emerg] still could not bind()
/gnu/store/01phrvxnxrg1q0gxa35g7f77q06crf6v-shepherd-marionette.scm:1:1718: ERROR:
1. &action-exception-error:
service: nginx
action: start
key: %exception
args: ("#<&invoke-error program: \"/gnu/store/815abphg8vr8qkl8ykd8pyxp1v62c9gk-nginx-1.21.6/sbin/nginx\" arguments: (\"-c\" \"/gnu/store/ayafihmfwg3yw4hp8nw622g2rr9mw7vn-nginx.conf\" \"-p\" \"/var/run/nginx\") exit-status: 1 term-signal: #f stop-signal: #f>")

Tests failed, dumping log file '/gnu/store/ix0hpwpr7b6zh20arig9bpg2lqzysxi7-gitile-test/gitile.log'.

Toggle quote (3 lines)
> * jami-test (https://ci.guix.gnu.org/build/646811/details)

Looks like those tests are failing because the daemon is started
multiple times:

Toggle snippet (45 lines)
This is the GNU system. Welcome.
jami login: Jami Daemon 11.0.0, by Savoir-faire Linux 2004-2019
https://jami.net/
[Video support enabled]
[Plugins support enabled]

12:21:08.165 os_core_unix.c !pjlib 2.11 for POSIX initialized
Jami Daemon 11.0.0, by Savoir-faire Linux 2004-2019
https://jami.net/
[Video support enabled]
[Plugins support enabled]

One does not simply initialize the client: Another daemon is detected
/gnu/store/01phrvxnxrg1q0gxa35g7f77q06crf6v-shepherd-marionette.scm:1:1718: ERROR:
1. &action-exception-error:
service: jami
action: start
key: match-error
args: ("match" "no matching pattern" #f)
Jami Daemon 11.0.0, by Savoir-faire Linux 2004-2019
https://jami.net/
[Video support enabled]
[Plugins support enabled]

One does not simply initialize the client: Another daemon is detected
/gnu/store/01phrvxnxrg1q0gxa35g7f77q06crf6v-shepherd-marionette.scm:1:1718: ERROR:
1. &action-exception-error:
service: jami
action: start
key: match-error
args: ("match" "no matching pattern" #f)
Jami Daemon 11.0.0, by Savoir-faire Linux 2004-2019
https://jami.net/
[Video support enabled]
[Plugins support enabled]

One does not simply initialize the client: Another daemon is detected
/gnu/store/01phrvxnxrg1q0gxa35g7f77q06crf6v-shepherd-marionette.scm:1:1718: ERROR:
1. &action-exception-error:
service: jami
action: start
key: match-error
args: ("match" "no matching pattern" #f)

Thanks,

Mathieu
L
L
Ludovic Courtès wrote on 28 Apr 21:19 +0200
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 54786@debbugs.gnu.org)
87y1zpaud6.fsf@gnu.org
Hi!

Mathieu Othacehe <othacehe@gnu.org> skribis:

Toggle quote (4 lines)
>
> The nginx daemon seems to be started multiple times:

I believe this is caused by a change of semantics (really: a bug) in the
shepherd ‘start’ method in 0.9.0.

Previously, ‘start’ would wait until the daemon was started. If the
service was being started, shepherd wouldn’t reply until it was done
starting it.

In 0.9.0, shepherd replies even while it’s waiting for the service to be
started. But as a consequence, it lets you start a service that is
already being started, leading to this mess you reported.


The proper fix is to better track the status of each service in
shepherd, and to prevent double-starts.

In the interim, perhaps we can work around that by using a different
check to determine whether the service is running. For instance,
instead of:

(test-assert "nginx running"
(marionette-eval
'(begin
(use-modules (gnu services herd))
(start-service 'nginx))
marionette))

… we’d write something like:

(test-assert "nginx running"
(wait-for-file "/var/run/nginx/pid"))

Thoughts? I’ll give that a try.

Thanks for the heads-up!

Ludo’.
L
L
Ludovic Courtès wrote on 29 Apr 21:50 +0200
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 54786@debbugs.gnu.org)
87mtg3655a.fsf@gnu.org
Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (16 lines)
> In the interim, perhaps we can work around that by using a different
> check to determine whether the service is running. For instance,
> instead of:
>
> (test-assert "nginx running"
> (marionette-eval
> '(begin
> (use-modules (gnu services herd))
> (start-service 'nginx))
> marionette))
>
> … we’d write something like:
>
> (test-assert "nginx running"
> (wait-for-file "/var/run/nginx/pid"))

I pushed something along these lines as
73eeeeafbb0765f76834b53c9fe6cf3c8f740840.

I wasn’t able to fix the tailon test because the ‘tailon’ package
doesn’t build and I failed to address that in a timely fashion.

Ludo’.
M
M
Mathieu Othacehe wrote on 30 Apr 15:02 +0200
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 54786@debbugs.gnu.org)
875ymqwwq8.fsf@gnu.org
Hey Ludo,

Toggle quote (3 lines)
> I pushed something along these lines as
> 73eeeeafbb0765f76834b53c9fe6cf3c8f740840.

Thanks for the fix! The jami and jami-provisioning tests are also broken
because of what looks like to be the same issue:

Toggle snippet (13 lines)
One does not simply initialize the client: Another daemon is detected
/gnu/store/01phrvxnxrg1q0gxa35g7f77q06crf6v-shepherd-marionette.scm:1:1718: ERROR:
1. &action-exception-error:
service: jami
action: start
key: match-error
args: ("match" "no matching pattern" #f)
Jami Daemon 11.0.0, by Savoir-faire Linux 2004-2019
https://jami.net/
[Video support enabled]
[Plugins support enabled]

I think we don't have the right approach here: we should check that the
system tests are passing before pushing series and not adapt the tests
afterwards.

Historically this was difficult because the system tests were often in a
semi-broken state. Before the Shepherd update the tests were however all
passing (modulo rare intermittent failures).

As it's not always obvious what's going to break the system tests and
what's not (simple package update can easily break them), it would be
really nice to have mandatory commit verification.

The mumi/cuirass gateway that has already been discussed could really
help us here. If some people are motivated, we could split the work and
introduce such a mechanism.

Thanks,

Mathieu
L
L
Ludovic Courtès wrote on 1 May 15:26 +0200
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 54786@debbugs.gnu.org)
875ymp4c5f.fsf@gnu.org
Hi,

Mathieu Othacehe <othacehe@gnu.org> skribis:

Toggle quote (15 lines)
> Thanks for the fix! The jami and jami-provisioning tests are also broken
> because of what looks like to be the same issue:
>
> One does not simply initialize the client: Another daemon is detected
> /gnu/store/01phrvxnxrg1q0gxa35g7f77q06crf6v-shepherd-marionette.scm:1:1718: ERROR:
> 1. &action-exception-error:
> service: jami
> action: start
> key: match-error
> args: ("match" "no matching pattern" #f)
> Jami Daemon 11.0.0, by Savoir-faire Linux 2004-2019
> https://jami.net/
> [Video support enabled]
> [Plugins support enabled]

Yes, I noticed that, but I’m not sure how to apply a similar workaround.

Toggle quote (4 lines)
> I think we don't have the right approach here: we should check that the
> system tests are passing before pushing series and not adapt the tests
> afterwards.

Yes, apologies for that.

Toggle quote (12 lines)
> Historically this was difficult because the system tests were often in a
> semi-broken state. Before the Shepherd update the tests were however all
> passing (modulo rare intermittent failures).
>
> As it's not always obvious what's going to break the system tests and
> what's not (simple package update can easily break them), it would be
> really nice to have mandatory commit verification.
>
> The mumi/cuirass gateway that has already been discussed could really
> help us here. If some people are motivated, we could split the work and
> introduce such a mechanism.

Yes, I agree; an “always green” ‘master’ branch would be great.

Do you have milestones in mind for “commit verification”?

As I see it, the difficulty is that we’ve been looking at a horizon of
features à la GitLab-CI without being quite sure how to get there (apart
from deploying GitLab or a similar tool, that is).

A first step that comes to mind would be an easier way to set up
transient jobsets for a branch (or, ideally, for an issue: the thing
would apply patches and create the branch).

Thoughts?

(Maybe worth moving to guix-devel.)

Ludo’.
M
M
Maxim Cournoyer wrote 3 days ago
(name . Ludovic Courtès)(address . ludo@gnu.org)
87a6b646qs.fsf@gmail.com
Hi,

Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (21 lines)
> Hi,
>
> Mathieu Othacehe <othacehe@gnu.org> skribis:
>
>> Thanks for the fix! The jami and jami-provisioning tests are also broken
>> because of what looks like to be the same issue:
>>
>> One does not simply initialize the client: Another daemon is detected
>> /gnu/store/01phrvxnxrg1q0gxa35g7f77q06crf6v-shepherd-marionette.scm:1:1718: ERROR:
>> 1. &action-exception-error:
>> service: jami
>> action: start
>> key: match-error
>> args: ("match" "no matching pattern" #f)
>> Jami Daemon 11.0.0, by Savoir-faire Linux 2004-2019
>> https://jami.net/
>> [Video support enabled]
>> [Plugins support enabled]
>
> Yes, I noticed that, but I’m not sure how to apply a similar workaround.

I tried fixing that today, but so far I've only managed to understand
what seems to be going wrong, with this (not so great) workflow:

1. Add pk uses in the code.

2. $(./pre-inst-env guix system vm --no-graphic -e '(@@ (gnu tests
telephony) %jami-os)' --no-offload --no-substitutes) -m 512 -nic
user,model=virtio-net-pci,hostfwd=tcp::10022-:22

3. ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -p
10022 root@localhost

and poke around with 'herd status', read /var/log/messages, experiment
with dbus-send, etc.

This allowed me to find out that (dbus-available-services) appears to be
broken. I'm not sure why the exceptions are reported so badly by
Shepherd (are exceptions raised with 'error' not handled by Shepherd or
something? -- the with-retries loop should end up printing the caught
exception arguments -- I would also have expected to see the backtrace
somewhere.

Anyway, connecting to another machine that is running the
jami-service-type still (hasn't been reconfigured in a while), I could
see:

Toggle snippet (8 lines)
scheme@(guix-user)> ,use (gnu build jami-service)
scheme@(guix-user)> (dbus-available-services)
;;; Failed to autoload fork+exec-command in (shepherd service):
;;; no code for module (fibers)
ice-9/boot-9.scm:1685:16: In procedure raise-exception:
error: fork+exec-command: unbound variable

Oh yes, so it now requires guile-fibers. After installing it:

Toggle snippet (6 lines)
scheme@(guix-user)> ,use (gnu build jami-service)
scheme@(guix-user)> (dbus-available-services)
ice-9/boot-9.scm:1685:16: In procedure raise-exception:
No scheduler current; call within run-fibers instead

So the users of fork+exec-command (a public API) needs to be adjusted.
I suspect that's the crux of the issue here. The rest (the jami tests
using Shepherd's start-service to check the service status and causing
multiple starts) should be easy to workaround.

To be continued...

Maxim
?