‘gui-installed-desktop-os-encrypted’ test intermittent failures

  • Done
  • quality assurance status badge
Details
2 participants
  • Ludovic Courtès
  • Mathieu Othacehe
Owner
unassigned
Submitted by
Ludovic Courtès
Severity
important
L
L
Ludovic Courtès wrote on 7 Dec 2022 14:50
(address . bug-guix@gnu.org)
878rjjjpa5.fsf@inria.fr
Hi,

The ‘gui-installed-desktop-os-encrypted’ test fails intermittently like
so:

Toggle snippet (26 lines)
Dec 7 12:14:01 localhost installer[253]: ^M^[[Kguix system: ^[[1m^[[0mbootloader successfully installed on /dev/vda^M
Dec 7 12:14:01 localhost installer[253]: command ("guix" "system" "init" "--fallback" "--no-grafts" "--no-substitutes" "/mnt/etc/config.scm" "/mnt") succeeded
conversation expecting pattern ((quote installation-complete))
Dec 7 12:14:01 localhost shepherd[1]: Service guix-daemon has been stopped.
Dec 7 12:14:01 localhost shepherd[1]: Service guix-daemon has been started.
Dec 7 12:14:01 localhost installer[253]: crashing due to uncaught exception: system-error ("umount" "~S: ~A" ("/remove" "Device or resource busy") (16))
Dec 7 12:14:01 localhost installer[253]: running form #<newt-form aec220> ("Unexpected problem") with 1 clients
Dec 7 12:14:01 localhost installer[253]: form #<newt-form aec220> ("Unexpected problem"): client 20 replied #t
/gnu/store/2bdm6l1j7siky5vadr62l3ns2z8qz75a-shepherd-marionette.scm:1:1718: ERROR:
1. &pattern-not-matched:
pattern: ((quote installation-complete))
sexp: (contents-dialog (title "Unexpected problem") (text "The installer has encountered an unexpected problem. The backtrace is displayed below. You may choose to exit or create a dump archive.") (content "In ./gnu/installer/steps.scm:\n 150:13 19 (run ((substitutes . #f) (network (select-technology . #<<technology> name: \"Wired\" type: \"ethernet\" powered?: #t connected?: #t>) (power-technology . #<unspecified>) (# . #<<???>) ???) ???) ???)\n 150:13 18 (run ((user #<<user> name: \"root\" real-name: \"\" group: \"users\" password: <secret> home-directory: \"/root\"> #<<user> name: \"alice\" real-name: \"Alice\" group: \"users\" password: <s???> ???) ???) ???)\n 150:13 17 (run ((services #<<system-service> name: \"GNOME\" type: desktop recommended?: #f snippet: ((service gnome-desktop-service-type)) packages: ()> #<<system-service> name: \"Xfce\" ty???> ???) ???) ???)\n 148:23 16 (run ((partition #<<user-partition> name: #f type: normal file-name: \"/dev/vda1\" disk-file-name: \"/dev/vda\" crypt-label: #f crypt-password: #f fs-type: ext4 bootable?: #t esp?:???> ???) ???) ???)\nIn ./gnu/installer/newt/final.scm:\n 132:11 15 (run-final-page ((partition #<<user-partition> name: #f type: normal file-name: \"/dev/vda1\" disk-file-name: \"/dev/vda\" crypt-label: #f crypt-password: #f fs-type: ext4 bootable???> ???) ???) ???)\n 105:21 14 (run-install-shell \"en_HK.utf8\" #:users _)\nIn ./gnu/installer/final.scm:\n 191:5 13 (install-system \"en_HK.utf8\" #:users _)\n 113:13 12 (call-with-mnt-container #<procedure 7f0e0b7d2f00 at ./gnu/installer/final.scm:192:7 ()>)\nIn gnu/build/linux-container.scm:\n 265:16 11 (run-container _ _ (mnt) _ #<procedure 7f0e0b7d2f00 at ./gnu/installer/final.scm:192:7 ()> #:guest-uid _ #:guest-gid _)\nIn ./gnu/installer/final.scm:\n 222:13 10 (_)\nIn gnu/build/install.scm:\n 290:5 9 (unmount-cow-store \"/mnt\" \"/tmp/guix-inst\")\nIn guix/build/syscalls.scm:\n 600:10 8 (_ _ _ #:update-mtab? _)\nIn ice-9/boot-9.scm:\n 1685:16 7 (raise-exception _ #:continuable? _)\n 1780:13 6 (_ #<&compound-exception components: (#<&external-error> #<&origin origin: \"umount\"> #<&message message: \"~S: ~A\"> #<&irritants irritants: (\"/remove\" \"Device or resource busy\")> #<&ex???>)\nIn ice-9/eval.scm:\n 619:8 5 (_ #(#(#(#<directory (guile-user) 7f0e0cee8c80>) system-error (\"umount\" \"~S: ~A\" (\"/remove\" \"Device or resource busy\") (16))) #<variable 7f0e08731270 value: #<unspecified>> #<varia???> ???))\n 626:19 4 (_ #(#(#(#<directory (guile-user) 7f0e0cee8c80>) system-error (\"umount\" \"~S: ~A\" (\"/remove\" \"Device or resource busy\") (16))) #<variable 7f0e08731270 value: #<unspecified>> #<varia???> ???))\nIn ./gnu/installer/dump.scm:\n 58:4 3 (prepare-dump system-error (\"umount\" \"~S: ~A\" (\"/remove\" \"Device or resource busy\") (16)) #:result _)\nIn ice-9/ports.scm:\n 433:17 2 (call-with-output-file _ _ #:binary _ #:encoding _)\nIn ./gnu/installer/dump.scm:\n 60:27 1 (_ #<output: installer-backtrace 7>)\nIn unknown file:\n 0 (make-stack #t)\n./gnu/installer/dump.scm:62:36: In procedure umount: \"/remove\": Device or resource busy\n"))
Dec 7 12:14:01 localhost installer[198]: unmounting "/mnt/"

Backtrace:
2 (primitive-load "/gnu/store/s6a8v2l4387bgsp0w0pcbg4wjsy?")
In ice-9/eval.scm:
191:35 1 (_ #f)
619:8 0 (_ #(#<directory (guile-user) 7fffeffcfc80> #<variabl?>))

ice-9/eval.scm:619:8: Throw to key `marionette-eval-failure' with args `((quote (complete-installation installer-socket)))'.
builder for `/gnu/store/wgw64jfyqrrg27afqmlj70a22d1mr5mv-installation.drv' failed with exit code 1
@ build-failed /gnu/store/wgw64jfyqrrg27afqmlj70a22d1mr5mv-installation.drv - 1 builder for `/gnu/store/wgw64jfyqrrg27afqmlj70a22d1mr5mv-installation.drv' failed with exit code 1
cannot build derivation `/gnu/store/dr20sisps9rlpbq0vfzncdiyymrd5r0i-gui-installed-desktop-os-encrypted.drv': 1 dependencies couldn't be built


Does that ring a bell, Mathieu?

Thanks,
Ludo’.
L
L
Ludovic Courtès wrote on 7 Dec 2022 18:36
control message for bug #59884
(address . control@debbugs.gnu.org)
87pmcvi0a1.fsf@gnu.org
severity 59884 important
quit
M
M
Mathieu Othacehe wrote on 8 Dec 2022 19:02
Re: bug#59884: ‘gui-installed-desktop-os-encryp ted’ test intermittent failures
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 59884@debbugs.gnu.org)
87359p22pt.fsf@gnu.org
Hello,

Toggle quote (9 lines)
> ice-9/eval.scm:619:8: Throw to key `marionette-eval-failure' with args `((quote (complete-installation installer-socket)))'.
> builder for `/gnu/store/wgw64jfyqrrg27afqmlj70a22d1mr5mv-installation.drv' failed with exit code 1
> @ build-failed /gnu/store/wgw64jfyqrrg27afqmlj70a22d1mr5mv-installation.drv - 1 builder for `/gnu/store/wgw64jfyqrrg27afqmlj70a22d1mr5mv-installation.drv' failed with exit code 1
> cannot build derivation `/gnu/store/dr20sisps9rlpbq0vfzncdiyymrd5r0i-gui-installed-desktop-os-encrypted.drv': 1 dependencies couldn't be built
>
> (From <https://ci.guix.gnu.org/build/224930/details>.)
>
> Does that ring a bell, Mathieu?

I spent days on that issue before. It used to show up on all installer
tests, and even on real hardware, then
8ce6f4dc2879919c12bc76a2f4b01200af97e01 mitigated it.

The installation is now made in a container to make sure that we are
later on able to umount the store overlay even though some background
processes such as kmscon or udev opened files from the overlay.

Now the issue only shows up on that specific test and is intermittent as
you noticed.

To be honest, that was quite painful to debug and I'm a bit scared to
jump back in. I think I had the marionette produce some lsof reports
back then, or something like that. I very much regret not to have kept
notes somewhere.

Mathieu
L
L
Ludovic Courtès wrote on 9 Dec 2022 23:32
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 59884@debbugs.gnu.org)
87pmcs5hui.fsf@gnu.org
Hi,

Mathieu Othacehe <othacehe@gnu.org> skribis:

Toggle quote (8 lines)
> I spent days on that issue before. It used to show up on all installer
> tests, and even on real hardware, then
> 8ce6f4dc2879919c12bc76a2f4b01200af97e01 mitigated it.
>
> The installation is now made in a container to make sure that we are
> later on able to umount the store overlay even though some background
> processes such as kmscon or udev opened files from the overlay.

Right.

Toggle quote (3 lines)
> Now the issue only shows up on that specific test and is intermittent as
> you noticed.

On this particular test it seems to be frequent.

Toggle quote (5 lines)
> To be honest, that was quite painful to debug and I'm a bit scared to
> jump back in. I think I had the marionette produce some lsof reports
> back then, or something like that. I very much regret not to have kept
> notes somewhere.

Yeah. One possibility is timing: (restart-service 'guix-daemon) kills
the daemon’s process group, waits for the group leader to terminate,
then starts the daemon. I think there’s a possibility that other
processes in the group (like a ‘guix substitute’ child) are still alive
at that point and they might be the one keeping the device busy.
Sleeping for a couple of seconds should allow us to kinda verify that
hypothesis. However, the fact that it’s only in the cryptsetup case
suggests this hypothesis might well be completely bogus.

But yeah, invoking ‘lsof’ would tell us.

I guess we’ll leave that as future work though.

Thanks for your feedback!

Ludo’.
L
L
Ludovic Courtès wrote on 10 Dec 2022 16:35
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 59884-done@debbugs.gnu.org)
87y1rf46go.fsf@gnu.org
Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (8 lines)
> Yeah. One possibility is timing: (restart-service 'guix-daemon) kills
> the daemon’s process group, waits for the group leader to terminate,
> then starts the daemon. I think there’s a possibility that other
> processes in the group (like a ‘guix substitute’ child) are still alive
> at that point and they might be the one keeping the device busy.
> Sleeping for a couple of seconds should allow us to kinda verify that
> hypothesis.

For good measure, I threw a retry loop in commit
61b7e9687757aff013b99e4ab15669a950c8b222. The test passes on my laptop
and it passed on ci.guix as well:


I’m considering the case closed and now getting ready for 1.4.0rc2.

Ludo’.
Closed
?