Ludovic Courtès <ludo@gnu.org> writes:
Toggle quote (20 lines)
> Hey Tomas,
>
> Ludovic Courtès <ludo@gnu.org> skribis:
>
>> I tried the config file you gave with:
>>
>> ./pre-inst-env guix system vm /tmp/config.scm
>>
>> and it hangs, to my surprise (I’ve been using ‘system-log’ on my laptop
>> since June, and “make check-system TESTS=basic” & co. pass).
>
> After spending hours on this and fixing improbable issues in the
> Shepherd (will push shortly), I found that the root of the problem is
> exactly what I feared and which led to the patches at
> <https://issues.guix.gnu.org/76262>.
>
> Namely, ‘dhcp-client-service-type’ calls ‘waitpid’; that call competes
> with the one done by shepherd’s SIGCHLD handler and, if you’re unlucky,
> it loses the race and waits forever.
Observation here. While yes, based on the description I agree that it
is (bad) luck based, in practice it seems to be extremely reliable to
reproduce.
At first I struggled to reproduce again, it did not hang even single
time (out of 5 tries) on the bad commit, but once I reverted my
configuration to what it was back then (== removed few shepherd timers),
the hang started happening every single time.
So, while in theory it should be a probabilistic problem, in practice it
does not seem to be the case. Not sure where I am going with this, I
just think it is interesting.
Toggle quote (5 lines)
>
> Could you try your config with the patch at
> <https://issues.guix.gnu.org/76262#2>, at least in a VM and ideally on
> the metal?
I have reverted your revert and applied the patch 2 on top of that.
Steps I took (both in VM and on a spare laptop):
1. Reconfigure from commit 1.
2. Ensure it still hangs (5x).
3. Reconfigure from commit 2.
4. Ensure it no longer hangs (5x).
I can confirm the patch 2 fixes the issue for me, both in the VM and on
physical machine.
Only thing I have noticed that even when deploying the "good" commit, I
see the following error in the log:
Toggle snippet (4 lines)
guix deploy: warning: an error occurred while upgrading services on '127.0.0.1':
%exception #<inferior-object #<&service-not-found-error service: system-log>>
The system comes up fine after reboot though.
Toggle quote (4 lines)
>
> Thanks in advance,
> Ludo’.
Thank you for figuring this one out. :)
Tomas
--
There are only two hard things in Computer Science:
cache invalidation, naming things and off-by-one errors.
-----BEGIN PGP SIGNATURE-----
iQJCBAEBCgAsFiEEt4NJs4wUfTYpiGikL7/ufbZ/wakFAme3n1MOHH5Ad29sZnNk
ZW4uY3oACgkQL7/ufbZ/wam3gxAAqO6v9XY9WN1M65M+bZ6G7DZgwML3RAh/W+pt
tsvQOmuPiQCx12Ic1CF07U83fkElT/UXn4crqs0eg+8dYFqRTO+KI36ovL9QmG8K
ZumRfNlu3FfDdgLjRgSBXYVm7pkIH4/Qoio+JPVZIoB56V+/HK5vuw2wBGTIamZz
vpTy8olLCQfEVAnzZZ2qQSvotlpuD+3PrTt1ZZBsf0GBr7t+Srgr8n7A2TslHSTT
EpgIUy5xXv/3lyEC9zikG9JvPrWOzW2cAd5JOwi7sDx4YrmaAtefjjLa0784PhH8
nOlbLvpjrXlkgL56pC2+j5mir98oywFYoM36lG7LLsoRWlKDUTr4/QPhnZKvIZBF
+CSVBHWcAAr4DOiLLmkDCBsPfxUmrA/mR/jyrP+Rh0KbQUa8ycciMDhrubUoHRAt
n+3m/J7I4teNFK6k2t8T6h0ONGyjjALaAG5czNQPwOHQha1IwRIVFCQNUDtTwdlV
zbPwqt0zUToyio+ribwNhuoUL2Kl0sDYWMUqgBxcetiQg/Cn+smoLoBncpvV6Jrv
kV9C9CaDMwKm4du10OGcgW2hmwC7FXr4k968FX7V/AvF0NcC7BjpcKF1u9y4EheT
iWiW0zZNlKxLYFcHf6ogKQFbzreAmLQyAKajArw8qPCZlaGZ6PgaLwfgSFc1DYm+
AhbR5iI=
=Pujd
-----END PGP SIGNATURE-----