ci.guix.gnu.org not building the 'guix' job

OpenSubmitted by Leo Famulari.
Details
5 participants
  • Leo Famulari
  • Ludovic Courtès
  • Maxim Cournoyer
  • Mathieu Othacehe
  • Ricardo Wurmus
Owner
unassigned
Severity
important
L
L
Leo Famulari wrote on 23 Jan 01:56 +0100
(address . bug-guix@gnu.org)(address . guix-sysadmin@gnu.org)
YeynsnjtpealqzUN@jasmine.lan
As far as I can tell, ci.guix.gnu.org has stopped building the 'guix'
job since a couple days ago:

L
L
Leo Famulari wrote on 24 Jan 00:00 +0100
(address . 53463@debbugs.gnu.org)
Ye3eGOrjtf33GDz5@jasmine.lan
Also, the 'master' job hasn't been run in ~2 days:


I think the build farm is waiting to finish collecting garbage.
L
L
Leo Famulari wrote on 27 Jan 23:13 +0100
(address . 53463@debbugs.gnu.org)
YfMZAP256r3Cbjdp@jasmine.lan
On Sun, Jan 23, 2022 at 06:00:40PM -0500, Leo Famulari wrote:
Toggle quote (6 lines)
> Also, the 'master' job hasn't been run in ~2 days:
>
> https://ci.guix.gnu.org/jobset/master
>
> I think the build farm is waiting to finish collecting garbage.

Unfortunately, the 'master' jobset is broken again, and the 'guix'
jobset is still broken.
L
L
Leo Famulari wrote on 29 Jan 22:11 +0100
(no subject)
(address . control@debbugs.gnu.org)
YfWtnSc39wmufkRL@jasmine.lan
block 53214 with 52943
block 53214 with 53463
M
M
Maxim Cournoyer wrote on 1 Feb 16:18 +0100
control message for bug #53463
(address . control@debbugs.gnu.org)
87wnieehtu.fsf@gmail.com
severity 53463 important
quit
M
M
Mathieu Othacehe wrote on 2 Feb 19:41 +0100
Re: bug#53463: ci.guix.gnu.org not building the 'guix' job
(name . Leo Famulari)(address . leo@famulari.name)(address . 53463@debbugs.gnu.org)
87leyt2jsr.fsf@gnu.org
Hello,

The issue here seems to be that the evaluations of the 'guix' jobset are
never finishing, even when the GC is not running.

I tried to strace one of the stuck evaluation process, it returns
repeatedly:

Toggle snippet (20 lines)
[pid 36294] read(227, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 65536) = 96
[pid 36294] write(239, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 96) = 96
[pid 36294] rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
[pid 36294] pselect6(240, [40 227 239], [], [], NULL, NULL) = 1 (in [227])
[pid 36294] read(227, "gmlo\0\0\0\0G\0\0\0\0\0\0\0process 40190 acquired build slot '/var/guix/offload/localhost:2224/0'\n\0", 65536) = 88
[pid 36294] write(239, "gmlo\0\0\0\0G\0\0\0\0\0\0\0process 40190 acquired build slot '/var/guix/offload/localhost:2224/0'\n\0", 88) = 88
[pid 36294] rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
[pid 36294] pselect6(240, [40 227 239], [], [], NULL, NULL) = 1 (in [227])
[pid 36294] read(227, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 65536) = 96
[pid 36294] write(239, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 96) = 96
[pid 36294] rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
[pid 36294] pselect6(240, [40 227 239], [], [], NULL, NULL) = 1 (in [227])
[pid 36294] read(227, "gmlo\0\0\0\0G\0\0\0\0\0\0\0process 40190 acquired build slot '/var/guix/offload/localhost:2224/0'\n\0", 65536) = 88
[pid 36294] write(239, "gmlo\0\0\0\0G\0\0\0\0\0\0\0process 40190 acquired build slot '/var/guix/offload/localhost:2224/0'\n\0", 88) = 88
[pid 36294] rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
[pid 36294] pselect6(240, [40 227 239], [], [], NULL, NULL) = 1 (in [227])
[pid 36294] read(227, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 65536) = 96
[pid 36294] write(239, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 96) = 96

To be continued,

Thanks,

Mathieu
L
L
Ludovic Courtès wrote on 4 Feb 09:58 +0100
(name . Mathieu Othacehe)(address . othacehe@gnu.org)
87sfsznh2z.fsf@gnu.org
Hello!

Mathieu Othacehe <othacehe@gnu.org> skribis:

Toggle quote (22 lines)
> I tried to strace one of the stuck evaluation process, it returns
> repeatedly:
>
> [pid 36294] read(227, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 65536) = 96
> [pid 36294] write(239, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 96) = 96
> [pid 36294] rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
> [pid 36294] pselect6(240, [40 227 239], [], [], NULL, NULL) = 1 (in [227])
> [pid 36294] read(227, "gmlo\0\0\0\0G\0\0\0\0\0\0\0process 40190 acquired build slot '/var/guix/offload/localhost:2224/0'\n\0", 65536) = 88
> [pid 36294] write(239, "gmlo\0\0\0\0G\0\0\0\0\0\0\0process 40190 acquired build slot '/var/guix/offload/localhost:2224/0'\n\0", 88) = 88
> [pid 36294] rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
> [pid 36294] pselect6(240, [40 227 239], [], [], NULL, NULL) = 1 (in [227])
> [pid 36294] read(227, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 65536) = 96
> [pid 36294] write(239, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 96) = 96
> [pid 36294] rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
> [pid 36294] pselect6(240, [40 227 239], [], [], NULL, NULL) = 1 (in [227])
> [pid 36294] read(227, "gmlo\0\0\0\0G\0\0\0\0\0\0\0process 40190 acquired build slot '/var/guix/offload/localhost:2224/0'\n\0", 65536) = 88
> [pid 36294] write(239, "gmlo\0\0\0\0G\0\0\0\0\0\0\0process 40190 acquired build slot '/var/guix/offload/localhost:2224/0'\n\0", 88) = 88
> [pid 36294] rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
> [pid 36294] pselect6(240, [40 227 239], [], [], NULL, NULL) = 1 (in [227])
> [pid 36294] read(227, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 65536) = 96
> [pid 36294] write(239, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 96) = 96

Oh! That indicates that it’s failing to offload to one of the
‘localhost’ build machines specified in /etc/guix/machines.scm.
Normally there’s an SSH tunnel set up for those, but I guess it broke.

Perhaps we can update /etc/guix/machines.scm to refer to armhf-linux
machines by their WireGuard IP?

Thanks,
Ludo’.
M
M
Mathieu Othacehe wrote on 4 Feb 10:54 +0100
(name . Ludovic Courtès)(address . ludo@gnu.org)
875ypv3qjo.fsf@gnu.org
Hey,

Toggle quote (7 lines)
> Oh! That indicates that it’s failing to offload to one of the
> ‘localhost’ build machines specified in /etc/guix/machines.scm.
> Normally there’s an SSH tunnel set up for those, but I guess it broke.
>
> Perhaps we can update /etc/guix/machines.scm to refer to armhf-linux
> machines by their WireGuard IP?

Seems like the right thing to do. This bit is also an unstaged change in
the berlin maintenance repository, we should commit it. Tobias, could
you have a look :) ?

Toggle snippet (14 lines)
+(define powerpc64le
+ (list
+ ;; A VM donated/hosted by OSUOSL & administered by nckx.
+ ;; XXX: SSH tunnel via overdrive1:
+ ;; ssh -L 2224:p9.tobias.gr:22 hydra@10.0.0.3
+ #;(build-machine
+ ;;(name "p9.tobias.gr")
+ (name "localhost")
+ (port 2224)
+ (user "hydra")
+ (systems '("powerpc64le-linux"))
+ (host-key "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIJEbRxJ6WqnNLYEMNDUKFcdMtyZ9V/6oEfBFSHY8xE6A nckx"))))

I also found that other machines were unreachable and commented them:

Toggle snippet (25 lines)
;; CPU: 16 ARM Cortex-A72 cores
;; RAM: 32 GB
- (list (build-machine
+ (list #;(build-machine
;;kreuzberg
(name "10.0.0.9")
(user "hydra")
@@ -243,13 +256,13 @@
;; BeagleBoard X15 kindly hosted by Simon Josefsson.
;; CPU: Cortex A15 (2 cores)
;; RAM: 2 GB
- (build-machine
+ #;(build-machine
(name "10.0.0.5") ;guix-x15
(user "hydra")
(systems '("armhf-linux"))
(host-key "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIOfXjwCAFWeGiUoOVXEgtIeXxbtymjOTg7ph1ObMAcJ0 root@beaglebone"))
- (build-machine
+ #;(build-machine
(name "10.0.0.6") ;guix-x15b
(user "hydra")
(systems '("armhf-linux"))

Nevertheless we are hitting an offload issue here, maybe an occurrence
of #24496. The offload mechanism should timeout when a machine is
unreachable instead of retrying over and over, causing all evaluation
processes to hang.

Thanks,

Mathieu
L
L
Ludovic Courtès wrote on 8 Feb 11:22 +0100
(name . Mathieu Othacehe)(address . othacehe@gnu.org)
87zgn1aca4.fsf@gnu.org
Hi,

Mathieu Othacehe <othacehe@gnu.org> skribis:

Toggle quote (24 lines)
>> Oh! That indicates that it’s failing to offload to one of the
>> ‘localhost’ build machines specified in /etc/guix/machines.scm.
>> Normally there’s an SSH tunnel set up for those, but I guess it broke.
>>
>> Perhaps we can update /etc/guix/machines.scm to refer to armhf-linux
>> machines by their WireGuard IP?
>
> Seems like the right thing to do. This bit is also an unstaged change in
> the berlin maintenance repository, we should commit it. Tobias, could
> you have a look :) ?
>
> +(define powerpc64le
> + (list
> + ;; A VM donated/hosted by OSUOSL & administered by nckx.
> + ;; XXX: SSH tunnel via overdrive1:
> + ;; ssh -L 2224:p9.tobias.gr:22 hydra@10.0.0.3
> + #;(build-machine
> + ;;(name "p9.tobias.gr")
> + (name "localhost")
> + (port 2224)
> + (user "hydra")
> + (systems '("powerpc64le-linux"))
> + (host-key "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIJEbRxJ6WqnNLYEMNDUKFcdMtyZ9V/6oEfBFSHY8xE6A nckx"))))

IIRC this machine is now running WireGuard, Tobias? If so, could you
change this to refer to its WireGuard IP and commit it?

Toggle quote (10 lines)
> I also found that other machines were unreachable and commented them:
>
> ;; CPU: 16 ARM Cortex-A72 cores
> ;; RAM: 32 GB
> - (list (build-machine
> + (list #;(build-machine
> ;;kreuzberg
> (name "10.0.0.9")
> (user "hydra")

Ricardo, could you check what’s wrong with kreuzberg?

Toggle quote (17 lines)
> @@ -243,13 +256,13 @@
> ;; BeagleBoard X15 kindly hosted by Simon Josefsson.
> ;; CPU: Cortex A15 (2 cores)
> ;; RAM: 2 GB
> - (build-machine
> + #;(build-machine
> (name "10.0.0.5") ;guix-x15
> (user "hydra")
> (systems '("armhf-linux"))
> (host-key "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIOfXjwCAFWeGiUoOVXEgtIeXxbtymjOTg7ph1ObMAcJ0 root@beaglebone"))
>
> - (build-machine
> + #;(build-machine
> (name "10.0.0.6") ;guix-x15b
> (user "hydra")
> (systems '("armhf-linux"))

Oops.

Note that it’s not necessary to comment them all out. As long as at
least one machine is available for a given system type, we’re fine:
‘guix offload’ will pick it up.

Toggle quote (5 lines)
> Nevertheless we are hitting an offload issue here, maybe an occurrence
> of #24496. The offload mechanism should timeout when a machine is
> unreachable instead of retrying over and over, causing all evaluation
> processes to hang.

Yes, though the problem here is that some architectures were left with
zero machines IIRC, so it would have failed one way or another.

Thanks!

Ludo’.
R
R
Ricardo Wurmus wrote on 8 Feb 13:52 +0100
(name . Ludovic Courtès)(address . ludo@gnu.org)
87ee4dzfeu.fsf@elephly.net
Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (43 lines)
> Hi,
>
> Mathieu Othacehe <othacehe@gnu.org> skribis:
>
>>> Oh! That indicates that it’s failing to offload to one of the
>>> ‘localhost’ build machines specified in /etc/guix/machines.scm.
>>> Normally there’s an SSH tunnel set up for those, but I guess it broke.
>>>
>>> Perhaps we can update /etc/guix/machines.scm to refer to armhf-linux
>>> machines by their WireGuard IP?
>>
>> Seems like the right thing to do. This bit is also an unstaged change in
>> the berlin maintenance repository, we should commit it. Tobias, could
>> you have a look :) ?
>>
>> +(define powerpc64le
>> + (list
>> + ;; A VM donated/hosted by OSUOSL & administered by nckx.
>> + ;; XXX: SSH tunnel via overdrive1:
>> + ;; ssh -L 2224:p9.tobias.gr:22 hydra@10.0.0.3
>> + #;(build-machine
>> + ;;(name "p9.tobias.gr")
>> + (name "localhost")
>> + (port 2224)
>> + (user "hydra")
>> + (systems '("powerpc64le-linux"))
>> + (host-key "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIJEbRxJ6WqnNLYEMNDUKFcdMtyZ9V/6oEfBFSHY8xE6A nckx"))))
>
> IIRC this machine is now running WireGuard, Tobias? If so, could you
> change this to refer to its WireGuard IP and commit it?
>
>> I also found that other machines were unreachable and commented them:
>>
>> ;; CPU: 16 ARM Cortex-A72 cores
>> ;; RAM: 32 GB
>> - (list (build-machine
>> + (list #;(build-machine
>> ;;kreuzberg
>> (name "10.0.0.9")
>> (user "hydra")
>
> Ricardo, could you check what’s wrong with kreuzberg?

Oh, the usual…

Toggle snippet (13 lines)
root@kreuzberg ~# guix shell wireguard-tools -- wg
interface: wg0
public key: f9WGJTXp8bozJb0KxePjkOclF5pJUy1AomHWJHy80y4=
private key: (hidden)
listening port: 51820

peer: wOIfhHqQ+JQmskRS2qSvNRgZGh33UxFDi8uuSXOltF0=
endpoint: 141.80.181.40:51820
allowed ips: 10.0.0.1/32
latest handshake: 2 days, 2 hours, 11 minutes, 13 seconds ago
transfer: 292.79 MiB received, 6.05 GiB sent

Whenever the build farm is awfully quiet (e.g. because of GC) the
wireguard connection times out. I usually restart the
cuirass-remote-worker and everything’s fine again.

Today I got some additional SD cards for these machines, so I’m going to
reconfigure them (locally, because of the “guix deploy” bug) and then
move them to the data centre. Once reconfigured they will keep the
wireguard connection alive all by themselves, so no manual intervention
is necessary.

I didn’t reconfigure them locally because I hoped we would be able to
make time for the “guix deploy” bug, but things turned out differently.

--
Ricardo
L
L
Ludovic Courtès wrote on 21 Mar 09:38 +0100
(name . Mathieu Othacehe)(address . othacehe@gnu.org)
87cziflmyh.fsf@gnu.org
Hi there!

Looks like this bug is solved: the ‘guix’ jobset is getting built.

However, evaluations are marked as “failed”, even though their build log
shows they succeeded, and if you click on one of them, you see that all
its builds are there:


Any idea what could be wrong?

Thanks,
Ludo’.
M
M
Mathieu Othacehe wrote on 21 Mar 09:55 +0100
(name . Ludovic Courtès)(address . ludo@gnu.org)
87tubradne.fsf@gnu.org
Hey Ludo,

Toggle quote (8 lines)
> However, evaluations are marked as “failed”, even though their build log
> shows they succeeded, and if you click on one of them, you see that all
> its builds are there:
>
> https://ci.guix.gnu.org/eval/168652
> https://ci.guix.gnu.org/eval/168652/log/raw
> https://ci.guix.gnu.org/jobset/guix?border-high=169749

This started at the time we enabled the armhf architecture, so I guess
it is marked as failed because the guix specification could not be
evaluated for this architecture.

Thanks,

Mathieu
?