[cuirass] workers stalled

OpenSubmitted by Mathieu Othacehe.
Details
7 participants
  • Greg Hogan
  • Ludovic Courtès
  • Maxim Cournoyer
  • Maxime Devos
  • Mathieu Othacehe
  • Ricardo Wurmus
  • Tom Fitzhenry
Owner
unassigned
Severity
important
M
M
Mathieu Othacehe wrote on 8 Jun 17:31 +0200
(address . bug-guix@gnu.org)
87h74v2mu7.fsf@gnu.org
Hello,

The aarch64 workers were all idle whereas 70k builds were
available. Once restarted, they started building again.

The problem might be that when the server is unavailable for a while the
worker connections expire and cannot be resumed once the server is
available again.

Thanks,

Mathieu
G
G
Greg Hogan wrote on 8 Jun 21:07 +0200
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 55848@debbugs.gnu.org)
CA+3U0ZmY4jcrZ6FPeQCgHHZKpvSh0BZ1ki5KTXfGemQRf-ZOkw@mail.gmail.com
On Wed, Jun 8, 2022 at 11:32 AM Mathieu Othacehe <othacehe@gnu.org> wrote:
Toggle quote (15 lines)
>
>
> Hello,
>
> The aarch64 workers were all idle whereas 70k builds were
> available. Once restarted, they started building again.
>
> The problem might be that when the server is unavailable for a while the
> worker connections expire and cannot be resumed once the server is
> available again.
>
> Thanks,
>
> Mathieu

The recent aarch64 builds look to all be failing with the following message.

===== <cut> =====
substitute:
substitute: [Kupdating substitutes from 'https://ci.guix.gnu.org'...
0.0%guix substitute: error: TLS error in procedure 'handshake': Error
in the pull function.
===== </cut> =====
T
T
Tom Fitzhenry wrote on 11 Jun 12:44 +0200
(name . Greg Hogan)(address . code@greghogan.com)
878rq3scn3.fsf@tom-fitzhenry.me.uk
Greg Hogan <code@greghogan.com> writes:

Toggle quote (4 lines)
> On Wed, Jun 8, 2022 at 11:32 AM Mathieu Othacehe <othacehe@gnu.org> wrote:
>> The aarch64 workers were all idle whereas 70k builds were
>> available. Once restarted, they started building again.

From following the builds on http://ci.guix.gnu.org/workers, many
(all?) builds are failing on the following workers:

* grunewald
* kreuzberg
* pankow

The builds are failing with the same error:

"substitute: updating substitutes from 'https://ci.guix.gnu.org'...
0.0%guix substitute: error: TLS error in procedure 'handshake': Error in
the pull function."

Here's some examples:


On worker overdrive1, in the raw log of
rust-async-mutex build managing to pull substitutes, but it
seems to be compiling rust-1.57 itself.
L
L
Ludovic Courtès wrote on 11 Jun 22:33 +0200
control message for bug #55848
(address . control@debbugs.gnu.org)
8735gbx7mk.fsf@gnu.org
severity 55848 important
quit
L
L
Ludovic Courtès wrote on 12 Jun 15:33 +0200
Re: bug#55848: [cuirass] workers stalled
(name . Tom Fitzhenry)(address . tom@tom-fitzhenry.me.uk)
87bkuyvwf0.fsf@gnu.org
Hi,

(+Cc: guix-sysadmin)

Tom Fitzhenry <tom@tom-fitzhenry.me.uk> skribis:

Toggle quote (13 lines)
>>From following the builds on http://ci.guix.gnu.org/workers, many
> (all?) builds are failing on the following workers:
>
> * grunewald
> * kreuzberg
> * pankow
>
> The builds are failing with the same error:
>
> "substitute: updating substitutes from 'https://ci.guix.gnu.org'...
> 0.0%guix substitute: error: TLS error in procedure 'handshake': Error in
> the pull function."

On these machines, https://ci.guix.gnu.org(among other) is unavailable
for some reason (firewall I guess):

Toggle snippet (17 lines)
ludo@grunewald ~$ wget --debug -O/dev/null https://ci.guix.gnu.org
Setting --output-document (outputdocument) to /dev/null
DEBUG output created by Wget 1.21.1 on linux-gnu.

Reading HSTS entries from /home/ludo/.wget-hsts
URI encoding = ‘UTF-8’
--2022-06-11 22:38:59-- https://ci.guix.gnu.org/
Certificates loaded: 444
Resolving ci.guix.gnu.org (ci.guix.gnu.org)... 141.80.181.40
Caching ci.guix.gnu.org => 141.80.181.40
Connecting to ci.guix.gnu.org (ci.guix.gnu.org)|141.80.181.40|:443... connected.
Created socket 4.
Releasing 0x000000001fd26b50 (new refcount 1).

[Sits there forever…]

These machines are configured using ‘honeycomb-system’ from (sysadmin
honeycomb) in maintenance.git.

guix-daemon is configured to use the default substitute URLs,
https://ci.guix.gnu.organd https://bordeaux.guix.gnu.org, which we know
are unreachable.

I’ve theoretically addressed this here:


I tried to reconfigure those boxes with ‘guix deploy’, but this is
currently on hold because ci.guix has run out of inodes…

To be continued!

Ludo’.
R
R
Ricardo Wurmus wrote on 12 Jun 18:10 +0200
(name . Ludovic Courtès)(address . ludo@gnu.org)
871qvt2701.fsf@elephly.net
Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (22 lines)
> Hi,
>
> (+Cc: guix-sysadmin)
>
> Tom Fitzhenry <tom@tom-fitzhenry.me.uk> skribis:
>
>>>From following the builds on http://ci.guix.gnu.org/workers , many
>> (all?) builds are failing on the following workers:
>>
>> * grunewald
>> * kreuzberg
>> * pankow
>>
>> The builds are failing with the same error:
>>
>> "substitute: updating substitutes from 'https://ci.guix.gnu.org'...
>> 0.0%guix substitute: error: TLS error in procedure 'handshake': Error in
>> the pull function."
>
> On these machines, https://ci.guix.gnu.org (among other) is unavailable
> for some reason (firewall I guess):

They should be using the local IP instead of routing through the
internet, so /etc/hosts should contain an entry for

141.80.167.131 ci.guix.gnu.org

(We have the same entry on the other build nodes hosted at the MDC.)

“guix deploy” did not work on these nodes due to a serious problem: they
were given *some* x86_64 binaries to execute, so deployed systems were
unbootable. Since we don’t have a serial interface through which you
could debug this remotely, please make sure not to deploy a broken
system. I’d like to avoid trips to the data centre.

--
Ricardo
L
L
Ludovic Courtès wrote on 12 Jun 22:22 +0200
(name . Ricardo Wurmus)(address . rekado@elephly.net)
87k09lvdh2.fsf@gnu.org
Ricardo Wurmus <rekado@elephly.net> skribis:

Toggle quote (5 lines)
> They should be using the local IP instead of routing through the
> internet, so /etc/hosts should contain an entry for
>
> 141.80.167.131 ci.guix.gnu.org

Good idea.

Toggle quote (6 lines)
> “guix deploy” did not work on these nodes due to a serious problem: they
> were given *some* x86_64 binaries to execute, so deployed systems were
> unbootable. Since we don’t have a serial interface through which you
> could debug this remotely, please make sure not to deploy a broken
> system. I’d like to avoid trips to the data centre.

Ooooh right, thanks for the reminder!

Ludo’.
T
T
Tom Fitzhenry wrote on 19 Jun 04:07 +0200
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 55848@debbugs.gnu.org)
875ykxcsnv.fsf@tom-fitzhenry.me.uk
Mathieu Othacehe <othacehe@gnu.org> writes:

Substitutes for aarch64 are a lot healthier now. Thanks Ludovic!

* kreuzberg is now successfully building and has been for a while.
* ci.guix.gnu.has has 41% of substitutes (a low percentage, but likely a
high percentage of toolchains). 0 jobs are queued, presumably because Curiass
believes its up-to-date. This should increase over time, as packages
are updated.
* bordeaux has 83.8% of substitutes.

A few issues remain for aarch64:

* grunewald and kreuzberg are not on https://ci.guix.gnu.org/workers.
Perhaps they were taken down while the substitute ratio was low to
avoid each worker independently recompiling expensive toolchains?
* rust@1.39.0 (and thus all of Rust) is missing from ci and bordeaux. I
had expected this would have been working. I'll take a look and raise
a separate issue.

Toggle snippet (37 lines)
$ ./pre-inst-env guix weather -s aarch64-linux -c2000
computing 15514 package derivations for aarch64-linux...
looking for 16265 store items on https://ci.guix.gnu.org...
https://ci.guix.gnu.org
41.0% substitutes available (6668 out of 16265)
at least 34188.1 MiB of nars (compressed)
45362.5 MiB on disk (uncompressed)
0.015 seconds per request (144.9 seconds in total)
66.2 requests per second

0.0% (0 out of 9597) of the missing items are queued
at least 1000 queued builds
aarch64-linux: 110 (11.0%)
powerpc64le-linux: 890 (89.0%)
build rate: 36.81 builds per hour
aarch64-linux: 17.23 builds per hour
x86_64-linux: 14.25 builds per hour
powerpc64le-linux: 1.01 builds per hour
i686-linux: 4.83 builds per hour
1871 packages are missing from 'https://ci.guix.gnu.org' for 'aarch64-linux', among which:
3479 rust@1.39.0 /gnu/store/xxlgndidxvhdd391k35vcmviixq5d9b0-rust-1.39.0-cargo /gnu/store/cfy1p8q4bwwy1i01cjfssfry21kpljz3-rust-1.39.0
2111 cairomm@1.14.2 /gnu/store/bxknxn3nbmmvavf537k0pggrynhrgsaf-cairomm-1.14.2-doc /gnu/store/3sn66mgr29v73zpp93c2v09a0rj87l3w-cairomm-1.14.2
2101 texlive-latex-pgf@59745 /gnu/store/l6jr7v8ygn3ybj4gxcwskf8ifsjcj6x1-texlive-latex-pgf-59745
looking for 16265 store items on https://bordeaux.guix.gnu.org...
https://bordeaux.guix.gnu.org
83.8% substitutes available (13624 out of 16265)
35138.6 MiB of nars (compressed)
109501.6 MiB on disk (uncompressed)
0.060 seconds per request (699.4 seconds in total)
16.7 requests per second
(continuous integration information unavailable)
579 packages are missing from 'https://bordeaux.guix.gnu.org' for 'aarch64-linux', among which:
3479 rust@1.39.0 /gnu/store/xxlgndidxvhdd391k35vcmviixq5d9b0-rust-1.39.0-cargo /gnu/store/cfy1p8q4bwwy1i01cjfssfry21kpljz3-rust-1.39.0



Toggle quote (12 lines)
> Hello,
>
> The aarch64 workers were all idle whereas 70k builds were
> available. Once restarted, they started building again.
>
> The problem might be that when the server is unavailable for a while the
> worker connections expire and cannot be resumed once the server is
> available again.
>
> Thanks,
>
> Mathieu
M
M
Maxim Cournoyer wrote 6 days ago
(name . Tom Fitzhenry)(address . tom@tom-fitzhenry.me.uk)
878rps83ec.fsf@gmail.com
Hi Mathieu!

[...]

Toggle quote (9 lines)
> A few issues remain for aarch64:
>
> * grunewald and kreuzberg are not on <https://ci.guix.gnu.org/workers>.
> Perhaps they were taken down while the substitute ratio was low to
> avoid each worker independently recompiling expensive toolchains?
> * rust@1.39.0 (and thus all of Rust) is missing from ci and bordeaux. I
> had expected this would have been working. I'll take a look and raise
> a separate issue.

That's a known issue with mrustc; it only succeeds with x86_64; the
other architectures have problems. That's a bug the mrustc author would
like to fix, so perhaps in time in will improve (especially if
interested parties can lend a hand).

There was also an attempt to cross-compile a rust/cargo bootstrap seed
for other architectures (branch: wip-cross-built-rust) but due to
complications with building rust as a static archive (it relies on
dynamic linking for its macro expand crates), the effort stalled.

Thanks,

Maxim
T
T
Tom Fitzhenry wrote 6 days ago
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)
173fb399-9db1-40de-b8bc-662f1f1736d2@www.fastmail.com
On Mon, 20 Jun 2022, at 12:39 PM, Maxim Cournoyer wrote:
Toggle quote (5 lines)
> That's a known issue with mrustc; it only succeeds with x86_64; the
> other architectures have problems. That's a bug the mrustc author would
> like to fix, so perhaps in time in will improve (especially if
> interested parties can lend a hand).

mrustc was fixed on aarch64 in https://issues.guix.gnu.org/54580on staging, which was recently merged to master.

I had tested mrustc and rust-1.39 to compile on aarch64 on staging, but now I observe rust-1.39 failing.

I'll take a closer look, maybe I'm missing something.
M
M
Maxime Devos wrote 6 days ago
40c9de93c11d0b93a2df2b23ef6d1a4b56eeac0b.camel@telenet.be
Maxim Cournoyer schreef op zo 19-06-2022 om 22:39 [-0400]:
Toggle quote (5 lines)
> There was also an attempt to cross-compile a rust/cargo bootstrap seed
> for other architectures (branch: wip-cross-built-rust) but due to
> complications with building rust as a static archive (it relies on
> dynamic linking for its macro expand crates), the effort stalled.

FWIW, has it been considered to cross-compile rust non-statically
(not as a seed, just as an input cross-compiled from another system)?
Doesn't help for people that cannot offload to x86_64 and don't have
substitutes from ci.guix.gnu.org or such enabled, but could still be an
improvement.

Greetings,
Maxime.
-----BEGIN PGP SIGNATURE-----

iI0EABYKADUWIQTB8z7iDFKP233XAR9J4+4iGRcl7gUCYrBv/xccbWF4aW1lZGV2
b3NAdGVsZW5ldC5iZQAKCRBJ4+4iGRcl7u+yAQDTZUeNLi0FUkrDMxT/9k5cyT1o
Yn9cB1g5BXP9wlMAlQEAgiLmMDvZ+iNNcHhW5Je62xSy11mSx/KHLcnw5jhfzQs=
=G9OT
-----END PGP SIGNATURE-----


M
M
Maxim Cournoyer wrote 5 days ago
(name . Maxime Devos)(address . maximedevos@telenet.be)
87zgi67f9q.fsf@gmail.com
Hi Maxime,

Maxime Devos <maximedevos@telenet.be> writes:

Toggle quote (12 lines)
> Maxim Cournoyer schreef op zo 19-06-2022 om 22:39 [-0400]:
>> There was also an attempt to cross-compile a rust/cargo bootstrap seed
>> for other architectures (branch: wip-cross-built-rust) but due to
>> complications with building rust as a static archive (it relies on
>> dynamic linking for its macro expand crates), the effort stalled.
>
> FWIW, has it been considered to cross-compile rust non-statically
> (not as a seed, just as an input cross-compiled from another system)?
> Doesn't help for people that cannot offload to x86_64 and don't have
> substitutes from ci.guix.gnu.org or such enabled, but could still be an
> improvement.

This already works, on the branch. One of the patches carried there
that made it possible has been merged upstream too. The issue is that
to offer a useful cross-compiled rust on non-x86_64 systems, you need to
move it from system domains; the clean way to do this is to archive a
static binary that depends on nothing else somewhere, and extract it in
a package for the target architecture.

Currently it's not cleanly self-contained because it still references
GCC libraries.

Maxim
?