Hydra: mozjs-60 builds on x86_64 and i686 seemingly get stuck

  • Done
  • quality assurance status badge
Details
3 participants
  • Efraim Flashner
  • Ludovic Courtès
  • Mark H Weaver
Owner
unassigned
Submitted by
Mark H Weaver
Severity
normal
Merged with
M
M
Mark H Weaver wrote on 21 Jan 2019 16:31
(address . bug-guix@gnu.org)
87zhruuiv6.fsf@netris.org
Yesterday on Hydra, I found both Intel mozjs-60 builds seemingly stuck
while exporting the source checkout to hydra.gnunet.org. One had been
going for ~22.5 hours, and the other for ~12 hours. I forcefully killed
them and restarted them. Now I see the same thing has happened on the
second attempt. Both builds have been seemingly stuck like this for
about 19 hours:


In both cases, the build logs are empty, and the hydra log ends with:

sending 1 store item to 'hydra.gnunet.org'...
exporting path `/gnu/store/j2sz7dg35vkcz38sim71jll2ix1nk554-mozjs-60.2.3-2-checkout'

Of course, it's possible that they're not really stuck, but that they're
merely taking a ridiculously long time to send the source checkout to
the build slave. My personal checkout of the mozilla-esr60 branch,
without the .hg directory, is about 2.1 gigabytes.

What do you think?

Mark
E
E
Efraim Flashner wrote on 21 Jan 2019 16:39
(name . Mark H Weaver)(address . mhw@netris.org)(address . 34157@debbugs.gnu.org)
20190121153947.GD11658@macbook41
On Mon, Jan 21, 2019 at 10:31:46AM -0500, Mark H Weaver wrote:
Toggle quote (24 lines)
> Yesterday on Hydra, I found both Intel mozjs-60 builds seemingly stuck
> while exporting the source checkout to hydra.gnunet.org. One had been
> going for ~22.5 hours, and the other for ~12 hours. I forcefully killed
> them and restarted them. Now I see the same thing has happened on the
> second attempt. Both builds have been seemingly stuck like this for
> about 19 hours:
>
> https://hydra.gnu.org/build/3342528
> https://hydra.gnu.org/build/3343511
>
> In both cases, the build logs are empty, and the hydra log ends with:
>
> sending 1 store item to 'hydra.gnunet.org'...
> exporting path `/gnu/store/j2sz7dg35vkcz38sim71jll2ix1nk554-mozjs-60.2.3-2-checkout'
>
> Of course, it's possible that they're not really stuck, but that they're
> merely taking a ridiculously long time to send the source checkout to
> the build slave. My personal checkout of the mozilla-esr60 branch,
> without the .hg directory, is about 2.1 gigabytes.
>
> What do you think?
>
> Mark
>
12 hours is far too long for it to tie up a build slave, sending code or
not. Being silent that long doesn't trigger the auto-kill?

--
Efraim Flashner <efraim@flashner.co.il> ????? ?????
GPG key = A28B F40C 3E55 1372 662D 14F7 41AA E7DC CA3D 8351
Confidentiality cannot be guaranteed on emails sent or received unencrypted
-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEEoov0DD5VE3JmLRT3Qarn3Mo9g1EFAlxF578ACgkQQarn3Mo9
g1Fjyg/+KLxi3DQijzrjUc+eZRmbS/Sy5V9Y6Mh6li8/6xKrdsYP6WwSoQIPiuo3
Sbd+1wW98Naux/REQKM+kjlEEI2oBma5gv45A30TpicqRfaMIL+RAXDgTbdJtw5e
8i+hhlk7w5njn/4xtVjW+bOgEDm3Zj3nJEWBFtP9wSeAiF+UzjwAZ2pIXMZKmRVT
Kr3z+Az7XaCcuTWkOEGe+2T4lcMe1hOF6EpYoIx28uq/2VHFMg6/zo19F7XSgy4u
2BspFAB+u96Vv10gWipzJ6+DBydstmjqZFcmNarBq5YBwLh/EKX+S2u88wB6Iiko
jYZfpkP03sO5Yv4qlrBh5umvez9+o4vrDkO8OZVfS3PA2ukP3PMlrQkGiSz98pwN
hMYyJdCTFTp/GECpVfMfUMcizj57YdusXtYU2W/QY4Z/4QinGbgzhoaHFGsi6+hk
rCRbjEoAoX8I5l7QnrMq/nJEonQ30e/7YX01/We0St7BlTokHSsbkWwgcE+YBjBD
Hbe622B+at1ryLVeDR00QNGlqcLEIlrdoAMLqoYoeUbqniYaihZpXselsDdjk3mo
umr5v2y7Hqt9mxT5PuBh58S6vwRegLirvVG+58fLaggxSKRrR7/hcS+A7B3blQGc
yTqcw6+Vlo56EUhf5zLUBdHSYnszy1QteL1PCRXUCKilxmO++eE=
=MEwG
-----END PGP SIGNATURE-----


M
M
Mark H Weaver wrote on 22 Jan 2019 03:54
(name . Efraim Flashner)(address . efraim@flashner.co.il)(address . 34157@debbugs.gnu.org)
87sgxlv1td.fsf@netris.org
Efraim Flashner <efraim@flashner.co.il> writes:

Toggle quote (28 lines)
> On Mon, Jan 21, 2019 at 10:31:46AM -0500, Mark H Weaver wrote:
>> Yesterday on Hydra, I found both Intel mozjs-60 builds seemingly stuck
>> while exporting the source checkout to hydra.gnunet.org. One had been
>> going for ~22.5 hours, and the other for ~12 hours. I forcefully killed
>> them and restarted them. Now I see the same thing has happened on the
>> second attempt. Both builds have been seemingly stuck like this for
>> about 19 hours:
>>
>> https://hydra.gnu.org/build/3342528
>> https://hydra.gnu.org/build/3343511
>>
>> In both cases, the build logs are empty, and the hydra log ends with:
>>
>> sending 1 store item to 'hydra.gnunet.org'...
>> exporting path `/gnu/store/j2sz7dg35vkcz38sim71jll2ix1nk554-mozjs-60.2.3-2-checkout'
>>
>> Of course, it's possible that they're not really stuck, but that they're
>> merely taking a ridiculously long time to send the source checkout to
>> the build slave. My personal checkout of the mozilla-esr60 branch,
>> without the .hg directory, is about 2.1 gigabytes.
>>
>> What do you think?
>>
>> Mark
>>
> 12 hours is far too long for it to tie up a build slave, sending code or
> not.

Those two builds are still occupying build slots. As I write this,
they've been running for over 30 hours.

I was curious whether the transfers were actually happening, even if
slowly, so I looked at 'netstat' output:

Toggle snippet (6 lines)
root@20121227-hydra:~# netstat --inet --program | grep net.in.tum
tcp 0 0 20121227-hydra.gn:58007 hydra.net.in.tum.de:ssh ESTABLISHED 18774/guile
tcp 0 0 20121227-hydra.gn:42586 hydra.net.in.tum.de:ssh ESTABLISHED 10042/guile
tcp 0 0 20121227-hydra.gn:56413 hydra.net.in.tum.de:ssh ESTABLISHED 16236/guile

There are currently three builds allocated to hydra.gnunet.org
(a.k.a. hydra.net.in.tum), so it appears that all three ssh connections
are still active. However, even after repeating this command many
times, I've never seen a non-zero "Send-Q" value. This suggests that no
data is actually being sent, but that it's stuck waiting for something.

I'll leave these builds alone for now, in case Ludovic wants to
investigate further.

Toggle quote (2 lines)
> Being silent that long doesn't trigger the auto-kill?

I guess that the usual timeouts do not apply to file transfers performed
before the actual build takes place.

Thanks,
Mark
L
L
Ludovic Courtès wrote on 22 Jan 2019 14:24
(name . Mark H Weaver)(address . mhw@netris.org)
878szcome4.fsf@gnu.org
Hi Mark,

Mark H Weaver <mhw@netris.org> skribis:

Toggle quote (17 lines)
> Those two builds are still occupying build slots. As I write this,
> they've been running for over 30 hours.
>
> I was curious whether the transfers were actually happening, even if
> slowly, so I looked at 'netstat' output:
>
> root@20121227-hydra:~# netstat --inet --program | grep net.in.tum
> tcp 0 0 20121227-hydra.gn:58007 hydra.net.in.tum.de:ssh ESTABLISHED 18774/guile
> tcp 0 0 20121227-hydra.gn:42586 hydra.net.in.tum.de:ssh ESTABLISHED 10042/guile
> tcp 0 0 20121227-hydra.gn:56413 hydra.net.in.tum.de:ssh ESTABLISHED 16236/guile
>
> There are currently three builds allocated to hydra.gnunet.org
> (a.k.a. hydra.net.in.tum), so it appears that all three ssh connections
> are still active. However, even after repeating this command many
> times, I've never seen a non-zero "Send-Q" value. This suggests that no
> data is actually being sent, but that it's stuck waiting for something.

Weird.

Toggle quote (3 lines)
> I'll leave these builds alone for now, in case Ludovic wants to
> investigate further.

I think you can terminate them as I’d rather not commit to investigate
further now.

I believe hydra.gnu.org is still running a rather old
guix-daemon/offload, right? We should upgrade to the latest and
greatest to make sure we’re after a bug that’s still present.

Thanks,
Ludo’.
M
M
Mark H Weaver wrote on 9 Apr 2019 03:06
Re: bug#35181: Hydra offloads often get stuck while exporting build requisites
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 35181@debbugs.gnu.org)
87k1g456nc.fsf@netris.org
merge 35181 34157
thanks

I looked more closely at the 'mozjs-60' failures, and I'm convinced that
it's an instance of the same problem that's currently affecting these
large font builds.

Mozjs-60 was pushed to the master branch on 2019-01-18. It has _never_
successfully built on x86_64 or i686, although all builds were
successful on armhf. See below for the complete list of build attempts
of mozjs-60 on Hydra.

Also of note: So far, all known instances of this problem have occurred
while transferring a large directory, as opposed to a tarball.

We have several packages with source tarballs _much_ larger than these
problematic source checkouts, and which are updated more much
frequently, and yet I've *never* seen an instance of this problem while
exporting a plain file to a build slave. For example, the upstream
IceCat and Firefox ESR tarballs are ~270 megabytes compressed, whereas
'font-google-material-design-icons-3.0.1' source is only ~176 megabytes
_uncompressed_.

I have no explanation for why the superficial form of the store item
should matter here, but maybe it's a clue. I know that plain
non-executable files in the store are handled somewhat differently in
the Nix model than directories or executable files, the latter
associated with the word "recursive", and requiring an additional layer
of encoding for purposes of serialization, but I'm not sufficiently
familiar with the details or relevant code.

Ludovic, can you think of a reason why the file/directory distinction
could be relevant to this issue?

Finally: the problem seems to have been introduced into Hydra sometime
between September 2018 and January 2019. September 2018 is when the
last successful build of the problematic font packages was performed,
and January 2019 is the first known instance of the problem. I do not
currently know of any relevant data points in that time range. The last
'core-updates' merge into 'master' was on December 3rd.

Mark


PS: Here's the complete history of 'mozjs-60' build attempts on Hydra:
First are the 'armhf' attempts, followed by i686, and x86_64. Note
that the two armhf aborts happened after only 2 seconds, and surely
had a different cause than this issue.

Toggle snippet (30 lines)
hydra=> select case when s.machine~'^(hydra|guix)\.' then s.machine else substring(s.machine from '^[^.]*') end as machine, jobset, s.build, s.stepnr as step, case when s.busy=1 then 'busy' when s.status=0 then NULL when s.status=1 then 'fail' when s.status=4 then 'abort' when s.status=7 then 'timeout' when s.status=8 then 'cfail' else '?' end as stat, regexp_replace(substr(s.drvpath,1+strpos(s.drvpath,'-')),'\.drv$','') as what, date_trunc('second', to_timestamp(s.stoptime)) as finished, date_trunc('second', to_timestamp(s.stoptime) - to_timestamp(s.starttime)) as duration from builds b, buildsteps s where b.id=s.build and b.job='mozjs-60.2.3-2.armhf-linux' order by s.stoptime;
machine | jobset | build | step | stat | what | finished | duration
--------------+--------+---------+------+-------+-------------------------+------------------------+----------
hydra-slave2 | master | 3342804 | 1 | | mozjs-60.2.3-2-checkout | 2019-01-19 12:58:52+00 | 00:23:55
hydra-slave2 | master | 3342804 | 2 | | mozjs-60.2.3-2 | 2019-01-19 15:49:37+00 | 02:50:42
| master | 3367975 | 1 | abort | mozjs-60.2.3-2 | 2019-02-13 00:03:58+00 | 00:00:02
| master | 3367975 | 2 | abort | mozjs-60.2.3-2 | 2019-02-15 15:35:45+00 | 00:00:02
hydra-slave3 | master | 3367975 | 3 | | mozjs-60.2.3-2 | 2019-02-18 16:38:08+00 | 02:46:42
(5 rows)

hydra=> select case when s.machine~'^(hydra|guix)\.' then s.machine else substring(s.machine from '^[^.]*') end as machine, jobset, s.build, s.stepnr as step, case when s.busy=1 then 'busy' when s.status=0 then NULL when s.status=1 then 'fail' when s.status=4 then 'abort' when s.status=7 then 'timeout' when s.status=8 then 'cfail' else '?' end as stat, regexp_replace(substr(s.drvpath,1+strpos(s.drvpath,'-')),'\.drv$','') as what, date_trunc('second', to_timestamp(s.stoptime)) as finished, date_trunc('second', to_timestamp(s.stoptime) - to_timestamp(s.starttime)) as duration from builds b, buildsteps s where b.id=s.build and b.job='mozjs-60.2.3-2.i686-linux' order by s.stoptime;
machine | jobset | build | step | stat | what | finished | duration
---------+--------+---------+------+-------+----------------+------------------------+-----------------
| master | 3343511 | 1 | abort | mozjs-60.2.3-2 | 2019-01-20 20:05:16+00 | 12:11:12
| master | 3343511 | 2 | abort | mozjs-60.2.3-2 | 2019-01-23 01:52:01+00 | 2 days 05:42:55
| master | 3360985 | 1 | abort | mozjs-60.2.3-2 | 2019-02-15 19:59:42+00 | 09:31:25
| master | 3360985 | 2 | abort | mozjs-60.2.3-2 | 2019-02-16 17:37:06+00 | 05:57:15
| master | 3360985 | 3 | abort | mozjs-60.2.3-2 | 2019-02-17 17:39:49+00 | 16:06:14
| master | 3360985 | 4 | abort | mozjs-60.2.3-2 | 2019-03-03 21:50:48+00 | 00:02:19
(6 rows)

hydra=> select case when s.machine~'^(hydra|guix)\.' then s.machine else substring(s.machine from '^[^.]*') end as machine, jobset, s.build, s.stepnr as step, case when s.busy=1 then 'busy' when s.status=0 then NULL when s.status=1 then 'fail' when s.status=4 then 'abort' when s.status=7 then 'timeout' when s.status=8 then 'cfail' else '?' end as stat, regexp_replace(substr(s.drvpath,1+strpos(s.drvpath,'-')),'\.drv$','') as what, date_trunc('second', to_timestamp(s.stoptime)) as finished, date_trunc('second', to_timestamp(s.stoptime) - to_timestamp(s.starttime)) as duration from builds b, buildsteps s where b.id=s.build and b.job='mozjs-60.2.3-2.x86_64-linux' order by s.stoptime;
machine | jobset | build | step | stat | what | finished | duration
---------+--------+---------+------+-------+----------------+------------------------+-----------------
| master | 3342528 | 1 | abort | mozjs-60.2.3-2 | 2019-01-20 20:04:50+00 | 22:25:28
| master | 3342528 | 2 | abort | mozjs-60.2.3-2 | 2019-01-23 01:51:48+00 | 2 days 05:19:35
| master | 3366691 | 1 | abort | mozjs-60.2.3-2 | 2019-02-17 17:39:59+00 | 09:21:24
(3 rows)

?