Offloaded builds can get stuck indefinitely due to network issues

Open

Details

2 participants

Ludovic Courtès
Mark H Weaver

Owner: unassigned

Submitted by: Mark H Weaver

Severity: normal

Debbugs page

Mark H Weaver wrote 6 years ago

Recipients:(address . bug-guix@gnu.org)

Message-ID:87a7m8xs42.fsf@netris.org

I just discovered that 4 out of 5 armhf build slots on Hydra have been

stuck for 24 hours, apparently after the network connections to the

build slaves were lost, possibly due to a temporary network outage.

I've seen this kind of thing happen periodically since we switched to

using guile-ssh for offloaded builds.

On Hydra I can monitor the builds and investigate when a given build

seems to be taking far too long, and I can kill those jobs to free up

the build slots. There's no way to kill the builds from Hydra's web

interface, but I can kill them manually by logging into Hydra.

This might become a more serious problem on Berlin, as we add ARM build

slaves that are not on the same local network as Berlin itself, until

the web interface allows for this kind of monitoring and intervention.

Mark

Ludovic Courtès wrote 6 years ago

Recipients:(name . Mark H Weaver)(address . mhw@netris.org)(address . 33410@debbugs.gnu.org)

Message-ID:87efbjokcz.fsf@gnu.org

Hello,

Mark H Weaver <mhw@netris.org> skribis:

Toggle quote (7 lines)

> I just discovered that 4 out of 5 armhf build slots on Hydra have been

> stuck for 24 hours, apparently after the network connections to the

> build slaves were lost, possibly due to a temporary network outage.

> I've seen this kind of thing happen periodically since we switched to

> using guile-ssh for offloaded builds.

Which guix-daemon version is hydra running?

Commit a708de151c255712071e42e5c8284756b51768cd adds a safeguard to make

sure timeouts are honored, though there might be some cases where it

doesn’t quite work as expected (I suspect libssh handles EINTR

internally by looping, in which case our signal handling async doesn’t

get a chance to run.)

Toggle quote (9 lines)

> On Hydra I can monitor the builds and investigate when a given build

> seems to be taking far too long, and I can kill those jobs to free up

> the build slots. There's no way to kill the builds from Hydra's web

> interface, but I can kill them manually by logging into Hydra.

> This might become a more serious problem on Berlin, as we add ARM build

> slaves that are not on the same local network as Berlin itself, until

> the web interface allows for this kind of monitoring and intervention.

The current situation on berlin is suboptimal: I run ‘guix processes’
when I suspect something is wrong, and that’s how I found about
https://issues.guix.info/issue/33239.

Thanks,
Ludo’.

Your comment

Commenting via the web interface is currently disabled.

To comment on this conversation send an email to 33410@debbugs.gnu.org

To respond to this issue using the mumi CLI, first switch to it

mumi current 33410

Then, you may apply the latest patchset in this issue (with sign off)

mumi am -- -s

Or, compose a reply to this issue

mumi compose

Or, send patches to this issue

mumi send-email *.patch

You may also tag this issue. See list of standard tags. For example, to set the confirmed and easy tags

mumi command -t +confirmed -t +easy

Or, remove the moreinfo tag and set the help tag

mumi command -t -moreinfo -t +help

is:open	open issues
is:done	closed issues
submitter:<who>	search issue submitter
author:<who>	search by message author
date:yesterday..now	search by issue date
mdate:3m..2d	search by message date