Stuck builds in Cuirass

  • Open
  • quality assurance status badge
Details
2 participants
  • Marius Bakke
  • Mathieu Othacehe
Owner
unassigned
Submitted by
Marius Bakke
Severity
normal
M
M
Marius Bakke wrote on 23 Nov 2022 13:50
(address . bug-guix@gnu.org)
87tu2pzvfo.fsf@gnu.org
Hi,

Cuirass has a tendency to not notice when a build is finished, leaving
it in a "running" state.

The phenomenon can be observed by going to
https://ci.guix.gnu.org/status and look at builds that are running for
a suspiciously long time.

Typically the build log will indicate that it has finished, yet Cuirass
is patiently waiting...and not scheduling further builds.

Restarting the builds typically get things going again.

I wrote a nasty script to automatically restart builds that are running
for >1 hour, but it's not a sustainable solution:
#!/usr/bin/env python3

# Restart stuck builds.... TODO fix cuirass properly.

import requests
from bs4 import BeautifulSoup
import re

builds_html = requests.get(builds_page).text

soup = BeautifulSoup(builds_html, "html5lib")
main = soup.find('main', {'id': 'content'})
table = main.find('table')

result = {}

for row in table.find_all('tr'):
data = row.find_all('td')
if len(data) > 0:
build_id = row.find('a').contents[0]
name = data[0].contents[0]
age = data[1].contents[0]
system = data[2].contents[0]
log = data[3]

result[build_id] = {'name': name, 'age': age, 'system': system}

age_re = re.compile("(\d+) (\w+) ago")
restart = []

for id in result.keys():
age = result[id]['age']
match = age_re.match(result[id]['age'])
if match is not None: # "seconds ago"
digits = match.group(1)
time_unit = match.group(2)
if time_unit == "hours":
restart.append(id)
elif time_unit == "minutes" and int(digits) > 60:
restart.append(id)

certificate_file = "/home/marius/tmp/mbakke.cert.pem"
certificate_key = "/home/marius/tmp/mbakke.key.pem"

import time

print(f"Found {len(restart)} stuck builds..!")

for id in restart:
print(f"Going to restart {result[id]['name']} ({id}, running since {result[id]['age']})...")
cert=(certificate_file, certificate_key))
time.sleep(3)
-----BEGIN PGP SIGNATURE-----

iIUEARYKAC0WIQRNTknu3zbaMQ2ddzTocYulkRQQdwUCY34XGw8cbWFyaXVzQGdu
dS5vcmcACgkQ6HGLpZEUEHcYCQD/WbYxZ+Mi1I4kYSCKqRmuVrucf7oVXlZwAyFT
KHhbOrQA/jUT3vZCpeiiSPWyxedXqYOBllkcvQXgmT3tj4RPcZMH
=pDj4
-----END PGP SIGNATURE-----

M
M
Mathieu Othacehe wrote on 23 Nov 2022 14:26
(name . Marius Bakke)(address . marius@gnu.org)(address . 59514@debbugs.gnu.org)
874jupx0md.fsf@gnu.org
Hello Marius,

Toggle quote (7 lines)
> Cuirass has a tendency to not notice when a build is finished, leaving
> it in a "running" state.
>
> The phenomenon can be observed by going to
> <https://ci.guix.gnu.org/status> and look at builds that are running for
> a suspiciously long time.

I suspect this is caused by https://issues.guix.gnu.org/59510which
causes the worker threads to bail out.

We can probably merge those two issues. The
/var/log/cuirass-remote-server.log file on Berlin also indicates when
the build-succeeded or build-failed message is received by the server,
and how long the fetch from the worker took.

Thanks,

Mathieu
?