Topological sorting in cuirass

  • Open
  • quality assurance status badge
Details
3 participants
  • Andreas Enge
  • Dr. Arne Babenhauserheide
  • Ludovic Courtès
Owner
unassigned
Submitted by
Andreas Enge
Severity
normal
A
A
Andreas Enge wrote on 10 May 2023 12:06
(address . bug-guix@gnu.org)
ZFtspPexmg3YM/ug@jurong
This is a wishlist bug, but it is important for architectures where we
are currently short on build power, and where this issue can stall builds
and waste an arbitrary amount of build power.

Cuirass should sort builds and only offload derivations for which all
inputs are available.

In my current understanding, cuirass offloads arbitrary derivations, and
the machine to which they are offloaded then starts building recursively
all inputs. If this is true, then it is possible that at some point in time,
all build slots are taken by the same package built as many times as there
are machines; I have seen something like this when working on core-updates,
where several machines were building the main gcc compiler at the same time.
At worse, if cuirass asks every machine to build a leaf package, this may
result in a simultaneous full bootstrap on all of them.

The situation becomes worse when the package in question fails. Then as
I understand it, each machine may receive a request to build something
depending on the failing package and try the failing build and thus waste
build power that will not be available to build other packages successfully.

Solving this problem may also make reports of build failures more accurate
and legible. For instance, doxygen currently fails to build on aarch64:
and is reported as "Failed", and not as "Failed (dependency)".
However, looking at the build log
shows this:
...
building path(s) `/gnu/store/p5vqrwywz053r1vkiyw54dp9gj7vw9xd-ninja-1.11.1'
...
builder for `/gnu/store/0zf7fqndzf2k595r4s6wblmpccdwr3nx-ninja-1.11.1.drv' failed with exit code 1
@ build-failed /gnu/store/0zf7fqndzf2k595r4s6wblmpccdwr3nx-ninja-1.11.1.drv - 1 builder for `/gnu/store/0zf7fqndzf2k595r4s6wblmpccdwr3nx-ninja-1.11.1.drv' failed with exit code 1
cannot build derivation `/gnu/store/hlscqram59id51hxg0fj15041v52h1kw-meson-1.1.0.drv': 1 dependencies couldn't be built
cannot build derivation `/gnu/store/w8qxkrwpffd9qs5w1jggy1yi27ycm0xr-jsoncpp-1.9.5.drv': 2 dependencies couldn't be built
cannot build derivation `/gnu/store/mss4yv015cil1vnjnglq506m83b7n3dy-cmake-bootstrap-3.24.2.drv': 1 dependencies couldn't be built
cannot build derivation `/gnu/store/w0irp6xn30nlmpizhcbjnvhqmsba41jn-cmake-minimal-3.24.2.drv': 2 dependencies couldn't be built
cannot build derivation `/gnu/store/rqk2rbnpjpcnqswz8hqari1rnw6r8v1m-doxygen-1.9.5.drv': 1 dependencies couldn't be built

So it is indeed a different package that fails (and the last few lines
give a list of dependencies between ninja and doxygen, each of which may
or may not fail once ninja is fixed).

Notice that this could be solved without a topological sorting of the
dependency graph: It would be enough to keep an array deriv in which
deriv[i] contains a list of derivations requiring i more inputs to be built,
together with the list of inputs; elements in deriv[0] are ready to be sent
to a build machine, and upon completion of a build, all derivations
depending on it should be moved from deriv[i] to deriv[i-1] if the input
has been built successfully, or marked as "Failed (dependency)" if the
input has failed. (But this could be expensive, and may require appropriate
data structures.)

Alternatively, build jobs could be sorted topologically and then be kept
in a list; then before sending out a job, all its inputs have been tried
to be built; the job should then be sent if all inputs are available, or
be marked as "Failed (dependency)" if any of them has failed.

Andreas
D
D
Dr. Arne Babenhauserheide wrote on 10 May 2023 15:59
(name . Andreas Enge)(address . andreas@enge.fr)
87r0ro9tdt.fsf@web.de
Andreas Enge <andreas@enge.fr> writes:

Toggle quote (2 lines)
> Cuirass should sort builds and only offload derivations for which all
> inputs are available.
...
Toggle quote (5 lines)
> Alternatively, build jobs could be sorted topologically and then be kept
> in a list; then before sending out a job, all its inputs have been tried
> to be built; the job should then be sent if all inputs are available, or
> be marked as "Failed (dependency)" if any of them has failed.

If you want to try this out quickly, you could use code from the
topological sorting SRFI I’m slowly finalizing:




edgelist->graph should do the conversion from inputs per package to the
input format.

Best wishes,
Arne
--
Unpolitisch sein
heißt politisch sein,
ohne es zu merken.
draketo.de
-----BEGIN PGP SIGNATURE-----

iQJEBAEBCAAuFiEE801qEjXQSQPNItXAE++NRSQDw+sFAmRbo/4QHGFybmVfYmFi
QHdlYi5kZQAKCRAT741FJAPD61HYEAC0cngpr3CGwz0WIR08pCPoDJ4aqQZ5hLMJ
NWCNS/UEa+svUUe8M3RTD0ULLu6bdPVIqapWxmU6eiCAWRR/Sw0yC//RwAHrrNEm
HsWT9/yX44upkyG4jD1UI04FVo/4anOi87tcPVxoUl+07g3NTs4I682W/nv0WxDY
8OURhIwMe3eeHF7bSX9cLzx6QbfmUwP0Z6uwi7ZTyzOat8FuJNqLIF13WofMj3N3
osKQerCADYZrpWMeiFkHrBIFF2n0Bx61SwVrWkZJJt6CDf2lUQhjEOhZtPKRDCsW
8299sdPs6NY2bcsRAhi4I2ImFVhVdL15oFvk/Z0eYu180J1cPedvjgvxHKjnYGSH
uvKWv6Yl8F7K1wNmkkV80OJwJfbbfYWE+YMoPAcMwBwCCfQh9HaErOF58tfoL18a
W/xOZ1x80YC1zmdFS85FSzn6DCwK+OJXOUPiAgMwrpVWq6jKvEST17Lt40XTEZWG
UM2vKd6anTMwdAYRjsioBDOHiV5EQCtK4gtCIjoodIu8UtZos1xW91GQ+jFYFFCy
aia90PgwLKBDjQ8fWDFQ8GF2lo5CYGKk+dvXfc3SruzpRTJAunPT/X5utNcCjvV8
0NsOdFpdNE5u5C6snAwJQChGSI+5UDNw1rUI/oXEIniQaEuwB+wCiuYSeCsbN5+C
WCVJqBjbK4jEBAEBCAAuFiEE3Si95tmHXKvOSosd3M8NswvBBUgFAmRbo/4QHGFy
bmVfYmFiQHdlYi5kZQAKCRDczw2zC8EFSCYSA/9dFjzglwj8/z4++auf9c0iHIG6
ve33HOtv6FuZd8Uqs4lOZed7dpIjhXaIkPPApxtRwF3qacs7BbY0VZlx+V8+z0US
fANOg45cha9eJJchXfQWWwussE/VePM5hY3I5gUZo6UaL0Pjrw4gEhJnHIhFxV0c
mq+xD9Tiiac9roJsUw==
=6oiM
-----END PGP SIGNATURE-----

L
L
Ludovic Courtès wrote on 29 Oct 2023 18:01
(name . Andreas Enge)(address . andreas@enge.fr)(address . 63412@debbugs.gnu.org)
87jzr5747z.fsf@gnu.org
Hi,

Andreas Enge <andreas@enge.fr> skribis:

Toggle quote (7 lines)
> This is a wishlist bug, but it is important for architectures where we
> are currently short on build power, and where this issue can stall builds
> and waste an arbitrary amount of build power.
>
> Cuirass should sort builds and only offload derivations for which all
> inputs are available.

Cuirass has two build backends: talking to the build daemon, and using
the ZeroMQ-based “remote worker” protocol. In practice we use the
latter, which fixes scalability issues with the former.

The worker protocol implements work stealing: workers periodically send
messages to “remote server” asking for work. Said server replies with a
derivation for one of the systems the worker supports; that derivation
is chosen among the “builds” whose dependencies have all been
successfully built.

And here’s the trick: the server is doing the right thing, but it has a
partial view. Namely, the server sees “builds” rather than
“derivations”. “Builds” are the things explicitly declared in the
jobset. If you have a declared build for GCC, but no build for MPC and
MPFR, then the server will consider that GCC has zero non-built
dependencies, even though MPFR and MPC may still need to be built.

Having ‘remote-worker’ operate on derivations rather than builds would
address this impedance mismatch, though there are complications (there
are bits of the database schema that amalgamate build/derivation).

Food for thought!

Ludo’.
?