cuirass: Contention while registering new builds.

  • Done
  • quality assurance status badge
Details
2 participants
  • Ludovic Courtès
  • Mathieu Othacehe
Owner
unassigned
Submitted by
Mathieu Othacehe
Severity
normal
M
M
Mathieu Othacehe wrote on 22 Sep 2020 18:52
(address . bug-guix@gnu.org)
87imc5zrc8.fsf@gnu.org
Hello,

The attached screenshot shows that 9 evaluations are currently "In
progress" for "guix-master" specification.

Evaluations 16725 to 16738 are completed, 7 hours ago and 56 minutes ago
respectively. They are still shown as "In progress" because the
"register" phase in "build-packages" is not over. It is also possible to
see that new builds are registered for those evaluations, but very
slowly.

As each evaluation has to register around 50k builds, there might be
some sort of "writing" contention on the database, explaining the long
build registration time.

Thanks,

Mathieu
--
Attachment: cuirass.png
M
M
Mathieu Othacehe wrote on 28 Sep 2020 09:51
(address . 43564-done@debbugs.gnu.org)
87k0wee3ud.fsf@gnu.org
Hello,

Toggle quote (4 lines)
> As each evaluation has to register around 50k builds, there might be
> some sort of "writing" contention on the database, explaining the long
> build registration time.

Turns out, once an evaluation is over, new builds are registered. This
registration tries to insert a new build for each derivation returned by
the evaluation phase. If the new build does not add a new output, the
insertion query is then rollbacked. This means that there are at least
as many insertion queries as new derivations.

SQlite allows at most one writer at a time, and even though we are using WAL
mode, performing a lot of insertions will reduce the reading performances.
When multiple evaluations are performed in parallel, the large number of
concurrent insertion queries also causes contention.

Having 10 pending evaluations means that we are probably trying to
insert around 500.000 records concurrently. This is something that
SQLite does not seem to be designed for.

To avoid those issues, we need to check first in the "Outputs" table
which derivations are already registered. This means that most of the
insertion queries will be replaced by reading queries, that are much
less expensive and more suitable for Cuirass concurrent implementation.

In the best case of one pending evaluation, the registration duration is
reduced from 1800 seconds to 320 seconds. I think that the gain is way
larger when there are multiple pending evaluations.

Pushed a fix as 461e07e14e1c8013343c0a2cb26c0e022e10d5e4.

Thanks,

Mathieu
Closed
M
M
Mathieu Othacehe wrote on 28 Sep 2020 10:17
(address . 43564@debbugs.gnu.org)
87d026e2n5.fsf@gnu.org
For future reference, here's a small addition. As I said, we are using
the Cuirass SQLite database in WAL mode. This means that insertion
queries are added to a separate cuirass.db-wal file. Once in a while,
this file is supposed to be merged in the main cuirass.db file.

This mechanism known as checkpointing is supposed to occur automatically
once 4MB of data are added to the WAL file. On berlin, the WAL file of
Cuirass is 9.4M and the Guix WAL file is almost 1G.

As reading from the WAL file is less effective, and checkpointing a
large file will end-up locking the database for some times, we should
definitely try to understand why the Guix WAL file get so big.

Thanks,

Mathieu
M
M
Mathieu Othacehe wrote on 29 Sep 2020 17:02
(address . 43564@debbugs.gnu.org)
87h7rglj7t.fsf@gnu.org
Hello,

Toggle quote (6 lines)
> In the best case of one pending evaluation, the registration duration is
> reduced from 1800 seconds to 320 seconds. I think that the gain is way
> larger when there are multiple pending evaluations.
>
> Pushed a fix as 461e07e14e1c8013343c0a2cb26c0e022e10d5e4.

While this improves the situation, another evaluation contention episode
occurred this afternoon (evaluations 16920 to 16938).

Even though reading queries are cheaper, I guess that when 10 fibers try
to execute 50k read queries, there's some contention in the
communication with the 4 database workers.

I'm trying locally to spawn database workers dedicated to registration,
that execute the evaluation queries in a single batch. Let's see if it
helps.

Thanks,

Mathieu
M
M
Mathieu Othacehe wrote on 4 Oct 2020 13:12
(address . 43564@debbugs.gnu.org)
87r1qep7m9.fsf@gnu.org
Hello,

Toggle quote (4 lines)
> I'm trying locally to spawn database workers dedicated to registration,
> that execute the evaluation queries in a single batch. Let's see if it
> helps.

I added 4 registration dedicated database workers and removed from
registration all accesses to the store. Now the evaluation returns
objects that contain all required information.

This seem to improve the situation quite a lot. For example, those three
registrations were done simultaneously in ~200 seconds:

Toggle snippet (14 lines)
2020-10-04T12:46:19 registration took 205.429772 seconds
2020-10-04T12:46:19 evaluation 17099 registered 156 new derivations
2020-10-04T12:46:19 building 156 derivations in batches of 200
2020-10-04T12:46:19 building batch of 200 derivations (0/156)
2020-10-04T12:46:19 registration took 212.892482 seconds
2020-10-04T12:46:19 evaluation 17105 registered 164 new derivations
2020-10-04T12:46:19 building 164 derivations in batches of 200
2020-10-04T12:46:19 building batch of 200 derivations (0/164)
2020-10-04T12:46:19 registration took 205.009416 seconds
2020-10-04T12:46:19 evaluation 17102 registered 160 new derivations
2020-10-04T12:46:19 building 160 derivations in batches of 200
2020-10-04T12:46:19 building batch of 200 derivations (0/160)

Thanks,

Mathieu
L
L
Ludovic Courtès wrote on 5 Oct 2020 14:07
(address . 43564@debbugs.gnu.org)(address . othacehe@gnu.org)
87v9fo27vz.fsf@gnu.org
Hello!

Mathieu Othacehe <othacehe@gnu.org> skribis:

Toggle quote (6 lines)
> Turns out, once an evaluation is over, new builds are registered. This
> registration tries to insert a new build for each derivation returned by
> the evaluation phase. If the new build does not add a new output, the
> insertion query is then rollbacked. This means that there are at least
> as many insertion queries as new derivations.

Isn’t that the real problem, that we’re doing one transaction per
derivation?

Ideally, upon evaluation completion, I’d expect us to create a single
transaction that registers all 50K derivations at once, no?

Thanks for the great improvements in this area!

Ludo’.
M
M
Mathieu Othacehe wrote on 5 Oct 2020 15:09
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 43564@debbugs.gnu.org)
87k0w47rbh.fsf@gnu.org
Hey Ludo,

Toggle quote (3 lines)
> Isn’t that the real problem, that we’re doing one transaction per
> derivation?

Is it really better in term of performance to send batch of queries
within a single transaction? I haven't tried it yet.

I think that the real bottleneck was having N fibers fighting over 4
workers to execute large number of insertions. Having all those queries
done by workers dedicated to registration in a single pass seems to
improve a lot the situation.

Maybe having those registration workers posting unique transactions
containing all the queries would be even more beneficial.

Thanks,

Mathieu
L
L
Ludovic Courtès wrote on 9 Oct 2020 09:59
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 43564@debbugs.gnu.org)
87y2kfomn0.fsf@gnu.org
Hi,

Mathieu Othacehe <othacehe@gnu.org> skribis:

Toggle quote (6 lines)
>> Isn’t that the real problem, that we’re doing one transaction per
>> derivation?
>
> Is it really better in term of performance to send batch of queries
> within a single transaction? I haven't tried it yet.

I’m definitely not a database expert, but I think there’s overhead for
setting up a transaction, which we’re paying N times here.

Toggle quote (5 lines)
> I think that the real bottleneck was having N fibers fighting over 4
> workers to execute large number of insertions. Having all those queries
> done by workers dedicated to registration in a single pass seems to
> improve a lot the situation.

Then that’s great, let’s profit! :-)

Ludo’.
?