cuirass: Contention while registering new builds.

Done

Details

2 participants

Ludovic Courtès
Mathieu Othacehe

Owner: unassigned

Submitted by: Mathieu Othacehe

Severity: normal

Mathieu Othacehe wrote on 22 Sep 2020 18:52

Recipients:(address . bug-guix@gnu.org)

Message-ID:87imc5zrc8.fsf@gnu.org

Hello,

The attached screenshot shows that 9 evaluations are currently "In

progress" for "guix-master" specification.

Evaluations 16725 to 16738 are completed, 7 hours ago and 56 minutes ago

respectively. They are still shown as "In progress" because the

"register" phase in "build-packages" is not over. It is also possible to

see that new builds are registered for those evaluations, but very

slowly.

As each evaluation has to register around 50k builds, there might be

some sort of "writing" contention on the database, explaining the long

build registration time.

Thanks,

Mathieu

https://othacehe.org

Attachment: cuirass.png

Mathieu Othacehe wrote on 28 Sep 2020 09:51

Recipients:(address . 43564-done@debbugs.gnu.org)

Message-ID:87k0wee3ud.fsf@gnu.org

Hello,

Toggle quote (4 lines)

> As each evaluation has to register around 50k builds, there might be

> some sort of "writing" contention on the database, explaining the long

> build registration time.

Turns out, once an evaluation is over, new builds are registered. This

registration tries to insert a new build for each derivation returned by

the evaluation phase. If the new build does not add a new output, the

insertion query is then rollbacked. This means that there are at least

as many insertion queries as new derivations.

SQlite allows at most one writer at a time, and even though we are using WAL

mode, performing a lot of insertions will reduce the reading performances.

When multiple evaluations are performed in parallel, the large number of

concurrent insertion queries also causes contention.

Having 10 pending evaluations means that we are probably trying to

insert around 500.000 records concurrently. This is something that

SQLite does not seem to be designed for.

To avoid those issues, we need to check first in the "Outputs" table

which derivations are already registered. This means that most of the

insertion queries will be replaced by reading queries, that are much

less expensive and more suitable for Cuirass concurrent implementation.

In the best case of one pending evaluation, the registration duration is

reduced from 1800 seconds to 320 seconds. I think that the gain is way

larger when there are multiple pending evaluations.

Pushed a fix as 461e07e14e1c8013343c0a2cb26c0e022e10d5e4.

Thanks,

Mathieu

Closed

Mathieu Othacehe wrote on 28 Sep 2020 10:17

Recipients:(address . 43564@debbugs.gnu.org)

Message-ID:87d026e2n5.fsf@gnu.org

For future reference, here's a small addition. As I said, we are using

the Cuirass SQLite database in WAL mode. This means that insertion

queries are added to a separate cuirass.db-wal file. Once in a while,

this file is supposed to be merged in the main cuirass.db file.

This mechanism known as checkpointing is supposed to occur automatically

once 4MB of data are added to the WAL file. On berlin, the WAL file of

Cuirass is 9.4M and the Guix WAL file is almost 1G.

As reading from the WAL file is less effective, and checkpointing a

large file will end-up locking the database for some times, we should

definitely try to understand why the Guix WAL file get so big.

Thanks,

Mathieu

Mathieu Othacehe wrote on 29 Sep 2020 17:02

Recipients:(address . 43564@debbugs.gnu.org)

Message-ID:87h7rglj7t.fsf@gnu.org

Hello,

Toggle quote (6 lines)

> In the best case of one pending evaluation, the registration duration is

> reduced from 1800 seconds to 320 seconds. I think that the gain is way

> larger when there are multiple pending evaluations.

> Pushed a fix as 461e07e14e1c8013343c0a2cb26c0e022e10d5e4.

While this improves the situation, another evaluation contention episode

occurred this afternoon (evaluations 16920 to 16938).

Even though reading queries are cheaper, I guess that when 10 fibers try

to execute 50k read queries, there's some contention in the

communication with the 4 database workers.

I'm trying locally to spawn database workers dedicated to registration,

that execute the evaluation queries in a single batch. Let's see if it

helps.

Thanks,

Mathieu

Mathieu Othacehe wrote on 4 Oct 2020 13:12

Recipients:(address . 43564@debbugs.gnu.org)

Message-ID:87r1qep7m9.fsf@gnu.org

Hello,

Toggle quote (4 lines)

> I'm trying locally to spawn database workers dedicated to registration,

> that execute the evaluation queries in a single batch. Let's see if it

> helps.

I added 4 registration dedicated database workers and removed from

registration all accesses to the store. Now the evaluation returns

objects that contain all required information.

This seem to improve the situation quite a lot. For example, those three

registrations were done simultaneously in ~200 seconds:

Toggle snippet (14 lines)2020-10-04T12:46:19 registration took 205.429772 seconds
2020-10-04T12:46:19 evaluation 17099 registered 156 new derivations
2020-10-04T12:46:19 building 156 derivations in batches of 200
2020-10-04T12:46:19 building batch of 200 derivations (0/156)
2020-10-04T12:46:19 registration took 212.892482 seconds
2020-10-04T12:46:19 evaluation 17105 registered 164 new derivations
2020-10-04T12:46:19 building 164 derivations in batches of 200
2020-10-04T12:46:19 building batch of 200 derivations (0/164)
2020-10-04T12:46:19 registration took 205.009416 seconds
2020-10-04T12:46:19 evaluation 17102 registered 160 new derivations
2020-10-04T12:46:19 building 160 derivations in batches of 200
2020-10-04T12:46:19 building batch of 200 derivations (0/160)

Thanks,

Mathieu

Ludovic Courtès wrote on 5 Oct 2020 14:07

Recipients:(address . 43564@debbugs.gnu.org)(address . othacehe@gnu.org)

Message-ID:87v9fo27vz.fsf@gnu.org

Hello!

Mathieu Othacehe <othacehe@gnu.org> skribis:

Toggle quote (6 lines)

> Turns out, once an evaluation is over, new builds are registered. This

> registration tries to insert a new build for each derivation returned by

> the evaluation phase. If the new build does not add a new output, the

> insertion query is then rollbacked. This means that there are at least

> as many insertion queries as new derivations.

Isn’t that the real problem, that we’re doing one transaction per

derivation?

Ideally, upon evaluation completion, I’d expect us to create a single

transaction that registers all 50K derivations at once, no?

Thanks for the great improvements in this area!

Ludo’.

Mathieu Othacehe wrote on 5 Oct 2020 15:09

Recipients:(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 43564@debbugs.gnu.org)

Message-ID:87k0w47rbh.fsf@gnu.org

Hey Ludo,

Toggle quote (3 lines)

> Isn’t that the real problem, that we’re doing one transaction per

> derivation?

Is it really better in term of performance to send batch of queries

within a single transaction? I haven't tried it yet.

I think that the real bottleneck was having N fibers fighting over 4

workers to execute large number of insertions. Having all those queries

done by workers dedicated to registration in a single pass seems to

improve a lot the situation.

Maybe having those registration workers posting unique transactions

containing all the queries would be even more beneficial.

Thanks,

Mathieu

Ludovic Courtès wrote on 9 Oct 2020 09:59

Recipients:(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 43564@debbugs.gnu.org)

Message-ID:87y2kfomn0.fsf@gnu.org

Hi,

Mathieu Othacehe <othacehe@gnu.org> skribis:

Toggle quote (6 lines)

>> Isn’t that the real problem, that we’re doing one transaction per

>> derivation?

> Is it really better in term of performance to send batch of queries

> within a single transaction? I haven't tried it yet.

I’m definitely not a database expert, but I think there’s overhead for

setting up a transaction, which we’re paying N times here.

Toggle quote (5 lines)

> I think that the real bottleneck was having N fibers fighting over 4

> workers to execute large number of insertions. Having all those queries

> done by workers dedicated to registration in a single pass seems to

> improve a lot the situation.

Then that’s great, let’s profit! :-)

Ludo’.

Your comment

This issue is archived.

To comment on this conversation send an email to 43564@debbugs.gnu.org

is:open	open issues
is:done	closed issues
submitter:<who>	search issue submitter
author:<who>	search by message author
date:yesterday..now	search by issue date
mdate:3m..2d	search by message date