From debbugs-submit-bounces@debbugs.gnu.org Mon Jun 22 07:49:45 2020 Received: (at submit) by debbugs.gnu.org; 22 Jun 2020 11:49:45 +0000 Received: from localhost ([127.0.0.1]:33397 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1jnKxM-0004va-Qa for submit@debbugs.gnu.org; Mon, 22 Jun 2020 07:49:45 -0400 Received: from lists.gnu.org ([209.51.188.17]:48372) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1jnKxI-0004vM-UZ for submit@debbugs.gnu.org; Mon, 22 Jun 2020 07:49:43 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:51724) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1jnKxI-0005C2-0n for bug-guix@gnu.org; Mon, 22 Jun 2020 07:49:40 -0400 Received: from fencepost.gnu.org ([2001:470:142:3::e]:40147) by eggs.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1jnKxH-0003P1-OQ for bug-guix@gnu.org; Mon, 22 Jun 2020 07:49:39 -0400 Received: from [2a01:e0a:1d:7270:af76:b9b:ca24:c465] (port=48038 helo=ribbon) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1jnKxF-0006OQ-SN for bug-guix@gnu.org; Mon, 22 Jun 2020 07:49:39 -0400 From: =?utf-8?Q?Ludovic_Court=C3=A8s?= To: Subject: =?utf-8?Q?=E2=80=9CSQLite?= database is =?utf-8?Q?busy=E2=80=9D?= =?utf-8?Q?=3A?= contention on the store database X-URL: http://www.fdn.fr/~lcourtes/ X-Revolutionary-Date: 5 Messidor an 228 de la =?utf-8?Q?R=C3=A9volution?= X-PGP-Key-ID: 0x090B11993D9AEBB5 X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc X-PGP-Fingerprint: 3CE4 6455 8A84 FDC6 9DB4 0CFB 090B 1199 3D9A EBB5 X-OS: x86_64-pc-linux-gnu Date: Mon, 22 Jun 2020 13:49:35 +0200 Message-ID: <877dvz7228.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: submit X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) Hi, On berlin.guix.gnu.org, we=E2=80=99ve seen a lot of contention on the store database for a month or so, with messages like: warning: SQLite database is busy More often than not, everything slows down to a halt, and builds don=E2=80= =99t proceed. We initially discussed it here: https://issues.guix.gnu.org/41119#4 When that happens we typically see 30+ processes that have opened the database: --8<---------------cut here---------------start------------->8--- ludo@berlin ~$ sudo fuser -v /var/guix/db/db.sqlite-shm=20 USER PID ACCESS COMMAND /var/guix/db/db.sqlite-shm: root 1211 F...m guix offload root 1845 F...m guix-daemon root 6070 F...m guix-daemon root 31378 F...m guix-daemon root 31380 F...m guix-daemon root 47381 F...m guix-daemon root 47382 F...m guix-daemon root 47383 F...m guix-daemon root 47385 F...m guix-daemon root 47681 F...m guix-daemon root 48506 F...m guix offload root 48542 F...m guix offload root 49417 F...m guix-daemon root 49472 F...m guix-daemon root 49971 F...m guix-daemon root 50027 F...m guix-daemon root 50847 F...m guix offload root 54625 F...m guix-daemon root 54723 F...m guix-daemon root 55470 F...m guix-daemon root 55614 F...m guix-daemon root 55617 F...m guix-daemon root 56225 F...m guix-daemon root 57732 F...m guix-daemon root 57733 F...m guix-daemon root 57738 F...m guix-daemon root 57739 F...m guix-daemon root 57740 F...m guix-daemon root 58772 F...m guix-daemon root 59056 F...m guix-daemon root 99055 F...m guix-daemon root 127690 F...m guix offload --8<---------------cut here---------------end--------------->8--- In this case, the first process seems to be in the deduplication phase in =E2=80=98finalize-store-file=E2=80=99: --8<---------------cut here---------------start------------->8--- ludo@berlin ~$ sudo strace -p 1211 strace: Process 1211 attached utimensat(AT_FDCWD, "/gnu/store/whhc80cx1zpi9cj3bil3sp8wwzcf8p9p-image-root= /gnu/store/n6l40ygclxfysnc2nxwrslnn5ab6wv3v-profile/lib/modules/5.4.47-gnu/= kernel/net/netfilter", [{tv_sec=3D1592814617, tv_nsec=3D396083395} /* 2020-= 06-22T10:30:17.396083395+0200 */, {tv_sec=3D1, tv_nsec=3D0} /* 1970-01-01T0= 1:00:01+0100 */], 0) =3D 0 chmod("/gnu/store/whhc80cx1zpi9cj3bil3sp8wwzcf8p9p-image-root/gnu/store/n6l= 40ygclxfysnc2nxwrslnn5ab6wv3v-profile/lib/modules/5.4.47-gnu/kernel/net/net= filter", 040555) =3D 0 getrusage(RUSAGE_SELF, {ru_utime=3D{tv_sec=3D1306, tv_usec=3D334062}, ru_st= ime=3D{tv_sec=3D143, tv_usec=3D819675}, ...}) =3D 0 clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=3D1450, tv_nsec=3D153748295= }) =3D 0=20 lstat("/gnu/store/whhc80cx1zpi9cj3bil3sp8wwzcf8p9p-image-root/gnu/store/n6l= 40ygclxfysnc2nxwrslnn5ab6wv3v-profile/lib/modules/5.4.47-gnu/kernel/net/net= filter/xt_connlimit.ko", {st_mode=3DS_IFLNK|0777, st_size=3D125, ...}) =3D 0 readlink("/gnu/store/whhc80cx1zpi9cj3bil3sp8wwzcf8p9p-image-root/gnu/store/= n6l40ygclxfysnc2nxwrslnn5ab6wv3v-profile/lib/modules/5.4.47-gnu/kernel/net/= netfilter/xt_connlimit.ko", "/gnu/store/x8d1hmv17jawp4a7nwwb4"..., 100) =3D= 100 readlink("/gnu/store/whhc80cx1zpi9cj3bil3sp8wwzcf8p9p-image-root/gnu/store/= n6l40ygclxfysnc2nxwrslnn5ab6wv3v-profile/lib/modules/5.4.47-gnu/kernel/net/= netfilter/xt_connlimit.ko", "/gnu/store/x8d1hmv17jawp4a7nwwb4"..., 200) =3D= 125 mkdir("/gnu", 0777) =3D -1 EEXIST (File exists) mkdir("/gnu/store", 0777) =3D -1 EEXIST (File exists) mkdir("/gnu/store/.links", 0777) =3D -1 EEXIST (File exists) lstat("/gnu/store/whhc80cx1zpi9cj3bil3sp8wwzcf8p9p-image-root/gnu/store/n6l= 40ygclxfysnc2nxwrslnn5ab6wv3v-profile/lib/modules/5.4.47-gnu/kernel/net/net= filter/xt_connlimit.ko", {st_mode=3DS_IFLNK|0777, st_size=3D125, ...}) =3D 0 stat("/gnu/store/.links/05p6zpj56f00g07c3621747kz3bjc6zbr8bgg8xysdajq18qgvl= v", {st_mode=3DS_IFREG|0444, st_size=3D6880, ...}) =3D 0 --8<---------------cut here---------------end--------------->8--- At that point it has the database opened 3 times: once from =E2=80=98with-temporary-store-file=E2=80=99, once =E2=80=98with-database=E2= =80=99 in =E2=80=98finalize-store-file=E2=80=99, and once from =E2=80=98register-item= s=E2=80=99 in =E2=80=98finalize-store-file=E2=80=99. Commit d7fb5538013710288e91657499f0= e04207115776 removed the last one. --8<---------------cut here---------------start------------->8--- ludo@berlin ~$ sudo ls -l /proc/1211/fd total 0 lr-x------ 1 root root 64 Jun 22 11:49 0 -> 'pipe:[10239344]' l-wx------ 1 root root 64 Jun 22 11:49 1 -> 'pipe:[10239343]' lr-x------ 1 root root 64 Jun 22 11:49 10 -> 'pipe:[10129708]' l-wx------ 1 root root 64 Jun 22 11:49 11 -> 'pipe:[10129708]' lr-x------ 1 root root 64 Jun 22 11:49 12 -> 'pipe:[10214031]' l-wx------ 1 root root 64 Jun 22 11:49 13 -> 'pipe:[10214031]' l-wx------ 1 root root 64 Jun 22 11:49 14 -> /var/guix/offload/141.80.167.1= 68:22/1 lrwx------ 1 root root 64 Jun 22 11:49 15 -> 'socket:[8056368]' lrwx------ 1 root root 64 Jun 22 11:49 16 -> 'socket:[10433466]' lrwx------ 1 root root 64 Jun 22 11:49 17 -> /var/guix/db/db.sqlite lrwx------ 1 root root 64 Jun 22 11:49 18 -> /var/guix/db/db.sqlite-wal lrwx------ 1 root root 64 Jun 22 11:49 19 -> /var/guix/db/db.sqlite-shm l-wx------ 1 root root 64 Jun 22 11:49 2 -> 'pipe:[10239343]' lrwx------ 1 root root 64 Jun 22 11:49 20 -> /var/guix/db/db.sqlite lrwx------ 1 root root 64 Jun 22 11:49 21 -> /var/guix/db/db.sqlite-wal lr-x------ 1 root root 64 Jun 22 11:49 3 -> 'pipe:[10021329]' l-wx------ 1 root root 64 Jun 22 11:49 4 -> 'pipe:[10239345]' lr-x------ 1 root root 64 Jun 22 11:49 40 -> /dev/urandom l-wx------ 1 root root 64 Jun 22 11:49 5 -> 'pipe:[10021329]' lr-x------ 1 root root 64 Jun 22 11:49 6 -> 'pipe:[10222125]' l-wx------ 1 root root 64 Jun 22 11:49 7 -> 'pipe:[10222125]' lr-x------ 1 root root 64 Jun 22 11:49 8 -> 'pipe:[10021332]' l-wx------ 1 root root 64 Jun 22 11:49 9 -> 'pipe:[10021332]' --8<---------------cut here---------------end--------------->8--- The are several possible causes of contention: (1) database opened too many times, (2) database kept opened for too long, and (3) sqlite mishandling contention somehow. I think we can work on #1 and #2. As for #3, I wonder if there=E2=80=99s a change in sqlite that could have made contention more contentious (ah ha!), given that all this happened around the time of the last =E2=80=98core-updates=E2=80=99 merge, roughly corresponding to an upgrade f= rom 3.24.0 to 3.31.0: http://data.guix.gnu.org/repository/1/branch/master/package/sqlite Maybe the most likely change that caused contention is that we increased the number of machines (or rather the sum of build jobs that can be handled by the build machines), and thus the number of parallel build processes. Ludo=E2=80=99.