[PATCH] git: Shell out to ‘git gc’ when necessary.

  • Done
  • quality assurance status badge
Details
3 participants
  • Ludovic Courtès
  • Christopher Baines
  • Simon Tournier
Owner
unassigned
Submitted by
Ludovic Courtès
Severity
normal
L
L
Ludovic Courtès wrote on 20 Oct 2023 18:15
[PATCH] git: Shell out to ‘git gc ’ when necessary.
(address . guix-patches@gnu.org)
f588bb38b4b9fdaff29dd8af8c62aa3c55902f7c.1697818202.git.ludo@gnu.org

This fixes a bug whereby libgit2-managed checkouts would keep growing as
we fetch.

* guix/git.scm (packs-in-git-repository, maybe-run-git-gc): New
procedures.
(update-cached-checkout): Use it.
---
guix/git.scm | 39 ++++++++++++++++++++++++++++++++++++---
1 file changed, 36 insertions(+), 3 deletions(-)

Hi!

This is a radical fix/workaround for the unbounded Git checkout growth
problem, shelling out to ‘git gc’ when it’s likely needed (“too many”
pack files around).

I thought we might be able to implement a ‘git gc’ approximation using
the libgit2 “packbuilder” interface, but I haven’t got around to doing

Once again, shelling out is not my favorite option, but it’s a bug we
should fix sooner rather than later, hence this compromise.

Thoughts?

Ludo’.

Toggle diff (81 lines)
diff --git a/guix/git.scm b/guix/git.scm
index b7182305cf..d704b62333 100644
--- a/guix/git.scm
+++ b/guix/git.scm
@@ -1,6 +1,6 @@
;;; GNU Guix --- Functional package management for GNU
;;; Copyright © 2017, 2020 Mathieu Othacehe <m.othacehe@gmail.com>
-;;; Copyright © 2018-2022 Ludovic Courtès <ludo@gnu.org>
+;;; Copyright © 2018-2023 Ludovic Courtès <ludo@gnu.org>
;;; Copyright © 2021 Kyle Meyer <kyle@kyleam.com>
;;; Copyright © 2021 Marius Bakke <marius@gnu.org>
;;; Copyright © 2022 Maxime Devos <maximedevos@telenet.be>
@@ -29,15 +29,16 @@ (define-module (guix git)
#:use-module (guix cache)
#:use-module (gcrypt hash)
#:use-module ((guix build utils)
- #:select (mkdir-p delete-file-recursively))
+ #:select (mkdir-p delete-file-recursively invoke/quiet))
#:use-module (guix store)
#:use-module (guix utils)
#:use-module (guix records)
#:use-module (guix gexp)
#:autoload (guix git-download)
(git-reference-url git-reference-commit git-reference-recursive?)
+ #:autoload (guix config) (%git)
#:use-module (guix sets)
- #:use-module ((guix diagnostics) #:select (leave warning))
+ #:use-module ((guix diagnostics) #:select (leave warning info))
#:use-module (guix progress)
#:autoload (guix swh) (swh-download commit-id?)
#:use-module (rnrs bytevectors)
@@ -428,6 +429,35 @@ (define (delete-checkout directory)
(rename-file directory trashed)
(delete-file-recursively trashed)))
+(define (packs-in-git-repository directory)
+ "Return the number of pack files under DIRECTORY, a Git checkout."
+ (catch 'system-error
+ (lambda ()
+ (let ((directory (opendir (in-vicinity directory ".git/objects/pack"))))
+ (let loop ((count 0))
+ (match (readdir directory)
+ ((? eof-object?)
+ (closedir directory)
+ count)
+ (str
+ (loop (if (string-suffix? ".pack" str)
+ (+ 1 count)
+ count)))))))
+ (const 0)))
+
+(define (maybe-run-git-gc directory)
+ "Run 'git gc' in DIRECTORY if needed."
+ ;; XXX: As of libgit2 1.3.x (used by Guile-Git), there's no support for GC.
+ ;; Each time a checkout is pulled, a new pack is created, which eventually
+ ;; takes up a lot of space (lots of small, poorly-compressed packs). As a
+ ;; workaround, shell out to 'git gc' when the number of packs in a
+ ;; repository has become "too large", potentially wasting a lot of space.
+ ;; See <https://issues.guix.gnu.org/65720>.
+ (when (> (packs-in-git-repository directory) 25)
+ (info (G_ "compressing cached Git repository at '~a'...~%")
+ directory)
+ (invoke/quiet %git "-C" directory "gc")))
+
(define* (update-cached-checkout url
#:key
(ref '())
@@ -515,6 +545,9 @@ (define* (update-cached-checkout url
seconds seconds
nanoseconds nanoseconds))))
+ ;; Run 'git gc' if needed.
+ (maybe-run-git-gc cache-directory)
+
;; When CACHE-DIRECTORY is a sub-directory of the default cache
;; directory, remove expired checkouts that are next to it.
(let ((parent (dirname cache-directory)))

base-commit: 6b0a32196982a0a2f4dbb59d35e55833a5545ac6
--
2.41.0
C
C
Christopher Baines wrote on 30 Oct 2023 13:02
(name . Ludovic Courtès)(address . ludo@gnu.org)
87sf5swc3j.fsf@cbaines.net
Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (27 lines)
>
> This fixes a bug whereby libgit2-managed checkouts would keep growing as
> we fetch.
>
> * guix/git.scm (packs-in-git-repository, maybe-run-git-gc): New
> procedures.
> (update-cached-checkout): Use it.
> ---
> guix/git.scm | 39 ++++++++++++++++++++++++++++++++++++---
> 1 file changed, 36 insertions(+), 3 deletions(-)
>
> Hi!
>
> This is a radical fix/workaround for the unbounded Git checkout growth
> problem, shelling out to ‘git gc’ when it’s likely needed (“too many”
> pack files around).
>
> I thought we might be able to implement a ‘git gc’ approximation using
> the libgit2 “packbuilder” interface, but I haven’t got around to doing
> it: <https://libgit2.org/libgit2/#HEAD/search/pack>.
>
> Once again, shelling out is not my favorite option, but it’s a bug we
> should fix sooner rather than later, hence this compromise.
>
> Thoughts?

This sounds good to me, the data service has this problem as well of
cached checkouts that grow to be too large and this sounds like it'll
address it.
-----BEGIN PGP SIGNATURE-----

iQKlBAEBCgCPFiEEPonu50WOcg2XVOCyXiijOwuE9XcFAmU/m8BfFIAAAAAALgAo
aXNzdWVyLWZwckBub3RhdGlvbnMub3BlbnBncC5maWZ0aGhvcnNlbWFuLm5ldDNF
ODlFRUU3NDU4RTcyMEQ5NzU0RTBCMjVFMjhBMzNCMEI4NEY1NzcRHG1haWxAY2Jh
aW5lcy5uZXQACgkQXiijOwuE9Xcx2A//f3qqeG4+4J5uvVPcP/26SMtpzvNCUsRs
5hzUgJherxg4U25tlUdXlpjSVTNDSc0qN7RpUQWy6rca/S/ro2NL/KcR7VAdyumu
v39ldNwbq0W2YC/eZ9fxS7SzeCnWV2oOpO5X5sy69TGvjE/plWNStttvOF/HFy6Y
CtH9GNXZE9xwL5PbRK1Lxun5JmamP3Lxk+oivN3ZC7AnhoYxGxJ4xHD9HfN970b/
6kBbc0Vcf0wHtHRVIEuhBw01JkklchBhTTzbYoi3SUFZeovbkm4Ys5g3s7nDVVZD
5XNMkdp2YcgdUtsfZN1jhgFXTsa6XyfFnQS/1qMfPg3U1niAj0nKIqjEfmUNnGD1
PpQbQ5WvOZm5S70HHDG9Cg58BVIcH0hrHqfVYyghhttf2yUvdKp6CtqNyuBzsr7D
276K8EAeTMcfQwtArxKaFfFG/ggInMvPy1UA1FoN2j0EIIxeND/7vcejqqIssjZm
jsU716+s9bP1JCf0s/gJPWSw7Iph7gOs4CKFUdQSeEqNawyXyetxc5PjI6K7NKzq
QSa7SJlTe0Lv8maRIZ7LV8t08n3PPFO0sFC7MQMVTCFbkwqFwaVGGdeFiF8dYXfN
m4eigk0nl9Poq7gQ79r0igy/rfkZW8mVKRucqVdUJ/znykJfiyBbadSmPcCZP6MS
XzvO0Cgwki8=
=sesm
-----END PGP SIGNATURE-----

L
L
Ludovic Courtès wrote on 14 Nov 2023 10:19
(name . Christopher Baines)(address . mail@cbaines.net)
87o7fwae0q.fsf@gnu.org
Hello,

Christopher Baines <mail@cbaines.net> skribis:

Toggle quote (7 lines)
> Ludovic Courtès <ludo@gnu.org> writes:
>
>> Fixes <https://issues.guix.gnu.org/65720>.
>>
>> This fixes a bug whereby libgit2-managed checkouts would keep growing as
>> we fetch.

[...]

Toggle quote (4 lines)
> This sounds good to me, the data service has this problem as well of
> cached checkouts that grow to be too large and this sounds like it'll
> address it.

Thanks for your input, Chris.

Any other comments? I’d like to push the patch within a few days if
there are no objections.


Ludo’.
S
S
Simon Tournier wrote on 14 Nov 2023 10:32
Re: bug#65720: [bug#66650] [PATCH] git: Shell out to ‘git gc’ when necessary.
87v8a4el3a.fsf@gmail.com
Hi,

On Tue, 14 Nov 2023 at 10:19, Ludovic Courtès <ludo@gnu.org> wrote:

Toggle quote (3 lines)
> Any other comments? I’d like to push the patch within a few days if
> there are no objections.

As mentioned in [1],

Toggle quote (7 lines)
>> * guix/git.scm (packs-in-git-repository, maybe-run-git-gc): New
>> procedures.
>> (update-cached-checkout): Use it.
>> ---
>> guix/git.scm | 39 ++++++++++++++++++++++++++++++++++++---
>> 1 file changed, 36 insertions(+), 3 deletions(-)

LGTM. Just two colors for the bikeshed. :-)


Toggle quote (2 lines)
>> + (when (> (packs-in-git-repository directory) 25)

Why 25? And not 10 or 50 or 100?


Toggle quote (10 lines)
>> (define* (update-cached-checkout url
>> #:key
>> (ref '())
>> @@ -515,6 +545,9 @@ (define* (update-cached-checkout url
>> seconds seconds
>> nanoseconds nanoseconds))))
>>
>> + ;; Run 'git gc' if needed.
>> + (maybe-run-git-gc cache-directory)

Why not trigger it by “guix gc”?

Well, I expect “guix gc” to take some time and I choose when. However,
I want “guix pull” or “guix time-machine” to be as fast as possible and
here some extra time is added, and I cannot control exactly when.


Cheers,
simon


1: bug#65720: [PATCH] git: Shell out to ‘git gc’ when necessary.
Simon Tournier <zimon.toutoune@gmail.com>
Mon, 23 Oct 2023 12:08:07 +0200
id:87il6xlkhk.fsf@gmail.com
L
L
Ludovic Courtès wrote on 16 Nov 2023 13:12
(name . Simon Tournier)(address . zimon.toutoune@gmail.com)
87h6ll28yh.fsf@gnu.org
Hi,

Simon Tournier <zimon.toutoune@gmail.com> skribis:

Toggle quote (9 lines)
>>> * guix/git.scm (packs-in-git-repository, maybe-run-git-gc): New
>>> procedures.
>>> (update-cached-checkout): Use it.
>>> ---
>>> guix/git.scm | 39 ++++++++++++++++++++++++++++++++++++---
>>> 1 file changed, 36 insertions(+), 3 deletions(-)
>
> LGTM.

Thanks!

Toggle quote (7 lines)
> Just two colors for the bikeshed. :-)
>
>
>>> + (when (> (packs-in-git-repository directory) 25)
>
> Why 25? And not 10 or 50 or 100?

Totally arbitrary. :-) I sampled the checkouts I had on my laptop and
that seems like a reasonable heuristic. In particular, it seems that
Git-managed checkouts never have this many packs; only libgit2-managed
checkouts do, precisely because libgit2 doesn’t repack/GC.

Toggle quote (5 lines)
>>> + ;; Run 'git gc' if needed.
>>> + (maybe-run-git-gc cache-directory)
>
> Why not trigger it by “guix gc”?

Because so far the idea is that ~/.cache/guix/checkouts is automatically
managed without user intervention; it’s really a cache in that sense.

Toggle quote (4 lines)
> Well, I expect “guix gc” to take some time and I choose when. However,
> I want “guix pull” or “guix time-machine” to be as fast as possible and
> here some extra time is added, and I cannot control exactly when.

Yes, I see. The thing is ‘maybe-run-git-gc’ is only called on the slow
path; so for example, it’s not called on a ‘time-machine’ cache hit, but
only on a cache miss, which is already expensive anyway.

Does that make sense?

Thanks,
Ludo’.
S
S
Simon Tournier wrote on 16 Nov 2023 14:24
Re: bug#65720: [bug#66650] [PATCH] git: Shell out to ‘git gc’ when necessary.
(name . Ludovic Courtès)(address . ludo@gnu.org)
CAJ3okZ2-W_Me-Gao44+LeKGCm7dhb8VkLfC2doL4NE9VO88HYg@mail.gmail.com
Hi,

On Thu, 16 Nov 2023 at 13:12, Ludovic Courtès <ludo@gnu.org> wrote:

Toggle quote (8 lines)
> > Well, I expect “guix gc” to take some time and I choose when. However,
> > I want “guix pull” or “guix time-machine” to be as fast as possible and
> > here some extra time is added, and I cannot control exactly when.
>
> Yes, I see. The thing is ‘maybe-run-git-gc’ is only called on the slow
> path; so for example, it’s not called on a ‘time-machine’ cache hit, but
> only on a cache miss, which is already expensive anyway.

What you mean as "only called on the slow path" is each time
'update-cached-checkout' is called, right? So, somehow when
'maybe-run-git-gc' is called appears to me "unpredictable". But
anyway. :-)

Let move it elsewhere if I am really annoyed.

Cheers,
simon
L
L
Ludovic Courtès wrote on 22 Nov 2023 12:17
Re: [bug#66650] bug#65720: [bug#66650] [PATCH] git: Shell out to ‘git gc’ when necessary.
(name . Simon Tournier)(address . zimon.toutoune@gmail.com)
874jhem3z0.fsf@gnu.org
Hi,

Simon Tournier <zimon.toutoune@gmail.com> skribis:

Toggle quote (13 lines)
> On Thu, 16 Nov 2023 at 13:12, Ludovic Courtès <ludo@gnu.org> wrote:
>
>> > Well, I expect “guix gc” to take some time and I choose when. However,
>> > I want “guix pull” or “guix time-machine” to be as fast as possible and
>> > here some extra time is added, and I cannot control exactly when.
>>
>> Yes, I see. The thing is ‘maybe-run-git-gc’ is only called on the slow
>> path; so for example, it’s not called on a ‘time-machine’ cache hit, but
>> only on a cache miss, which is already expensive anyway.
>
> What you mean as "only called on the slow path" is each time
> 'update-cached-checkout' is called, right?

Yes, which usually indicates we’re on a cache miss (for example a cache
miss of ‘guix time-machine’) and thus are going to do potentially more
work (updating a Git repo, building things, etc.). That’s why I think
it’s on the “slow path” and shouldn’t make much of a difference. More
importantly, unless I’m mistaken, it’s rarely going to fire.

Toggle quote (3 lines)
> So, somehow when 'maybe-run-git-gc' is called appears to me
> "unpredictable". But anyway. :-)

Sure, but the way I see it, that’s the nature of caches.

Toggle quote (2 lines)
> Let move it elsewhere if I am really annoyed.

:-/

Ludo’.
S
S
Simon Tournier wrote on 22 Nov 2023 12:57
Re: bug#65720: Guile-Git-managed checkouts grow way too much
(name . Ludovic Courtès)(address . ludo@gnu.org)
86ttpehug4.fsf_-_@gmail.com
Hi Ludo,

Thanks for explaining.

On Wed, 22 Nov 2023 at 12:17, Ludovic Courtès <ludo@gnu.org> wrote:

Toggle quote (2 lines)
> it’s rarely going to fire.

[...]

Toggle quote (4 lines)
>> Let move it elsewhere if I am really annoyed.
>
> :-/

Sorry, I poorly worded my last comment. :-)

Somehow I was expressing: my view probably falls into the “Premature
optimization is the root of all evil” category. Other said, I have no
objection and I will revisit the issue when I will be on fire, if I am,
or annoyed for real.

Cheers,
simon

PS:

Aside this patch:

Toggle quote (5 lines)
>> So, somehow when 'maybe-run-git-gc' is called appears to me
>> "unpredictable". But anyway. :-)
>
> Sure, but the way I see it, that’s the nature of caches.

What makes cache unpredictable is their current state. However, this
does not imply that *all* the actions modifying from one state to
another must also be triggered in unpredictable moment.

For instance, I choose when I wash family’s clothes and the wash-machine
does not start by itself when the unpredictable stack of family’s dirty
clothes is enough. Because, maybe today it’s rainy so drying is
difficult and tomorrow will be sunny so it will be a better moment. :-)

For me, “guix gc” should be the driver for cleaning all the various Guix
caches. Anyway. :-D
L
L
Ludovic Courtès wrote on 22 Nov 2023 17:00
Re: [bug#66650] bug#65720: Guile-Git-managed checkouts grow way too much
(name . Simon Tournier)(address . zimon.toutoune@gmail.com)
87h6ldkcc8.fsf@gnu.org
Hi,

Simon Tournier <zimon.toutoune@gmail.com> skribis:

Toggle quote (5 lines)
> Somehow I was expressing: my view probably falls into the “Premature
> optimization is the root of all evil” category. Other said, I have no
> objection and I will revisit the issue when I will be on fire, if I am,
> or annoyed for real.

Alright!

Pushed as b150c546b04c9ebb09de9f2c39789221054f5eea.

Let’s see how it behaves and if there are problems we had overlooked…

Ludo’.
Closed
?