Guile-Git-managed checkouts grow way too much

  • Done
  • quality assurance status badge
Details
8 participants
  • Josselin Poiret
  • Jelle Licht
  • Ludovic Courtès
  • Christopher Baines
  • Tobias Geerinckx-Rice
  • Csepp
  • wolf
  • Simon Tournier
Owner
unassigned
Submitted by
Ludovic Courtès
Severity
important
L
L
Ludovic Courtès wrote on 3 Sep 2023 22:44
(address . bug-guix@gnu.org)
87bkejc7go.fsf@inria.fr
Hello!

As reported by Tobias on IRC (in the context of ‘hpcguix-web’),
checkouts managed by Guile-Git appear to grow beyond reason. As an
example, here’s the same ‘.git’ managed with Guile-Git and with Git:

Toggle snippet (6 lines)
$ du -hs ~/.cache/guix/checkouts/pjmkglp4t7znuugeurpurzikxq3tnlaywmisyr27shj7apsnalwq
6.7G /home/ludo/.cache/guix/checkouts/pjmkglp4t7znuugeurpurzikxq3tnlaywmisyr27shj7apsnalwq
$ du -hs .git
517M .git

It would seem that libgit2 doesn’t do the equivalent of ‘git gc’.

Ludo’.
L
L
Ludovic Courtès wrote on 4 Sep 2023 23:13
control message for bug #65720
(address . control@debbugs.gnu.org)
87zg21od50.fsf@gnu.org
severity 65720 important
quit
L
L
Ludovic Courtès wrote on 4 Sep 2023 23:47
Re: bug#65720: Guile-Git-managed checkouts grow way too much
(address . 65720@debbugs.gnu.org)
87fs3tobju.fsf@gnu.org
Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (9 lines)
> As reported by Tobias on IRC (in the context of ‘hpcguix-web’),
> checkouts managed by Guile-Git appear to grow beyond reason. As an
> example, here’s the same ‘.git’ managed with Guile-Git and with Git:
>
> $ du -hs ~/.cache/guix/checkouts/pjmkglp4t7znuugeurpurzikxq3tnlaywmisyr27shj7apsnalwq
> 6.7G /home/ludo/.cache/guix/checkouts/pjmkglp4t7znuugeurpurzikxq3tnlaywmisyr27shj7apsnalwq
> $ du -hs .git
> 517M .git

Unsurprisingly, GC makes a big difference:

Toggle snippet (20 lines)
$ cp -r ~/.cache/guix/checkouts/pjmkglp4t7znuugeurpurzikxq3tnlaywmisyr27shj7apsnalwq /tmp/checkout
$ (cd /tmp/checkout/; git gc)
Enumerating objects: 717785, done.
Counting objects: 100% (717785/717785), done.
Delta compression using up to 4 threads
Compressing objects: 100% (154644/154644), done.
Writing objects: 100% (717785/717785), done.
Total 717785 (delta 569440), reused 710535 (delta 562274), pack-reused 0
Enumerating cruft objects: 103412, done.
Traversing cruft objects: 81753, done.
Counting objects: 100% (64171/64171), done.
Delta compression using up to 4 threads
Compressing objects: 100% (17379/17379), done.
Writing objects: 100% (64171/64171), done.
Total 64171 (delta 52330), reused 58296 (delta 46792), pack-reused 0
Expanding reachable commits in commit graph: 133730, done.
$ du -hs /tmp/checkout
539M /tmp/checkout

Toggle quote (2 lines)
> It would seem that libgit2 doesn’t do the equivalent of ‘git gc’.


My inclination for the short term would be to work around this
limitation by (1) finding a heuristic to determine is a checkout has
likely accumulated too much cruft, and (2) considering such checkouts as
expired (thereby forcing a re-clone) or running ‘git gc’ on them if
‘git’ is available.

I can’t think of a good heuristic for (1). Birth time could be one, but
we’d need statx(2):

Toggle snippet (7 lines)
$ stat ~/.cache/guix/checkouts/pjmkglp4t7znuugeurpurzikxq3tnlaywmisyr27shj7apsnalwq | tail -4
Access: 2023-09-04 23:13:54.668279105 +0200
Modify: 2023-09-04 11:34:41.665385000 +0200
Change: 2023-09-04 11:34:41.661629102 +0200
Birth: 2021-08-09 10:48:17.748722151 +0200

Lacking statx(2), we can approximate creation time by looking at
‘.git/config’:

Toggle snippet (6 lines)
$ stat ~/.cache/guix/checkouts/pjmkglp4t7znuugeurpurzikxq3tnlaywmisyr27shj7apsnalwq/.git/config | tail -3
Modify: 2021-08-09 10:50:28.031760953 +0200
Change: 2021-08-09 10:50:28.031760953 +0200
Birth: 2021-08-09 10:50:28.031760953 +0200

This strategy can be implemented like this:
Toggle diff (22 lines)
diff --git a/guix/git.scm b/guix/git.scm
index ebe2600209..ed3fa56bc8 100644
--- a/guix/git.scm
+++ b/guix/git.scm
@@ -405,7 +405,16 @@ (define cached-checkout-expiration
;; Use the mtime rather than the atime to cope with file systems mounted
;; with 'noatime'.
- (file-expiration-time (* 90 24 3600) stat:mtime))
+ (let ((ttl (* 90 24 3600))
+ (max-checkout-retention (* 9 30 24 3600)))
+ (lambda (file)
+ (match (false-if-exception (lstat file))
+ (#f 0) ;FILE may have been deleted in the meantime
+ (st (min (pk 'ttl (+ (stat:mtime st) ttl))
+ (pk 'maxttl (match (false-if-exception
+ (lstat (in-vicinity file ".git/config")))
+ (#f +inf.0)
+ (st (+ (stat:mtime st) max-checkout-retention))))))))))
(define %checkout-cache-cleanup-period
;; Period for the removal of expired cached checkouts.
Namely, a cached checkout as considered as “expired” after 9 months. In
my case, it gives this:

Toggle snippet (8 lines)
scheme@(guix git)> (cached-checkout-expiration "/home/ludo/.cache/guix/checkouts/pjmkglp4t7znuugeurpurzikxq3tnlaywmisyr27shj7apsnalwq/")

;;; (ttl 1701596081)

;;; (maxttl 1651827028)
$6 = 1651827028

Of course having to re-clone entire repositories every 9 months is
ridiculous, but storing gigabytes of packs is worse IMO (I’m
specifically thinking about the Guix repo, which every users copies via
‘guix pull’).

Thoughts?

Thanks,
Ludo’.
J
J
Josselin Poiret wrote on 5 Sep 2023 10:18
87tts9uj6x.fsf@jpoiret.xyz
Hi Ludo,

Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (6 lines)
> My inclination for the short term would be to work around this
> limitation by (1) finding a heuristic to determine is a checkout has
> likely accumulated too much cruft, and (2) considering such checkouts as
> expired (thereby forcing a re-clone) or running ‘git gc’ on them if
> ‘git’ is available.

I think using the git binary instead of libgit2 as a workaround is a
good idea. We can consider building it directly as well, so that people
who don't have it in their profiles can still benefit from it. We could
even consider using git commands in most places and using libgit2 only
where we really need the tight coupling. IIUC, libgit2 is eternally
trying to catch up to git and often performs in a counter-intuitive way
(I expect the various bugs with stale deleted files in checkouts to be
caused by this). Maybe it could also let us use bare repository and
directly extract the refs we want without having to mess with checkouts?

Best,
--
Josselin Poiret
-----BEGIN PGP SIGNATURE-----

iQHEBAEBCgAuFiEEOSSM2EHGPMM23K8vUF5AuRYXGooFAmT25FYQHGRldkBqcG9p
cmV0Lnh5egAKCRBQXkC5Fhcain93DACHuEyLuP52K5rHucB5+rmiiaHAqwh6U7Us
GdD98bt8ggLcGkuJviQJKAL7sWrbLZLZGoFvGOVSIFU71zixL0aDy0vLHdLrr9kw
nhlp9FBrsTE1WJ87n6cqN7QEGAKdecX8QSHzAPSgOiPniby8DDML/EZ5qkJA8HeA
x8atrhamPs/j9lUiWMR94O9eTAo0iOrZ3V+o6Phc7711vRWXiSOneIkcrCcylLfs
gtCjJUaVp2AwZXNvvCj8Lf17wWcEacsdvi4e6hTzhnT75xiDcCCc5O682F+lLPy2
XzaDc6GAql8y2tt+/zuNUa2S+anBXFGaqqz5Dxzmm3Vi/AEPA24dbCUyjmw+gX+z
yzCbSSdt2wWKT9U3Us6jQWbtzztwRjHEqNLFS7NqZvCRZA2UiDF9XxZSLlB6jZB8
01biNYDABcz1SDgEFj007l2iBoxtTWwDeOt1bTdrYP3pmTZpojf0mCox4RXkviHB
z6d2UwRQq8C9aGdHVCM0n6FSzs6a7l4=
=6KMj
-----END PGP SIGNATURE-----

J
J
Jelle Licht wrote on 5 Sep 2023 10:22
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 65720@debbugs.gnu.org)
CE9E1465-187B-462B-B9E2-94E6A43B86EC@posteo.net
Hi Ludo,

Toggle quote (8 lines)
>
> On 4 Sep 2023, at 23:49, Ludovic Courtès <ludo@gnu.org> wrote:
>
> Of course having to re-clone entire repositories every 9 months is
> ridiculous, but storing gigabytes of packs is worse IMO (I’m
> specifically thinking about the Guix repo, which every users copies via
> ‘guix pull’).

Please ignore if it doesn’t make sense, or would not make a practical difference for the current issue, but wouldn’t a local clone do the trick here? As in, clone from the ‘clogged’ local repo, move over fresh clone to old location.

Kr, Jelle
L
L
Ludovic Courtès wrote on 5 Sep 2023 16:11
(address . 65720@debbugs.gnu.org)
87wmx4lnfu.fsf@gnu.org
Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (3 lines)
> $ du -hs ~/.cache/guix/checkouts/pjmkglp4t7znuugeurpurzikxq3tnlaywmisyr27shj7apsnalwq
> 6.7G /home/ludo/.cache/guix/checkouts/pjmkglp4t7znuugeurpurzikxq3tnlaywmisyr27shj7apsnalwq

Another data point, with Cuirass instances:

Toggle snippet (6 lines)
ludo@berlin ~$ sudo du -hs /var/lib/cuirass/.cache/guix/checkouts/pjmkglp4t7znuugeurpurzikxq3tnlaywmisyr27shj7apsnalwq
65G /var/lib/cuirass/.cache/guix/checkouts/pjmkglp4t7znuugeurpurzikxq3tnlaywmisyr27shj7apsnalwq
ludo@berlin ~$ sudo stat /var/lib/cuirass/.cache/guix/checkouts/pjmkglp4t7znuugeurpurzikxq3tnlaywmisyr27shj7apsnalwq | tail -1
Birth: 2022-07-30 23:15:45.582559879 +0200

… and:

Toggle snippet (6 lines)
ludo@guix-hpc4 ~$ sudo du -hs /var/lib/cuirass/.cache/guix/checkouts/pjmkglp4t7znuugeurpurzikxq3tnlaywmisyr27shj7apsnalwq
86G /var/lib/cuirass/.cache/guix/checkouts/pjmkglp4t7znuugeurpurzikxq3tnlaywmisyr27shj7apsnalwq
ludo@guix-hpc4 ~$ sudo stat /var/lib/cuirass/.cache/guix/checkouts/pjmkglp4t7znuugeurpurzikxq3tnlaywmisyr27shj7apsnalwq | tail -1
Créé : 2021-06-01 11:48:48.854669310 +0200

So yeah, problem we have.

Ludo’.
L
L
Ludovic Courtès wrote on 5 Sep 2023 16:18
(name . Josselin Poiret)(address . dev@jpoiret.xyz)(address . 65720@debbugs.gnu.org)
87msy0ln4m.fsf@gnu.org
Hi,

Josselin Poiret <dev@jpoiret.xyz> skribis:

Toggle quote (6 lines)
> I think using the git binary instead of libgit2 as a workaround is a
> good idea. We can consider building it directly as well, so that people
> who don't have it in their profiles can still benefit from it. We could
> even consider using git commands in most places and using libgit2 only
> where we really need the tight coupling.

Surely you’d agree that it would suck though: depending on two Git
implementations because one doesn’t have a proper API and the other one
lacks a bunch of features.

It would also be pretty bad for closure size:

Toggle snippet (6 lines)
$ guix size guile-git | tail -1
total: 106.6 MiB
$ guix size guile-git git-minimal | tail -1
total: 169.8 MiB

It’s also not clear concretely how we’d add that dependency. Try
invoking ‘git’ from $PATH and print a warning if it doesn’t work?
But then, what about applications like Cuirass and hpcguix-web?

Tricky, tricky.

Ludo’.
L
L
Ludovic Courtès wrote on 5 Sep 2023 16:20
(name . Jelle Licht)(address . jlicht@posteo.net)(address . 65720@debbugs.gnu.org)
87il8oln06.fsf@gnu.org
Hello,

Jelle Licht <jlicht@posteo.net> skribis:

Toggle quote (9 lines)
>> On 4 Sep 2023, at 23:49, Ludovic Courtès <ludo@gnu.org> wrote:
>>
>> Of course having to re-clone entire repositories every 9 months is
>> ridiculous, but storing gigabytes of packs is worse IMO (I’m
>> specifically thinking about the Guix repo, which every users copies via
>> ‘guix pull’).
>
> Please ignore if it doesn’t make sense, or would not make a practical difference for the current issue, but wouldn’t a local clone do the trick here? As in, clone from the ‘clogged’ local repo, move over fresh clone to old location.

Good question.

Toggle snippet (14 lines)
scheme@(guix git)> ,use(git)
scheme@(guix git)> (clone "/home/ludo/.cache/guix/checkouts/pjmkglp4t7znuugeurpurzikxq3tnlaywmisyr27shj7apsnalwq/" "/tmp/fresh-clone")
$7 = #<git-repository ba4240>
scheme@(guix git)> (system* "du" "-hs" "/tmp/fresh-clone")
6.7G /tmp/fresh-clone
$8 = 0
scheme@(guix git)> (system* "du" "-hs" "/tmp/fresh-clone/.git")
6.6G /tmp/fresh-clone/.git
$9 = 0
scheme@(guix git)> (system* "du" "-hs" "/home/ludo/.cache/guix/checkouts/pjmkglp4t7znuugeurpurzikxq3tnlaywmisyr27shj7apsnalwq/")
6.7G /home/ludo/.cache/guix/checkouts/pjmkglp4t7znuugeurpurzikxq3tnlaywmisyr27shj7apsnalwq/
$10 = 0

Conclusion: it makes no difference.

Ludo’.
S
S
Simon Tournier wrote on 5 Sep 2023 20:59
86edjcqwec.fsf@gmail.com
Hi,

On Mon, 04 Sep 2023 at 23:47, Ludovic Courtès <ludo@gnu.org> wrote:

Toggle quote (4 lines)
>> It would seem that libgit2 doesn’t do the equivalent of ‘git gc’.
>
> Confirmed: <https://github.com/libgit2/libgit2/issues/3247>.

Ouch!

The goals of the project haven't changed, and neither have the
tradeoffs. If one were to rewrite git-gc on top of libgit2, the
best-case scenario is ending up with what we already had.

If you want to use regular maintenance on some repostories, use
git gc, that's what it's there for.


Toggle quote (6 lines)
> My inclination for the short term would be to work around this
> limitation by (1) finding a heuristic to determine is a checkout has
> likely accumulated too much cruft, and (2) considering such checkouts
> as expired (thereby forcing a re-clone) or running ‘git gc’ on them if
> ‘git’ is available.

About (1) maybe we could add a “counter” and teach after X updates of
the checkout then let run (2). Well, I guess the number of crufts is
more or less proportional with the number of checkout updates; that’s
the heuristic I would use.

The most annoying is (2). Because forcing a re-clone does not appear to
me a solution; I prefer to waste disk space (and probably run myself and
manually ‘git gc’) than re-clone… Somehow this re-clone would always
happen when I am using a poor network.

Moreover, assuming this clean-up (2) would be run once every while, we
could imagine to invoke something like,

guix shell -C git-minimal
-- git
-C ~/.cache/guix/checkouts/pjmkglp4t7znuugeurpurzikxq3tnlaywmisyr27shj7apsnalwq
gc

when the checkout is updated. And maybe we could provide another “guix
pull” command-line option for turning off this and mark it as done
(reset the “counter”).

Well, that’s a poor solution but we can assume that git-minimal is at
worse available using “guix shell git-minimal”. Note that the closure
of git-minimal is far less than re-cloning the full Guix repository.

Cheers,
simon
J
J
Josselin Poiret wrote on 6 Sep 2023 10:04
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 65720@debbugs.gnu.org)
87pm2vvibo.fsf@jpoiret.xyz
Hi Ludo,

Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (4 lines)
> Surely you’d agree that it would suck though: depending on two Git
> implementations because one doesn’t have a proper API and the other one
> lacks a bunch of features.

Right, although I wouldn't necessarily say that the former doesn't have
a proper API, but rather that it has a Unix-oriented API. That leads to
performance issues on e.g. Windows but on Linux I'm not sure there's
much of a difference.

Toggle quote (15 lines)
> It would also be pretty bad for closure size:
>
> --8<---------------cut here---------------start------------->8---
> $ guix size guile-git | tail -1
> total: 106.6 MiB
> $ guix size guile-git git-minimal | tail -1
> total: 169.8 MiB
> --8<---------------cut here---------------end--------------->8---
>
> It’s also not clear concretely how we’d add that dependency. Try
> invoking ‘git’ from $PATH and print a warning if it doesn’t work?
> But then, what about applications like Cuirass and hpcguix-web?
>
> Tricky, tricky.

We could consider replacing the guile-git dependency with another
library built directly on top of git-minimal, and have this be a
dependency of Guix. Not ideal though, and not really scalable either:
we can't just add every VCS as direct dependencies.

From what I've seen, people are now scaling back on their use of
libgit2 because of the impedence mismatch and are resorting more and
more to git plumbing. From a pragmatic point of view, I'd prefer the
latter, since it is more stable and feature-complete.

Best,
--
Josselin Poiret
-----BEGIN PGP SIGNATURE-----

iQHEBAEBCgAuFiEEOSSM2EHGPMM23K8vUF5AuRYXGooFAmT4MnsQHGRldkBqcG9p
cmV0Lnh5egAKCRBQXkC5FhcairUvDACAJZdGUEBC2qVWbsms7Xk6OUTUgfjucIC3
XOLFH74Ewo4OdrUJUrADcWP4GKjrEmglO1hQjRlTwpo60TB7CkFyZIC39Dkm0MPm
R7Oc8BYnPByFHihy3RJwrtk0zH1jOaRd/A6cvdIXCrXk1rnlTBn9EEAKpYlA1OrG
7al4FfxKfFkea48xZsGVM8uc1fsqiHrycZH3gLCbT8V0O4BtNY2rhYLf1eTjFkQl
PbcAdHmkOcnQaZR2WzUHUUH/9GGrHwcXkqSZtgnJ8y/zauig5nyWzX3Zgej4K+VJ
nN0l6QCVtkaAIgCN2+8zTg2ml+WeXjWcE1gGjVyv7748ICmpP3jXE0uGTSGvM0MX
ZFfi+TKU4wJUl8fCZLf41P5v7P1jSy2TbthqbAHkRKlsasYMf/KZMhSh6jypVIN0
VOFPWMUIJhZNQMP2DtVOMC/thc+O8BcfkbFSORNJ5XYaLiNvC4/ODwsKCGxYOa2h
4XlDMtm7YgAewG19hO9fYvbS1oXKLZM=
=olI3
-----END PGP SIGNATURE-----

S
S
Simon Tournier wrote on 7 Sep 2023 02:41
(address . 65720@debbugs.gnu.org)
86il8mn7al.fsf@gmail.com
Hi,

On Tue, 05 Sep 2023 at 16:18, Ludovic Courtès <ludo@gnu.org> wrote:

Toggle quote (13 lines)
> It would also be pretty bad for closure size:
>
> --8<---------------cut here---------------start------------->8---
> $ guix size guile-git | tail -1
> total: 106.6 MiB
> $ guix size guile-git git-minimal | tail -1
> total: 169.8 MiB
> --8<---------------cut here---------------end--------------->8---
>
> It’s also not clear concretely how we’d add that dependency. Try
> invoking ‘git’ from $PATH and print a warning if it doesn’t work?
> But then, what about applications like Cuirass and hpcguix-web?

I think we can rely on something like,

guix shell -C git-minimal -- git gc

It would be invoked internally using the Scheme API for inferiors and
friends. Doing so, it would add nothing to the closure size.

It appears to me safe to assume that this command can be run from any
Guix installation. Since the Git GC would only be done once every X Git
fetches, the overhead would be much lower.

Hum, am I repeating myself [1]? :-)

And I would run this “git gc” via “guix gc”, not via “guix pull”. Well,
I do not like all these automatic removals happening based on date
(last-expiry-cleanup) with some usual commands. It always happens when
I do not want. ;-) Contrary to “guix gc”. Bah, another story. :-)

Cheers,
simon


1: bug#65720: Guile-Git-managed checkouts grow way too much
Simon Tournier <zimon.toutoune@gmail.com>
Tue, 05 Sep 2023 20:59:07 +0200
id:86edjcqwec.fsf@gmail.com
L
L
Ludovic Courtès wrote on 8 Sep 2023 19:08
(name . Josselin Poiret)(address . dev@jpoiret.xyz)(address . 65720@debbugs.gnu.org)
87pm2s385m.fsf@gnu.org
Hello!

Josselin Poiret <dev@jpoiret.xyz> skribis:

Toggle quote (5 lines)
> Right, although I wouldn't necessarily say that the former doesn't have
> a proper API, but rather that it has a Unix-oriented API. That leads to
> performance issues on e.g. Windows but on Linux I'm not sure there's
> much of a difference.

[...]

Toggle quote (5 lines)
> We could consider replacing the guile-git dependency with another
> library built directly on top of git-minimal, and have this be a
> dependency of Guix. Not ideal though, and not really scalable either:
> we can't just add every VCS as direct dependencies.

I cannot imagine a viable implementation of things like ‘commit-closure’
and ‘commit-relation’ from (guix git) done by shelling out to ‘git’.
I’m quite confident this would be slow and brittle.

It looks like there’s no option other than carrying the two
implementations.

~~~

Years ago, Andy Wingo sketched a plan for GNU hackers to implement Git
in pure Scheme. That was on April 1st though, so people mistakenly
assumed it was a joke and the project was never carried out.

I digress, but I wonder: is there not even a viable Haskell or OCaml
implementation of Git?

Thanks,
Ludo’.
L
L
Ludovic Courtès wrote on 8 Sep 2023 19:09
(name . Simon Tournier)(address . zimon.toutoune@gmail.com)
87jzt0382l.fsf@gnu.org
Hi!

Simon Tournier <zimon.toutoune@gmail.com> skribis:

Toggle quote (19 lines)
> On Tue, 05 Sep 2023 at 16:18, Ludovic Courtès <ludo@gnu.org> wrote:
>
>> It would also be pretty bad for closure size:
>>
>> --8<---------------cut here---------------start------------->8---
>> $ guix size guile-git | tail -1
>> total: 106.6 MiB
>> $ guix size guile-git git-minimal | tail -1
>> total: 169.8 MiB
>> --8<---------------cut here---------------end--------------->8---
>>
>> It’s also not clear concretely how we’d add that dependency. Try
>> invoking ‘git’ from $PATH and print a warning if it doesn’t work?
>> But then, what about applications like Cuirass and hpcguix-web?
>
> I think we can rely on something like,
>
> guix shell -C git-minimal -- git gc

We’re talking about the implementation of a cache (meant to speed up
operations), that would actually fill said cache plus do a whole bunch
of expensive operations? Nah. :-)

Ludo’.
S
S
Simon Tournier wrote on 9 Sep 2023 12:31
(name . Ludovic Courtès)(address . ludo@gnu.org)
86cyyrskmj.fsf@gmail.com
Hi,

On Fri, 08 Sep 2023 at 19:09, Ludovic Courtès <ludo@gnu.org> wrote:

Toggle quote (21 lines)
>>> It would also be pretty bad for closure size:
>>>
>>> --8<---------------cut here---------------start------------->8---
>>> $ guix size guile-git | tail -1
>>> total: 106.6 MiB
>>> $ guix size guile-git git-minimal | tail -1
>>> total: 169.8 MiB
>>> --8<---------------cut here---------------end--------------->8---
>>>
>>> It’s also not clear concretely how we’d add that dependency. Try
>>> invoking ‘git’ from $PATH and print a warning if it doesn’t work?
>>> But then, what about applications like Cuirass and hpcguix-web?
>>
>> I think we can rely on something like,
>>
>> guix shell -C git-minimal -- git gc
>
> We’re talking about the implementation of a cache (meant to speed up
> operations), that would actually fill said cache plus do a whole bunch
> of expensive operations? Nah. :-)

I do not think. If I understand correctly, we need to run “git gc” at
some point, therefore git-minimal needs to me around. The question is
how and when.

Well, maybe I am missing what the bug is about. For me, it is about
running ‘git gc’ for cleaning the Git checkout cache, no?


Solution #1. Add git-minimal as inputs. It increases the closure and
the extra load (on average) is about the ratio between the rate of “guix
pull” and the rate of the git-minimal changes.

Assuming, that people are running “guix pull” once per week and say “git
gc” is run after 50 pulls. (These both number are totally arbitrary and
based on my personal estimate).

Data Service [1] tells:

2023-07-07 15:45:22 2023-09-08 21:22:08
2023-05-11 16:10:48 2023-07-07 14:21:45
2023-05-01 16:40:08 2023-05-11 14:36:16
2023-04-25 13:34:54 2023-05-01 15:19:55
2023-04-25 13:34:54 2023-09-08 21:22:08
2023-03-06 17:22:28 2023-04-25 12:27:33
2023-01-17 23:49:19 2023-03-06 16:48:43
2022-11-08 13:06:42 2023-01-17 15:11:47
2022-10-08 05:14:46 2022-11-08 09:56:31
2022-09-06 15:00:08 2022-10-08 04:15:43
2022-08-13 22:02:31 2022-09-06 12:58:52

It means that an user will download ~10 times git-minimal for nothing.


Solution #2. The one I am proposing. :-) Download git-minimal only
when Guix needs it for running “git gc”. Yeah, there is probably a
small overload with some operations. But, I bet this overload is much
smaller than the one of solution #1.

Well, it depends on the number of times people are updating the cache vs
the rate of change of git-minimal.

For sure, if one updates 100 times per week the cache, having
git-minimal as inputs is far better. But I do not think that the
regular usage on average. :-)

That’s why I am proposing to have an option for turning off this “git
gc“ operation.

Well, we have lived since years without running ‘git gc’ so running it
once per year on average is probably enough to keep the cache size
reasonable. And git-minimal is changing every month.


Maybe, there is some solution #3. ;-)

Cheers,
simon


C
(name . Ludovic Courtès)(address . ludo@gnu.org)
cuch6o16vgh.fsf@riseup.net
Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (35 lines)
> Hello!
>
> Josselin Poiret <dev@jpoiret.xyz> skribis:
>
>> Right, although I wouldn't necessarily say that the former doesn't have
>> a proper API, but rather that it has a Unix-oriented API. That leads to
>> performance issues on e.g. Windows but on Linux I'm not sure there's
>> much of a difference.
>
> [...]
>
>> We could consider replacing the guile-git dependency with another
>> library built directly on top of git-minimal, and have this be a
>> dependency of Guix. Not ideal though, and not really scalable either:
>> we can't just add every VCS as direct dependencies.
>
> I cannot imagine a viable implementation of things like ‘commit-closure’
> and ‘commit-relation’ from (guix git) done by shelling out to ‘git’.
> I’m quite confident this would be slow and brittle.
>
> It looks like there’s no option other than carrying the two
> implementations.
>
> ~~~
>
> Years ago, Andy Wingo sketched a plan for GNU hackers to implement Git
> in pure Scheme. That was on April 1st though, so people mistakenly
> assumed it was a joke and the project was never carried out.
>
> I digress, but I wonder: is there not even a viable Haskell or OCaml
> implementation of Git?
>
> Thanks,
> Ludo’.

For sake of completeness:
There is an alternative implentation in C for Plan 9 that I've used and
is now mature enough that the 9front project switched to it from
Mercurial.
It might be possible to compile it with the plan9port compiler wrapper.

There is also a Git implementation in OCaml that some MirageOS
unikernels use to serve static content from a git repository.
Also the Irmin "database" is based on git and is written in OCaml.
C
(name . Simon Tournier)(address . zimon.toutoune@gmail.com)
cuca5tt6va2.fsf@riseup.net
Simon Tournier <zimon.toutoune@gmail.com> writes:

Toggle quote (87 lines)
> Hi,
>
> On Fri, 08 Sep 2023 at 19:09, Ludovic Courtès <ludo@gnu.org> wrote:
>
>>>> It would also be pretty bad for closure size:
>>>>
>>>> --8<---------------cut here---------------start------------->8---
>>>> $ guix size guile-git | tail -1
>>>> total: 106.6 MiB
>>>> $ guix size guile-git git-minimal | tail -1
>>>> total: 169.8 MiB
>>>> --8<---------------cut here---------------end--------------->8---
>>>>
>>>> It’s also not clear concretely how we’d add that dependency. Try
>>>> invoking ‘git’ from $PATH and print a warning if it doesn’t work?
>>>> But then, what about applications like Cuirass and hpcguix-web?
>>>
>>> I think we can rely on something like,
>>>
>>> guix shell -C git-minimal -- git gc
>>
>> We’re talking about the implementation of a cache (meant to speed up
>> operations), that would actually fill said cache plus do a whole bunch
>> of expensive operations? Nah. :-)
>
> I do not think. If I understand correctly, we need to run “git gc” at
> some point, therefore git-minimal needs to me around. The question is
> how and when.
>
> Well, maybe I am missing what the bug is about. For me, it is about
> running ‘git gc’ for cleaning the Git checkout cache, no?
>
>
> Solution #1. Add git-minimal as inputs. It increases the closure and
> the extra load (on average) is about the ratio between the rate of “guix
> pull” and the rate of the git-minimal changes.
>
> Assuming, that people are running “guix pull” once per week and say “git
> gc” is run after 50 pulls. (These both number are totally arbitrary and
> based on my personal estimate).
>
> Data Service [1] tells:
>
> 2023-07-07 15:45:22 2023-09-08 21:22:08
> 2023-05-11 16:10:48 2023-07-07 14:21:45
> 2023-05-01 16:40:08 2023-05-11 14:36:16
> 2023-04-25 13:34:54 2023-05-01 15:19:55
> 2023-04-25 13:34:54 2023-09-08 21:22:08
> 2023-03-06 17:22:28 2023-04-25 12:27:33
> 2023-01-17 23:49:19 2023-03-06 16:48:43
> 2022-11-08 13:06:42 2023-01-17 15:11:47
> 2022-10-08 05:14:46 2022-11-08 09:56:31
> 2022-09-06 15:00:08 2022-10-08 04:15:43
> 2022-08-13 22:02:31 2022-09-06 12:58:52
> …
>
> It means that an user will download ~10 times git-minimal for nothing.
>
>
> Solution #2. The one I am proposing. :-) Download git-minimal only
> when Guix needs it for running “git gc”. Yeah, there is probably a
> small overload with some operations. But, I bet this overload is much
> smaller than the one of solution #1.
>
> Well, it depends on the number of times people are updating the cache vs
> the rate of change of git-minimal.
>
> For sure, if one updates 100 times per week the cache, having
> git-minimal as inputs is far better. But I do not think that the
> regular usage on average. :-)
>
> That’s why I am proposing to have an option for turning off this “git
> gc“ operation.
>
> Well, we have lived since years without running ‘git gc’ so running it
> once per year on average is probably enough to keep the cache size
> reasonable. And git-minimal is changing every month.
>
>
> Maybe, there is some solution #3. ;-)
>
> Cheers,
> simon
>
>
> 1: https://data.guix.gnu.org/repository/1/branch/master/package/git-minimal/output-history

Please don't create another situation like with guix system roll-back,
where a crucial sysadmin operation doesn't work without network access.
Or at least make it configurable, so things that are likely to be needed
for future operations are pre-fetched.
S
S
Simon Tournier wrote on 11 Sep 2023 10:42
Digression about Git implementations (was Re: bug#65720: Guile-Git-managed checkouts grow way too much)
(address . 65720@debbugs.gnu.org)
87zg1tje2s.fsf@gmail.com
Hi Ludo,

On Fri, 08 Sep 2023 at 19:08, Ludovic Courtès <ludo@gnu.org> wrote:

Toggle quote (4 lines)
> Years ago, Andy Wingo sketched a plan for GNU hackers to implement Git
> in pure Scheme. That was on April 1st though, so people mistakenly
> assumed it was a joke and the project was never carried out.

Well, that is a piece of work. :-)

Maybe there is an hope with: git-std-lib.

Subject: Proposal/Discussion: Turning parts of Git into libraries
From: Emily Shaffer <nasamuffin@google.com>
To: Git List <git@vger.kernel.org>
Date: Fri, 17 Feb 2023 13:12:23 -0800

And some patches are starting to float around.


Toggle quote (3 lines)
> I digress, but I wonder: is there not even a viable Haskell or OCaml
> implementation of Git?

It depends on what means “viable”. :-)


Irmin [1] is an OCaml library for building mergeable, branchable
distributed data stores – A Distributed Database Built on the Same
Principles as Git. And irmin relies on ocaml-git.


Then there is a pure Go implementation and another using Java.


I do not know all that are “viable”. Well, I do not know if ’git gc’ is
implemented. And I do not know which plumbing is implemented and which
porcelain is available.

Last, SWH uses dulwich [2] which is a pure Python implementation of Git.


To my knowledge, there is no “dulwich gc” but they implement “dulwich
fsck” and “dulwich repack”.

Back on 10 Years of Guix or at UNESCO on February – I do not remember
exactly when – we were discussing about implementation of Git. And we
mentioned an implementation in Rust. Maybe this one:


Cheers,
simon
L
L
Ludovic Courtès wrote on 11 Sep 2023 16:37
Re: bug#65720: Guile-Git-managed checkouts grow way too much
(name . Josselin Poiret)(address . dev@jpoiret.xyz)(address . 65720@debbugs.gnu.org)
87jzswsrlt.fsf@gnu.org
Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (10 lines)
> It would also be pretty bad for closure size:
>
> $ guix size guile-git | tail -1
> total: 106.6 MiB
> $ guix size guile-git git-minimal | tail -1
> total: 169.8 MiB
>
> It’s also not clear concretely how we’d add that dependency. Try
> invoking ‘git’ from $PATH and print a warning if it doesn’t work?

A solution to this particular problem is coming:


Ludo’.
W
(name . Ludovic Courtès)(address . ludo@gnu.org)
ZP8nc1m8rN_34XV-@ws
On 2023-09-08 19:08:05 +0200, Ludovic Courtès wrote:
Toggle quote (19 lines)
> Hello!
>
> Josselin Poiret <dev@jpoiret.xyz> skribis:
>
> > Right, although I wouldn't necessarily say that the former doesn't have
> > a proper API, but rather that it has a Unix-oriented API. That leads to
> > performance issues on e.g. Windows but on Linux I'm not sure there's
> > much of a difference.
>
> [...]
>
> > We could consider replacing the guile-git dependency with another
> > library built directly on top of git-minimal, and have this be a
> > dependency of Guix. Not ideal though, and not really scalable either:
> > we can't just add every VCS as direct dependencies.
>
> I cannot imagine a viable implementation of things like ‘commit-closure’
> and ‘commit-relation’ from (guix git) done by shelling out to ‘git’.

I am sure I must be missing some part of the contract of the function, but at
least the commit-relation seems fairly straightforward:

(define (shelling-commit-relation old new)
(let ((h-old (oid->string (commit-id old)))
(h-new (oid->string (commit-id new))))
(cond ((eq? old new)
'self)
((zero? (git-C %repo "merge-base" "--is-ancestor" h-old h-new))
'ancestor)
((zero? (git-C %repo "merge-base" "--is-ancestor" h-new h-old))
'descendant)
(else
'unrelated))))

I would argue it is even somewhat more readable than the current implementation.

Toggle quote (2 lines)
> I’m quite confident this would be slow

My version is ~2000x faster compared to (guix git):

Guix: 1048.620992ms
Git: 0.532143ms

Again, I am sure I must have miss something, either in the implementation or in
the measurements, because it is pretty hard to believe there is so much room for
improvement.

The full script I used is attached to this email.

Toggle quote (2 lines)
> and brittle.

In general git plumbing command are design to have stable CLI interface in order
to be usable in scripting. So I am not sure where the brittleness would come
from.

Toggle quote (4 lines)
>
> It looks like there’s no option other than carrying the two
> implementations.

Assuming I made no mistake (hard to believe), it is probably worth exploring the
feasibility of just shelling out to the git binary some more.

Toggle quote (14 lines)
>
> ~~~
>
> Years ago, Andy Wingo sketched a plan for GNU hackers to implement Git
> in pure Scheme. That was on April 1st though, so people mistakenly
> assumed it was a joke and the project was never carried out.
>
> I digress, but I wonder: is there not even a viable Haskell or OCaml
> implementation of Git?
>
> Thanks,
> Ludo’.
>

W.

--
There are only two hard things in Computer Science:
cache invalidation, naming things and off-by-one errors.
#!/bin/sh # -*-scheme-*- exec guile -s "$0" "$@" !# (use-modules (git) (guix git)) (define %repo "/tmp/guix-fork") (define h1 "72745172d155e489936f694d6b9013cb76272370") (define h2 "6d60d7ccba5a8e06c17d55a1772fa7f4529b5eff") (define h3 "c3db650680f995f0556d3ddce567cdc1c33e4603") ;;; r has to still be defined when the commit-relation is called. There is *no* ;;; error, but it always returns 'unrelated. Quite a footgun. (define r (repository-open %repo)) (define c1 (commit-lookup r (string->oid h1))) (define c2 (commit-lookup r (string->oid h2))) (define c3 (commit-lookup r (string->oid h3))) (define (git-C dir . args) (apply system* "git" "-C" dir args)) (define (shelling-commit-relation old new) (let ((h-old (oid->string (commit-id old))) (h-new (oid->string (commit-id new)))) (cond ((eq? old new) 'self) ;; In real code, git-C should probably return #t (for 0), #f (for 1) ;; or raise (for anything else). ((zero? (git-C %repo "merge-base" "--is-ancestor" h-old h-new)) 'ancestor) ((zero? (git-C %repo "merge-base" "--is-ancestor" h-new h-old)) 'descendant) (else 'unrelated)))) ;;; Make sure it actually works. (let ((tests `((,c1 . ,c1) (,c1 . ,c2) (,c2 . ,c1) (,c1 . ,c3)))) (for-each (λ (c) (format #t "Guix: ~a\nGit: ~a\n\n" (commit-relation (car c) (cdr c)) (shelling-commit-relation (car c) (cdr c)))) tests)) (define (time proc) (let* ((start (get-internal-run-time)) (_ (proc)) (end (get-internal-run-time))) (exact->inexact (* 1000 (/ (- end start) internal-time-units-per-second))))) (format #t "Guix: ~ams\nGit: ~ams\n" (time (λ () (commit-relation c1 c2))) (time (λ () (shelling-commit-relation c1 c2))))
-----BEGIN PGP SIGNATURE-----

iQIzBAEBCgAdFiEEt4NJs4wUfTYpiGikL7/ufbZ/wakFAmT/J20ACgkQL7/ufbZ/
walA7BAAioswpeyaAYJlo/HjxXOUviMoZ49RJ0vjoWkcdKDJyZCvY1bSaa6E+o38
4mjw+8qT2VH3Su+GKTWgYJ66O6PT2IZh7kybzqPCdIFFXAK3KHNP2cQlweDgl6jG
YhktsUBWalhzk06rEy3JXNPqrIinGHmMqm/pIxMQXPcOLN5/d90TMB304YqbjAio
J5sCeNNYNhVL0A1jY7rZMefUcHISKX8B3XvsNr2A0AvofGv6OQrftf3OMEX4OeE1
5KFeukwv9FRZ38Cc6+Ob3Jw+Atmz5WrOutTPMXAbp4fxxXHQguG9/fIP3JinAtd1
3ruwT7Q4V5n6pGcz81vMYTR+24Tfbcs4thDqKfIM2uoPOvCh1c6dQ3ap2hI4uvls
DlCSviISQkjjCqR30jj2ZhHIHF3kPDl+DnaDCn/LIKBRwEbLDJJ+eW9Bv9JJLG2h
6TCouuRrJzCZ7OpkTg6psZI7mhzwYNdJO2wIkGib8eI2U+/GxFDgWTi8U9HHQFiR
Z8/97ph5AdoIObDz0R/hezyvpWOJMYuhI0IhKvBksyx8UYOnpM0lIaSASQt2DqU8
xmRztjNazvoUbTASBg9l4MedSejPcDVn6FFFQ+QpkBORXTMYJ5E572BVxOOmnTbu
s2K2nZZMczHKbOgWyJt4rafRzSZeJRY6fr062Cu+TrHs9TO0i70=
=UIrN
-----END PGP SIGNATURE-----


L
L
Ludovic Courtès wrote on 13 Sep 2023 20:10
(name . wolf)(address . wolf@wolfsden.cz)
874jjylza9.fsf@gnu.org
Hi,

wolf <wolf@wolfsden.cz> skribis:

Toggle quote (10 lines)
> (define (time proc)
> (let* ((start (get-internal-run-time))
> (_ (proc))
> (end (get-internal-run-time)))
> (exact->inexact (* 1000 (/ (- end start) internal-time-units-per-second)))))
>
> (format #t "Guix: ~ams\nGit: ~ams\n"
> (time (λ () (commit-relation c1 c2)))
> (time (λ () (shelling-commit-relation c1 c2))))

‘get-internal-run-time’ returns “units of processor time” used by the
current process (info "(guile) Time"). When shelling out, the process
calls waitpid(2) and does nothing, so naturally its processor time is
close to zero.

‘get-internal-real-time’ should give something closer to elapsed time.

Ludo’.
S
S
Simon Tournier wrote on 14 Sep 2023 00:36
86o7i5wvj2.fsf@gmail.com
Hi Ludo,

On Wed, 13 Sep 2023 at 20:10, Ludovic Courtès <ludo@gnu.org> wrote:

Toggle quote (7 lines)
> ‘get-internal-run-time’ returns “units of processor time” used by the
> current process (info "(guile) Time"). When shelling out, the process
> calls waitpid(2) and does nothing, so naturally its processor time is
> close to zero.
>
> ‘get-internal-real-time’ should give something closer to elapsed time.

Well, let avoid to mix unrelated discussion. :-) For discussing that
specific part, I reported on guix-devel my timing using ,time.

comparing commit-relation using Scheme+libgit2 vs shellout plumbing Git
Simon Tournier <zimon.toutoune@gmail.com>
Tue, 12 Sep 2023 00:48:30 +0200
id:865y4gz5q9.fsf@gmail.com

The result is still significantly less and discussion is welcome
overthere. :-)

Cheers,
simon
L
L
Ludovic Courtès wrote on 19 Sep 2023 00:35
(address . 65720@debbugs.gnu.org)
87jzsnf6tr.fsf@gnu.org
Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (9 lines)
> As reported by Tobias on IRC (in the context of ‘hpcguix-web’),
> checkouts managed by Guile-Git appear to grow beyond reason. As an
> example, here’s the same ‘.git’ managed with Guile-Git and with Git:
>
> $ du -hs ~/.cache/guix/checkouts/pjmkglp4t7znuugeurpurzikxq3tnlaywmisyr27shj7apsnalwq
> 6.7G /home/ludo/.cache/guix/checkouts/pjmkglp4t7znuugeurpurzikxq3tnlaywmisyr27shj7apsnalwq
> $ du -hs .git
> 517M .git

More data… The biggest file in that repo is a pack that was created
when that repo was first cloned (Aug. 2021):

Toggle snippet (10 lines)
$ du /home/ludo/.cache/guix/checkouts/pjmkglp4t7znuugeurpurzikxq3tnlaywmisyr27shj7apsnalwq/.git/objects/pack/* |sort -k1 -n| tail -3
44272 /home/ludo/.cache/guix/checkouts/pjmkglp4t7znuugeurpurzikxq3tnlaywmisyr27shj7apsnalwq/.git/objects/pack/pack-3c2f1857501b01c321bc67ba1f30704deb9e18e9.pack
47272 /home/ludo/.cache/guix/checkouts/pjmkglp4t7znuugeurpurzikxq3tnlaywmisyr27shj7apsnalwq/.git/objects/pack/pack-30d5b35ad14a8398464e49e224811b162f673d66.pack
191492 /home/ludo/.cache/guix/checkouts/pjmkglp4t7znuugeurpurzikxq3tnlaywmisyr27shj7apsnalwq/.git/objects/pack/pack-d39507858782209d1ad87e389e4dffd4b6ff7ea2.pack
$ ls -l /home/ludo/.cache/guix/checkouts/pjmkglp4t7znuugeurpurzikxq3tnlaywmisyr27shj7apsnalwq/.git/objects/pack/pack-d39507858782209d1ad87e389e4dffd4b6ff7ea2.pack
-r--r--r-- 1 ludo users 196079671 Aug 9 2021 /home/ludo/.cache/guix/checkouts/pjmkglp4t7znuugeurpurzikxq3tnlaywmisyr27shj7apsnalwq/.git/objects/pack/pack-d39507858782209d1ad87e389e4dffd4b6ff7ea2.pack
$ ls -ld /home/ludo/.cache/guix/checkouts/pjmkglp4t7znuugeurpurzikxq3tnlaywmisyr27shj7apsnalwq/.git/config
-rw-r--r-- 1 ludo users 266 Aug 9 2021 /home/ludo/.cache/guix/checkouts/pjmkglp4t7znuugeurpurzikxq3tnlaywmisyr27shj7apsnalwq/.git/config

The pack starts with things from Aug. 2021:

Toggle snippet (12 lines)
$ git show-index < pack-d39507858782209d1ad87e389e4dffd4b6ff7ea2.idx|sort -k1 -n|head -3
12 30289f4d4638452520f52c1a36240220d0d940ff (852d8cb3)
927 d7ffc535c52f49177a8e5553569cdb1e321b5bc6 (2007c5d0)
1800 0a379de3249d5e9ff66fb404f7e5aa8ce2cb3d24 (b1e69aa4)
$ git show 30289f4d4638452520f52c1a36240220d0d940ff
commit 30289f4d4638452520f52c1a36240220d0d940ff
Author: Milkey Mouse <milkeymouse@meme.institute>
Date: Sun Aug 8 22:15:40 2021 -0700

[…]

… and at the bottom (large offsets) it contains very old blogs from the
Nix repo that somehow made it here.

I figured we still had a ‘nix’ branch from the early days, that contains
the history of Nix. I’ve now removed it, which helps a bit:

Toggle snippet (9 lines)
scheme@(guile-user)> ,use(git)
scheme@(guile-user)> ,t (clone "https://git.savannah.gnu.org/git/guix.git" "/tmp/guix")
$5 = #<git-repository 91a7b0>
;; 600.534529s real time, 435.260926s run time. 0.000000s spent in GC.
scheme@(guile-user)> ,t (clone "https://git.savannah.gnu.org/git/guix.git" "/tmp/guix-after-removing-nix-branch")
$6 = #<git-repository 4465a50>
;; 420.321511s real time, 398.772963s run time. 0.000000s spent in GC.

… and more importantly:

Toggle snippet (6 lines)
$ du -hs /tmp/guix/.git
373M /tmp/guix/.git
$ du -hs /tmp/guix-after-removing-nix-branch/.git
362M /tmp/guix-after-removing-nix-branch/.git

Anyway, what seems to happen is that every pull (every call to
‘remote-fetch’) creates a new pack (see ‘git_fetch_download_pack’ in
libgit2), which becomes inefficient in the long run (lots of small
poorly-compressed packs). That’s at least one possible explanation.

To be continued…

Ludo’.
S
S
Simon Tournier wrote on 19 Sep 2023 09:19
86wmwmlje7.fsf@gmail.com
Hi Ludo.

On Tue, 19 Sep 2023 at 00:35, Ludovic Courtès <ludo@gnu.org> wrote:

Toggle quote (10 lines)
> --8<---------------cut here---------------start------------->8---
> scheme@(guile-user)> ,use(git)
> scheme@(guile-user)> ,t (clone "https://git.savannah.gnu.org/git/guix.git" "/tmp/guix")
> $5 = #<git-repository 91a7b0>
> ;; 600.534529s real time, 435.260926s run time. 0.000000s spent in GC.
> scheme@(guile-user)> ,t (clone "https://git.savannah.gnu.org/git/guix.git" "/tmp/guix-after-removing-nix-branch")
> $6 = #<git-repository 4465a50>
> ;; 420.321511s real time, 398.772963s run time. 0.000000s spent in GC.
> --8<---------------cut here---------------end--------------->8---

[...]

Toggle quote (7 lines)
> --8<---------------cut here---------------start------------->8---
> $ du -hs /tmp/guix/.git
> 373M /tmp/guix/.git
> $ du -hs /tmp/guix-after-removing-nix-branch/.git
> 362M /tmp/guix-after-removing-nix-branch/.git
> --8<---------------cut here---------------end--------------->8---

Just to also point [1] that using shallow clone and restrict to the
oldest reachable commit by the time-machine, it saves 25% of bits to
download, and similarly on disk.

Toggle snippet (22 lines)
scheme@(guix-user)> ,t (clone "https://git.savannah.gnu.org/git/guix.git" "/tmp/guix-guile")
$1 = #<git-repository df3710>
;; 383.186818s real time, 278.060733s run time. 0.000000s spent in GC.

$ time git clone https://git.savannah.gnu.org/git/guix.git guix-full
Receiving objects: 100% (693699/693699), 342.14 MiB | 2.87 MiB/s, done.
real 2m40,830s
user 3m4,683s
sys 0m8,189s

$ time git clone --shallow-since=2019-04-30 https://git.savannah.gnu.org/git/guix.git guix-oldest
Receiving objects: 100% (428646/428646), 259.41 MiB | 3.87 MiB/s, done.
real 1m45,604s
user 2m32,370s
sys 0m5,916s

$ du -sh guix-*/.git
362M guix-full/.git
362M guix-guile/.git
272M guix-oldest/.git

Cheers,
simon


1: Re: hard dependency on Git? (was bug#65866: [PATCH 0/8] Add built-in builder for Git checkouts)
Simon Tournier <zimon.toutoune@gmail.com>
Mon, 11 Sep 2023 19:52:34 +0200
id:871qf4ha1p.fsf@gmail.com
L
L
Ludovic Courtès wrote on 20 Oct 2023 18:15
[PATCH] git: Shell out to ‘git gc ’ when necessary.
(address . guix-patches@gnu.org)
f588bb38b4b9fdaff29dd8af8c62aa3c55902f7c.1697818202.git.ludo@gnu.org

This fixes a bug whereby libgit2-managed checkouts would keep growing as
we fetch.

* guix/git.scm (packs-in-git-repository, maybe-run-git-gc): New
procedures.
(update-cached-checkout): Use it.
---
guix/git.scm | 39 ++++++++++++++++++++++++++++++++++++---
1 file changed, 36 insertions(+), 3 deletions(-)

Hi!

This is a radical fix/workaround for the unbounded Git checkout growth
problem, shelling out to ‘git gc’ when it’s likely needed (“too many”
pack files around).

I thought we might be able to implement a ‘git gc’ approximation using
the libgit2 “packbuilder” interface, but I haven’t got around to doing

Once again, shelling out is not my favorite option, but it’s a bug we
should fix sooner rather than later, hence this compromise.

Thoughts?

Ludo’.

Toggle diff (81 lines)
diff --git a/guix/git.scm b/guix/git.scm
index b7182305cf..d704b62333 100644
--- a/guix/git.scm
+++ b/guix/git.scm
@@ -1,6 +1,6 @@
;;; GNU Guix --- Functional package management for GNU
;;; Copyright © 2017, 2020 Mathieu Othacehe <m.othacehe@gmail.com>
-;;; Copyright © 2018-2022 Ludovic Courtès <ludo@gnu.org>
+;;; Copyright © 2018-2023 Ludovic Courtès <ludo@gnu.org>
;;; Copyright © 2021 Kyle Meyer <kyle@kyleam.com>
;;; Copyright © 2021 Marius Bakke <marius@gnu.org>
;;; Copyright © 2022 Maxime Devos <maximedevos@telenet.be>
@@ -29,15 +29,16 @@ (define-module (guix git)
#:use-module (guix cache)
#:use-module (gcrypt hash)
#:use-module ((guix build utils)
- #:select (mkdir-p delete-file-recursively))
+ #:select (mkdir-p delete-file-recursively invoke/quiet))
#:use-module (guix store)
#:use-module (guix utils)
#:use-module (guix records)
#:use-module (guix gexp)
#:autoload (guix git-download)
(git-reference-url git-reference-commit git-reference-recursive?)
+ #:autoload (guix config) (%git)
#:use-module (guix sets)
- #:use-module ((guix diagnostics) #:select (leave warning))
+ #:use-module ((guix diagnostics) #:select (leave warning info))
#:use-module (guix progress)
#:autoload (guix swh) (swh-download commit-id?)
#:use-module (rnrs bytevectors)
@@ -428,6 +429,35 @@ (define (delete-checkout directory)
(rename-file directory trashed)
(delete-file-recursively trashed)))
+(define (packs-in-git-repository directory)
+ "Return the number of pack files under DIRECTORY, a Git checkout."
+ (catch 'system-error
+ (lambda ()
+ (let ((directory (opendir (in-vicinity directory ".git/objects/pack"))))
+ (let loop ((count 0))
+ (match (readdir directory)
+ ((? eof-object?)
+ (closedir directory)
+ count)
+ (str
+ (loop (if (string-suffix? ".pack" str)
+ (+ 1 count)
+ count)))))))
+ (const 0)))
+
+(define (maybe-run-git-gc directory)
+ "Run 'git gc' in DIRECTORY if needed."
+ ;; XXX: As of libgit2 1.3.x (used by Guile-Git), there's no support for GC.
+ ;; Each time a checkout is pulled, a new pack is created, which eventually
+ ;; takes up a lot of space (lots of small, poorly-compressed packs). As a
+ ;; workaround, shell out to 'git gc' when the number of packs in a
+ ;; repository has become "too large", potentially wasting a lot of space.
+ ;; See <https://issues.guix.gnu.org/65720>.
+ (when (> (packs-in-git-repository directory) 25)
+ (info (G_ "compressing cached Git repository at '~a'...~%")
+ directory)
+ (invoke/quiet %git "-C" directory "gc")))
+
(define* (update-cached-checkout url
#:key
(ref '())
@@ -515,6 +545,9 @@ (define* (update-cached-checkout url
seconds seconds
nanoseconds nanoseconds))))
+ ;; Run 'git gc' if needed.
+ (maybe-run-git-gc cache-directory)
+
;; When CACHE-DIRECTORY is a sub-directory of the default cache
;; directory, remove expired checkouts that are next to it.
(let ((parent (dirname cache-directory)))

base-commit: 6b0a32196982a0a2f4dbb59d35e55833a5545ac6
--
2.41.0
S
S
Simon Tournier wrote on 23 Oct 2023 12:08
87il6xlkhk.fsf@gmail.com
Hi Ludo,

On Fri, 20 Oct 2023 at 18:15, Ludovic Courtès <ludo@gnu.org> wrote:

Toggle quote (7 lines)
> * guix/git.scm (packs-in-git-repository, maybe-run-git-gc): New
> procedures.
> (update-cached-checkout): Use it.
> ---
> guix/git.scm | 39 ++++++++++++++++++++++++++++++++++++---
> 1 file changed, 36 insertions(+), 3 deletions(-)

LGTM. Just two colors for the bikeshed. :-)


Toggle quote (2 lines)
> + (when (> (packs-in-git-repository directory) 25)

Why 25? And not 10 or 50 or 100?


Toggle quote (10 lines)
> (define* (update-cached-checkout url
> #:key
> (ref '())
> @@ -515,6 +545,9 @@ (define* (update-cached-checkout url
> seconds seconds
> nanoseconds nanoseconds))))
>
> + ;; Run 'git gc' if needed.
> + (maybe-run-git-gc cache-directory)

Why not trigger it by “guix gc”?

Well, I expect “guix gc” to take some time and I choose when. However,
I want “guix pull” or “guix time-machine” to be as fast as possible and
here some extra time is added, and I cannot control exactly when.

Cheers,
simon
T
T
Tobias Geerinckx-Rice wrote on 24 Oct 2023 00:27
Re: bug#65720: [PATCH] git: Shel l out to ‘git gc’ when necessary.
8A262178-2BFF-41EE-BEF2-5DC3270EF9C5@tobias.gr
Toggle quote (2 lines)
>Why not trigger it by “guix gc”?

Unless there's a new option I missed, guix gc doesn't handle this.

Toggle quote (3 lines)
>Well, I expect “guix gc” to take some time and I choose when. However,
>I want “guix pull” or “guix time-machine” to be as fast as possible

I don't think that things should be pushed into guix gc merely because they are slow.

This is not a great post (I'd look at the git code if I were at a computer) but I remember git printing something like 'optimising repository in the background'. Maybe something similar would be appropriate here, to better hide such housekeeping from the user.


Kind regards,

T G-R

Sent on the go. Excuse or enjoy my brevity.
S
S
Simon Tournier wrote on 24 Oct 2023 01:28
Re: bug#65720: Guile-Git-managed checkouts grow way too much
(name . Tobias Geerinckx-Rice)(address . me@tobias.gr)
86r0lkvrzn.fsf_-_@gmail.com
Hi,

On Mon, 23 Oct 2023 at 22:27, Tobias Geerinckx-Rice <me@tobias.gr> wrote:

Toggle quote (4 lines)
>>Why not trigger it by “guix gc”?
>
> Unless there's a new option I missed, guix gc doesn't handle this.

Maybe I missed something but “guix gc” handles what we implement, no? :-)

Well, I run “guix gc” when I need some space. And this
“maybe-run-git-gc” does exactly that: collect some spaces when I need
them.

For me, they are part of “guix gc” and not part of some update.


Aside, re-thinking about other features, I am consistent with other
comments I made when introducing ’maybe-remove-expired-cache-entries’;
see https://issues.guix.gnu.org/45327#4. And consistent because most
probably I still think the same: cache cleanup should be handled by
“guix gc” and not by the commands themselves. And maybe we are having
the same discussion. ;-)


Toggle quote (6 lines)
>>Well, I expect “guix gc” to take some time and I choose when. However,
>>I want “guix pull” or “guix time-machine” to be as fast as possible
>
> I don't think that things should be pushed into guix gc merely because
> they are slow.

Maybe I misread, somehow it appears to me that you miss the key part: I
choose when some extra work is done and I keep “guix pull” and “guix
time-machine” as fast as possible.


Cheers,
simon
C
C
Christopher Baines wrote on 30 Oct 2023 13:02
Re: [bug#66650] [PATCH] git: Shell out to ‘git gc ’ when necessary.
(name . Ludovic Courtès)(address . ludo@gnu.org)
87sf5swc3j.fsf@cbaines.net
Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (27 lines)
>
> This fixes a bug whereby libgit2-managed checkouts would keep growing as
> we fetch.
>
> * guix/git.scm (packs-in-git-repository, maybe-run-git-gc): New
> procedures.
> (update-cached-checkout): Use it.
> ---
> guix/git.scm | 39 ++++++++++++++++++++++++++++++++++++---
> 1 file changed, 36 insertions(+), 3 deletions(-)
>
> Hi!
>
> This is a radical fix/workaround for the unbounded Git checkout growth
> problem, shelling out to ‘git gc’ when it’s likely needed (“too many”
> pack files around).
>
> I thought we might be able to implement a ‘git gc’ approximation using
> the libgit2 “packbuilder” interface, but I haven’t got around to doing
> it: <https://libgit2.org/libgit2/#HEAD/search/pack>.
>
> Once again, shelling out is not my favorite option, but it’s a bug we
> should fix sooner rather than later, hence this compromise.
>
> Thoughts?

This sounds good to me, the data service has this problem as well of
cached checkouts that grow to be too large and this sounds like it'll
address it.
-----BEGIN PGP SIGNATURE-----

iQKlBAEBCgCPFiEEPonu50WOcg2XVOCyXiijOwuE9XcFAmU/m8BfFIAAAAAALgAo
aXNzdWVyLWZwckBub3RhdGlvbnMub3BlbnBncC5maWZ0aGhvcnNlbWFuLm5ldDNF
ODlFRUU3NDU4RTcyMEQ5NzU0RTBCMjVFMjhBMzNCMEI4NEY1NzcRHG1haWxAY2Jh
aW5lcy5uZXQACgkQXiijOwuE9Xcx2A//f3qqeG4+4J5uvVPcP/26SMtpzvNCUsRs
5hzUgJherxg4U25tlUdXlpjSVTNDSc0qN7RpUQWy6rca/S/ro2NL/KcR7VAdyumu
v39ldNwbq0W2YC/eZ9fxS7SzeCnWV2oOpO5X5sy69TGvjE/plWNStttvOF/HFy6Y
CtH9GNXZE9xwL5PbRK1Lxun5JmamP3Lxk+oivN3ZC7AnhoYxGxJ4xHD9HfN970b/
6kBbc0Vcf0wHtHRVIEuhBw01JkklchBhTTzbYoi3SUFZeovbkm4Ys5g3s7nDVVZD
5XNMkdp2YcgdUtsfZN1jhgFXTsa6XyfFnQS/1qMfPg3U1niAj0nKIqjEfmUNnGD1
PpQbQ5WvOZm5S70HHDG9Cg58BVIcH0hrHqfVYyghhttf2yUvdKp6CtqNyuBzsr7D
276K8EAeTMcfQwtArxKaFfFG/ggInMvPy1UA1FoN2j0EIIxeND/7vcejqqIssjZm
jsU716+s9bP1JCf0s/gJPWSw7Iph7gOs4CKFUdQSeEqNawyXyetxc5PjI6K7NKzq
QSa7SJlTe0Lv8maRIZ7LV8t08n3PPFO0sFC7MQMVTCFbkwqFwaVGGdeFiF8dYXfN
m4eigk0nl9Poq7gQ79r0igy/rfkZW8mVKRucqVdUJ/znykJfiyBbadSmPcCZP6MS
XzvO0Cgwki8=
=sesm
-----END PGP SIGNATURE-----

L
L
Ludovic Courtès wrote on 14 Nov 2023 10:19
(name . Christopher Baines)(address . mail@cbaines.net)
87o7fwae0q.fsf@gnu.org
Hello,

Christopher Baines <mail@cbaines.net> skribis:

Toggle quote (7 lines)
> Ludovic Courtès <ludo@gnu.org> writes:
>
>> Fixes <https://issues.guix.gnu.org/65720>.
>>
>> This fixes a bug whereby libgit2-managed checkouts would keep growing as
>> we fetch.

[...]

Toggle quote (4 lines)
> This sounds good to me, the data service has this problem as well of
> cached checkouts that grow to be too large and this sounds like it'll
> address it.

Thanks for your input, Chris.

Any other comments? I’d like to push the patch within a few days if
there are no objections.


Ludo’.
S
S
Simon Tournier wrote on 14 Nov 2023 10:32
Re: bug#65720: [bug#66650] [PATCH] git: Shell out to ‘git gc’ when necessary.
87v8a4el3a.fsf@gmail.com
Hi,

On Tue, 14 Nov 2023 at 10:19, Ludovic Courtès <ludo@gnu.org> wrote:

Toggle quote (3 lines)
> Any other comments? I’d like to push the patch within a few days if
> there are no objections.

As mentioned in [1],

Toggle quote (7 lines)
>> * guix/git.scm (packs-in-git-repository, maybe-run-git-gc): New
>> procedures.
>> (update-cached-checkout): Use it.
>> ---
>> guix/git.scm | 39 ++++++++++++++++++++++++++++++++++++---
>> 1 file changed, 36 insertions(+), 3 deletions(-)

LGTM. Just two colors for the bikeshed. :-)


Toggle quote (2 lines)
>> + (when (> (packs-in-git-repository directory) 25)

Why 25? And not 10 or 50 or 100?


Toggle quote (10 lines)
>> (define* (update-cached-checkout url
>> #:key
>> (ref '())
>> @@ -515,6 +545,9 @@ (define* (update-cached-checkout url
>> seconds seconds
>> nanoseconds nanoseconds))))
>>
>> + ;; Run 'git gc' if needed.
>> + (maybe-run-git-gc cache-directory)

Why not trigger it by “guix gc”?

Well, I expect “guix gc” to take some time and I choose when. However,
I want “guix pull” or “guix time-machine” to be as fast as possible and
here some extra time is added, and I cannot control exactly when.


Cheers,
simon


1: bug#65720: [PATCH] git: Shell out to ‘git gc’ when necessary.
Simon Tournier <zimon.toutoune@gmail.com>
Mon, 23 Oct 2023 12:08:07 +0200
id:87il6xlkhk.fsf@gmail.com
L
L
Ludovic Courtès wrote on 16 Nov 2023 13:12
(name . Simon Tournier)(address . zimon.toutoune@gmail.com)
87h6ll28yh.fsf@gnu.org
Hi,

Simon Tournier <zimon.toutoune@gmail.com> skribis:

Toggle quote (9 lines)
>>> * guix/git.scm (packs-in-git-repository, maybe-run-git-gc): New
>>> procedures.
>>> (update-cached-checkout): Use it.
>>> ---
>>> guix/git.scm | 39 ++++++++++++++++++++++++++++++++++++---
>>> 1 file changed, 36 insertions(+), 3 deletions(-)
>
> LGTM.

Thanks!

Toggle quote (7 lines)
> Just two colors for the bikeshed. :-)
>
>
>>> + (when (> (packs-in-git-repository directory) 25)
>
> Why 25? And not 10 or 50 or 100?

Totally arbitrary. :-) I sampled the checkouts I had on my laptop and
that seems like a reasonable heuristic. In particular, it seems that
Git-managed checkouts never have this many packs; only libgit2-managed
checkouts do, precisely because libgit2 doesn’t repack/GC.

Toggle quote (5 lines)
>>> + ;; Run 'git gc' if needed.
>>> + (maybe-run-git-gc cache-directory)
>
> Why not trigger it by “guix gc”?

Because so far the idea is that ~/.cache/guix/checkouts is automatically
managed without user intervention; it’s really a cache in that sense.

Toggle quote (4 lines)
> Well, I expect “guix gc” to take some time and I choose when. However,
> I want “guix pull” or “guix time-machine” to be as fast as possible and
> here some extra time is added, and I cannot control exactly when.

Yes, I see. The thing is ‘maybe-run-git-gc’ is only called on the slow
path; so for example, it’s not called on a ‘time-machine’ cache hit, but
only on a cache miss, which is already expensive anyway.

Does that make sense?

Thanks,
Ludo’.
S
S
Simon Tournier wrote on 16 Nov 2023 14:24
Re: bug#65720: [bug#66650] [PATCH] git: Shell out to ‘git gc’ when necessary.
(name . Ludovic Courtès)(address . ludo@gnu.org)
CAJ3okZ2-W_Me-Gao44+LeKGCm7dhb8VkLfC2doL4NE9VO88HYg@mail.gmail.com
Hi,

On Thu, 16 Nov 2023 at 13:12, Ludovic Courtès <ludo@gnu.org> wrote:

Toggle quote (8 lines)
> > Well, I expect “guix gc” to take some time and I choose when. However,
> > I want “guix pull” or “guix time-machine” to be as fast as possible and
> > here some extra time is added, and I cannot control exactly when.
>
> Yes, I see. The thing is ‘maybe-run-git-gc’ is only called on the slow
> path; so for example, it’s not called on a ‘time-machine’ cache hit, but
> only on a cache miss, which is already expensive anyway.

What you mean as "only called on the slow path" is each time
'update-cached-checkout' is called, right? So, somehow when
'maybe-run-git-gc' is called appears to me "unpredictable". But
anyway. :-)

Let move it elsewhere if I am really annoyed.

Cheers,
simon
L
L
Ludovic Courtès wrote on 22 Nov 2023 12:17
Re: [bug#66650] bug#65720: [bug#66650] [PATCH] git: Shell out to ‘git gc’ when necessary.
(name . Simon Tournier)(address . zimon.toutoune@gmail.com)
874jhem3z0.fsf@gnu.org
Hi,

Simon Tournier <zimon.toutoune@gmail.com> skribis:

Toggle quote (13 lines)
> On Thu, 16 Nov 2023 at 13:12, Ludovic Courtès <ludo@gnu.org> wrote:
>
>> > Well, I expect “guix gc” to take some time and I choose when. However,
>> > I want “guix pull” or “guix time-machine” to be as fast as possible and
>> > here some extra time is added, and I cannot control exactly when.
>>
>> Yes, I see. The thing is ‘maybe-run-git-gc’ is only called on the slow
>> path; so for example, it’s not called on a ‘time-machine’ cache hit, but
>> only on a cache miss, which is already expensive anyway.
>
> What you mean as "only called on the slow path" is each time
> 'update-cached-checkout' is called, right?

Yes, which usually indicates we’re on a cache miss (for example a cache
miss of ‘guix time-machine’) and thus are going to do potentially more
work (updating a Git repo, building things, etc.). That’s why I think
it’s on the “slow path” and shouldn’t make much of a difference. More
importantly, unless I’m mistaken, it’s rarely going to fire.

Toggle quote (3 lines)
> So, somehow when 'maybe-run-git-gc' is called appears to me
> "unpredictable". But anyway. :-)

Sure, but the way I see it, that’s the nature of caches.

Toggle quote (2 lines)
> Let move it elsewhere if I am really annoyed.

:-/

Ludo’.
S
S
Simon Tournier wrote on 22 Nov 2023 12:57
Re: bug#65720: Guile-Git-managed checkouts grow way too much
(name . Ludovic Courtès)(address . ludo@gnu.org)
86ttpehug4.fsf_-_@gmail.com
Hi Ludo,

Thanks for explaining.

On Wed, 22 Nov 2023 at 12:17, Ludovic Courtès <ludo@gnu.org> wrote:

Toggle quote (2 lines)
> it’s rarely going to fire.

[...]

Toggle quote (4 lines)
>> Let move it elsewhere if I am really annoyed.
>
> :-/

Sorry, I poorly worded my last comment. :-)

Somehow I was expressing: my view probably falls into the “Premature
optimization is the root of all evil” category. Other said, I have no
objection and I will revisit the issue when I will be on fire, if I am,
or annoyed for real.

Cheers,
simon

PS:

Aside this patch:

Toggle quote (5 lines)
>> So, somehow when 'maybe-run-git-gc' is called appears to me
>> "unpredictable". But anyway. :-)
>
> Sure, but the way I see it, that’s the nature of caches.

What makes cache unpredictable is their current state. However, this
does not imply that *all* the actions modifying from one state to
another must also be triggered in unpredictable moment.

For instance, I choose when I wash family’s clothes and the wash-machine
does not start by itself when the unpredictable stack of family’s dirty
clothes is enough. Because, maybe today it’s rainy so drying is
difficult and tomorrow will be sunny so it will be a better moment. :-)

For me, “guix gc” should be the driver for cleaning all the various Guix
caches. Anyway. :-D
L
L
Ludovic Courtès wrote on 22 Nov 2023 17:00
Re: [bug#66650] bug#65720: Guile-Git-managed checkouts grow way too much
(name . Simon Tournier)(address . zimon.toutoune@gmail.com)
87h6ldkcc8.fsf@gnu.org
Hi,

Simon Tournier <zimon.toutoune@gmail.com> skribis:

Toggle quote (5 lines)
> Somehow I was expressing: my view probably falls into the “Premature
> optimization is the root of all evil” category. Other said, I have no
> objection and I will revisit the issue when I will be on fire, if I am,
> or annoyed for real.

Alright!

Pushed as b150c546b04c9ebb09de9f2c39789221054f5eea.

Let’s see how it behaves and if there are problems we had overlooked…

Ludo’.
Closed
L
L
Ludovic Courtès wrote on 23 Nov 2023 12:35
Re: bug#65720: Guile-Git-managed checkouts grow way too much
(address . 65720-done@debbugs.gnu.org)
87ttpcd7nb.fsf@gnu.org
Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (9 lines)
> As reported by Tobias on IRC (in the context of ‘hpcguix-web’),
> checkouts managed by Guile-Git appear to grow beyond reason. As an
> example, here’s the same ‘.git’ managed with Guile-Git and with Git:
>
> $ du -hs ~/.cache/guix/checkouts/pjmkglp4t7znuugeurpurzikxq3tnlaywmisyr27shj7apsnalwq
> 6.7G /home/ludo/.cache/guix/checkouts/pjmkglp4t7znuugeurpurzikxq3tnlaywmisyr27shj7apsnalwq
> $ du -hs .git
> 517M .git

Fixed by b150c546b04c9ebb09de9f2c39789221054f5eea.

We still need to update the ‘guix’ package so that tools that rely on
(guix git) such as the Data Service, hpcguix-web, and Cuirass, can
benefit from this change.

Ludo’.
Closed
?