Disarchive as a fallback for downloads

  • Done
  • quality assurance status badge
Details
3 participants
  • Ludovic Courtès
  • Timothy Sample
  • zimoun
Owner
unassigned
Submitted by
Timothy Sample
Severity
normal
T
T
Timothy Sample wrote on 23 Mar 2021 05:42
(address . bug-guix@gnu.org)
87eeg6o50b.fsf@ngyro.com
Hello,

This patch series adds Disarchive assembly (backed by SWH lookup) as a
fallback for downloads.

To try it, make sure you are running the daemon in an environment with
Disarchive available:

$ ./pre-inst-env guix environment --ad-hoc guile disarchive
# ./pre-inst-env guix-daemon --build-users-group=guixbuild

Don’t forget to stop your existing Guix Daemon. :)

You also need to make sure that regular downloads are unavailable. I do
this by adjusting the “try” loop at the end of “url-fetch” in
“guix/build/download.scm”. I replace the usual list of URLs with ‘()’:

(let try ((uri (append uri content-addressed-uris)))
(match '() ; uri
...))

Now you can ask Guix for a recent .tar.gz source package:

$ ./pre-inst-env guix build --no-substitutes -S python-httpretty

You should see:

Trying to use Disarchive to assemble /gnu/store/kbcnm57y2q1jvhvd8zw1g5vdiwlv19y9-httpretty-1.0.5.tar.gz
Assembling the directory httpretty-1.0.5
Downloading from Software Heritage...
7903d608efc89c14afb4d692a3721156e31a43e2/
7903d608efc89c14afb4d692a3721156e31a43e2/httpretty-1.0.5/
7903d608efc89c14afb4d692a3721156e31a43e2/httpretty-1.0.5/COPYING
[...]
Checking httpretty-1.0.5 digest... ok
Assembling the tarball httpretty-1.0.5.tar
Checking httpretty-1.0.5.tar digest... ok
Assembling the Gzip file httpretty-1.0.5.tar.gz
Checking httpretty-1.0.5.tar.gz digest... ok
Copying result to /gnu/store/kbcnm57y2q1jvhvd8zw1g5vdiwlv19y9-httpretty-1.0.5.tar.gz
successfully built /gnu/store/k0b3c7kgzyn1nlyhx192pcbcgbfnhnwa-httpretty-1.0.5.tar.gz.drv

There’s lots to talk about though....

First, it looks up the metadata on my server. This is fine for a demo,
but not what we want forever. The patch series supports adding several
mirrors for looking up the metadata. In the past, we talked about
putting everything on one or a few of the big Git hosting platforms like
GitHub or Gitlab. That way, it would be easily picked up by SWH and
archived “forever”. Right now, I have Cuirass set up to build the
metadata, and a little script that moves it from the build server to my
Web server. It would be simple enough to adjust that script to push it
to a remote Git repo. (Of course, the next step is to move this setup
to Guix infrastructure.) Thoughts?

On the code level, there were two things I couldn’t figure out for
myself.

I made the mirror list just simple strings. AIUI, the client and the
daemon have to agree about the format of the mirror list. Given that
running old daemons is common, changing the format is difficult. Is it
worth it to copy the more flexible interface used by the content
addressed mirrors? If yes, do I have to do the same ‘module-autoload!’
dance to use ‘bytevector->base16-string’? :) (I probably would have
just copied it, but that part confused me a bit.)

I imported some modules from “guix/build/download.scm” (well, just
“base16” and “swh”). It feels weird to use a bunch of host-side modules
from what’s nominally a “guix/build” module. This is okay because
“guix/build/download.scm” is not /really/ build-side code. It’s more
like daemon (-ish) code that just happens to live in “guix/build”, which
is why importing host-side modules is OK... right?

Hopefully everything else is more-or-less fine. :)


-- Tim
T
T
Timothy Sample wrote on 23 Mar 2021 05:52
[PATCH 1/2] swh: Add a directory download procedure.
(address . 47336@debbugs.gnu.org)(name . Timothy Sample)(address . samplet@ngyro.com)
20210323045213.9419-1-samplet@ngyro.com
* guix/swh.scm (swh-directory-download): New procedure (with
implementation extracted from 'swh-download').
(swh-download): Use it to download the revision directory.
---
guix/swh.scm | 65 +++++++++++++++++++++++++++++-----------------------
1 file changed, 36 insertions(+), 29 deletions(-)

Toggle diff (98 lines)
diff --git a/guix/swh.scm b/guix/swh.scm
index f11b7ea2d5..2402ec98e6 100644
--- a/guix/swh.scm
+++ b/guix/swh.scm
@@ -108,6 +108,7 @@
commit-id?
+ swh-download-directory
swh-download))
;;; Commentary:
@@ -558,12 +559,6 @@ requested bundle cooking, waiting for completion...~%"))
;;; High-level interface.
;;;
-(define (commit-id? reference)
- "Return true if REFERENCE is likely a commit ID, false otherwise---e.g., if
-it is a tag name. This is based on a simple heuristic so use with care!"
- (and (= (string-length reference) 40)
- (string-every char-set:hex-digit reference)))
-
(define (call-with-temporary-directory proc) ;FIXME: factorize
"Call PROC with a name of a temporary directory; close the directory and
delete it when leaving the dynamic extent of this call."
@@ -577,6 +572,39 @@ delete it when leaving the dynamic extent of this call."
(lambda ()
(false-if-exception (delete-file-recursively tmp-dir))))))
+(define* (swh-download-directory id output
+ #:key (log-port (current-error-port)))
+ "Download from Software Heritage the directory with the given ID, and
+unpack it to OUTPUT. Return #t on success and #f on failure"
+ (call-with-temporary-directory
+ (lambda (directory)
+ (match (vault-fetch id 'directory #:log-port log-port)
+ (#f
+ (format log-port
+ "SWH: directory ~a could not be fetched from the vault~%"
+ id)
+ #f)
+ ((? port? input)
+ (let ((tar (open-pipe* OPEN_WRITE "tar" "-C" directory "-xzvf" "-")))
+ (dump-port input tar)
+ (close-port input)
+ (let ((status (close-pipe tar)))
+ (unless (zero? status)
+ (error "tar extraction failure" status)))
+
+ (match (scandir directory)
+ (("." ".." sub-directory)
+ (copy-recursively (string-append directory "/" sub-directory)
+ output
+ #:log (%make-void-port "w"))
+ #t))))))))
+
+(define (commit-id? reference)
+ "Return true if REFERENCE is likely a commit ID, false otherwise---e.g., if
+it is a tag name. This is based on a simple heuristic so use with care!"
+ (and (= (string-length reference) 40)
+ (string-every char-set:hex-digit reference)))
+
(define* (swh-download url reference output
#:key (log-port (current-error-port)))
"Download from Software Heritage a checkout of the Git tag or commit
@@ -593,28 +621,7 @@ wait until it becomes available, which could take several minutes."
(format log-port "SWH: found revision ~a with directory at '~a'~%"
(revision-id revision)
(swh-url (revision-directory-url revision)))
- (call-with-temporary-directory
- (lambda (directory)
- (match (vault-fetch (revision-directory revision) 'directory
- #:log-port log-port)
- (#f
- (format log-port
- "SWH: directory ~a could not be fetched from the vault~%"
- (revision-directory revision))
- #f)
- ((? port? input)
- (let ((tar (open-pipe* OPEN_WRITE "tar" "-C" directory "-xzvf" "-")))
- (dump-port input tar)
- (close-port input)
- (let ((status (close-pipe tar)))
- (unless (zero? status)
- (error "tar extraction failure" status)))
-
- (match (scandir directory)
- (("." ".." sub-directory)
- (copy-recursively (string-append directory "/" sub-directory)
- output
- #:log (%make-void-port "w"))
- #t))))))))
+ (swh-download-directory (revision-directory revision) output
+ #:log-port log-port))
(#f
#f)))
--
2.31.0
T
T
Timothy Sample wrote on 23 Mar 2021 05:52
[PATCH 2/2] download: Use Disarchive as a last resort.
(address . 47336@debbugs.gnu.org)(name . Timothy Sample)(address . samplet@ngyro.com)
20210323045213.9419-2-samplet@ngyro.com
* guix/download.scm (%disarchive-mirrors): New variable.
(%disarchive-mirror-file): New variable.
(built-in-download): Add 'disarchive-mirrors' keyword argument and
pass its value along to the 'builtin:download' derivation.
(url-fetch): Pass '%disarchive-mirror-file' to 'built-in-download'.
* guix/scripts/perform-download.scm (perform-download): Read
Disarchive mirrors from the environment and pass them to
'url-fetch'.
* guix/build/download.scm (disarchive-fetch/any): New procedure.
(url-fetch): Add 'disarchive-mirrors' keyword argument, use it to
make a list of URIs, and use the new procedure to fetch the file if
all other methods fail.
---
guix/build/download.scm | 77 +++++++++++++++++++++++++++----
guix/download.scm | 19 ++++++--
guix/scripts/perform-download.scm | 7 ++-
3 files changed, 89 insertions(+), 14 deletions(-)

Toggle diff (212 lines)
diff --git a/guix/build/download.scm b/guix/build/download.scm
index a22d4064ca..f476d0f8ec 100644
--- a/guix/build/download.scm
+++ b/guix/build/download.scm
@@ -2,6 +2,7 @@
;;; Copyright © 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021 Ludovic Courtès <ludo@gnu.org>
;;; Copyright © 2015 Mark H Weaver <mhw@netris.org>
;;; Copyright © 2017 Tobias Geerinckx-Rice <me@tobias.gr>
+;;; Copyright © 2021 Timothy Sample <samplet@ngyro.com>
;;;
;;; This file is part of GNU Guix.
;;;
@@ -23,10 +24,12 @@
#:use-module (web http)
#:use-module ((web client) #:hide (open-socket-for-uri))
#:use-module (web response)
+ #:use-module (guix base16)
#:use-module (guix base64)
#:use-module (guix ftp-client)
#:use-module (guix build utils)
#:use-module (guix progress)
+ #:use-module (guix swh)
#:use-module (rnrs io ports)
#:use-module (rnrs bytevectors)
#:use-module (srfi srfi-1)
@@ -626,10 +629,50 @@ Return a list of URIs."
(else
(list uri))))
+(define* (disarchive-fetch/any uris file
+ #:key (timeout 10))
+ "Fetch a Disarchive specification from any of URIS, assemble it,
+and write the output to FILE."
+ (define (fetch-specification uris)
+ (any (lambda (uri)
+ (false-if-exception*
+ (let-values (((port size) (http-fetch uri
+ #:verify-certificate? #t
+ #:timeout timeout)))
+ (let ((specification (read port)))
+ (close-port port)
+ specification))))
+ uris))
+
+ (define (resolve addresses output)
+ (any (match-lambda
+ (('swhid swhid)
+ (match (string-split swhid #\:)
+ (("swh" "1" "dir" id)
+ (format #t "Downloading from Software Heritage...~%" file)
+ (false-if-exception*
+ (swh-download-directory id output)))
+ (_ #f)))
+ (_ #f))
+ addresses))
+
+ (match (and=> (resolve-module '(disarchive) #:ensure #f)
+ (lambda (disarchive)
+ (cons (module-ref disarchive '%disarchive-log-port)
+ (module-ref disarchive 'disarchive-assemble))))
+ (#f #f)
+ ((%disarchive-log-port . disarchive-assemble)
+ (format #t "Trying to use Disarchive to assemble ~a~%" file)
+ (match (fetch-specification uris)
+ (#f #f)
+ (spec (parameterize ((%disarchive-log-port (current-output-port)))
+ (disarchive-assemble spec file #:resolver resolve)))))))
+
(define* (url-fetch url file
#:key
(timeout 10) (verify-certificate? #t)
(mirrors '()) (content-addressed-mirrors '())
+ (disarchive-mirrors '())
(hashes '())
print-build-trace?)
"Fetch FILE from URL; URL may be either a single string, or a list of
@@ -693,6 +736,17 @@ otherwise simply ignore them."
hashes))
content-addressed-mirrors))
+ (define disarchive-uris
+ (append-map (lambda (mirror)
+ (map (match-lambda
+ ((hash-algo . hash)
+ (string->uri
+ (string-append mirror
+ (symbol->string hash-algo) "/"
+ (bytevector->base16-string hash)))))
+ hashes))
+ disarchive-mirrors))
+
;; Make this unbuffered so 'progress-report/file' works as expected. 'line
;; means '\n', not '\r', so it's not appropriate here.
(setvbuf (current-output-port) 'none)
@@ -705,15 +759,18 @@ otherwise simply ignore them."
(or (fetch uri file)
(try tail)))
(()
- (format (current-error-port) "failed to download ~s from ~s~%"
- file url)
-
- ;; Remove FILE in case we made an incomplete download, for example due
- ;; to ENOSPC.
- (catch 'system-error
- (lambda ()
- (delete-file file))
- (const #f))
- #f))))
+ ;; If we are looking for a software archive, one last thing we
+ ;; can try is to use Disarchive to assemble it.
+ (or (disarchive-fetch/any disarchive-uris file #:timeout timeout)
+ (begin
+ (format (current-error-port) "failed to download ~s from ~s~%"
+ file url)
+ ;; Remove FILE in case we made an incomplete download, for
+ ;; example due to ENOSPC.
+ (catch 'system-error
+ (lambda ()
+ (delete-file file))
+ (const #f))
+ #f))))))
;;; download.scm ends here
diff --git a/guix/download.scm b/guix/download.scm
index 30f69c0325..72094e7318 100644
--- a/guix/download.scm
+++ b/guix/download.scm
@@ -406,12 +406,19 @@
(plain-file "content-addressed-mirrors"
(object->string %content-addressed-mirrors)))
+(define %disarchive-mirrors
+ '("https://disarchive.ngyro.com/"))
+
+(define %disarchive-mirror-file
+ (plain-file "disarchive-mirrors" (object->string %disarchive-mirrors)))
+
(define built-in-builders*
(store-lift built-in-builders))
(define* (built-in-download file-name url
#:key system hash-algo hash
mirrors content-addressed-mirrors
+ disarchive-mirrors
executable?
(guile 'unused))
"Download FILE-NAME from URL using the built-in 'download' builder. When
@@ -422,13 +429,16 @@ explicitly depend on Guile, GnuTLS, etc. Instead, the daemon performs the
download by itself using its own dependencies."
(mlet %store-monad ((mirrors (lower-object mirrors))
(content-addressed-mirrors
- (lower-object content-addressed-mirrors)))
+ (lower-object content-addressed-mirrors))
+ (disarchive-mirrors (lower-object disarchive-mirrors)))
(raw-derivation file-name "builtin:download" '()
#:system system
#:hash-algo hash-algo
#:hash hash
#:recursive? executable?
- #:sources (list mirrors content-addressed-mirrors)
+ #:sources (list mirrors
+ content-addressed-mirrors
+ disarchive-mirrors)
;; Honor the user's proxy and locale settings.
#:leaked-env-vars '("http_proxy" "https_proxy"
@@ -439,6 +449,7 @@ download by itself using its own dependencies."
("mirrors" . ,mirrors)
("content-addressed-mirrors"
. ,content-addressed-mirrors)
+ ("disarchive-mirrors" . ,disarchive-mirrors)
,@(if executable?
'(("executable" . "1"))
'()))
@@ -492,7 +503,9 @@ name in the store."
#:executable? executable?
#:mirrors %mirror-file
#:content-addressed-mirrors
- %content-addressed-mirror-file)))))
+ %content-addressed-mirror-file
+ #:disarchive-mirrors
+ %disarchive-mirror-file)))))
(define* (url-fetch/executable url hash-algo hash
#:optional name
diff --git a/guix/scripts/perform-download.scm b/guix/scripts/perform-download.scm
index 8d409092ba..6889bcef79 100644
--- a/guix/scripts/perform-download.scm
+++ b/guix/scripts/perform-download.scm
@@ -54,7 +54,8 @@ actual output is different from that when we're doing a 'bmCheck' or
(output* "out")
(executable "executable")
(mirrors "mirrors")
- (content-addressed-mirrors "content-addressed-mirrors"))
+ (content-addressed-mirrors "content-addressed-mirrors")
+ (disarchive-mirrors "disarchive-mirrors"))
(unless url
(leave (G_ "~a: missing URL~%") (derivation-file-name drv)))
@@ -79,6 +80,10 @@ actual output is different from that when we're doing a 'bmCheck' or
(lambda (port)
(eval (read port) %user-module)))
'())
+ #:disarchive-mirrors
+ (if disarchive-mirrors
+ (call-with-input-file disarchive-mirrors read)
+ '())
#:hashes `((,algo . ,hash))
;; Since DRV's output hash is known, X.509 certificate
--
2.31.0
T
T
Timothy Sample wrote on 23 Mar 2021 06:11
Re: bug#47336: Disarchive as a fallback for downloads
(address . 47336@debbugs.gnu.org)(address . control@debbugs.gnu.org)
8735wmo3n2.fsf@ngyro.com
reassign 47336 guix-patches
thanks

Oops! I sent this to the wrong list. My apologies.
Z
Z
zimoun wrote on 23 Mar 2021 10:35
86sg4mnreu.fsf@gmail.com
Hi Timothy,

(CC Mathieu to advice if it could be a feature of Cuirass.)


On Tue, 23 Mar 2021 at 00:42, Timothy Sample <samplet@ngyro.com> wrote:

Toggle quote (3 lines)
> This patch series adds Disarchive assembly (backed by SWH lookup) as a
> fallback for downloads.

Awesome!


Toggle quote (12 lines)
> You also need to make sure that regular downloads are unavailable. I do
> this by adjusting the “try” loop at the end of “url-fetch” in
> “guix/build/download.scm”. I replace the usual list of URLs with ‘()’:
>
> (let try ((uri (append uri content-addressed-uris)))
> (match '() ; uri
> ...))
>
> Now you can ask Guix for a recent .tar.gz source package:
>
> $ ./pre-inst-env guix build --no-substitutes -S python-httpretty

Neat! Now, there is a way to easily check the coverage, right? Since
SWH is ingesting the tarball using http://guix.gnu.org/sources.json,
there is now a mean to report what Guix is able to rebuild.

Toggle quote (2 lines)
> Checking httpretty-1.0.5 digest... ok

What happens if it is not ok?

Toggle quote (6 lines)
> Assembling the tarball httpretty-1.0.5.tar
> Checking httpretty-1.0.5.tar digest... ok
> Assembling the Gzip file httpretty-1.0.5.tar.gz
> Checking httpretty-1.0.5.tar.gz digest... ok
> Copying result to /gnu/store/kbcnm57y2q1jvhvd8zw1g5vdiwlv19y9-httpretty-1.0.5.tar.gz

Where is the assembly done? In /tmp/, right?

Toggle quote (2 lines)
> successfully built /gnu/store/k0b3c7kgzyn1nlyhx192pcbcgbfnhnwa-httpretty-1.0.5.tar.gz.drv

Just to be sure, when does Guix check the integrity checksum? I mean,
does Guix check the checksum after ’disassemble’ re-assembled the source?


Toggle quote (4 lines)
> First, it looks up the metadata on my server. This is fine for a demo,
> but not what we want forever. The patch series supports adding
> several

As we talked before, how does the database scale? Do you have some
numbers for the current demo? In order to try to extrapolate what does
it mean for a server to «store the metadata».

Toggle quote (9 lines)
> mirrors for looking up the metadata. In the past, we talked about
> putting everything on one or a few of the big Git hosting platforms like
> GitHub or Gitlab. That way, it would be easily picked up by SWH and
> archived “forever”. Right now, I have Cuirass set up to build the
> metadata, and a little script that moves it from the build server to my
> Web server. It would be simple enough to adjust that script to push it
> to a remote Git repo. (Of course, the next step is to move this setup
> to Guix infrastructure.) Thoughts?

Maybe this database could be a package, say “guix-tarball-db”, updated
in agreement with the package “guix”. The source of this
“guix-tarball-db” would be a remote big Git hosting platforms like
GitHub or whatever and not stored on Guix infrastructure, or maybe
stored on Guix infra.

Regularly, i.e., when the package “guix” is updated, in the same time,
the package “guix-tarball-db” is updated too. The “guix lint -c
archival” sends the saving request to SWH. Even if this saving request
should be automated soon. :-)

Then if Cuirass would have a feature to disassemble and update the Git
repo.

Last, a service should run as your demo. But for long-term, this
service could disappear––assuming SWH not :-). Therefore, we could
imagine installing “guix-tarball-db” then tweak some parameters of the
guix-daemon and “guix build <foo>”. Both installing and building would
fetch from SWH if both upstream disappear.

Or this “guix-tarball-db” should not be a plain package but only an
input as origin for the package “guix”.


Toggle quote (2 lines)
> Hopefully everything else is more-or-less fine. :)

Thanks! That’s awesome!


Cheers,
simon
T
T
Timothy Sample wrote on 23 Mar 2021 15:31
(name . zimoun)(address . zimon.toutoune@gmail.com)
87sg4mt00c.fsf@ngyro.com
Hi zimoun,

You make a lot of good points here. Let me at least provide some quick
answers even if I’m not ready to comment on some of the bigger picture
stuff.

zimoun <zimon.toutoune@gmail.com> writes:

Toggle quote (2 lines)
> (CC Mathieu to advice if it could be a feature of Cuirass.)

So far I have been using Cuirass with only a tiny patch. I’m not sure
we need anything more than what Cuirass already provides. (The tiny
patch is for allowing sorting the “latestbuilds” results by “stoptime”
and “id”. This in turn allows paging through all the builds from the
API.)

Toggle quote (10 lines)
> On Tue, 23 Mar 2021 at 00:42, Timothy Sample <samplet@ngyro.com> wrote:
>
>> Now you can ask Guix for a recent .tar.gz source package:
>>
>> $ ./pre-inst-env guix build --no-substitutes -S python-httpretty
>
> Neat! Now, there is a way to easily check the coverage, right? Since
> SWH is ingesting the tarball using <http://guix.gnu.org/sources.json>,
> there is now a mean to report what Guix is able to rebuild.

I’m not sure I fully understand. Disarchive covers about 4,300 Gzip’ed
tarballs (no XZ yet). There are about 100 for which compression
parameters cannot be found, and a handful (about 5) that have a
particularly funny idea about what a tarball is. The metadata builds
for my database started one week ago and have been continuously updating
since then.

Are you asking if we could check what SWH has? Yes! Each metadata
file contains the SWHID of the input directory. You could use
Disarchive to get this value or a simple “grep swhid” would do it. :)


It would be neat to have a big database of archive coverage from Guix
1.0 through to the present. It’s quite a big project though.

Of course, you know all about the SWH rate limit....

Toggle quote (4 lines)
>> Checking httpretty-1.0.5 digest... ok
>
> What happens if it is not ok?

For that particular digest, it means the source directory is wrong.
Since we get the source from SWH, it means that the SWH archive is
wrong. You will have to look elsewhere, I guess (this seems pretty
unlikely). (There is a vanishing possibility that Disarchive
miscomputed the SWHID and managed to come up with a different, but still
valid SWHID....)

The other digest checks are more likely to fail. They would indicate
that Disarchive no longer knows how to interpret the metadata. Maybe
there will be a subtle bug in Disarchive 0.3.0 that causes this. Either
use an old version of Disarchive or try to fix the current version. :)
I worry about this, because it would be annoying, but the metadata does
have all the information needed to recover the original archive, so
nothing is really lost (except the user’s time).

Toggle quote (9 lines)
>> Assembling the tarball httpretty-1.0.5.tar
>> Checking httpretty-1.0.5.tar digest... ok
>> Assembling the Gzip file httpretty-1.0.5.tar.gz
>> Checking httpretty-1.0.5.tar.gz digest... ok
>> Copying result to
>> /gnu/store/kbcnm57y2q1jvhvd8zw1g5vdiwlv19y9-httpretty-1.0.5.tar.gz
>
> Where is the assembly done? In /tmp/, right?

Yes.

Toggle quote (6 lines)
>> successfully built
>> /gnu/store/k0b3c7kgzyn1nlyhx192pcbcgbfnhnwa-httpretty-1.0.5.tar.gz.drv
>
> Just to be sure, when does Guix check the integrity checksum? I mean,
> does Guix check the checksum after ’disassemble’ re-assembled the source?

Disarchive checks the result against the metadata to make sure it didn’t
make a mistake. Guix also checks the final result to make sure the
fixed-output derivation is correct. A fixed-output derivation is
basically just a checksum with a hint about how the data can be
obtained. Guix really only cares about the checksum, the hint can do
whatever as long as it produces the result Guix wants. With this patch
series, Disarchive is part of the hint.

Toggle quote (8 lines)
>> First, it looks up the metadata on my server. This is fine for a demo,
>> but not what we want forever. The patch series supports adding
>> several
>
> As we talked before, how does the database scale? Do you have some
> numbers for the current demo? In order to try to extrapolate what does
> it mean for a server to «store the metadata».

With “gzip -9”, the average metadata file is 6.8KiB. It’s pretty
manageable. There’s room for improvement on the Disarchive side, too.
It still stores some redundant information. Uncompressed, it’s more
like 112KiB per file. This is still pretty okay, really. It means we
might hit tens of GiB over a couple years. (It would take just over
100GiB to store a million uncompressed metadata files.) The compression
ratio is what drove me to skip Git for now.

Toggle quote (32 lines)
>> mirrors for looking up the metadata. In the past, we talked about
>> putting everything on one or a few of the big Git hosting platforms like
>> GitHub or Gitlab. That way, it would be easily picked up by SWH and
>> archived “forever”. Right now, I have Cuirass set up to build the
>> metadata, and a little script that moves it from the build server to my
>> Web server. It would be simple enough to adjust that script to push it
>> to a remote Git repo. (Of course, the next step is to move this setup
>> to Guix infrastructure.) Thoughts?
>
> Maybe this database could be a package, say “guix-tarball-db”, updated
> in agreement with the package “guix”. The source of this
> “guix-tarball-db” would be a remote big Git hosting platforms like
> GitHub or whatever and not stored on Guix infrastructure, or maybe
> stored on Guix infra.
>
> Regularly, i.e., when the package “guix” is updated, in the same time,
> the package “guix-tarball-db” is updated too. The “guix lint -c
> archival” sends the saving request to SWH. Even if this saving request
> should be automated soon. :-)
>
> Then if Cuirass would have a feature to disassemble and update the Git
> repo.
>
> Last, a service should run as your demo. But for long-term, this
> service could disappear––assuming SWH not :-). Therefore, we could
> imagine installing “guix-tarball-db” then tweak some parameters of the
> guix-daemon and “guix build <foo>”. Both installing and building would
> fetch from SWH if both upstream disappear.
>
> Or this “guix-tarball-db” should not be a plain package but only an
> input as origin for the package “guix”.

This is an interesting idea, but one that I would have to think about
more. :)


-- Tim
L
L
Ludovic Courtès wrote on 27 Mar 2021 11:39
(name . Timothy Sample)(address . samplet@ngyro.com)
87eeg0284p.fsf_-_@gnu.org
Hi!

Timothy Sample <samplet@ngyro.com> skribis:

Toggle quote (8 lines)
> With “gzip -9”, the average metadata file is 6.8KiB. It’s pretty
> manageable. There’s room for improvement on the Disarchive side, too.
> It still stores some redundant information. Uncompressed, it’s more
> like 112KiB per file. This is still pretty okay, really. It means we
> might hit tens of GiB over a couple years. (It would take just over
> 100GiB to store a million uncompressed metadata files.) The compression
> ratio is what drove me to skip Git for now.

If needed, the sexp serialization could still be made more compact:
using ‘write’ instead of ‘pretty-print’, shortening field names (but
that’d be incompatible).

We could also use CBOR or canonical sexp serialization, though maybe
gzipped sexps are more compact than what we could achieve?

Anyway, these are surface syntax optimizations that can always be made
at a later point in time when we feel a need for them.

Ludo’.
L
L
Ludovic Courtès wrote on 27 Mar 2021 11:40
(name . Timothy Sample)(address . samplet@ngyro.com)(address . 47336@debbugs.gnu.org)
87a6qo2835.fsf_-_@gnu.org
Timothy Sample <samplet@ngyro.com> skribis:

Toggle quote (4 lines)
> * guix/swh.scm (swh-directory-download): New procedure (with
> implementation extracted from 'swh-download').
> (swh-download): Use it to download the revision directory.

LGTM!
L
L
Ludovic Courtès wrote on 27 Mar 2021 11:57
(name . Timothy Sample)(address . samplet@ngyro.com)(address . 47336@debbugs.gnu.org)
87y2e8zwxj.fsf_-_@gnu.org
Hi!

Timothy Sample <samplet@ngyro.com> skribis:

Toggle quote (13 lines)
> * guix/download.scm (%disarchive-mirrors): New variable.
> (%disarchive-mirror-file): New variable.
> (built-in-download): Add 'disarchive-mirrors' keyword argument and
> pass its value along to the 'builtin:download' derivation.
> (url-fetch): Pass '%disarchive-mirror-file' to 'built-in-download'.
> * guix/scripts/perform-download.scm (perform-download): Read
> Disarchive mirrors from the environment and pass them to
> 'url-fetch'.
> * guix/build/download.scm (disarchive-fetch/any): New procedure.
> (url-fetch): Add 'disarchive-mirrors' keyword argument, use it to
> make a list of URIs, and use the new procedure to fetch the file if
> all other methods fail.

[...]

Toggle quote (7 lines)
> + #:use-module (guix base16)
> #:use-module (guix base64)
> #:use-module (guix ftp-client)
> #:use-module (guix build utils)
> #:use-module (guix progress)
> + #:use-module (guix swh)

Maybe #:autoload them.

Toggle quote (11 lines)
> +(define* (disarchive-fetch/any uris file
> + #:key (timeout 10))
> + "Fetch a Disarchive specification from any of URIS, assemble it,
> +and write the output to FILE."
> + (define (fetch-specification uris)
> + (any (lambda (uri)
> + (false-if-exception*
> + (let-values (((port size) (http-fetch uri
> + #:verify-certificate? #t
> + #:timeout timeout)))

Perhaps add #:key (verify-certificate? #t) and have the caller pass it?
Currently (guix scripts perform-download) sets it to #f, which is a good
idea IMO.

Toggle quote (12 lines)
> + (match (and=> (resolve-module '(disarchive) #:ensure #f)
> + (lambda (disarchive)
> + (cons (module-ref disarchive '%disarchive-log-port)
> + (module-ref disarchive 'disarchive-assemble))))
> + (#f #f)
> + ((%disarchive-log-port . disarchive-assemble)
> + (format #t "Trying to use Disarchive to assemble ~a~%" file)
> + (match (fetch-specification uris)
> + (#f #f)
> + (spec (parameterize ((%disarchive-log-port (current-output-port)))
> + (disarchive-assemble spec file #:resolver resolve)))))))

So we would normally arrange so that the ‘guix’ package depends on
Disarchive, such that the above ‘resolve-module’ call works when done
via ‘guix perform-download’, right?

In the #f case, perhaps we should print something like “Disarchive not
found, bailing out”?

That’s all I have to say; it looks great to me!

That’s quite a milestone, it’d be great to have that in the upcoming
release. Next we can discuss how to populate the Disarchive database
and where to do that (or your hosting fees could easily skyrocket :-)).
I suppose we could run that in Berlin and/or we could make an argument
about using SWH or Inria resources for that.

Thanks,
Ludo’.
L
L
Ludovic Courtès wrote on 10 Apr 2021 22:52
(name . Timothy Sample)(address . samplet@ngyro.com)(address . 47336@debbugs.gnu.org)
87blal2772.fsf_-_@gnu.org
Ping! :-)

Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (7 lines)
> Timothy Sample <samplet@ngyro.com> skribis:
>
>> * guix/swh.scm (swh-directory-download): New procedure (with
>> implementation extracted from 'swh-download').
>> (swh-download): Use it to download the revision directory.
>
> LGTM!
L
L
Ludovic Courtès wrote on 26 Apr 2021 11:49
(name . Timothy Sample)(address . samplet@ngyro.com)(address . 47336@debbugs.gnu.org)
87v989e5oc.fsf_-_@gnu.org
Hi Timothy,

Ping²!

Let me know if you’re like me to apply the patches on your behalf.

Ludo’.

Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (7 lines)
> Timothy Sample <samplet@ngyro.com> skribis:
>
>> * guix/swh.scm (swh-directory-download): New procedure (with
>> implementation extracted from 'swh-download').
>> (swh-download): Use it to download the revision directory.
>
> LGTM!
T
T
Timothy Sample wrote on 28 Apr 2021 04:30
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 47336-done@debbugs.gnu.org)
87o8dzkunk.fsf_-_@ngyro.com
Hi,

Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (4 lines)
> Ping²!
>
> Let me know if you’re like me to apply the patches on your behalf.

No, no. I’m just a little distracted over here. I just pushed this
series with the updates you suggested (using #:autoload, passing
#:verify-certificates?, and being a bit more chatty). Sorry for the
delay and thanks for the reminder.

Next, I’ll convert my Cuirass 0.x setup to a Cuirass 1.x setup, and then
I can start a discussion about moving the metadata builds to
ci.guix.gnu.org.

Also, to answer your other question:

Toggle quote (4 lines)
> So we would normally arrange so that the ‘guix’ package depends on
> Disarchive, such that the above ‘resolve-module’ call works when done
> via ‘guix perform-download’, right?

That’s the idea. I’m not confident about updating the ‘guix’ package
myself, though....


-- Tim
Closed
T
T
Timothy Sample wrote on 28 Apr 2021 09:01
(name . Ludovic Courtès)(address . ludo@gnu.org)
87h7jqlwnw.fsf@ngyro.com
reopen 47336
thanks

Hi again,

Timothy Sample <samplet@ngyro.com> writes:

Toggle quote (2 lines)
> I just pushed this series [...]

And broke “guix pull”!! (I somehow fooled myself into thinking that I
had already tested with “guix pull --url=...” locally.) I reverted the
offending commit.

It turns out that adding a reference from “(guix build download)” to
“(guix swh)” breaks “compute-guix-derivation” in
“build-aux/build-self.scm”. This is because “(guix swh)” references
“(json)”, which is not available in the “compute-guix-derivation”
environment. I tried mimicking the “fake-git” trick, but it didn’t work
(I guess it needs the “define-json-mapping” macro at compile time).

Everything works if I remove the #:autoload for “(guix swh)” and put

;; If we import (guix swh) directly, we introduce a compile-time
;; dependency on Guile-JSON. This breaks the "build-self" code, which
;; needs to build this module without Guile-JSON. Hence, we track
;; down the following procedure at runtime.
(define swh-download-directory
(module-ref (resolve-module '(guix swh)) 'swh-download-directory))

inside of “disarchive-fetch/any” (just before it’s needed). Does this
approach look okay?


-- Tim
L
L
Ludovic Courtès wrote on 29 Apr 2021 09:48
(name . Timothy Sample)(address . samplet@ngyro.com)(address . 47336@debbugs.gnu.org)
874kfpo7is.fsf@gnu.org
Hi!

Timothy Sample <samplet@ngyro.com> skribis:

Toggle quote (4 lines)
> And broke “guix pull”!! (I somehow fooled myself into thinking that I
> had already tested with “guix pull --url=...” locally.) I reverted the
> offending commit.

You can test with ‘guix pull’ (you need to make sure to specify the
right file:// URL *and* branch), or you can run “make as-derivation”.

Toggle quote (19 lines)
> It turns out that adding a reference from “(guix build download)” to
> “(guix swh)” breaks “compute-guix-derivation” in
> “build-aux/build-self.scm”. This is because “(guix swh)” references
> “(json)”, which is not available in the “compute-guix-derivation”
> environment. I tried mimicking the “fake-git” trick, but it didn’t work
> (I guess it needs the “define-json-mapping” macro at compile time).
>
> Everything works if I remove the #:autoload for “(guix swh)” and put
>
> ;; If we import (guix swh) directly, we introduce a compile-time
> ;; dependency on Guile-JSON. This breaks the "build-self" code, which
> ;; needs to build this module without Guile-JSON. Hence, we track
> ;; down the following procedure at runtime.
> (define swh-download-directory
> (module-ref (resolve-module '(guix swh)) 'swh-download-directory))
>
> inside of “disarchive-fetch/any” (just before it’s needed). Does this
> approach look okay?

That’s one possibility.

The patch below takes another approach. I think it aesthetically
slightly more pleasant because we don’t have to play ‘resolve-module’
tricks for obscure reasons. WDYT?

(It also fixes a format string argument mismatch.)

Thanks!

Ludo’.
Toggle diff (39 lines)
diff --git a/build-aux/build-self.scm b/build-aux/build-self.scm
index 853a2f328f..f100ff4aae 100644
--- a/build-aux/build-self.scm
+++ b/build-aux/build-self.scm
@@ -250,6 +250,7 @@ interface (FFI) of Guile.")
(match-lambda
(('guix 'config) #f)
(('guix 'channels) #f)
+ (('guix 'build 'download) #f) ;autoloaded by (guix download)
(('guix _ ...) #t)
(('gnu _ ...) #t)
(_ #f)))
diff --git a/guix/build/download.scm b/guix/build/download.scm
index 5431d7c682..ce31038b05 100644
--- a/guix/build/download.scm
+++ b/guix/build/download.scm
@@ -650,7 +650,7 @@ and write the output to FILE."
(('swhid swhid)
(match (string-split swhid #\:)
(("swh" "1" "dir" id)
- (format #t "Downloading from Software Heritage...~%" file)
+ (format #t "Downloading ~a from Software Heritage...~%" file)
(false-if-exception*
(swh-download-directory id output)))
(_ #f)))
diff --git a/guix/self.scm b/guix/self.scm
index 3154d180ac..7181205610 100644
--- a/guix/self.scm
+++ b/guix/self.scm
@@ -878,7 +878,8 @@ itself."
("guix/store/schema.sql"
,(local-file "../guix/store/schema.sql")))
- #:extensions (list guile-gcrypt)
+ #:extensions (list guile-gcrypt
+ guile-json) ;for (guix swh)
#:guile-for-build guile-for-build))
(define *extra-modules*
T
T
Timothy Sample wrote on 29 Apr 2021 19:24
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 47336-done@debbugs.gnu.org)
878s51knra.fsf_-_@ngyro.com
Hello,

Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (9 lines)
> Timothy Sample <samplet@ngyro.com> skribis:
>
>> And broke “guix pull”!! (I somehow fooled myself into thinking that I
>> had already tested with “guix pull --url=...” locally.) I reverted the
>> offending commit.
>
> You can test with ‘guix pull’ (you need to make sure to specify the
> right file:// URL *and* branch), or you can run “make as-derivation”.

I will definitely be more careful with this in the future.

Toggle quote (8 lines)
>> [...] Does this approach look okay?
>
> That’s one possibility.
>
> The patch below takes another approach. I think it aesthetically
> slightly more pleasant because we don’t have to play ‘resolve-module’
> tricks for obscure reasons. WDYT?

This is exactly what I was hoping for, but I couldn’t quite connect all
the dots in “build-self.scm”. Thanks!

Toggle quote (2 lines)
> (It also fixes a format string argument mismatch.)

Good catch!

I’ve pushed the updated patch and am closing the issue. :)


-- Tim
Closed
L
L
Ludovic Courtès wrote on 14 May 2021 23:36
(name . Timothy Sample)(address . samplet@ngyro.com)(address . 47336@debbugs.gnu.org)
87im3lknfu.fsf@gnu.org
Hi!

Timothy Sample <samplet@ngyro.com> skribis:

Toggle quote (40 lines)
> This patch series adds Disarchive assembly (backed by SWH lookup) as a
> fallback for downloads.
>
> To try it, make sure you are running the daemon in an environment with
> Disarchive available:
>
> $ ./pre-inst-env guix environment --ad-hoc guile disarchive
> # ./pre-inst-env guix-daemon --build-users-group=guixbuild
>
> Don’t forget to stop your existing Guix Daemon. :)
>
> You also need to make sure that regular downloads are unavailable. I do
> this by adjusting the “try” loop at the end of “url-fetch” in
> “guix/build/download.scm”. I replace the usual list of URLs with ‘()’:
>
> (let try ((uri (append uri content-addressed-uris)))
> (match '() ; uri
> ...))
>
> Now you can ask Guix for a recent .tar.gz source package:
>
> $ ./pre-inst-env guix build --no-substitutes -S python-httpretty
>
> You should see:
>
> Trying to use Disarchive to assemble /gnu/store/kbcnm57y2q1jvhvd8zw1g5vdiwlv19y9-httpretty-1.0.5.tar.gz
> Assembling the directory httpretty-1.0.5
> Downloading from Software Heritage...
> 7903d608efc89c14afb4d692a3721156e31a43e2/
> 7903d608efc89c14afb4d692a3721156e31a43e2/httpretty-1.0.5/
> 7903d608efc89c14afb4d692a3721156e31a43e2/httpretty-1.0.5/COPYING
> [...]
> Checking httpretty-1.0.5 digest... ok
> Assembling the tarball httpretty-1.0.5.tar
> Checking httpretty-1.0.5.tar digest... ok
> Assembling the Gzip file httpretty-1.0.5.tar.gz
> Checking httpretty-1.0.5.tar.gz digest... ok
> Copying result to /gnu/store/kbcnm57y2q1jvhvd8zw1g5vdiwlv19y9-httpretty-1.0.5.tar.gz
> successfully built /gnu/store/k0b3c7kgzyn1nlyhx192pcbcgbfnhnwa-httpretty-1.0.5.tar.gz.drv

Commits 67bf61255414115ffae0141df9dd3623bc742bff and
0b1f70d1a792af40aa0d13b3d227fde88f02d061 add the dependency on
Disarchive, so this fallback path is now enabled!

Toggle quote (13 lines)
> There’s lots to talk about though....
>
> First, it looks up the metadata on my server. This is fine for a demo,
> but not what we want forever. The patch series supports adding several
> mirrors for looking up the metadata. In the past, we talked about
> putting everything on one or a few of the big Git hosting platforms like
> GitHub or Gitlab. That way, it would be easily picked up by SWH and
> archived “forever”. Right now, I have Cuirass set up to build the
> metadata, and a little script that moves it from the build server to my
> Web server. It would be simple enough to adjust that script to push it
> to a remote Git repo. (Of course, the next step is to move this setup
> to Guix infrastructure.) Thoughts?

We should talk to SWH, giving them the figures you gave earlier in this
thread. But yeah, a Git repo looks best to me (it would be useful to
keep track of changes, for example if we eventually update metadata to a
new format) and it simplifies archival to SWH.

Second thing we need to figure out if where to create this database. If
you have a Cuirass job already, we should run it on ci.guix. WDYT?

Toggle quote (11 lines)
> On the code level, there were two things I couldn’t figure out for
> myself.
>
> I made the mirror list just simple strings. AIUI, the client and the
> daemon have to agree about the format of the mirror list. Given that
> running old daemons is common, changing the format is difficult. Is it
> worth it to copy the more flexible interface used by the content
> addressed mirrors? If yes, do I have to do the same ‘module-autoload!’
> dance to use ‘bytevector->base16-string’? :) (I probably would have
> just copied it, but that part confused me a bit.)

I had overlooked this suggestion of yours. Yes, I think it’s best to
copy the SWH scheme. Don’t worry about ‘module-autoload!’: nowadays we
can safely assume (guix base16) is available.

When we change from list-of-strings to list-of-procedures, we’ll have to
adjust the (guix build download) code so that it can deal with both.

Toggle quote (7 lines)
> I imported some modules from “guix/build/download.scm” (well, just
> “base16” and “swh”). It feels weird to use a bunch of host-side modules
> from what’s nominally a “guix/build” module. This is okay because
> “guix/build/download.scm” is not /really/ build-side code. It’s more
> like daemon (-ish) code that just happens to live in “guix/build”, which
> is why importing host-side modules is OK... right?

Yup. :-) In the end, the whole point is to reuse code on both sides,
and that’s what’s being done here.

Thanks,
Ludo’.
?