gforge.inria.fr to be taken off-line in Dec. 2020

OpenSubmitted by Ludovic Courtès.
Details
7 participants
  • Dr. Arne Babenhauserheide
  • Ludovic Courtès
  • Ludovic Courtès
  • Christopher Baines
  • Ricardo Wurmus
  • Timothy Sample
  • zimoun
Owner
unassigned
Severity
important
L
L
Ludovic Courtès wrote on 2 Jul 09:29 +0200
(address . bug-guix@gnu.org)
87mu4iv0gc.fsf@inria.fr
Hello!
The hosting site gforge.inria.fr will be taken off-line in December2020. This GForge instance hosts source code as tarballs, Subversionrepos, and Git repos. Users have been invited to migrate togitlab.inria.fr, which is Git only. It seems that Software Heritagehasn’t archived (yet) all of gforge.inria.fr. Let’s keep track of thesituation in this issue.
The following packages have their source on gforge.inria.fr:
Toggle snippet (14 lines)scheme@(guile-user)> ,pp packages-on-gforge$7 = (#<package r-spams@2.6-2017-03-22 gnu/packages/statistics.scm:3931 7f632401a640> #<package ocaml-cudf@0.9 gnu/packages/ocaml.scm:295 7f63235eb3c0> #<package ocaml-dose3@5.0.1 gnu/packages/ocaml.scm:357 7f63235eb280> #<package mpfi@1.5.4 gnu/packages/multiprecision.scm:158 7f632ee3adc0> #<package pt-scotch@6.0.6 gnu/packages/maths.scm:2920 7f632d832640> #<package scotch@6.0.6 gnu/packages/maths.scm:2774 7f632d832780> #<package pt-scotch32@6.0.6 gnu/packages/maths.scm:2944 7f632d8325a0> #<package scotch32@6.0.6 gnu/packages/maths.scm:2873 7f632d8326e0> #<package gf2x@1.2 gnu/packages/algebra.scm:103 7f6323ea1280> #<package gmp-ecm@7.0.4 gnu/packages/algebra.scm:658 7f6323eb4960> #<package cmh@1.0 gnu/packages/algebra.scm:322 7f6323eb4dc0>)
‘isl’ (a dependency of GCC) has its source on gforge.inria.fr but it’salso mirrored at gcc.gnu.org apparently.
Of these, the following are available on Software Heritage:
Toggle snippet (11 lines)scheme@(guile-user)> ,pp archived-source$8 = (#<package ocaml-cudf@0.9 gnu/packages/ocaml.scm:295 7f63235eb3c0> #<package ocaml-dose3@5.0.1 gnu/packages/ocaml.scm:357 7f63235eb280> #<package pt-scotch@6.0.6 gnu/packages/maths.scm:2920 7f632d832640> #<package scotch@6.0.6 gnu/packages/maths.scm:2774 7f632d832780> #<package pt-scotch32@6.0.6 gnu/packages/maths.scm:2944 7f632d8325a0> #<package scotch32@6.0.6 gnu/packages/maths.scm:2873 7f632d8326e0> #<package isl@0.18 gnu/packages/gcc.scm:925 7f632dc82320> #<package isl@0.11.1 gnu/packages/gcc.scm:939 7f632dc82280>)
So we’ll be missing these:
Toggle snippet (8 lines)scheme@(guile-user)> ,pp (lset-difference eq? $7 $8)$11 = (#<package r-spams@2.6-2017-03-22 gnu/packages/statistics.scm:3931 7f632401a640> #<package mpfi@1.5.4 gnu/packages/multiprecision.scm:158 7f632ee3adc0> #<package gf2x@1.2 gnu/packages/algebra.scm:103 7f6323ea1280> #<package gmp-ecm@7.0.4 gnu/packages/algebra.scm:658 7f6323eb4960> #<package cmh@1.0 gnu/packages/algebra.scm:322 7f6323eb4dc0>)
Attached the code I used for this.
Thanks,Ludo’.
(use-modules (guix) (gnu) (guix svn-download) (guix git-download) (guix swh) (ice-9 match) (srfi srfi-1) (srfi srfi-26)) (define (gforge? package) (define (gforge-string? str) (string-contains str "gforge.inria.fr")) (match (package-source package) ((? origin? o) (match (origin-uri o) ((? string? url) (gforge-string? url)) (((? string? urls) ...) (any gforge-string? urls)) ;or 'find' ((? git-reference? ref) (gforge-string? (git-reference-url ref))) ((? svn-reference? ref) (gforge-string? (svn-reference-url ref))) (_ #f))) (_ #f))) (define packages-on-gforge (fold-packages (lambda (package result) (if (gforge? package) (cons package result) result)) '())) (define archived-source (filter (lambda (package) (let* ((origin (package-source package)) (hash (origin-hash origin))) (lookup-content (content-hash-value hash) (symbol->string (content-hash-algorithm hash))))) packages-on-gforge))
Z
Z
zimoun wrote on 2 Jul 10:50 +0200
(name . Maurice Brémond)(address . Maurice.Bremond@inria.fr)
86h7uq8fmk.fsf@gmail.com
Hi Ludo,
On Thu, 02 Jul 2020 at 09:29, Ludovic Courtès <ludovic.courtes@inria.fr> wrote:
Toggle quote (7 lines)> The hosting site gforge.inria.fr will be taken off-line in December> 2020. This GForge instance hosts source code as tarballs, Subversion> repos, and Git repos. Users have been invited to migrate to> gitlab.inria.fr, which is Git only. It seems that Software Heritage> hasn’t archived (yet) all of gforge.inria.fr. Let’s keep track of the> situation in this issue.
[...]
Toggle quote (9 lines)> --8<---------------cut here---------------start------------->8---> scheme@(guile-user)> ,pp (lset-difference eq? $7 $8)> $11 = (#<package r-spams@2.6-2017-03-22 gnu/packages/statistics.scm:3931 7f632401a640>> #<package mpfi@1.5.4 gnu/packages/multiprecision.scm:158 7f632ee3adc0>> #<package gf2x@1.2 gnu/packages/algebra.scm:103 7f6323ea1280>> #<package gmp-ecm@7.0.4 gnu/packages/algebra.scm:658 7f6323eb4960>> #<package cmh@1.0 gnu/packages/algebra.scm:322 7f6323eb4dc0>)> --8<---------------cut here---------------end--------------->8---
All the 5 are 'url-fetch' so we can expect that sources.json will be upbefore the shutdown on December. :-)
Then, all the 14 packages we have from gforge.inria.fr will begit-fetch, right? So should we contact upstream to inform us when theyswitch? Then we can adapt the origin.
Toggle quote (5 lines)> (use-modules (guix) (gnu)> (guix svn-download)> (guix git-download)> (guix swh)
It does not work properly if I do not replace by
((guix swh) #:hide (origin?))
Well, I have no investigate further.
Toggle quote (4 lines)> (ice-9 match)> (srfi srfi-1)> (srfi srfi-26))
[...]
Toggle quote (9 lines)> (define archived-source> (filter (lambda (package)> (let* ((origin (package-source package))> (hash (origin-hash origin)))> (lookup-content (content-hash-value hash)> (symbol->string> (content-hash-algorithm hash)))))> packages-on-gforge))
I am a bit lost about the other discussion on falling back for tarball.But that's another story. :-)

Cheers,simon
L
L
Ludovic Courtès wrote on 2 Jul 12:03 +0200
(name . zimoun)(address . zimon.toutoune@gmail.com)
87d05etero.fsf@gnu.org
zimoun <zimon.toutoune@gmail.com> skribis:
Toggle quote (23 lines)> On Thu, 02 Jul 2020 at 09:29, Ludovic Courtès <ludovic.courtes@inria.fr> wrote:>>> The hosting site gforge.inria.fr will be taken off-line in December>> 2020. This GForge instance hosts source code as tarballs, Subversion>> repos, and Git repos. Users have been invited to migrate to>> gitlab.inria.fr, which is Git only. It seems that Software Heritage>> hasn’t archived (yet) all of gforge.inria.fr. Let’s keep track of the>> situation in this issue.>> [...]>>> --8<---------------cut here---------------start------------->8--->> scheme@(guile-user)> ,pp (lset-difference eq? $7 $8)>> $11 = (#<package r-spams@2.6-2017-03-22 gnu/packages/statistics.scm:3931 7f632401a640>>> #<package mpfi@1.5.4 gnu/packages/multiprecision.scm:158 7f632ee3adc0>>> #<package gf2x@1.2 gnu/packages/algebra.scm:103 7f6323ea1280>>> #<package gmp-ecm@7.0.4 gnu/packages/algebra.scm:658 7f6323eb4960>>> #<package cmh@1.0 gnu/packages/algebra.scm:322 7f6323eb4dc0>)>> --8<---------------cut here---------------end--------------->8--->> All the 5 are 'url-fetch' so we can expect that sources.json will be up> before the shutdown on December. :-)
Unfortunately, it won’t help for tarballs:
https://sympa.inria.fr/sympa/arc/swh-devel/2020-07/msg00001.html
There’s this other discussion you mentioned, which I hope will have apositive outcome:
https://forge.softwareheritage.org/T2430
Toggle quote (9 lines)>> (use-modules (guix) (gnu)>> (guix svn-download)>> (guix git-download)>> (guix swh)>> It does not work properly if I do not replace by>> ((guix swh) #:hide (origin?))
Oh right, I had overlooked this as I played at the REPL.
Thanks,Ludo’.
L
L
Ludovic Courtès wrote on 10 Jul 00:30 +0200
control message for bug #42162
(address . control@debbugs.gnu.org)
877dvc9v9o.fsf@gnu.org
severity 42162 importantquit
L
L
Ludovic Courtès wrote on 11 Jul 17:50 +0200
Recovering source tarballs
(name . zimoun)(address . zimon.toutoune@gmail.com)
87r1tit5j6.fsf_-_@gnu.org
Hi,
Ludovic Courtès <ludo@gnu.org> skribis:
Toggle quote (5 lines)> There’s this other discussion you mentioned, which I hope will have a> positive outcome:>> https://forge.softwareheritage.org/T2430
This discussion as well as discussions on #swh-devel have made it clearthat SWH will not archive raw tarballs, at least not in the foreseeablefuture. Instead, it will keep archiving the contents of tarballs, as ithas always done—that’s already a huge service.
Not storing raw tarballs makes sense from an engineering perspective,but it does mean that we cannot rely on SWH as a content-addressedmirror for tarballs. (In fact, some raw tarballs are available on SWH,but that’s mostly “by chance”, for instance because they appear as-is ina Git repo that was ingested.) In fact this is one of the challengesmentioned inhttps://guix.gnu.org/blog/2019/connecting-reproducible-deployment-to-a-long-term-source-code-archive/.
So we need a solution for now (and quite urgently), and a solution forthe future.
For the now, since 70% of our packages use ‘url-fetch’, we need to beable to fetch or to reconstruct tarballs. There’s no way around it.
In the short term, we should arrange so that the build farm keeps GCroots on source tarballs for an indefinite amount of time. Cuirassjobset? Mcron job to preserve GC roots? Ideas?
For the future, we could store nar hashes of unpacked tarballs insteadof hashes over tarballs. But that raises two questions:
• If we no longer deal with tarballs but upstreams keep signing tarballs (not raw directory hashes), how can we authenticate our code after the fact?
• SWH internally store Git-tree hashes, not nar hashes, so we still wouldn’t be able to fetch our unpacked trees from SWH.
(Both issues were previously discussed athttps://sympa.inria.fr/sympa/arc/swh-devel/2016-07/.)
So for the medium term, and perhaps for the future, a possible optionwould be to preserve tarball metadata so we can reconstruct them:
tarball = metadata + tree
After all, tarballs are byproducts and should be no exception: we shouldbuild them from source. :-)
In https://forge.softwareheritage.org/T2430, Stefano mentionedpristine-tar, which does almost that, but not quite: it stores a binarydelta between a tarball and a tree:
https://manpages.debian.org/unstable/pristine-tar/pristine-tar.1.en.html
I think we should have something more transparent than a binary delta.
The code below can “disassemble” and “assemble” a tar. When itdisassembles it, it generates metadata like this:
Toggle snippet (32 lines)(tar-source (version 0) (headers (("guile-3.0.4/" (mode 493) (size 0) (mtime 1593007723) (chksum 3979) (typeflag #\5)) ("guile-3.0.4/m4/" (mode 493) (size 0) (mtime 1593007720) (chksum 4184) (typeflag #\5)) ("guile-3.0.4/m4/pipe2.m4" (mode 420) (size 531) (mtime 1536050419) (chksum 4812) (hash (sha256 "arx6n2rmtf66yjlwkgwp743glcpdsfzgjiqrqhfegutmcwvwvsza"))) ("guile-3.0.4/m4/time_h.m4" (mode 420) (size 5471) (mtime 1536050419) (chksum 4974) (hash (sha256 "z4py26rmvsk4st7db6vwziwwhkrjjrwj7nra4al6ipqh2ms45kka")))[…]
The ’assemble-archive’ procedure consumes that, looks up file contentsby hash on SWH, and reconstructs the original tarball…
… at least in theory, because in practice we hit the SWH rate limitafter looking up a few files:
https://archive.softwareheritage.org/api/#rate-limiting
So it’s a bit ridiculous, but we may have to store a SWH “dir”identifier for the whole extracted tree—a Git-tree hash—since that wouldallow us to retrieve the whole thing in a single HTTP request.
Besides, we’ll also have to handle compression: storing gzip/xz headersand compression levels.

How would we put that in practice? Good question. :-)
I think we’d have to maintain a database that maps tarball hashes tometadata (!). A simple version of it could be a Git repo where, say,‘sha256/0mq9fc0ig0if5x9zjrs78zz8gfzczbvykj2iwqqd6salcqdgdwhk’ wouldcontain the metadata above. The nice thing is that the Git repo itselfcould be archived by SWH. :-)
Thus, if a tarball vanishes, we’d look it up in the database andreconstruct it from its metadata plus content store in SWH.
Thoughts?
Anyhow, we should team up with fellow NixOS and SWH hackers to addressthis, and with developers of other distros as well—this problem is notjust that of the functional deployment geeks, is it?
Ludo’.
;;; GNU Guix --- Functional package management for GNU ;;; Copyright © 2020 Ludovic Courtès <ludo@gnu.org> ;;; ;;; This file is part of GNU Guix. ;;; ;;; GNU Guix is free software; you can redistribute it and/or modify it ;;; under the terms of the GNU General Public License as published by ;;; the Free Software Foundation; either version 3 of the License, or (at ;;; your option) any later version. ;;; ;;; GNU Guix is distributed in the hope that it will be useful, but ;;; WITHOUT ANY WARRANTY; without even the implied warranty of ;;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ;;; GNU General Public License for more details. ;;; ;;; You should have received a copy of the GNU General Public License ;;; along with GNU Guix. If not, see <http://www.gnu.org/licenses/>. (define-module (tar) #:use-module (ice-9 match) #:use-module (ice-9 binary-ports) #:use-module (rnrs bytevectors) #:use-module (srfi srfi-1) #:use-module (srfi srfi-9) #:use-module (srfi srfi-26) #:use-module (gcrypt hash) #:use-module (guix base16) #:use-module (guix base32) #:use-module ((ice-9 rdelim) #:select ((read-string . get-string-all))) #:use-module (web client) #:use-module (web response) #:export (disassemble-archive assemble-archive)) ;;; ;;; Tar. ;;; (define %TMAGIC "ustar\0") (define %TVERSION "00") (define-syntax-rule (define-field-type type type-size read-proc write-proc) "Define TYPE as a ustar header field type of TYPE-SIZE bytes. READ-PROC is the procedure to obtain the value of an object of this type froma bytevector, and WRITE-PROC writes it to a bytevector." (define-syntax type (syntax-rules (read write size) ((_ size) type-size) ((_ read) read-proc) ((_ write) write-proc)))) (define (sub-bytevector bv offset size) (let ((sub (make-bytevector size))) (bytevector-copy! bv offset sub 0 size) sub)) (define (read-integer bv offset len) (string->number (read-string bv offset len) 8)) (define read-integer12 (cut read-integer <> <> 12)) (define read-integer8 (cut read-integer <> <> 8)) (define (read-string bv offset max-len) (define len (let loop ((len 0)) (cond ((= len max-len) len) ((zero? (bytevector-u8-ref bv (+ offset len))) len) (else (loop (+ 1 len)))))) (utf8->string (sub-bytevector bv offset len))) (define read-string155 (cut read-string <> <> 155)) (define read-string100 (cut read-string <> <> 100)) (define read-string32 (cut read-string <> <> 32)) (define read-string6 (cut read-string <> <> 6)) (define read-string2 (cut read-string <> <> 2)) (define (read-character bv offset) (integer->char (bytevector-u8-ref bv offset))) (define (read-padding12 bv offset) (bytevector-uint-ref bv offset (endianness big) 12)) (define (write-integer! bv offset value len) (let ((str (string-pad (number->string value 8) (- len 1) #\0))) (write-string! bv offset str len))) (define write-integer12! (cut write-integer! <> <> <> 12)) (define write-integer8! (cut write-integer! <> <> <> 8)) (define (write-string! bv offset str len) (let* ((str (string-pad-right str len #\nul)) (buf (string->utf8 str))) (bytevector-copy! buf 0 bv offset (bytevector-length buf)))) (define write-string155! (cut write-string! <> <> <> 155)) (define write-string100! (cut write-string! <> <> <> 100)) (define write-string32! (cut write-string! <> <> <> 32)) (define write-string6! (cut write-string! <> <> <> 6)) (define write-string2! (cut write-string! <> <> <> 2)) (define (write-character! bv offset value) (bytevector-u8-set! bv offset (char->integer value))) (define (write-padding12! bv offset value) (bytevector-uint-set! bv offset value (endianness big) 12)) (define-field-type integer12 12 read-integer12 write-integer12!) (define-field-type integer8 8 read-integer8 write-integer8!) (define-field-type character 1 read-character write-character!) (define-field-type string155 155 read-string155 write-string155!) (define-field-type string100 100 read-string100 write-string100!) (define-field-type string32 32 read-string32 write-string32!) (define-field-type string6 6 read-string6 write-string6!) (define-field-type string2 2 read-string2 write-string2!) (define-field-type padding12 12 read-padding12 write-padding12!) (define-syntax define-pack (syntax-rules () ((_ type ctor pred write-header read-header (field-names field-types field-getters) ...) (begin (define-record-type type (ctor field-names ...) pred (field-names field-getters) ...) (define (read-header port) "Return the ustar header read from PORT." (set-port-encoding! port "ISO-8859-1") (let ((bv (get-bytevector-n port (+ (field-types size) ...)))) (letrec-syntax ((build (syntax-rules () ((_ bv () offset (fields (... ...))) (ctor fields (... ...))) ((_ bv (type0 types (... ...)) offset (fields (... ...))) (build bv (types (... ...)) (+ offset (type0 size)) (fields (... ...) ((type0 read) bv offset))))))) (build bv (field-types ...) 0 ())))) (define (write-header header port) "Serialize HEADER, a <ustar-header> record, to PORT." (let* ((len (+ (field-types size) ...)) (bv (make-bytevector len))) (match header (($ type field-names ...) (letrec-syntax ((write! (syntax-rules () ((_ () offset) #t) ((_ ((type value) rest (... ...)) offset) (begin ((type write) bv offset value) (write! (rest (... ...)) (+ offset (type size)))))))) (write! ((field-types field-names) ...) 0) (put-bytevector port bv)))))))))) ;; The ustar header. See <tar.h>. (define-pack <ustar-header> %make-ustar-header ustar-header? write-ustar-header read-ustar-header (name string100 ustar-header-name) ;NUL-terminated if NUL fits (mode integer8 ustar-header-mode) (uid integer8 ustar-header-uid) (gid integer8 ustar-header-gid) (size integer12 ustar-header-size) (mtime integer12 ustar-header-mtime) (chksum integer8 ustar-header-checksum) (typeflag character ustar-header-type-flag) (linkname string100 ustar-header-link-name) (magic string6 ustar-header-magic) ;must be TMAGIC (version string2 ustar-header-version) ;must be TVERSION (uname string32 ustar-header-uname) ;NUL-terminated (gname string32 ustar-header-gname) ;NUL-terminated (devmajor integer8 ustar-header-device-major) (devminor integer8 ustar-header-device-minor) (prefix string155 ustar-header-prefix) ;NUL-terminated if NUL fits (padding padding12 ustar-header-padding)) (define* (make-ustar-header name #:key (mode 0) (uid 0) (gid 0) (size 0) (mtime 0) (checksum 0) (type-flag 0) (link-name "") (magic %TMAGIC) (version %TVERSION) (uname "") (gname "") (device-major 0) (device-minor 0) (prefix "") (padding 0)) (%make-ustar-header name mode uid gid size mtime checksum type-flag link-name magic version uname gname device-major device-minor prefix padding)) (define %zero-header ;; The all-zeros header, which marks the end of stream. (read-ustar-header (open-bytevector-input-port (make-bytevector 512 0)))) (define (consumer port) "Return a procedure that consumes or skips the given number of bytes from PORT." (if (false-if-exception (seek port 0 SEEK_CUR)) (lambda (len) (seek port len SEEK_CUR)) (lambda (len) (define bv (make-bytevector 8192)) (let loop ((len len)) (define block (min len (bytevector-length bv))) (unless (or (zero? block) (eof-object? (get-bytevector-n! port bv 0 block))) (loop (- len block))))))) (define (fold-archive proc seed port) "Read ustar headers from PORT; for each header, call PROC." (define skip (consumer port)) (let loop ((result seed)) (define header (read-ustar-header port)) (if (equal? header %zero-header) result (let* ((result (proc header port result)) (size (ustar-header-size header)) (remainder (modulo size 512))) ;; It's up to PROC to consume the SIZE bytes of data corresponding ;; to HEADER. Here we consume padding. (unless (zero? remainder) (skip (- 512 remainder))) (loop result))))) ;;; ;;; Disassembling/assembling an archive. ;;; (define (dump in out size) "Copy SIZE bytes from IN to OUT." (define buf-size 65536) (define buf (make-bytevector buf-size)) (let loop ((left size)) (if (<= left 0) 0 (let ((read (get-bytevector-n! in buf 0 (min left buf-size)))) (if (eof-object? read) left (begin (put-bytevector out buf 0 read) (loop (- left read)))))))) (define* (disassemble-archive port #:optional (algorithm (hash-algorithm sha256))) "Read tar archive from PORT and return an sexp representing its metadata, including individual file hashes with ALGORITHM." (define headers+hashes (fold-archive (lambda (header port result) (if (zero? (ustar-header-size header)) (alist-cons header #f result) (let () (define-values (hash-port get-hash) (open-hash-port algorithm)) (dump port hash-port (ustar-header-size header)) (close-port hash-port) (alist-cons header (get-hash) result)))) '() port)) (define header+hash->sexp (match-lambda ((header . hash) (letrec-syntax ((serialize (syntax-rules () ((_) '()) ((_ (tag get default) rest ...) (let ((value (get header))) (append (if (equal? default value) '() `((tag ,value))) (serialize rest ...)))) ((_ (tag get) rest ...) (append `((tag ,(get header))) (serialize rest ...)))))) `(,(ustar-header-name header) ,@(serialize (mode ustar-header-mode) (uid ustar-header-uid 0) (gid ustar-header-gid 0) (size ustar-header-size) (mtime ustar-header-mtime) (chksum ustar-header-checksum) (typeflag ustar-header-type-flag #\nul) (linkname ustar-header-link-name "") (magic ustar-header-magic "") (version ustar-header-version "") (uname ustar-header-uname "") (gname ustar-header-gname "") (devmajor ustar-header-device-major 0) (devminor ustar-header-device-minor 0) (prefix ustar-header-prefix "") (padding ustar-header-padding 0) (hash (lambda (_) (and hash `(,(hash-algorithm-name algorithm) ,(bytevector->base32-string hash)))) #f))))))) `(tar-source (version 0) (headers ,(map header+hash->sexp (reverse headers+hashes))))) (define (fetch-from-swh algorithm hash) (define url (string-append "https://archive.softwareheritage.org/api/1/content/" (symbol->string algorithm) ":" (bytevector->base16-string hash) "/raw/")) (define-values (response port) (http-get url #:streaming? #t #:verify-certificate? #f)) (if (= 200 (response-code response)) port (throw 'swh-fetch-error url (get-string-all port)))) (define* (assemble-archive source port #:optional (fetch-data fetch-from-swh)) "Assemble archive from SOURCE, an sexp as returned by 'disassemble-archive'." (define sexp->header (match-lambda ((name . properties) (let ((ref (lambda (field) (and=> (assq-ref properties field) car)))) (make-ustar-header name #:mode (ref 'mode) #:uid (or (ref 'uid) 0) #:gid (or (ref 'gid) 0) #:size (ref 'size) #:mtime (ref 'mtime) #:checksum (ref 'chksum) #:type-flag (or (ref 'typeflag) #\nul) #:link-name (or (ref 'linkname) "") #:magic (or (ref 'magic) "") #:version (or (ref 'version) "") #:uname (or (ref 'uname) "") #:gname (or (ref 'gname) "") #:device-major (or (ref 'devmajor) 0) #:device-minor (or (ref 'devminor) 0) #:prefix (or (ref 'prefix) "") #:padding (or (ref 'padding) 0)))))) (define sexp->data (match-lambda ((name . properties) (match (assq-ref properties 'hash) (((algorithm (= base32-string->bytevector hash)) _ ...) (fetch-data algorithm hash)) (#f (open-input-string "")))))) (match source (('tar-source ('version 0) ('headers headers) _ ...) (for-each (lambda (sexp) (let ((header (sexp->header sexp)) (data (sexp->data sexp))) (write-ustar-header header port) (dump-port data port) (close-port data))) headers))))
C
C
Christopher Baines wrote on 13 Jul 21:20 +0200
(name . Ludovic Courtès)(address . ludo@gnu.org)
87a703jk78.fsf@cbaines.net
Ludovic Courtès <ludo@gnu.org> writes:
Toggle quote (32 lines)> Hi,>> Ludovic Courtès <ludo@gnu.org> skribis:>>> There’s this other discussion you mentioned, which I hope will have a>> positive outcome:>>>> https://forge.softwareheritage.org/T2430>> This discussion as well as discussions on #swh-devel have made it clear> that SWH will not archive raw tarballs, at least not in the foreseeable> future. Instead, it will keep archiving the contents of tarballs, as it> has always done—that’s already a huge service.>> Not storing raw tarballs makes sense from an engineering perspective,> but it does mean that we cannot rely on SWH as a content-addressed> mirror for tarballs. (In fact, some raw tarballs are available on SWH,> but that’s mostly “by chance”, for instance because they appear as-is in> a Git repo that was ingested.) In fact this is one of the challenges> mentioned in> <https://guix.gnu.org/blog/2019/connecting-reproducible-deployment-to-a-long-term-source-code-archive/>.>> So we need a solution for now (and quite urgently), and a solution for> the future.>> For the now, since 70% of our packages use ‘url-fetch’, we need to be> able to fetch or to reconstruct tarballs. There’s no way around it.>> In the short term, we should arrange so that the build farm keeps GC> roots on source tarballs for an indefinite amount of time. Cuirass> jobset? Mcron job to preserve GC roots? Ideas?
Going forward, being methodical as a project about storing the tarballsand source material for the packages is probalby the way to ensure it'savailable for the future. I'm not sure the data storage cost issignificant, the cost of doing this is probably in working out what tostore, doing so in a redundant manor, and making the data available.
The Guix Data Service knows about fixed output derivations, so it mightbe possible to backfill such a store by just attempting to build thosederivations. It might also be possible to use the Guix Data Service towork out what's available, and what tarballs are missing.
Chris
-----BEGIN PGP SIGNATURE-----
iQKTBAEBCgB9FiEEPonu50WOcg2XVOCyXiijOwuE9XcFAl8Ms/tfFIAAAAAALgAoaXNzdWVyLWZwckBub3RhdGlvbnMub3BlbnBncC5maWZ0aGhvcnNlbWFuLm5ldDNFODlFRUU3NDU4RTcyMEQ5NzU0RTBCMjVFMjhBMzNCMEI4NEY1NzcACgkQXiijOwuE9XfAtw/6AtEyqRcimef5NTFchcAigC6fT6DJLcGnyJNUXlfZn6nHU9ao/ev33D5dMFfKl1YljKf+fA848fZSIe0eBERbkZ+D1oed6SD6Xx8fG9ekCSgGtbmysNEcDDKKqO5kg/QUbKYODpRW8iZIDMPUQZ0yNQu9KQdvVKIhHIZJnSGNt2XVjRdoCkW+H19mQVPVdgqZIarkZctOzPegA8FFEi8O/GO7gK4gbizewecgsl1qL0yWBDyUJ9tsWeAH+EsVykk91y9tHDPfQYfKqik7A0WrK75oeNOqs5QtEqRPjcMzwsDkIO13e5Y3Z5YlM7zTs7R/OLSyiSlT5z/1S5RrbMyMMryt0S4uvqjZfFDtgaOHxhVhBg/1kya/H5v1cB3jq8WpvL6sDYFbSqI9vWPJnQDq5EpIvI16Ri0ygnMAffiz6hhtdn/pCGV7GG5U7H6ED7gz5FB8YovGED1C9l8dh7h3Hi+1P+JL3KheJyF5bU829wqL9r2l5sOprad0PEsq52RCwPBuNu8agTbobICimqFnp3B5wySDNEvkXZ4FFlMR6ZdW0BjBnLF0ZRU4v8FCf+w81lAIksF9UWusZTzb++aMPXsdlHfelyWtOUi5mc1GMNRfCIW/VLIYyZIPaqVPHoFkTWb6q6XK5tjC302Di/BD9qDEr5g9qFU16Yeq7ywcAjs==hgnz-----END PGP SIGNATURE-----
Z
Z
zimoun wrote on 15 Jul 18:55 +0200
Re: Recovering source tarballs
(name . Ludovic Courtès)(address . ludo@gnu.org)
CAJ3okZ2iesMKLLD29qrOJzNBjV=haoPFtpRg9T=0aTA8ZxLQMA@mail.gmail.com
Hi Ludo,
Well, you enlarge the discussion to more than the issue of the 5url-fetch packages on gforge.inria.fr :-)

First of all, you wrote [1] ``Migration away from tarballs is alreadyhappening as more and more software is distributed straight fromcontent-addressed VCS repositories, though progress has been relativelyslow since we first discussed it in 2016.'' but on the other hand Guixuses more than often [2] "url-fetch" even if "git-fetch" is availableupstream. Other said, I am not convinced the migration is reallyhappening...
The issue would be mitigated if Guix transitions from "url-fetch" to"git-fetch" when possible.
1: https://forge.softwareheritage.org/T2430#458002: https://lists.gnu.org/archive/html/guix-devel/2020-05/msg00224.html

Second, trying to do some stats about the SWH coverage, I note thatnon-neglectible "url-fetch" are reachable by "lookup-content". Thecoverage is not straightforward because of the 120 request per hour ratelimit or unexpected server error. Another story.
Well, I would like having numbers because I do not know what isconcretely the issue: how many "url-fetch" packages are reachable? Andif they are unreachable, is it because they are not in yet? or is itbecause Guix does not have enough info to lookup them?

On Sat, 11 Jul 2020 at 17:50, Ludovic Courtès <ludo@gnu.org> wrote:
Toggle quote (3 lines)> For the now, since 70% of our packages use ‘url-fetch’, we need to be> able to fetch or to reconstruct tarballs. There’s no way around it.
Yes, but for example all the packages in gnu/packages/bioconductor.scmcould be "git-fetch". Today the source is over url-fetch but it couldbe over git-fetch with https://git.bioconductor.org/packages/flowCoreorgit@git.bioconductor.org:packages/flowCore.
Another example is the packages in gnu/packages/emacs-xyz.scm and theones from elpa.gnu.org are "url-fetch" and could be "git-fetch", forexample usinghttp://git.savannah.gnu.org/gitweb/?p=emacs/elpa.git;a=tree;f=packages/ace-window;h=71d3eb7bd2efceade91846a56b9937812f658bae;hb=HEAD
So I would be more reserved about the "no way around it". :-) I meanthe 70% could be a bit mitigated.

Toggle quote (4 lines)> In the short term, we should arrange so that the build farm keeps GC> roots on source tarballs for an indefinite amount of time. Cuirass> jobset? Mcron job to preserve GC roots? Ideas?
Yes, preserving source tarballs for an indefinite amount of time willhelp. At least all the packages where "lookup-content" returns #f,which means they are not in SWH or they are unreachable -- both isequivalent from Guix side.
What about in addition push to IPFS? Feasible? Lookup issue?
Toggle quote (7 lines)> For the future, we could store nar hashes of unpacked tarballs instead> of hashes over tarballs. But that raises two questions:>> • If we no longer deal with tarballs but upstreams keep signing> tarballs (not raw directory hashes), how can we authenticate our> code after the fact?
Does Guix automatically authenticate code using signed tarballs?

Toggle quote (11 lines)> • SWH internally store Git-tree hashes, not nar hashes, so we still> wouldn’t be able to fetch our unpacked trees from SWH.>> (Both issues were previously discussed at> <https://sympa.inria.fr/sympa/arc/swh-devel/2016-07/>.)>> So for the medium term, and perhaps for the future, a possible option> would be to preserve tarball metadata so we can reconstruct them:>> tarball = metadata + tree
There is different issues at different levels:
1. how to lookup? what information do we need to keep/store to be able to query SWH? 2. how to check the integrity? what information do we need to keep/store to be able to verify that SWH returns what Guix expects? 3. how to authenticate? where the tarball metadata has to be stored if SWH removes it?
Basically, the git-fetch source stores 3 identifiers:
- upstream url - commit / tag - integrity (sha256)
Fetching from SWH requires the commit only (lookup-revision) or thetag+url (lookup-origin-revision) then from the returned revision, theintegrity of the downloaded data is checked using the sha256, right?
Therefore, one way to fix lookup of the url-fetch source is to add anextra field mimicking the commit role.
The easiest is to store a SWHID or an identifier allowing to deduce theSWHID.
I have not checked the code, but something like this:
https://pypi.org/project/swh.model/ https://forge.softwareheritage.org/source/swh-model/
and at package time, this identifier is added, similarly to integrity.
Aside, does Guix use the authentication metadata that tarballs provide?

( BTW, I failed [3,4] to package swh.model so if someone wants to give atry.3: https://lists.gnu.org/archive/html/help-guix/2020-06/msg00158.html4: https://lists.gnu.org/archive/html/help-guix/2020-06/msg00161.html)

Toggle quote (3 lines)> After all, tarballs are byproducts and should be no exception: we should> build them from source. :-)
[...]
Toggle quote (3 lines)> The code below can “disassemble” and “assemble” a tar. When it> disassembles it, it generates metadata like this:
[...]
Toggle quote (3 lines)> The ’assemble-archive’ procedure consumes that, looks up file contents> by hash on SWH, and reconstructs the original tarball…
Where do you plan to store the "disassembled" metadata?And where do you plan to "assemble-archive"?
I mean,
What is pushed to SWH? And how? What is fetched from SWH? And how?
(Well, answer below. :-))
Toggle quote (3 lines)> … at least in theory, because in practice we hit the SWH rate limit> after looking up a few files:
Yes, it is 120 request per hour and 10 save per hour. Well, I do notthink they will increase much these numbers in general. However,they seem open for specific machines. So, I do not want to speak forthem, but we could ask an higher rate limit for ci.guix.gnu.org forexample. Then we need to distinguish between source substitutes andbinary substitutes. And basically, when an user runs "guix build foo",if the source is not available upstream nor already on ci.guix.gnu.org,then ci.guix.gnu.org fetch the missing sources from SWH and delivers itto the user.

Toggle quote (6 lines)> https://archive.softwareheritage.org/api/#rate-limiting>> So it’s a bit ridiculous, but we may have to store a SWH “dir”> identifier for the whole extracted tree—a Git-tree hash—since that would> allow us to retrieve the whole thing in a single HTTP request.
Well, the limited resources of SWH is an issue but SWH is not a mirrorbut an archive. :-)
And as I wrote above, we could ask to SWH to increase the rate limit forspecific machine such as ci.guix.gnu.org

Toggle quote (6 lines)> I think we’d have to maintain a database that maps tarball hashes to> metadata (!). A simple version of it could be a Git repo where, say,> ‘sha256/0mq9fc0ig0if5x9zjrs78zz8gfzczbvykj2iwqqd6salcqdgdwhk’ would> contain the metadata above. The nice thing is that the Git repo itself> could be archived by SWH. :-)
How this database that maps tarball hashes to metadata should bemaintained? Git push hook? Cron task?
What about foreign channels? Should they maintain their own map?
To summary, it would work like this, right?
at package time: - store an integrity identiter (today sha256-nix-base32) - disassemble the tarball - commit to another repo the metadata using the path (address) sha256/base32/<identitier> - push to packages-repo *and* metadata-database-repo
at future time: (upstream has disappeared, say!) - use the integrity identifier to query the database repo - lookup the SWHID from the database repo - fetch the data from SWH - or lookup the IPFS identifier from the database repo and fetch the data from IPFS, for another example - re-assemble the tarball using the metadata from the database repo - check integrity, authentication, etc.
Well, right it is better than only adding an identifier for looking upas I described above; because it is more general and flexible than onlySWH as fall-back.
The format of metadata (disassemble) that you propose is schemish(obviously! :-)) but we could propose something more JSON-like.

All the best,simon
L
L
Ludovic Courtès wrote on 20 Jul 10:39 +0200
(name . zimoun)(address . zimon.toutoune@gmail.com)
87365mzil1.fsf@gnu.org
Hi!
There are many many comments in your message, so I took the liberty toreply only to the essence of it. :-)
zimoun <zimon.toutoune@gmail.com> skribis:
Toggle quote (18 lines)> On Sat, 11 Jul 2020 at 17:50, Ludovic Courtès <ludo@gnu.org> wrote:>>> For the now, since 70% of our packages use ‘url-fetch’, we need to be>> able to fetch or to reconstruct tarballs. There’s no way around it.>> Yes, but for example all the packages in gnu/packages/bioconductor.scm> could be "git-fetch". Today the source is over url-fetch but it could> be over git-fetch with https://git.bioconductor.org/packages/flowCore or> git@git.bioconductor.org:packages/flowCore.>> Another example is the packages in gnu/packages/emacs-xyz.scm and the> ones from elpa.gnu.org are "url-fetch" and could be "git-fetch", for> example using> http://git.savannah.gnu.org/gitweb/?p=emacs/elpa.git;a=tree;f=packages/ace-window;h=71d3eb7bd2efceade91846a56b9937812f658bae;hb=HEAD>> So I would be more reserved about the "no way around it". :-) I mean> the 70% could be a bit mitigated.
The “no way around it” was about the situation today: it’s a fact that70% of packages are built from tarballs, so we need to be able to fetchthem or reconstruct them.
However, the two examples above are good ideas as to the way forward: wecould start a url-fetch-to-git-fetch migration in these two cases, andperhaps more.
Toggle quote (11 lines)>> In the short term, we should arrange so that the build farm keeps GC>> roots on source tarballs for an indefinite amount of time. Cuirass>> jobset? Mcron job to preserve GC roots? Ideas?>> Yes, preserving source tarballs for an indefinite amount of time will> help. At least all the packages where "lookup-content" returns #f,> which means they are not in SWH or they are unreachable -- both is> equivalent from Guix side.>> What about in addition push to IPFS? Feasible? Lookup issue?
Lookup issue. :-) The hash in a CID is not just a raw blob hash.Files are typically chunked beforehand, assembled as a Merkle tree, andthe CID is roughly the hash to the tree root. So it would seem we can’tuse IPFS as-is for tarballs.
Toggle quote (9 lines)>> For the future, we could store nar hashes of unpacked tarballs instead>> of hashes over tarballs. But that raises two questions:>>>> • If we no longer deal with tarballs but upstreams keep signing>> tarballs (not raw directory hashes), how can we authenticate our>> code after the fact?>> Does Guix automatically authenticate code using signed tarballs?
Not automatically; packagers are supposed to authenticate code when theyadd a package (‘guix refresh -u’ does that automatically).
Toggle quote (30 lines)>> • SWH internally store Git-tree hashes, not nar hashes, so we still>> wouldn’t be able to fetch our unpacked trees from SWH.>>>> (Both issues were previously discussed at>> <https://sympa.inria.fr/sympa/arc/swh-devel/2016-07/>.)>>>> So for the medium term, and perhaps for the future, a possible option>> would be to preserve tarball metadata so we can reconstruct them:>>>> tarball = metadata + tree>> There is different issues at different levels:>> 1. how to lookup? what information do we need to keep/store to be able> to query SWH?> 2. how to check the integrity? what information do we need to> keep/store to be able to verify that SWH returns what Guix expects?> 3. how to authenticate? where the tarball metadata has to be stored if> SWH removes it?>> Basically, the git-fetch source stores 3 identifiers:>> - upstream url> - commit / tag> - integrity (sha256)>> Fetching from SWH requires the commit only (lookup-revision) or the> tag+url (lookup-origin-revision) then from the returned revision, the> integrity of the downloaded data is checked using the sha256, right?
Yes.
Toggle quote (3 lines)> Therefore, one way to fix lookup of the url-fetch source is to add an> extra field mimicking the commit role.
But today, we store tarball hashes, not directory hashes.
Toggle quote (10 lines)> The easiest is to store a SWHID or an identifier allowing to deduce the> SWHID.>> I have not checked the code, but something like this:>> https://pypi.org/project/swh.model/> https://forge.softwareheritage.org/source/swh-model/>> and at package time, this identifier is added, similarly to integrity.
I’m skeptical about adding a field that is practically never used.
[...]
Toggle quote (11 lines)>> The code below can “disassemble” and “assemble” a tar. When it>> disassembles it, it generates metadata like this:>> [...]>>> The ’assemble-archive’ procedure consumes that, looks up file contents>> by hash on SWH, and reconstructs the original tarball…>> Where do you plan to store the "disassembled" metadata?> And where do you plan to "assemble-archive"?
We’d have a repo/database containing metadata indexed by tarball sha256.
Toggle quote (3 lines)> How this database that maps tarball hashes to metadata should be> maintained? Git push hook? Cron task?
Yes, something like that. :-)
Toggle quote (2 lines)> What about foreign channels? Should they maintain their own map?
Yes, presumably.
Toggle quote (18 lines)> To summary, it would work like this, right?>> at package time:> - store an integrity identiter (today sha256-nix-base32)> - disassemble the tarball> - commit to another repo the metadata using the path (address)> sha256/base32/<identitier>> - push to packages-repo *and* metadata-database-repo>> at future time: (upstream has disappeared, say!)> - use the integrity identifier to query the database repo> - lookup the SWHID from the database repo> - fetch the data from SWH> - or lookup the IPFS identifier from the database repo and fetch the> data from IPFS, for another example> - re-assemble the tarball using the metadata from the database repo> - check integrity, authentication, etc.
That’s the idea.
Toggle quote (3 lines)> The format of metadata (disassemble) that you propose is schemish> (obviously! :-)) but we could propose something more JSON-like.
Sure, if that helps get other people on-board, why not (though sexpshave lived much longer than JSON and XML together :-)).
Thanks,Ludo’.
Z
Z
zimoun wrote on 20 Jul 17:52 +0200
(name . Ludovic Courtès)(address . ludo@gnu.org)
CAJ3okZ0iMNjv93MM1FkEB3_zXA48Rq3rKXhwwug85fNRRc41Mg@mail.gmail.com
Hi,
On Mon, 20 Jul 2020 at 10:39, Ludovic Courtès <ludo@gnu.org> wrote:
Toggle quote (6 lines)> zimoun <zimon.toutoune@gmail.com> skribis:> > On Sat, 11 Jul 2020 at 17:50, Ludovic Courtès <ludo@gnu.org> wrote:
> There are many many comments in your message, so I took the liberty to> reply only to the essence of it. :-)
Many comments because many open topics. ;-)

Toggle quote (4 lines)> However, the two examples above are good ideas as to the way forward: we> could start a url-fetch-to-git-fetch migration in these two cases, and> perhaps more.
Well, to be honest, I have tried to probe such migration when I openedthis thread:
https://lists.gnu.org/archive/html/guix-devel/2020-05/msg00224.html
and I have tried to summarized the pros/cons arguments here:
https://lists.gnu.org/archive/html/guix-devel/2020-05/msg00448.html

Toggle quote (7 lines)> > What about in addition push to IPFS? Feasible? Lookup issue?>> Lookup issue. :-) The hash in a CID is not just a raw blob hash.> Files are typically chunked beforehand, assembled as a Merkle tree, and> the CID is roughly the hash to the tree root. So it would seem we can’t> use IPFS as-is for tarballs.
Using the Git-repo map/table, then it becomes an option, right?Well, SWH would be a backend and IPFS could be another one. Or any"cloudy" storage system that could appear in the future, right?

Toggle quote (9 lines)> >> • If we no longer deal with tarballs but upstreams keep signing> >> tarballs (not raw directory hashes), how can we authenticate our> >> code after the fact?> >> > Does Guix automatically authenticate code using signed tarballs?>> Not automatically; packagers are supposed to authenticate code when they> add a package (‘guix refresh -u’ does that automatically).
So I miss the point of having this authentication information in thefuture where upstream has disappeared.The authentication is done at packaging time. So once it is done,merged into master and then pushed to SWH, being able to authenticateagain does not really matter.
And if it matters, all should be updated each time vulnerabilities arediscovered and so I am not sure SWH makes sense for this use-case.

Toggle quote (2 lines)> But today, we store tarball hashes, not directory hashes.
We store what "guix hash" returns. ;-)So it is easy to migrate from tarball hashes to whatever else. :-)I mean, it is "(sha256 (base32" and it is easy to have also"(sha256-tree (base32" or something like that.
In the case where the integrity is also used as lookup key.
Toggle quote (6 lines)> > The format of metadata (disassemble) that you propose is schemish> > (obviously! :-)) but we could propose something more JSON-like.>> Sure, if that helps get other people on-board, why not (though sexps> have lived much longer than JSON and XML together :-)).
Lived much longer and still less less less used than JSON or XML alone. ;-)

I have not done yet the clear back-to-envelop computations. Roughly,there are ~23 commits on average per day updating packages, so say 70%of them are url-fetch, it is ~16 new tarballs per day, on average.How the model using a Git-repo will scale? Because, naively theoutput of "disassemble-archive" in full text (pretty-print format) forthe hello-2.10.tar is 120KB and so 16*365*120K = ~700Mb per yearwithout considering all the Git internals. Obviously, it depends onthe number of files and I do not know if hello is a representativeexample.
And I do not know how Git operates on binary files if the disassembledtarball is stored as .go file, or any other.

All the best,simon
ps:Just if someone wants to check from where I estimate the numbers.
Toggle snippet (17 lines)for ci in $(git log --after=v1.0.0 --oneline \ | grep "gnu:" | grep -E "(Add|Update)" \ | cut -f1 -d' ')do git --no-pager log -1 $ci --format="%cs"done | uniq -c > /tmp/commits
guix environment --ad-hoc r-minimal \ -- R -e 'summary(read.table("/tmp/commits"))'
gzip -dc < $(guix build -S hello) > /tmp/hello.targuix repl -L /tmp/tar/
scheme@(guix-user)> (call-with-input-file "hello.tar" (lambda (port) (disassemble-archive port)))
D
D
Dr. Arne Babenhauserheide wrote on 20 Jul 19:05 +0200
Re: bug#42162: Recovering source tarballs
(name . zimoun)(address . zimon.toutoune@gmail.com)
87wo2ynml7.fsf@web.de
zimoun <zimon.toutoune@gmail.com> writes:
Toggle quote (8 lines)>> > The format of metadata (disassemble) that you propose is schemish>> > (obviously! :-)) but we could propose something more JSON-like.>>>> Sure, if that helps get other people on-board, why not (though sexps>> have lived much longer than JSON and XML together :-)).>> Lived much longer and still less less less used than JSON or XML alone. ;-)
Though this is likely not a function of the format, but of thepopularity of both Javascript and Java.
JSON isn’t a well defined format for arbitrary data (try to storenumbers as keys and reason about what you get as return-values), andXML is a monster of complexity.
Best wishes,Arne-- Unpolitisch seinheißt politisch seinohne es zu merken
-----BEGIN PGP SIGNATURE-----
iQIzBAEBCAAdFiEE801qEjXQSQPNItXAE++NRSQDw+sFAl8VzucACgkQE++NRSQDw+vH5A/+O4YSG9c/P8FD66fdhZ/tOcHBvSxfDu0GDfyB6O9gILuHtDM+OJFlcPvBNplo+FV/abU5mw7CeEgbVaK/Nv5MPTEHwbZBTZvlpwPPYtFpyLbxwbTqMa6Tgp9z4Ml/L4FXlDBc1ohEZqJQqWouLOl0LjClMMPBv+rsThZZSBiRdYEUIXOQfrJv7tMiWjosPJqtQ5Sp9QFxKTwbLayHVvbFyY095EyQVhy/7BY6+thaGGVYCjz0CcozJZhrM/ebgF32Geu+IQtf1+hnXJCdQj4mEc5ALgz97qT7KXFwOpnZ0hT78dBooZY+5laD0/pgrWwboNI3kRpDQw0PCeaq05Q3+ppLo++NZ1s+9vDUEW1uvzcjeSqv+Cm+wn++3KmDDmVN2VuRTGQmBE9XdIqI0SYb65OXzzGaDoB8fvQzZgvMlhcCfrp+BtrjHWy0UzNT4YScuZVTUgXJ49Hk+enihJAMGTyfOmwMo4eOaoQYxIuKayjtfIk8+CsCJWroJlWMmPB0golG9EjO6cAK1zWN8gpzXuhATNsUvIqv2qHWzVSsh+rzDAf3xlsDO1rmJIduKYyyuP9QaEtEjmLzCbER+4Qwzll3vsaelxfrQbs3xIbGKmm4vAzJPI/TVa5JY24v09FlMXw8bp+pw9XVC+V0fueRqMMD2+Log/ZTFnHbeX/uzguIswQBAQgAHRYhBN0ovebZh1yrzkqLHdzPDbMLwQVIBQJfFc7nAAoJENzPDbMLwQVIehgD/3sqChU9MHZfBv6LXzVixV8F68JW4UxKzEPOzYAr7MDmKgT1VN5gdltKCq+GYCgfD8CXepNwqqL2K+DbapBBvTGpLXcJp36I0VbdOL04mshW6XMVJP33Cgyg9c5c569TiVV1R0GkGHr4eal/jedvaqlhit6qqmbWsI+ERHApnHQS=6n8s-----END PGP SIGNATURE-----
Z
Z
zimoun wrote on 20 Jul 21:59 +0200
(name . Dr. Arne Babenhauserheide)(address . arne_bab@web.de)
CAJ3okZ2ndtsn5t38t+C_odoYDa-m8cdpFG9tnKC8FoKuoHXveA@mail.gmail.com
On Mon, 20 Jul 2020 at 19:05, Dr. Arne Babenhauserheide <arne_bab@web.de> wrote:
Toggle quote (12 lines)> zimoun <zimon.toutoune@gmail.com> writes:> >> > The format of metadata (disassemble) that you propose is schemish> >> > (obviously! :-)) but we could propose something more JSON-like.> >>> >> Sure, if that helps get other people on-board, why not (though sexps> >> have lived much longer than JSON and XML together :-)).> >> > Lived much longer and still less less less used than JSON or XML alone. ;-)>> Though this is likely not a function of the format, but of the> popularity of both Javascript and Java.
Well, the popularity matters to attract a broad audience and maybe getother people on-board; if it is the aim.It seems the de-facto format; even if JSON has flaws. And zillions ofparsers for all the languages are floating around, which is not thecase for Sexp, even if it is easier to parse.
And JSON is already used in Guix, see [1] for an example.
1: https://guix.gnu.org/manual/devel/en/guix.html#Additional-Build-Options
However, I am not convinced that JSON or similarly Sexp will scalewell for a Tarball Heritage perspective.
All the best,simon
Z
Z
zimoun wrote on 20 Jul 23:27 +0200
865zahev23.fsf@gmail.com
Hi Chris,
On Mon, 13 Jul 2020 at 20:20, Christopher Baines <mail@cbaines.net> wrote:
Toggle quote (6 lines)> Going forward, being methodical as a project about storing the tarballs> and source material for the packages is probalby the way to ensure it's> available for the future. I'm not sure the data storage cost is> significant, the cost of doing this is probably in working out what to> store, doing so in a redundant manor, and making the data available.
A really rough estimate is 120KB on average* per raw tarball. So if weconsider 14000 packages and 70% of them are url-fetch, then it leads to14k*0.7*120K= 1.2GB; which is not significant. Moreover, if weextrapolate the numbers, between v1.0.0 and now it is 23 commits per daymodifying gnu/packages/ so 0.7*23*120K*365= 700MB per year. However,the 120KB of metadata to re-assemble the tarball have to be compared tothe 712KB of raw compressed tarball; both about the hello package.
*based on the hello package. And it depends on the number of files in the tarball. File stored not compressed: plain sexp.

Therefore, in addition to what to store, redundancy and availability,one question is how to store? Git-repo? SQL database? etc.


Toggle quote (5 lines)> The Guix Data Service knows about fixed output derivations, so it might> be possible to backfill such a store by just attempting to build those> derivations. It might also be possible to use the Guix Data Service to> work out what's available, and what tarballs are missing.
Missing from where? The substitutes farm or SWH?

Cheers,simon
L
L
Ludovic Courtès wrote on 21 Jul 23:22 +0200
Re: Recovering source tarballs
(name . zimoun)(address . zimon.toutoune@gmail.com)
87k0ywlg1z.fsf@gnu.org
Hi!
zimoun <zimon.toutoune@gmail.com> skribis:
Toggle quote (9 lines)> On Mon, 20 Jul 2020 at 10:39, Ludovic Courtès <ludo@gnu.org> wrote:>> zimoun <zimon.toutoune@gmail.com> skribis:>> > On Sat, 11 Jul 2020 at 17:50, Ludovic Courtès <ludo@gnu.org> wrote:>>> There are many many comments in your message, so I took the liberty to>> reply only to the essence of it. :-)>> Many comments because many open topics. ;-)
Understood, and they’re very valuable but (1) I choose not to just doemail :-), and (2) I like to separate issues in reasonable chunks ratherthan long threads addressing all the problems we’ll have to deal with.
I think it really helps keep things tractable!
Toggle quote (9 lines)>> Lookup issue. :-) The hash in a CID is not just a raw blob hash.>> Files are typically chunked beforehand, assembled as a Merkle tree, and>> the CID is roughly the hash to the tree root. So it would seem we can’t>> use IPFS as-is for tarballs.>> Using the Git-repo map/table, then it becomes an option, right?> Well, SWH would be a backend and IPFS could be another one. Or any> "cloudy" storage system that could appear in the future, right?
Sure, why not.
Toggle quote (12 lines)>> >> • If we no longer deal with tarballs but upstreams keep signing>> >> tarballs (not raw directory hashes), how can we authenticate our>> >> code after the fact?>> >>> > Does Guix automatically authenticate code using signed tarballs?>>>> Not automatically; packagers are supposed to authenticate code when they>> add a package (‘guix refresh -u’ does that automatically).>> So I miss the point of having this authentication information in the> future where upstream has disappeared.
What I meant above, is that often, what we have is things like detachedsignatures of raw tarballs, or documents referring to a tarball hash:
https://sympa.inria.fr/sympa/arc/swh-devel/2016-07/msg00009.html
Toggle quote (5 lines)>> But today, we store tarball hashes, not directory hashes.>> We store what "guix hash" returns. ;-)> So it is easy to migrate from tarball hashes to whatever else. :-)
True, but that other thing, as it stands, would be a nar hash (like for‘git-fetch’), not a Git-tree hash (what SWH uses).
Toggle quote (3 lines)> I mean, it is "(sha256 (base32" and it is easy to have also> "(sha256-tree (base32" or something like that.
Right, but that first and foremost requires daemon support.
It’s doable, but migration would have to take a long time, since this istouching core parts of the “protocol”.
Toggle quote (10 lines)> I have not done yet the clear back-to-envelop computations. Roughly,> there are ~23 commits on average per day updating packages, so say 70%> of them are url-fetch, it is ~16 new tarballs per day, on average.> How the model using a Git-repo will scale? Because, naively the> output of "disassemble-archive" in full text (pretty-print format) for> the hello-2.10.tar is 120KB and so 16*365*120K = ~700Mb per year> without considering all the Git internals. Obviously, it depends on> the number of files and I do not know if hello is a representative> example.
Interesting, thanks for making that calculation! We could make theformat more compact if needed.
Thanks,Ludo’.
Z
Z
zimoun wrote on 22 Jul 02:27 +0200
(name . Ludovic Courtès)(address . ludo@gnu.org)
86o8o81jic.fsf@gmail.com
Hi!
On Tue, 21 Jul 2020 at 23:22, Ludovic Courtès <ludo@gnu.org> wrote:
Toggle quote (17 lines)>>> >> • If we no longer deal with tarballs but upstreams keep signing>>> >> tarballs (not raw directory hashes), how can we authenticate our>>> >> code after the fact?>>> >>>> > Does Guix automatically authenticate code using signed tarballs?>>>>>> Not automatically; packagers are supposed to authenticate code when they>>> add a package (‘guix refresh -u’ does that automatically).>>>> So I miss the point of having this authentication information in the>> future where upstream has disappeared.>> What I meant above, is that often, what we have is things like detached> signatures of raw tarballs, or documents referring to a tarball hash:>> https://sympa.inria.fr/sympa/arc/swh-devel/2016-07/msg00009.html
I still miss why it matters to store detached signature of raw tarballs.
The authentication is done now (at package time and/or inclusion in thelookup table proposal). I miss why we would have to re-authenticateagain later.
IMHO, having a lookup table that returns the signatures from a tarballhash or an archive of all the OpenGPG keys ever published is anothertopic.

Toggle quote (8 lines)>>> But today, we store tarball hashes, not directory hashes.>>>> We store what "guix hash" returns. ;-)>> So it is easy to migrate from tarball hashes to whatever else. :-)>> True, but that other thing, as it stands, would be a nar hash (like for> ‘git-fetch’), not a Git-tree hash (what SWH uses).
Ok, now I am totally convinced that a lookup table is The Right Thing™. :-)
Toggle quote (8 lines)>> I mean, it is "(sha256 (base32" and it is easy to have also>> "(sha256-tree (base32" or something like that.>> Right, but that first and foremost requires daemon support.>> It’s doable, but migration would have to take a long time, since this is> touching core parts of the “protocol”.
Doable but not necessary tractable. :-)

Toggle quote (13 lines)>> I have not done yet the clear back-to-envelop computations. Roughly,>> there are ~23 commits on average per day updating packages, so say 70%>> of them are url-fetch, it is ~16 new tarballs per day, on average.>> How the model using a Git-repo will scale? Because, naively the>> output of "disassemble-archive" in full text (pretty-print format) for>> the hello-2.10.tar is 120KB and so 16*365*120K = ~700Mb per year>> without considering all the Git internals. Obviously, it depends on>> the number of files and I do not know if hello is a representative>> example.>> Interesting, thanks for making that calculation! We could make the> format more compact if needed.
Compressing should help.
Considering 14000 packages, based on this 120KB estimation, it leads to:0.7*14k*120K= ~1.2GB for the Git-repo of the current Guix.
Cheers,simon
L
L
Ludovic Courtès wrote on 22 Jul 12:28 +0200
(name . zimoun)(address . zimon.toutoune@gmail.com)
875zafkfml.fsf@gnu.org
Hello!
zimoun <zimon.toutoune@gmail.com> skribis:
Toggle quote (21 lines)> On Tue, 21 Jul 2020 at 23:22, Ludovic Courtès <ludo@gnu.org> wrote:>>>>> >> • If we no longer deal with tarballs but upstreams keep signing>>>> >> tarballs (not raw directory hashes), how can we authenticate our>>>> >> code after the fact?>>>> >>>>> > Does Guix automatically authenticate code using signed tarballs?>>>>>>>> Not automatically; packagers are supposed to authenticate code when they>>>> add a package (‘guix refresh -u’ does that automatically).>>>>>> So I miss the point of having this authentication information in the>>> future where upstream has disappeared.>>>> What I meant above, is that often, what we have is things like detached>> signatures of raw tarballs, or documents referring to a tarball hash:>>>> https://sympa.inria.fr/sympa/arc/swh-devel/2016-07/msg00009.html>> I still miss why it matters to store detached signature of raw tarballs.
I’m not saying we (Guix) should store signatures; I’m just saying thatdevelopers typically sign raw tarballs. It’s a general statement toexplain why storing or being able to reconstruct tarballs matters.
Thanks,Ludo’.
T
T
Timothy Sample wrote on 30 Jul 19:36 +0200
Re: bug#42162: Recovering source tarballs
(name . Ludovic Courtès)(address . ludo@gnu.org)
875za4ykej.fsf@ngyro.com
Hi Ludovic,
Ludovic Courtès <ludo@gnu.org> writes:
Toggle quote (71 lines)> Hi,>> Ludovic Courtès <ludo@gnu.org> skribis:>> [...]>> So for the medium term, and perhaps for the future, a possible option> would be to preserve tarball metadata so we can reconstruct them:>> tarball = metadata + tree>> After all, tarballs are byproducts and should be no exception: we should> build them from source. :-)>> In <https://forge.softwareheritage.org/T2430>, Stefano mentioned> pristine-tar, which does almost that, but not quite: it stores a binary> delta between a tarball and a tree:>> https://manpages.debian.org/unstable/pristine-tar/pristine-tar.1.en.html>> I think we should have something more transparent than a binary delta.>> The code below can “disassemble” and “assemble” a tar. When it> disassembles it, it generates metadata like this:>> (tar-source> (version 0)> (headers> (("guile-3.0.4/"> (mode 493)> (size 0)> (mtime 1593007723)> (chksum 3979)> (typeflag #\5))> ("guile-3.0.4/m4/"> (mode 493)> (size 0)> (mtime 1593007720)> (chksum 4184)> (typeflag #\5))> ("guile-3.0.4/m4/pipe2.m4"> (mode 420)> (size 531)> (mtime 1536050419)> (chksum 4812)> (hash (sha256> "arx6n2rmtf66yjlwkgwp743glcpdsfzgjiqrqhfegutmcwvwvsza")))> ("guile-3.0.4/m4/time_h.m4"> (mode 420)> (size 5471)> (mtime 1536050419)> (chksum 4974)> (hash (sha256> "z4py26rmvsk4st7db6vwziwwhkrjjrwj7nra4al6ipqh2ms45kka")))> […]>> The ’assemble-archive’ procedure consumes that, looks up file contents> by hash on SWH, and reconstructs the original tarball…>> … at least in theory, because in practice we hit the SWH rate limit> after looking up a few files:>> https://archive.softwareheritage.org/api/#rate-limiting>> So it’s a bit ridiculous, but we may have to store a SWH “dir”> identifier for the whole extracted tree—a Git-tree hash—since that would> allow us to retrieve the whole thing in a single HTTP request.>> Besides, we’ll also have to handle compression: storing gzip/xz headers> and compression levels.
This jumped out at me because I have been working with compression andtarballs for the bootstrapping effort. I started pulling some threadsand doing some research, and ended up prototyping an end-to-end solutionfor decomposing a Gzip’d tarball into Gzip metadata, tarball metadata,and an SWH directory ID. It can even put them back together! :) Thereare a bunch of problems still, but I think this project is doable in theshort-term. I’ve tested 100 arbitrary Gzip’d tarballs from Guix, andfound and fixed a bunch of little gaffes. There’s a ton of work to do,of course, but here’s another small step.
I call the thing “Disarchive” as in “disassemble a source code archive”.You can find it at https://git.ngyro.com/disarchive/. It has a simplecommand-line interface so you can do
$ disarchive save software-1.0.tar.gz
which serializes a disassembled version of “software-1.0.tar.gz” to thedatabase (which is just a directory) specified by the “DISARCHIVE_DB”environment variable. Next, you can run
$ disarchive load hash-of-something-in-the-db
which will recover an original file from its metadata (stored in thedatabase) and data retrieved from the SWH archive or taken from a cache(again, just a directory) specified by “DISARCHIVE_DIRCACHE”.
Now some implementation details. The way I’ve set it up is that all ofthe assembly happens through Guix. Each step in recreating a compressedtarball is a fixed-output derivation: the download from SWH, thecreation of the tarball, and the compression. I wanted an easy way tobuild and verify things according to a dependency graph without writingany code. Hi Guix Daemon! I’m not sure if this is a good long-termapproach, though. It could work well for reproducibility, but it mightbe easier to let some external service drive my code as a Guix package.Either way, it was an easy way to get started.
For disassembly, it takes a Gzip file (containing a single member) andbreaks it down like this:
(gzip-member (version 0) (name "hungrycat-0.4.1.tar.gz") (input (sha256 "1ifzck1b97kjm567qb0prnqag2d01x0v8lghx98w1h2gzwsmxgi1")) (header (mtime 0) (extra-flags 2) (os 3)) (footer (crc 3863610951) (isize 194560)) (compressor gnu-best) (digest (sha256 "03fc1zsrf99lvxa7b4ps6pbi43304wbxh1f6ci4q0vkal370yfwh")))
The header and footer are read directly from the file. Finding thecompressor is harder. I followed the approach taken by the pristine-tarproject. That is, try a bunch of compressors and hope for a match.Currently, I have:
• gnu-best • gnu-best-rsync • gnu • gnu-rsync • gnu-fast • gnu-fast-rsync • zlib-best • zlib • zlib-fast • zlib-best-perl • zlib-perl • zlib-fast-perl • gnu-best-rsync-1.4 • gnu-rsync-1.4 • gnu-fast-rsync-1.4
This list is inspired by pristine-tar. The first couple GNU compressorsuse modern Gzip from Guix. The zlib and rsync-1.4 ones use the Gzip andzlib wrapper from pristine-tar called “zgz”. The 100 Gzip files Ilooked at use “gnu”, “gnu-best”, “gnu-best-rsync-1.4”, “zlib”,“zlib-best”, and “zlib-fast-perl”.
(As an aside, I had a way to decompose multi-member Gzip files, but itwas much, much slower. Since I doubt they exist in the wild, I removedthat code.)
The “input” field likely points to a tarball, which looks like this:
(tarball (version 0) (name "hungrycat-0.4.1.tar") (input (sha256 "02qg3z5cvq6dkdc0mxz4sami1ys668lddggf7bjhszk23xpfjm5r")) (default-header) (headers ((name "hungrycat-0.4.1/") (mode 493) (mtime 1513360022) (chksum 5058) (typeflag 53)) ((name "hungrycat-0.4.1/configure") (mode 493) (size 130263) (mtime 1513360022) (chksum 6043)) ...) (padding 3584) (digest (sha256 "1ifzck1b97kjm567qb0prnqag2d01x0v8lghx98w1h2gzwsmxgi1")))
Originally, I used your code, but I ran into some problems. Namely,real tarballs are not well-behaved. I wrote new code to keep track ofsubtle things like the formatting of the octal values. Even though theyare not well-behaved, they are usually self-consistent, so I introducedthe “default-header” field to set default values for all headers. Anyomitted fields in the headers use the value from the default header, andthe default header takes defaults from a “default default header”defined in the code. Here’s a default header from a different tarball:
(default-header (uid 1199) (gid 30) (magic "ustar ") (version " \x00") (uname "cagordon") (gname "lhea") (devmajor-format (width 0)) (devminor-format (width 0)))
These default values are computed to minimize the noise in theserialized form. Here we see for example that each header should haveUID 1199 unless otherwise specified. We also see that the device fieldsshould be null strings instead of octal zeros. Another good examplehere is that the magic field has a space after “ustar”, which is notwhat modern POSIX says to do.
My tarball reader has minimal support for extended headers, but they arenot serialized cleanly (they survive the round-trip, but they are nothuman-readable).
Finally, the “input” field here points to an “swh-directory” object. Itlooks like this:
(swh-directory (version 0) (name "hungrycat-0.4.1") (id "0496abd5a2e9e05c9fe20ae7684f48130ef6124a") (digest (sha256 "02qg3z5cvq6dkdc0mxz4sami1ys668lddggf7bjhszk23xpfjm5r")))
I have a little module for computing the directory hash like SWH does(which is in-turn like what Git does). I did not verify that the 100packages where in the SWH archive. I did verify a couple of packages,but I hit the rate limit and decided to avoid it for now.
To avoid hitting the SWH archive at all, I introduced a directory cacheso that I can store the directories locally. If the directory cache isavailable, directories are stored and retrieved from it.
Toggle quote (8 lines)> How would we put that in practice? Good question. :-)>> I think we’d have to maintain a database that maps tarball hashes to> metadata (!). A simple version of it could be a Git repo where, say,> ‘sha256/0mq9fc0ig0if5x9zjrs78zz8gfzczbvykj2iwqqd6salcqdgdwhk’ would> contain the metadata above. The nice thing is that the Git repo itself> could be archived by SWH. :-)
You mean like https://git.ngyro.com/disarchive-db/? :)
This was generated by a little script built on top of “fold-packages”.It downloads Gzip’d tarballs used by Guix packages and passes them on toDisarchive for disassembly. I limited the number to 100 because it’sslow and because I’m sure there is a long tail of weird softwarearchives that are going to be hard to process. The metadata directoryended up being 13M and the directory cache 2G.
Toggle quote (5 lines)> Thus, if a tarball vanishes, we’d look it up in the database and> reconstruct it from its metadata plus content store in SWH.>> Thoughts?
Obviously I like the idea. ;)
Even with the code I have so far, I have a lot of questions. Mainly I’mworried about keeping everything working into the future. It would beeasy to make incompatible changes. A lot of care would have to betaken. Of course, keeping a Guix commit and a Disarchive commit mightbe enough to make any assembling reproducible, but there’s achicken-and-egg problem there. What if a tarball from the closure ofone the derivations is missing? I guess you could work around it, butit would be tricky.
Toggle quote (4 lines)> Anyhow, we should team up with fellow NixOS and SWH hackers to address> this, and with developers of other distros as well—this problem is not> just that of the functional deployment geeks, is it?
I could remove most of the Guix stuff so that it would be easy topackage in Guix, Nix, Debian, etc. Then, someone™ could write a servicethat consumes a “sources.json” file, adds the sources to a Disarchivedatabase, and pushes everything to a Git repo. I guess everyone whocares has to produce a “sources.json” file anyway, so it will be verylittle extra work. Other stuff like changing the serialization formatto JSON would be pretty easy, too. I’m not well connected to theseother projects, mind you, so I’m not really sure how to reach out.
Sorry about the big mess of code and ideas – I realize I may have takenthe “do-ocracy” approach a little far here. :) Even if this is not“the” solution, hopefully it’s useful for discussion!

-- Tim
L
L
Ludovic Courtès wrote on 31 Jul 16:41 +0200
(name . Timothy Sample)(address . samplet@ngyro.com)
87bljvu4p4.fsf@gnu.org
Hi Timothy!
Timothy Sample <samplet@ngyro.com> skribis:
Toggle quote (26 lines)> This jumped out at me because I have been working with compression and> tarballs for the bootstrapping effort. I started pulling some threads> and doing some research, and ended up prototyping an end-to-end solution> for decomposing a Gzip’d tarball into Gzip metadata, tarball metadata,> and an SWH directory ID. It can even put them back together! :) There> are a bunch of problems still, but I think this project is doable in the> short-term. I’ve tested 100 arbitrary Gzip’d tarballs from Guix, and> found and fixed a bunch of little gaffes. There’s a ton of work to do,> of course, but here’s another small step.>> I call the thing “Disarchive” as in “disassemble a source code archive”.> You can find it at <https://git.ngyro.com/disarchive/>. It has a simple> command-line interface so you can do>> $ disarchive save software-1.0.tar.gz>> which serializes a disassembled version of “software-1.0.tar.gz” to the> database (which is just a directory) specified by the “DISARCHIVE_DB”> environment variable. Next, you can run>> $ disarchive load hash-of-something-in-the-db>> which will recover an original file from its metadata (stored in the> database) and data retrieved from the SWH archive or taken from a cache> (again, just a directory) specified by “DISARCHIVE_DIRCACHE”.
Wooohoo! Is it that time of the year when people give presents to oneanother? I can’t believe it. :-)
Toggle quote (30 lines)> Now some implementation details. The way I’ve set it up is that all of> the assembly happens through Guix. Each step in recreating a compressed> tarball is a fixed-output derivation: the download from SWH, the> creation of the tarball, and the compression. I wanted an easy way to> build and verify things according to a dependency graph without writing> any code. Hi Guix Daemon! I’m not sure if this is a good long-term> approach, though. It could work well for reproducibility, but it might> be easier to let some external service drive my code as a Guix package.> Either way, it was an easy way to get started.>> For disassembly, it takes a Gzip file (containing a single member) and> breaks it down like this:>> (gzip-member> (version 0)> (name "hungrycat-0.4.1.tar.gz")> (input (sha256> "1ifzck1b97kjm567qb0prnqag2d01x0v8lghx98w1h2gzwsmxgi1"))> (header> (mtime 0)> (extra-flags 2)> (os 3))> (footer> (crc 3863610951)> (isize 194560))> (compressor gnu-best)> (digest> (sha256> "03fc1zsrf99lvxa7b4ps6pbi43304wbxh1f6ci4q0vkal370yfwh")))
Awesome.
Toggle quote (21 lines)> The header and footer are read directly from the file. Finding the> compressor is harder. I followed the approach taken by the pristine-tar> project. That is, try a bunch of compressors and hope for a match.> Currently, I have:>> • gnu-best> • gnu-best-rsync> • gnu> • gnu-rsync> • gnu-fast> • gnu-fast-rsync> • zlib-best> • zlib> • zlib-fast> • zlib-best-perl> • zlib-perl> • zlib-fast-perl> • gnu-best-rsync-1.4> • gnu-rsync-1.4> • gnu-fast-rsync-1.4
I would have used the integers that zlib supports, but I guess thatdoesn’t capture this whole gamut of compression setups. And yeah, it’snot great that we actually have to try and find the right compressionlevels, but there’s no way around it it seems, and as you write, we canexpect a couple of variants to be the most commonly used ones.
Toggle quote (29 lines)> The “input” field likely points to a tarball, which looks like this:>> (tarball> (version 0)> (name "hungrycat-0.4.1.tar")> (input (sha256> "02qg3z5cvq6dkdc0mxz4sami1ys668lddggf7bjhszk23xpfjm5r"))> (default-header)> (headers> ((name "hungrycat-0.4.1/")> (mode 493)> (mtime 1513360022)> (chksum 5058)> (typeflag 53))> ((name "hungrycat-0.4.1/configure")> (mode 493)> (size 130263)> (mtime 1513360022)> (chksum 6043))> ...)> (padding 3584)> (digest> (sha256> "1ifzck1b97kjm567qb0prnqag2d01x0v8lghx98w1h2gzwsmxgi1")))>> Originally, I used your code, but I ran into some problems. Namely,> real tarballs are not well-behaved. I wrote new code to keep track of> subtle things like the formatting of the octal values.
Yeah I guess I was too optimistic. :-) I wanted to have theserialization/deserialization code automatically generated by thatmacro, but yeah, it doesn’t capture enough details for real-worldtarballs.
Do you know how frequently you get “weird” tarballs? I was thinkingabout having something that works for plain GNU tar, but it’s evenbetter to have something that works with “unusual” tarballs!
(BTW the code I posted or the one in Disarchive could perhaps replacethe one in Gash-Utils. I was frustrated to not see a ‘fold-archive’procedure there, notably.)
Toggle quote (17 lines)> Even though they are not well-behaved, they are usually> self-consistent, so I introduced the “default-header” field to set> default values for all headers. Any omitted fields in the headers use> the value from the default header, and the default header takes> defaults from a “default default header” defined in the code. Here’s> a default header from a different tarball:>> (default-header> (uid 1199)> (gid 30)> (magic "ustar ")> (version " \x00")> (uname "cagordon")> (gname "lhea")> (devmajor-format (width 0))> (devminor-format (width 0)))
Very nice.
Toggle quote (11 lines)> Finally, the “input” field here points to an “swh-directory” object. It> looks like this:>> (swh-directory> (version 0)> (name "hungrycat-0.4.1")> (id "0496abd5a2e9e05c9fe20ae7684f48130ef6124a")> (digest> (sha256> "02qg3z5cvq6dkdc0mxz4sami1ys668lddggf7bjhszk23xpfjm5r")))
Yay!
Toggle quote (9 lines)> I have a little module for computing the directory hash like SWH does> (which is in-turn like what Git does). I did not verify that the 100> packages where in the SWH archive. I did verify a couple of packages,> but I hit the rate limit and decided to avoid it for now.>> To avoid hitting the SWH archive at all, I introduced a directory cache> so that I can store the directories locally. If the directory cache is> available, directories are stored and retrieved from it.
I guess we can get back to them eventually to estimate our coverage ratio.
Toggle quote (8 lines)>> I think we’d have to maintain a database that maps tarball hashes to>> metadata (!). A simple version of it could be a Git repo where, say,>> ‘sha256/0mq9fc0ig0if5x9zjrs78zz8gfzczbvykj2iwqqd6salcqdgdwhk’ would>> contain the metadata above. The nice thing is that the Git repo itself>> could be archived by SWH. :-)>> You mean like <https://git.ngyro.com/disarchive-db/>? :)
Woow. :-)
We could actually have a CI job to create the database: it wouldbasically do ‘disarchive save’ for each tarball and store that using alayout like the one you used. Then we could have a job somewhere thatperiodically fetches that and adds it to the database. WDYT?
I think we should leave room for other hash algorithms (in the sexpsabove too).
Toggle quote (7 lines)> This was generated by a little script built on top of “fold-packages”.> It downloads Gzip’d tarballs used by Guix packages and passes them on to> Disarchive for disassembly. I limited the number to 100 because it’s> slow and because I’m sure there is a long tail of weird software> archives that are going to be hard to process. The metadata directory> ended up being 13M and the directory cache 2G.
Neat.
So it does mean that we could pretty much right away add a fall-back in(guix download) that looks up tarballs in your database and usesDisarchive to recontruct it, right? I love solved problems. :-)
Of course we could improve Disarchive and the database, but it seems tome that we already have enough to improve the situation. WDYT?
Toggle quote (7 lines)> Even with the code I have so far, I have a lot of questions. Mainly I’m> worried about keeping everything working into the future. It would be> easy to make incompatible changes. A lot of care would have to be> taken. Of course, keeping a Guix commit and a Disarchive commit might> be enough to make any assembling reproducible, but there’s a> chicken-and-egg problem there.
The way I see it, Guix would always look up tarballs in the HEAD of thedatabase (no need to pick a specific commit). Worst that could happenis we reconstruct a tarball that doesn’t match, and so the daemon errorsout.
Regarding future-proofness, I think we must be super careful about thefile formats (the sexps). You did pay attention to not having implicitdefaults, which is perfect. Perhaps one thing to change (or perhapsit’s already there) is support for other hashes in those sexps: bothhash algorithms and directory hash methods (SWH dir/Git tree, nar, Gittree with different hash algorithm, IPFS CID, etc.). Also the abilityto specify several hashes.
That way we could “refresh” the database anytime by adding the hash dujour for already-present tarballs.
Toggle quote (3 lines)> What if a tarball from the closure of one the derivations is missing?> I guess you could work around it, but it would be tricky.
Well, more generally, we’ll have to monitor archive coverage. But Idon’t think the issue is specific to this method.
Toggle quote (13 lines)>> Anyhow, we should team up with fellow NixOS and SWH hackers to address>> this, and with developers of other distros as well—this problem is not>> just that of the functional deployment geeks, is it?>> I could remove most of the Guix stuff so that it would be easy to> package in Guix, Nix, Debian, etc. Then, someone™ could write a service> that consumes a “sources.json” file, adds the sources to a Disarchive> database, and pushes everything to a Git repo. I guess everyone who> cares has to produce a “sources.json” file anyway, so it will be very> little extra work. Other stuff like changing the serialization format> to JSON would be pretty easy, too. I’m not well connected to these> other projects, mind you, so I’m not really sure how to reach out.
If you feel like it, you’re welcome to point them to your work in thediscussion at https://forge.softwareheritage.org/T2430. There’s oneperson from NixOS (lewo) participating in the discussion and I’m surethey’d be interested. Perhaps they’ll tell whether they care abouthaving it available as JSON.
Toggle quote (4 lines)> Sorry about the big mess of code and ideas – I realize I may have taken> the “do-ocracy” approach a little far here. :) Even if this is not> “the” solution, hopefully it’s useful for discussion!
You did great! I had a very rough sketch and you did the real thing,that’s just awesome. :-)
Thanks a lot!
Ludo’.
T
T
Timothy Sample wrote on 3 Aug 18:59 +0200
(name . Ludovic Courtès)(address . ludo@gnu.org)
87d047u0l3.fsf@ngyro.com
Hi Ludovic,
Ludovic Courtès <ludo@gnu.org> writes:
Toggle quote (3 lines)> Wooohoo! Is it that time of the year when people give presents to one> another? I can’t believe it. :-)
Not to be too cynical, but I think it’s just the time of year that I getfrustrated with what I should be working on, and start fantasizing aboutgreen-field projects. :p
Toggle quote (29 lines)> Timothy Sample <samplet@ngyro.com> skribis:>>> The header and footer are read directly from the file. Finding the>> compressor is harder. I followed the approach taken by the pristine-tar>> project. That is, try a bunch of compressors and hope for a match.>> Currently, I have:>>>> • gnu-best>> • gnu-best-rsync>> • gnu>> • gnu-rsync>> • gnu-fast>> • gnu-fast-rsync>> • zlib-best>> • zlib>> • zlib-fast>> • zlib-best-perl>> • zlib-perl>> • zlib-fast-perl>> • gnu-best-rsync-1.4>> • gnu-rsync-1.4>> • gnu-fast-rsync-1.4>> I would have used the integers that zlib supports, but I guess that> doesn’t capture this whole gamut of compression setups. And yeah, it’s> not great that we actually have to try and find the right compression> levels, but there’s no way around it it seems, and as you write, we can> expect a couple of variants to be the most commonly used ones.
My first instinct was “this is impossible – a DEFLATE compressor can dojust about whatever it wants!” Then I looked at pristine-tar andrealized that their hack probably works pretty well. If I had infinitetime, I would think about some kind of fully general, parameterized LZ77algorithm that could describe any implementation. If I had a lot oftime I would peel back the curtain on Gzip and zlib and expose theirtuning parameters. That would be nicer, but keep in mind we will haveto cover XZ, bzip2, and ZIP, too! There’s a bit of balance betweenquality and coverage. Any improvement to the representation of thecompression algorithm could be implemented easily: just replace thenames with their improved representation.
One thing pristine-tar does is reorder the compressor list based on theinput metadata. A Gzip member usually stores its compression level, soit makes sense to try everything at that level first before moving one.
Toggle quote (9 lines)>> Originally, I used your code, but I ran into some problems. Namely,>> real tarballs are not well-behaved. I wrote new code to keep track of>> subtle things like the formatting of the octal values.>> Yeah I guess I was too optimistic. :-) I wanted to have the> serialization/deserialization code automatically generated by that> macro, but yeah, it doesn’t capture enough details for real-world> tarballs.
I enjoyed your implementation! I might even bring back its style. Itwas a little stiff for trying to figure out exactly what I needed forreproducing the tarballs.
Toggle quote (4 lines)> Do you know how frequently you get “weird” tarballs? I was thinking> about having something that works for plain GNU tar, but it’s even> better to have something that works with “unusual” tarballs!
I don’t have hard numbers, but I would say that a good handful (5–10%)have “X-format” fields, meaning their octal formatting is unusual. (I’mlooking at “grep -A 10 default-header” over all the S-Exp files.) Themost charming thing is the “uname” and “gname” fields. For example,“rtmidi-4.0.0” was made by “gary” from “staff”. :)
Toggle quote (4 lines)> (BTW the code I posted or the one in Disarchive could perhaps replace> the one in Gash-Utils. I was frustrated to not see a ‘fold-archive’> procedure there, notably.)
I really like “fold-archive”. One of the reasons I started doing thisis to possibly share code with Gash-Utils. It’s not as easy as I washoping, but I’m planning on improving things there based on myexperience here. I’ve now worked with four Scheme tar implementations,maybe if I write a really good one I could cap that number at five!
Toggle quote (6 lines)>> To avoid hitting the SWH archive at all, I introduced a directory cache>> so that I can store the directories locally. If the directory cache is>> available, directories are stored and retrieved from it.>> I guess we can get back to them eventually to estimate our coverage ratio.
It would be nice to know, but pretty hard to find out with the ratelimit. I guess it will improve immensely when we set up a“sources.json” file.
Toggle quote (9 lines)>> You mean like https://git.ngyro.com/disarchive-db/? :)>> Woow. :-)>> We could actually have a CI job to create the database: it would> basically do ‘disarchive save’ for each tarball and store that using a> layout like the one you used. Then we could have a job somewhere that> periodically fetches that and adds it to the database. WDYT?
Maybe.... I assume that Disarchive would fail for a few of them. Wewould need a plan for monitoring those failures so that Disarchive canbe improved. Also, unless I’m misunderstanding something, this meansbuilding the whole database at every commit, no? That would take a lotof time and space. On the other hand, it would be easy enough to try.If it works, it’s a lot easier than setting up a whole other service.
Toggle quote (3 lines)> I think we should leave room for other hash algorithms (in the sexps> above too).
It works for different hash algorithms, but not for different directoryhashing methods (like you mention below).
Toggle quote (16 lines)>> This was generated by a little script built on top of “fold-packages”.>> It downloads Gzip’d tarballs used by Guix packages and passes them on to>> Disarchive for disassembly. I limited the number to 100 because it’s>> slow and because I’m sure there is a long tail of weird software>> archives that are going to be hard to process. The metadata directory>> ended up being 13M and the directory cache 2G.>> Neat.>> So it does mean that we could pretty much right away add a fall-back in> (guix download) that looks up tarballs in your database and uses> Disarchive to recontruct it, right? I love solved problems. :-)>> Of course we could improve Disarchive and the database, but it seems to> me that we already have enough to improve the situation. WDYT?
I would say that we are darn close! In theory it would work. It wouldbe much more practical if we had better coverage in the SWH archive(i.e., “sources.json”) and a way to get metadata for a source archivewithout downloading the entire Disarchive database. It’s 13M now, butit will likely be 500M with all the Gzip’d tarballs from a recent commitof Guix. It will only grow after that, too.
Of course those are not hard blockers, so ‘(guix download)’ could startusing Disarchive as soon as we package it. I’ve starting looking intoit, but I’m confused about getting access to Disarchive from the“out-of-band” download system. Would it have to become a dependency ofGuix?
Toggle quote (12 lines)>> Even with the code I have so far, I have a lot of questions. Mainly I’m>> worried about keeping everything working into the future. It would be>> easy to make incompatible changes. A lot of care would have to be>> taken. Of course, keeping a Guix commit and a Disarchive commit might>> be enough to make any assembling reproducible, but there’s a>> chicken-and-egg problem there.>> The way I see it, Guix would always look up tarballs in the HEAD of the> database (no need to pick a specific commit). Worst that could happen> is we reconstruct a tarball that doesn’t match, and so the daemon errors> out.
I was imagining an escape hatch beyond this, where one could look up aprovenance record from when Disarchive ingested and verified a sourcecode archive. The provenance record would tell you which version ofGuix was used when saving the archive, so you could try your luck withusing “guix time-machine” to reproduce Disarchive’s originalcomputation. If we perform database migrations, you would need totravel back in time in the database, too. The idea is that you couldwork around breakages in Disarchive automatically using the Power ofGuix™. Just a stray thought, really.
Toggle quote (11 lines)> Regarding future-proofness, I think we must be super careful about the> file formats (the sexps). You did pay attention to not having implicit> defaults, which is perfect. Perhaps one thing to change (or perhaps> it’s already there) is support for other hashes in those sexps: both> hash algorithms and directory hash methods (SWH dir/Git tree, nar, Git> tree with different hash algorithm, IPFS CID, etc.). Also the ability> to specify several hashes.>> That way we could “refresh” the database anytime by adding the hash du> jour for already-present tarballs.
The hash algorithm is already configurable, but the directory hashmethod is not. You’re right that it should be, and that there should besupport for multiple digests.
Toggle quote (6 lines)>> What if a tarball from the closure of one the derivations is missing?>> I guess you could work around it, but it would be tricky.>> Well, more generally, we’ll have to monitor archive coverage. But I> don’t think the issue is specific to this method.
Again, I’m thinking about the case where I want to travel back in timeto reproduce a Disarchive computation. It’s really an unlikelyscenario, I’m just trying to think of everything that could go wrong.
Toggle quote (19 lines)>>> Anyhow, we should team up with fellow NixOS and SWH hackers to address>>> this, and with developers of other distros as well—this problem is not>>> just that of the functional deployment geeks, is it?>>>> I could remove most of the Guix stuff so that it would be easy to>> package in Guix, Nix, Debian, etc. Then, someone™ could write a service>> that consumes a “sources.json” file, adds the sources to a Disarchive>> database, and pushes everything to a Git repo. I guess everyone who>> cares has to produce a “sources.json” file anyway, so it will be very>> little extra work. Other stuff like changing the serialization format>> to JSON would be pretty easy, too. I’m not well connected to these>> other projects, mind you, so I’m not really sure how to reach out.>> If you feel like it, you’re welcome to point them to your work in the> discussion at <https://forge.softwareheritage.org/T2430>. There’s one> person from NixOS (lewo) participating in the discussion and I’m sure> they’d be interested. Perhaps they’ll tell whether they care about> having it available as JSON.
Good idea. I will work out a few more kinks and then bring it up there.I’ve already rewritten the parts that used the Guix daemon. Disarchivenow only needs a handful Guix modules ('base32', 'serialization', and'swh' are the ones that would be hard to remove).
Toggle quote (9 lines)>> Sorry about the big mess of code and ideas – I realize I may have taken>> the “do-ocracy” approach a little far here. :) Even if this is not>> “the” solution, hopefully it’s useful for discussion!>> You did great! I had a very rough sketch and you did the real thing,> that’s just awesome. :-)>> Thanks a lot!
My pleasure! Thanks for the feedback so far.

-- Tim
R
R
Ricardo Wurmus wrote on 3 Aug 23:10 +0200
(name . zimoun)(address . zimon.toutoune@gmail.com)(address . 42162@debbugs.gnu.org)
87r1snfnb1.fsf@elephly.net
zimoun <zimon.toutoune@gmail.com> writes:
Toggle quote (5 lines)> Yes, but for example all the packages in gnu/packages/bioconductor.scm> could be "git-fetch". Today the source is over url-fetch but it could> be over git-fetch with https://git.bioconductor.org/packages/flowCore or> git@git.bioconductor.org:packages/flowCore.
We should do that (and soon), especially because Bioconductor does notkeep an archive of old releases. We can discuss this on a separateissue lest we derail the discussion at hand.
-- Ricardo
L
L
Ludovic Courtès wrote on 5 Aug 19:14 +0200
(name . Timothy Sample)(address . samplet@ngyro.com)
87wo2dnhgb.fsf@gnu.org
Hello!
Timothy Sample <samplet@ngyro.com> skribis:
Toggle quote (9 lines)> Ludovic Courtès <ludo@gnu.org> writes:>>> Wooohoo! Is it that time of the year when people give presents to one>> another? I can’t believe it. :-)>> Not to be too cynical, but I think it’s just the time of year that I get> frustrated with what I should be working on, and start fantasizing about> green-field projects. :p
:-)
Toggle quote (41 lines)>> Timothy Sample <samplet@ngyro.com> skribis:>>>>> The header and footer are read directly from the file. Finding the>>> compressor is harder. I followed the approach taken by the pristine-tar>>> project. That is, try a bunch of compressors and hope for a match.>>> Currently, I have:>>>>>> • gnu-best>>> • gnu-best-rsync>>> • gnu>>> • gnu-rsync>>> • gnu-fast>>> • gnu-fast-rsync>>> • zlib-best>>> • zlib>>> • zlib-fast>>> • zlib-best-perl>>> • zlib-perl>>> • zlib-fast-perl>>> • gnu-best-rsync-1.4>>> • gnu-rsync-1.4>>> • gnu-fast-rsync-1.4>>>> I would have used the integers that zlib supports, but I guess that>> doesn’t capture this whole gamut of compression setups. And yeah, it’s>> not great that we actually have to try and find the right compression>> levels, but there’s no way around it it seems, and as you write, we can>> expect a couple of variants to be the most commonly used ones.>> My first instinct was “this is impossible – a DEFLATE compressor can do> just about whatever it wants!” Then I looked at pristine-tar and> realized that their hack probably works pretty well. If I had infinite> time, I would think about some kind of fully general, parameterized LZ77> algorithm that could describe any implementation. If I had a lot of> time I would peel back the curtain on Gzip and zlib and expose their> tuning parameters. That would be nicer, but keep in mind we will have> to cover XZ, bzip2, and ZIP, too! There’s a bit of balance between> quality and coverage. Any improvement to the representation of the> compression algorithm could be implemented easily: just replace the> names with their improved representation.
Yup, it makes sense to not spend too much time on this bit. I guesswe’d already have good coverage with gzip and xz.
Toggle quote (10 lines)>> (BTW the code I posted or the one in Disarchive could perhaps replace>> the one in Gash-Utils. I was frustrated to not see a ‘fold-archive’>> procedure there, notably.)>> I really like “fold-archive”. One of the reasons I started doing this> is to possibly share code with Gash-Utils. It’s not as easy as I was> hoping, but I’m planning on improving things there based on my> experience here. I’ve now worked with four Scheme tar implementations,> maybe if I write a really good one I could cap that number at five!
Heh. :-) The needs are different anyway. In Gash-Utils the focus isprobably on simplicity/maintainability, whereas here you really want tocover all the details of the wire representation.
Toggle quote (10 lines)>>> To avoid hitting the SWH archive at all, I introduced a directory cache>>> so that I can store the directories locally. If the directory cache is>>> available, directories are stored and retrieved from it.>>>> I guess we can get back to them eventually to estimate our coverage ratio.>> It would be nice to know, but pretty hard to find out with the rate> limit. I guess it will improve immensely when we set up a> “sources.json” file.
Note that we have https://guix.gnu.org/sources.json. Last I checked,SWH was ingesting it in its “qualification” instance, so it should beingesting it for good real soon if it’s not doing it already.
Toggle quote (16 lines)>>> You mean like https://git.ngyro.com/disarchive-db/? :)>>>> Woow. :-)>>>> We could actually have a CI job to create the database: it would>> basically do ‘disarchive save’ for each tarball and store that using a>> layout like the one you used. Then we could have a job somewhere that>> periodically fetches that and adds it to the database. WDYT?>> Maybe.... I assume that Disarchive would fail for a few of them. We> would need a plan for monitoring those failures so that Disarchive can> be improved. Also, unless I’m misunderstanding something, this means> building the whole database at every commit, no? That would take a lot> of time and space. On the other hand, it would be easy enough to try.> If it works, it’s a lot easier than setting up a whole other service.
One can easily write a procedure that takes a tarball and returns a<computed-file> that builds its database entry. So at each commit, we’djust rebuild things that have changed.
Toggle quote (6 lines)>> I think we should leave room for other hash algorithms (in the sexps>> above too).>> It works for different hash algorithms, but not for different directory> hashing methods (like you mention below).
OK.
[...]
Toggle quote (14 lines)>> So it does mean that we could pretty much right away add a fall-back in>> (guix download) that looks up tarballs in your database and uses>> Disarchive to recontruct it, right? I love solved problems. :-)>>>> Of course we could improve Disarchive and the database, but it seems to>> me that we already have enough to improve the situation. WDYT?>> I would say that we are darn close! In theory it would work. It would> be much more practical if we had better coverage in the SWH archive> (i.e., “sources.json”) and a way to get metadata for a source archive> without downloading the entire Disarchive database. It’s 13M now, but> it will likely be 500M with all the Gzip’d tarballs from a recent commit> of Guix. It will only grow after that, too.
If we expose the database over HTTP (like over cgit), we can arrange sothat (guix download) simply GETs db.example.org/sha256/xyz. No need tofetch the whole database.
It might be more reasonable to have a real database and a real servicearound it, I’m sure Chris Baines would agree ;-), but we can choose URLsthat could easily be implemented by a “real” service instead of cgit inthe future.
Toggle quote (6 lines)> Of course those are not hard blockers, so ‘(guix download)’ could start> using Disarchive as soon as we package it. I’ve starting looking into> it, but I’m confused about getting access to Disarchive from the> “out-of-band” download system. Would it have to become a dependency of> Guix?
Yes. It could be a behind-the-scenes dependency of “builtin:download”;it doesn’t have to be a dependency of each and every fixed-outputderivation.
Toggle quote (10 lines)> I was imagining an escape hatch beyond this, where one could look up a> provenance record from when Disarchive ingested and verified a source> code archive. The provenance record would tell you which version of> Guix was used when saving the archive, so you could try your luck with> using “guix time-machine” to reproduce Disarchive’s original> computation. If we perform database migrations, you would need to> travel back in time in the database, too. The idea is that you could> work around breakages in Disarchive automatically using the Power of> Guix™. Just a stray thought, really.
Seems to me it Shouldn’t Be Necessary? :-)
I mean, as long as the format is extensible and “future-proof”, we’llalways be able to rebuild tarballs and then re-disassemble them if weneed to compute new hashes or whatever.
Toggle quote (11 lines)>> If you feel like it, you’re welcome to point them to your work in the>> discussion at <https://forge.softwareheritage.org/T2430>. There’s one>> person from NixOS (lewo) participating in the discussion and I’m sure>> they’d be interested. Perhaps they’ll tell whether they care about>> having it available as JSON.>> Good idea. I will work out a few more kinks and then bring it up there.> I’ve already rewritten the parts that used the Guix daemon. Disarchive> now only needs a handful Guix modules ('base32', 'serialization', and> 'swh' are the ones that would be hard to remove).
An option would be to use (gcrypt base64); another one would be tobundle (guix base32).
I was thinking that it might be best to not use Guix for computations.For example, have “disarchive save” not build derivations and instead doeverything “here and now”. That would make it easier for others toadopt. Wait, looking at the Git history, it looks like you alreadyaddressed that point, neat. :-)
Thank you!
Ludo’.
T
T
Timothy Sample wrote on 5 Aug 20:57 +0200
(name . Ludovic Courtès)(address . ludo@gnu.org)
874kpgudic.fsf@ngyro.com
Hey,
Ludovic Courtès <ludo@gnu.org> writes:
Toggle quote (4 lines)> Note that we have https://guix.gnu.org/sources.json. Last I checked,> SWH was ingesting it in its “qualification” instance, so it should be> ingesting it for good real soon if it’s not doing it already.
Oh fantastic! I was going to volunteer to do it, so that’s one thingoff my list.
Toggle quote (4 lines)> One can easily write a procedure that takes a tarball and returns a> <computed-file> that builds its database entry. So at each commit, we’d> just rebuild things that have changed.
That makes more sense. I will give this a shot soon.
Toggle quote (9 lines)> If we expose the database over HTTP (like over cgit), we can arrange so> that (guix download) simply GETs db.example.org/sha256/xyz. No need to> fetch the whole database.>> It might be more reasonable to have a real database and a real service> around it, I’m sure Chris Baines would agree ;-), but we can choose URLs> that could easily be implemented by a “real” service instead of cgit in> the future.
I got it working over cgit shortly after sending my last message. :) Sofar, I am very much on team “good enough for now”.
Toggle quote (18 lines)> Timothy Sample <samplet@ngyro.com> skribis:>>> I was imagining an escape hatch beyond this, where one could look up a>> provenance record from when Disarchive ingested and verified a source>> code archive. The provenance record would tell you which version of>> Guix was used when saving the archive, so you could try your luck with>> using “guix time-machine” to reproduce Disarchive’s original>> computation. If we perform database migrations, you would need to>> travel back in time in the database, too. The idea is that you could>> work around breakages in Disarchive automatically using the Power of>> Guix™. Just a stray thought, really.>> Seems to me it Shouldn’t Be Necessary? :-)>> I mean, as long as the format is extensible and “future-proof”, we’ll> always be able to rebuild tarballs and then re-disassemble them if we> need to compute new hashes or whatever.
If Disarchive relies on external compressors, there’s an outside chancethat those compressors could change under our feet. In that case, onewould want to be able to track down exactly which version of XZ was usedwhen Disarchive verified that it could reassemble a given sourcearchive. Maybe I’m being paranoid, but if the database entries arebeing computed by the CI infrastructure it would be pretty easy to notethe Guix commit just in case.
Toggle quote (6 lines)> I was thinking that it might be best to not use Guix for computations.> For example, have “disarchive save” not build derivations and instead do> everything “here and now”. That would make it easier for others to> adopt. Wait, looking at the Git history, it looks like you already> addressed that point, neat. :-)
Since my last message I managed to remove Guix as dependency completely.Right now it loads ‘(guix swh)’ opportunistically, but I might just copythe code in. Directory references now support multiple “addresses” sothat you could have Nix-style, SWH-style, IPFS-style, etc. Hopefully mynext message will have a WIP patch enabling Guix to use Disarchive!

-- Tim
?