From debbugs-submit-bounces@debbugs.gnu.org Wed Jul 15 12:55:46 2020 Received: (at 42162) by debbugs.gnu.org; 15 Jul 2020 16:55:46 +0000 Received: from localhost ([127.0.0.1]:53454 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1jvkh4-0004KV-3B for submit@debbugs.gnu.org; Wed, 15 Jul 2020 12:55:46 -0400 Received: from mail-qt1-f180.google.com ([209.85.160.180]:34101) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1jvkh0-0004KG-5d for 42162@debbugs.gnu.org; Wed, 15 Jul 2020 12:55:40 -0400 Received: by mail-qt1-f180.google.com with SMTP id w34so2267187qte.1 for <42162@debbugs.gnu.org>; Wed, 15 Jul 2020 09:55:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=AKvtKOvImqiqraOsC7W4AmagSpzQR/Pc4DK8OvFqlP0=; b=Hy45TSTAwVfMIg8JVIUE1+UuitDTS4bY3j0jhK9SOYhxc7SxFrAcRIGaXCnxikUR8o rWcolatV+s/c9JQFDn+Prr35mNAvV0k+HXN9szSHqTiT1rQtnslm6KV9uM1D4Rj2Icin dN8pmuFcER65/PJubD0zr/hQfyQnWENS/ASHIxuGPzSWE4KrUo1oIkEIU+3HwZaF4n7Z iVqjuhpWz1pqnVyYRs+Ks8J21fFqoqLOeDgJAZXgc/rzjIipY6wVw2Jv+WC6QIly7i1G YesTsqmS+Cdi4x8d63O9UGVACcDxG4hcZg01/kPzVL9uhY7+l63aekQ9PDotHvEbD7Ue MKKA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=AKvtKOvImqiqraOsC7W4AmagSpzQR/Pc4DK8OvFqlP0=; b=f/p+JAeWkbenAmJvFdAn19ZtjEvE18caaOjbTXcus73UHDsn8n2wsH54bzwn5t3OAy M7Kbj55WP+ohAsSCHOz/VuMVFYbd/CbtTg6ayHxATzK1Ld+8ZmxubsY76z797jrjoaxp b9A6+e3DJXYq+QuPMEDs9+CaQKTmrB1DjrFlukIZzQvGOenYKsvouhNVgXNiCfcgB7c6 uxrU2xnIcyFVoKjf+lbOpYFfyre+CxcJApqbTXYQbsKVaNKu+g3IgSZQ0AEb8Of5aAnu +aRgbGFXKWt1zYG6FhpyTYqOF6r461Lk/zzIHkffcpwqQI0UD2To8cde7G3pk1UO1OoY DOIg== X-Gm-Message-State: AOAM5328IpM4BLND5lIYh3zBSyVZl8bdZFG3N34p5qaRbFNemNTPXnJr /F0DkN2UPYOaIVFRb6C05UO3uoTqElQ8urRgQ3A= X-Google-Smtp-Source: ABdhPJziRxOVUynGzvTwaViQ1xR3c2sh8Vz6Zg9Z+vdat0tllH+W/qFDibId/nClbg0S+LHSZDyxV4AaWNdW5qllJcE= X-Received: by 2002:ac8:4649:: with SMTP id f9mr676589qto.313.1594832132217; Wed, 15 Jul 2020 09:55:32 -0700 (PDT) MIME-Version: 1.0 References: <87mu4iv0gc.fsf@inria.fr> <86h7uq8fmk.fsf@gmail.com> <87d05etero.fsf@gnu.org> <87r1tit5j6.fsf_-_@gnu.org> In-Reply-To: <87r1tit5j6.fsf_-_@gnu.org> From: zimoun Date: Wed, 15 Jul 2020 18:55:21 +0200 Message-ID: Subject: Re: Recovering source tarballs To: =?UTF-8?Q?Ludovic_Court=C3=A8s?= Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 42162 Cc: 42162@debbugs.gnu.org, =?UTF-8?Q?Maurice_Br=C3=A9mond?= X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) Hi Ludo, Well, you enlarge the discussion to more than the issue of the 5 url-fetch packages on gforge.inria.fr :-) First of all, you wrote [1] ``Migration away from tarballs is already happening as more and more software is distributed straight from content-addressed VCS repositories, though progress has been relatively slow since we first discussed it in 2016.'' but on the other hand Guix uses more than often [2] "url-fetch" even if "git-fetch" is available upstream. Other said, I am not convinced the migration is really happening... The issue would be mitigated if Guix transitions from "url-fetch" to "git-fetch" when possible. 1: https://forge.softwareheritage.org/T2430#45800 2: https://lists.gnu.org/archive/html/guix-devel/2020-05/msg00224.html Second, trying to do some stats about the SWH coverage, I note that non-neglectible "url-fetch" are reachable by "lookup-content". The coverage is not straightforward because of the 120 request per hour rate limit or unexpected server error. Another story. Well, I would like having numbers because I do not know what is concretely the issue: how many "url-fetch" packages are reachable? And if they are unreachable, is it because they are not in yet? or is it because Guix does not have enough info to lookup them? On Sat, 11 Jul 2020 at 17:50, Ludovic Court=C3=A8s wrote: > For the now, since 70% of our packages use =E2=80=98url-fetch=E2=80=99, w= e need to be > able to fetch or to reconstruct tarballs. There=E2=80=99s no way around = it. Yes, but for example all the packages in gnu/packages/bioconductor.scm could be "git-fetch". Today the source is over url-fetch but it could be over git-fetch with https://git.bioconductor.org/packages/flowCore or git@git.bioconductor.org:packages/flowCore. Another example is the packages in gnu/packages/emacs-xyz.scm and the ones from elpa.gnu.org are "url-fetch" and could be "git-fetch", for example using http://git.savannah.gnu.org/gitweb/?p=3Demacs/elpa.git;a=3Dtree;f=3Dpackage= s/ace-window;h=3D71d3eb7bd2efceade91846a56b9937812f658bae;hb=3DHEAD So I would be more reserved about the "no way around it". :-) I mean the 70% could be a bit mitigated. > In the short term, we should arrange so that the build farm keeps GC > roots on source tarballs for an indefinite amount of time. Cuirass > jobset? Mcron job to preserve GC roots? Ideas? Yes, preserving source tarballs for an indefinite amount of time will help. At least all the packages where "lookup-content" returns #f, which means they are not in SWH or they are unreachable -- both is equivalent from Guix side. What about in addition push to IPFS? Feasible? Lookup issue? > For the future, we could store nar hashes of unpacked tarballs instead > of hashes over tarballs. But that raises two questions: > > =E2=80=A2 If we no longer deal with tarballs but upstreams keep signing > tarballs (not raw directory hashes), how can we authenticate our > code after the fact? Does Guix automatically authenticate code using signed tarballs? > =E2=80=A2 SWH internally store Git-tree hashes, not nar hashes, so we s= till > wouldn=E2=80=99t be able to fetch our unpacked trees from SWH. > > (Both issues were previously discussed at > .) > > So for the medium term, and perhaps for the future, a possible option > would be to preserve tarball metadata so we can reconstruct them: > > tarball =3D metadata + tree There is different issues at different levels: 1. how to lookup? what information do we need to keep/store to be able to query SWH? 2. how to check the integrity? what information do we need to keep/store to be able to verify that SWH returns what Guix expects? 3. how to authenticate? where the tarball metadata has to be stored if SWH removes it? Basically, the git-fetch source stores 3 identifiers: - upstream url - commit / tag - integrity (sha256) Fetching from SWH requires the commit only (lookup-revision) or the tag+url (lookup-origin-revision) then from the returned revision, the integrity of the downloaded data is checked using the sha256, right? Therefore, one way to fix lookup of the url-fetch source is to add an extra field mimicking the commit role. The easiest is to store a SWHID or an identifier allowing to deduce the SWHID. I have not checked the code, but something like this: https://pypi.org/project/swh.model/ https://forge.softwareheritage.org/source/swh-model/ and at package time, this identifier is added, similarly to integrity. Aside, does Guix use the authentication metadata that tarballs provide? ( BTW, I failed [3,4] to package swh.model so if someone wants to give a try. 3: https://lists.gnu.org/archive/html/help-guix/2020-06/msg00158.html 4: https://lists.gnu.org/archive/html/help-guix/2020-06/msg00161.html ) > After all, tarballs are byproducts and should be no exception: we should > build them from source. :-) [...] > The code below can =E2=80=9Cdisassemble=E2=80=9D and =E2=80=9Cassemble=E2= =80=9D a tar. When it > disassembles it, it generates metadata like this: [...] > The =E2=80=99assemble-archive=E2=80=99 procedure consumes that, looks up = file contents > by hash on SWH, and reconstructs the original tarball=E2=80=A6 Where do you plan to store the "disassembled" metadata? And where do you plan to "assemble-archive"? I mean, What is pushed to SWH? And how? What is fetched from SWH? And how? (Well, answer below. :-)) > =E2=80=A6 at least in theory, because in practice we hit the SWH rate lim= it > after looking up a few files: Yes, it is 120 request per hour and 10 save per hour. Well, I do not think they will increase much these numbers in general. However, they seem open for specific machines. So, I do not want to speak for them, but we could ask an higher rate limit for ci.guix.gnu.org for example. Then we need to distinguish between source substitutes and binary substitutes. And basically, when an user runs "guix build foo", if the source is not available upstream nor already on ci.guix.gnu.org, then ci.guix.gnu.org fetch the missing sources from SWH and delivers it to the user. > https://archive.softwareheritage.org/api/#rate-limiting > > So it=E2=80=99s a bit ridiculous, but we may have to store a SWH =E2=80= =9Cdir=E2=80=9D > identifier for the whole extracted tree=E2=80=94a Git-tree hash=E2=80=94s= ince that would > allow us to retrieve the whole thing in a single HTTP request. Well, the limited resources of SWH is an issue but SWH is not a mirror but an archive. :-) And as I wrote above, we could ask to SWH to increase the rate limit for specific machine such as ci.guix.gnu.org > I think we=E2=80=99d have to maintain a database that maps tarball hashes= to > metadata (!). A simple version of it could be a Git repo where, say, > =E2=80=98sha256/0mq9fc0ig0if5x9zjrs78zz8gfzczbvykj2iwqqd6salcqdgdwhk=E2= =80=99 would > contain the metadata above. The nice thing is that the Git repo itself > could be archived by SWH. :-) How this database that maps tarball hashes to metadata should be maintained? Git push hook? Cron task? What about foreign channels? Should they maintain their own map? To summary, it would work like this, right? at package time: - store an integrity identiter (today sha256-nix-base32) - disassemble the tarball - commit to another repo the metadata using the path (address) sha256/base32/ - push to packages-repo *and* metadata-database-repo at future time: (upstream has disappeared, say!) - use the integrity identifier to query the database repo - lookup the SWHID from the database repo - fetch the data from SWH - or lookup the IPFS identifier from the database repo and fetch the data from IPFS, for another example - re-assemble the tarball using the metadata from the database repo - check integrity, authentication, etc. Well, right it is better than only adding an identifier for looking up as I described above; because it is more general and flexible than only SWH as fall-back. The format of metadata (disassemble) that you propose is schemish (obviously! :-)) but we could propose something more JSON-like. All the best, simon