From debbugs-submit-bounces@debbugs.gnu.org Wed Aug 26 06:05:12 2020 Received: (at 42162) by debbugs.gnu.org; 26 Aug 2020 10:05:12 +0000 Received: from localhost ([127.0.0.1]:37434 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1kAsIq-0005YX-1k for submit@debbugs.gnu.org; Wed, 26 Aug 2020 06:05:12 -0400 Received: from mail-wr1-f65.google.com ([209.85.221.65]:46966) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1kAsIi-0005XL-2s for 42162@debbugs.gnu.org; Wed, 26 Aug 2020 06:05:10 -0400 Received: by mail-wr1-f65.google.com with SMTP id r15so1167666wrp.13 for <42162@debbugs.gnu.org>; Wed, 26 Aug 2020 03:05:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:in-reply-to:references:date:message-id :mime-version:content-transfer-encoding; bh=zhHFO+g8iEfdoCani6/L6PNlDjNXkm8YD7nCiPcLqRk=; b=ZmlV3hN4nNvtpy0cVRaFUPiWpGi4gZNUPYCpIO37C9foWlEvUgRocPYzROwYookDBR zD7tWh7i1NXZWFK200Q0q9pTuIw2hnI6vPf89PYZ1AtPUwr7b2K0FQUkhZ3qZiqlMn5m iYy3pEvex7CIvQtwrdaXAxZ7kHIkCwN7FIW/64ev39n7/cVX116SVfrJZ+G4dRoSlMIp AArViAb3jJOPRxiX4lkWb+z7CKAPMg1+CaiAVKt6v/CAuvoqBi/GWpjUxH2b8zNLtjk3 6D0V5o5MxXGvjN0GjxbB9qIMND7RuDSC01etoExNfIzQ52tbXxioefVE1Er5z3+GumRF Yk9A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:in-reply-to:references:date :message-id:mime-version:content-transfer-encoding; bh=zhHFO+g8iEfdoCani6/L6PNlDjNXkm8YD7nCiPcLqRk=; b=YHaSrI1fBGMu1qzEaH3na8I6tv/B82ArvaIBiOfzqgllH2Dl/N7xhC9RVr0fondgmm RsIZB/DCCjagvjca7y21TRqIwJaYN14XqKYUj+QZC7ErxayAaUizxHbcl8fm9difVQf7 NNohqU/AHGBhKIXcy+jLDSCuwpqsMc4iHBaFVpEG3P8nrprMseHrVzOq90NVsmxI0Ys7 y5H2QY1iw3DivWBzrm+rmTmnFEJTTGF6gpkVjt+KLwo2btob3l3djWNwtw+97+9AQKtm HKgS2ffIt/BmtM6UXMPuR7pgBixDiKwpdsTiYhr/l6bgIkZT3MXO9NpUeTquPnXBCXUk wv3A== X-Gm-Message-State: AOAM530PTLbBXbY8lJRz6r3X8dWZw11TcGUAclJPTBTJMJXF0AWk+7mH z6AMBYIk5vjE29eROSQwnuE= X-Google-Smtp-Source: ABdhPJz9qYFy+AjB9uHysaAgfoQwHBE2Nj3D0vlg9XKXN4N48EpfVD7qwVBdWzzgUi1kRrb4/T0LsA== X-Received: by 2002:a5d:4ccb:: with SMTP id c11mr14511391wrt.159.1598436298156; Wed, 26 Aug 2020 03:04:58 -0700 (PDT) Received: from lili (57.246.195.77.rev.sfr.net. [77.195.246.57]) by smtp.gmail.com with ESMTPSA id a74sm4506921wme.11.2020.08.26.03.04.56 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 26 Aug 2020 03:04:57 -0700 (PDT) From: zimoun To: Timothy Sample , Ludovic =?utf-8?Q?Court=C3=A8s?= Subject: Re: bug#42162: Recovering source tarballs In-Reply-To: <875za4ykej.fsf@ngyro.com> References: <87mu4iv0gc.fsf@inria.fr> <86h7uq8fmk.fsf@gmail.com> <87d05etero.fsf@gnu.org> <87r1tit5j6.fsf_-_@gnu.org> <875za4ykej.fsf@ngyro.com> Date: Wed, 26 Aug 2020 12:04:55 +0200 Message-ID: <86blixyb7c.fsf@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 42162 Cc: 42162@debbugs.gnu.org, Maurice =?utf-8?Q?Br=C3=A9mond?= X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) Dear Timothy, On Thu, 30 Jul 2020 at 13:36, Timothy Sample wrote: > I call the thing =E2=80=9CDisarchive=E2=80=9D as in =E2=80=9Cdisassemble = a source code archive=E2=80=9D. > You can find it at . It has a simple > command-line interface so you can do > > $ disarchive save software-1.0.tar.gz > > which serializes a disassembled version of =E2=80=9Csoftware-1.0.tar.gz= =E2=80=9D to the > database (which is just a directory) specified by the =E2=80=9CDISARCHIVE= _DB=E2=80=9D > environment variable. Next, you can run > > $ disarchive load hash-of-something-in-the-db > > which will recover an original file from its metadata (stored in the > database) and data retrieved from the SWH archive or taken from a cache > (again, just a directory) specified by =E2=80=9CDISARCHIVE_DIRCACHE=E2=80= =9D. Really nice! Thank you! >> I think we=E2=80=99d have to maintain a database that maps tarball hashe= s to >> metadata (!). A simple version of it could be a Git repo where, say, >> =E2=80=98sha256/0mq9fc0ig0if5x9zjrs78zz8gfzczbvykj2iwqqd6salcqdgdwhk=E2= =80=99 would >> contain the metadata above. The nice thing is that the Git repo itself >> could be archived by SWH. :-) > > You mean like ? :) [...] > This was generated by a little script built on top of =E2=80=9Cfold-packa= ges=E2=80=9D. > It downloads Gzip=E2=80=99d tarballs used by Guix packages and passes the= m on to > Disarchive for disassembly. I limited the number to 100 because it=E2=80= =99s > slow and because I=E2=80=99m sure there is a long tail of weird software > archives that are going to be hard to process. The metadata directory > ended up being 13M and the directory cache 2G. One question is how this database scales? For example, a quick back-to-envelop estimation leads to ~1.2GB metadata for ~14k packages and then an increase of ~700MB per year, both with the Ludo=E2=80=99s code [1]. [1] > I could remove most of the Guix stuff so that it would be easy to > package in Guix, Nix, Debian, etc. Then, someone=E2=84=A2 could write a = service > that consumes a =E2=80=9Csources.json=E2=80=9D file, adds the sources to = a Disarchive > database, and pushes everything to a Git repo. I guess everyone who > cares has to produce a =E2=80=9Csources.json=E2=80=9D file anyway, so it = will be very > little extra work. Other stuff like changing the serialization format > to JSON would be pretty easy, too. I=E2=80=99m not well connected to the= se > other projects, mind you, so I=E2=80=99m not really sure how to reach out. This service could be really useful. Yes, it could be easy to update the database each time Guix produces a new =E2=80=9Csources.json=E2=80=9D. As mentioned [2], should this service be part of SWH (download cooking task)? Or project side? [2] Thank you again for this piece for work. All the best, simon