Hi, On Wed, 26 Aug 2020 at 17:11, Timothy Sample wrote: > zimoun writes: > >> One question is how this database scales? >> >> For example, a quick back-to-envelop estimation leads to ~1.2GB metadata >> for ~14k packages and then an increase of ~700MB per year, both with the >> Ludo’s code [1]. >> >> [1] > > It’s a good question. A good part of the size comes from the > representation rather than the data. Compression helps a lot here. I > have a database of 3,912 packages. It’s 295M uncompressed (which is a > little better than your estimation). If I pass each file through Lzip, > it shrinks down to 60M. That’s more like 15.5K per package, which is > almost an order of magnitude smaller than the estimation you used > (120K). I think that makes the numbers rather pleasant, but it comes at > the expense of easy storing in Git. Thank you for these numbers. Really interesting! First, I do not know if the database needs to be stored with Git. What should be the advantage? (naive question :-)) On SWH T2430 [1], you explain the “default-header” trick to cut down the size. Nice! Moreover, the format is a long list, e.g., --8<---------------cut here---------------start------------->8--- (headers ((name "raptor2-2.0.15/") (mode 493) (mtime 1414909500) (chksum 4225) (typeflag 53)) ((name "raptor2-2.0.15/build/") (mode 493) (mtime 1414909497) (chksum 4797) (typeflag 53)) ((name "raptor2-2.0.15/build/ltversion.m4") (size 690) (mtime 1414908273) (chksum 5958)) […]) --8<---------------cut here---------------end--------------->8--- which is human-readable. Is it useful? Instead, one could imagine shorter keywords: ((na "raptor2-2.0.15/") (mo 493) (mt 1414909500) (ch 4225) (ty 53)) which using your database (commit fc50927) reduces from 295MB to 279MB. Or even plain list: (\x00 "raptor2-2.0.15/" 493 1414909500 4225 53) (\x01 "raptor2-2.0.15/build/ltversion.m4" 690 1414908273 5958) where the first element provides the “type” of list to ease the reader. Well, the 2 naive questions are: does it make sense to - have the database stored under Git? - have an human-readable format? Thank you again for pushing forward this topic. :-) All the best, simon [1] https://forge.softwareheritage.org/T2430#47522