Simon Tournier <zimon.toutoune@gmail.com> writes:
Toggle quote (17 lines)
> Hi,
>
> On Wed, 15 Feb 2023 at 05:21, Kyle Andrews <kyle@posteo.net> wrote:
>> Dear Guix,
>>
>> I would like to be able to conveniently analyze Guix package
>> dependencies using general purpose network analysis software such as
>> igraph. To achieve this, I have added another backend to Guix and which
>> is exposed via guix graph which spits out a three column table that,
>> while not technically and edge list, is readily transformed into one
>> with minimal data munging.
>
> You might be interested by [1] where I export all the packages as
> JSON-like (Python dictionary) and then import with python-networkx.
>
> Feel free to report your analyses, I am very interested by such. :-)
That's my plan. I hope to provide Guix developers some tools to guide
their efforts. Of course, I would really like to make networks of many
more things in Guix including the Guile code itself and the Git history.
Toggle quote (2 lines)
That looks like a great reference! With regards to your query about
graphs in that email, I remember there being a Racket library which
provides the core graph data structures. Maybe those could be ported to
guile? However, I wonder if it would more practical for guile to
interface with igraph.
Toggle quote (12 lines)
>> +(define (emit-edgelist-prologue name port)
>> + (display "" port))
>
> Here, I would add the description of the data as header of the CSV-like
> file. For instance, something:
>
> --8<---------------cut here---------------start------------->8---
> # type, name-or-edge1, item-or-edge2
> # package, name, item
> # depends, edge1, edge2
> --8<---------------cut here---------------end--------------->8---
I toyed with calling columns 2 and 3 "parent/child", "source/sink",
"input/output", "origin/destination". The "input/output" option sounds
the best to me.
Toggle quote (2 lines)
> Well, is this format a standard format for representing graph?
No it's not. Since I am not particularly comfortable with guile, I was
hesitant to make extensive changes to the existing backend code. If I
could have produced just the "depends" lines with the integers
substituted with their meaningful names@versions, I would have done that
instead. If I could pass id1 id2 label1 label2 to all of the emit
procedures that would be pretty handy!
For example:
```
input, output
package1@1.2, package3@1.4
package1@1.2, package2@3.9
...
```
Technically, R programmers would probably first turn this into a matrix
or a data frame. That intermediate step would provide a convenient
opportuntity to extract out the versions strings so that networks could
more readily be compared across commits and branches. The versions could
be added back in later as node attributes. Just as with relational
tables, network analysis gets much more powerful when they have lots of
attributes, some of which may refer to hash table keys pointing to other
data structures. For example, libressl and openssl might share a
protocol attribute with a value of "SSL". With a rich set of attributes
data, researchers could start thinking about how to sample from the
distribution of possible alternative system configurations when doing
reproducibility studies. This might reveal "hot spots" of
irreproducibility which package authors could be looking out for. That's
one idea I just had while writing this email. I'm sure many people could
come up with many more neat ideas if the biggest barriers to getting the
data in the first place were removed.
Toggle quote (14 lines)
> From igraph documentation [1], it reads ’igraph_read_graph_edgelist’:
>
> This format is simply a series of an even number of non-negative
> integers separated by whitespace. The integers represent vertex
> IDs. Placing each edge (i.e. pair of integers) on a separate
> line is not required, but it is recommended for
> readability. Edges of directed graphs are assumed to be in
> "from, to" order.
>
> so maybe it could be nice to use this plain list for the edgelist
> backend. WDYT?
>
> 1: https://igraph.org/c/doc/igraph-Foreign.html#igraph_read_graph_edgelist
This is exactly why I did what I did in this patch. Just by filtering
rows with "depends" in the first column, you get the edge list the
igraph manual describes. To get the labels, you just need to filter rows
with "package" instead. These are straightfoward post processing steps
for many R and python users.
I don't like the idea of just returning integers though. It's no fun to
not be able to readily see what nodes refer to. That's why I prefer the
CSV view of things with descriptive labels.
Thanks for looking at my patch!
Cheers,
Kyle