Updater needs to support HTTP(S) servers

  • Done
  • quality assurance status badge
Details
4 participants
  • Brice Waegeneire
  • Hartmut Goebel
  • Ludovic Courtès
  • Rubén Rodríguez Pérez via RT
Owner
unassigned
Submitted by
Hartmut Goebel
Severity
normal
H
H
Hartmut Goebel wrote on 20 Aug 2017 14:06
(name . bug-guix)(address . bug-guix@gnu.org)
2c2838f3-24d6-5010-faf6-49e70f85e963@crazy-compilers.com
Hi,

our updater currently only supports FTP servers, but more and more
projects shutdown the FTP service and provide HTTP(S) servers only (e.g
the Linux kernel). For other projects, the main distribution point has
changed to HTTP and the mirrors still providing FTP at lagging (e.g.
KDE, see [1]).

A common case is to simply use Apache to serve the directories, but it
will deliver a HTML view on the directory contents (using mod_autoindex
[3]).

In [2] Ludo wrote:

So we need a way to list the latest releases somehow. If they publish
JSON, XML, or some other structured info format, that’s fine too. But
HTTP alone is not good: we’d have to infer the information from HTML
pages, which sounds fragile.

IMHO we can not expect project and mirror sites to provide these
additional data. Most projects simply will not do since this would
require the server to generate some data-files n the fly.

OTOH, I assume the delivered directory index pages to be well-formed
(X)HTML. Thus parsing the HTML should be quite simple: We only need to
pattern-match "<A>" tags, or – if guile has some decent one – a
xml/html-parser use this to query the data.

Only relative links without slash (except a trailing one) have to be
handled. Links with a trailing slash can be assumed to be a directories.
(Since auto-index only works if URL is pointing to a directory and the
directory is marked by a training slash we can assume the generated
links for directories will all have the trailing slash.) At least this
would be a good start which could be refined if necessary.

Please note tha I'm not suggesting to write a general-purpose parser,
but aiming for auto-index html-pages only.

Some things I already found out:

* Directory-listings generated by mod_autoindex can be provided as a
simple list by passing the query-parameter "F=0" in the URL [4].
There are other query parameters for sorting and pattern matching.
* nginx's "ngx_http_autoindex_module" [6] seem to not use query
parameters, but can be configured (on the server-side) to provide
the content as XML or json. The "fancy_index" module [7] si
documented to "Allow choosing to sort elements", but [7] does not
state how and if "fancy" can be switched off.
* Lighttp supports some of these options [5].

[5]

--

Regards
Hartmut Goebel

| Hartmut Goebel | h.goebel@crazy-compilers.com |
| www.crazy-compilers.com | compilers which you thought are impossible |
Attachment: 0xBF773B65.asc
L
L
Ludovic Courtès wrote on 22 Aug 2017 10:57
(name . Hartmut Goebel)(address . h.goebel@crazy-compilers.com)(address . 28159@debbugs.gnu.org)
87poboasjz.fsf@gnu.org
Hi Hartmut,

Hartmut Goebel <h.goebel@crazy-compilers.com> skribis:

Toggle quote (2 lines)
> our updater currently only supports FTP servers,

More precisely, several updaters rely on FTP (gnu, kernel.org, kde,
etc. see (guix gnu-maintenance)), but others rely on structured data
retrieved over HTTP(S) (pypi, cran, elpa, etc.)

Toggle quote (5 lines)
> but more and more projects shutdown the FTP service and provide
> HTTP(S) servers only (e.g the Linux kernel). For other projects, the
> main distribution point has changed to HTTP and the mirrors still
> providing FTP at lagging (e.g. KDE, see [1]).

The FTP updater had the advantage of being simple and fairly generic,
but here we’ll probably have to go for project specific methods.

So I would suggest picking one updater, say kde, and implementing it
using whatever metadata can be retrieved from kde.org.

This should be simpler than trying to figure out a generic method that
will work for every software project.

HTH!

Ludo’.
H
H
Hartmut Goebel wrote on 23 Aug 2017 12:20
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 28159@debbugs.gnu.org)
570534f1-58d2-6db5-b5c2-b9e5276c5974@crazy-compilers.com
Am 22.08.2017 um 10:57 schrieb Ludovic Courtès:
Toggle quote (3 lines)
> So I would suggest picking one updater, say kde, and implementing it
> using whatever metadata can be retrieved from kde.org.

I'm not sure if I understood what you mean with "whatever metadata can
be retrieved from kde.org".

By change, download.kde.org indeed provides a "ls-lR" and "ls-lR.bz"
file at the top-level. I was not aware of this up to just now. Using
this might be an option (It is lagging a bit, though I think this is
acceptable. From what I've ssen I guess it is generated each hour if
some file changed.)

So for kde we might find a simpler solution. But in the long-run IMHO we
need a simple html parser.

I'm not skilled enough in scheme/guile to write such a parser, sorry.

--
Regards
Hartmut Goebel

| Hartmut Goebel | h.goebel@crazy-compilers.com |
| www.crazy-compilers.com | compilers which you thought are impossible |
L
L
Ludovic Courtès wrote on 23 Aug 2017 23:30
(name . Hartmut Goebel)(address . h.goebel@crazy-compilers.com)(address . 28159@debbugs.gnu.org)
87ziaqj7k4.fsf@gnu.org
Hartmut Goebel <h.goebel@crazy-compilers.com> skribis:

Toggle quote (7 lines)
> Am 22.08.2017 um 10:57 schrieb Ludovic Courtès:
>> So I would suggest picking one updater, say kde, and implementing it
>> using whatever metadata can be retrieved from kde.org.
>
> I'm not sure if I understood what you mean with "whatever metadata can
> be retrieved from kde.org".

I mean using package metadata provided by kde.org (maybe they have a
JSON representation of the package graph or something?), or the ‘ls-lR’
files at worst.

Toggle quote (6 lines)
> By change, download.kde.org indeed provides a "ls-lR" and "ls-lR.bz"
> file at the top-level. I was not aware of this up to just now. Using
> this might be an option (It is lagging a bit, though I think this is
> acceptable. From what I've ssen I guess it is generated each hour if
> some file changed.)

Sounds good.

Toggle quote (3 lines)
> So for kde we might find a simpler solution. But in the long-run IMHO we
> need a simple html parser.

In some cases yes, but maybe not in all cases. I also suspect that
something that attempts to extract the latest release number from a home
page may be brittle.

Toggle quote (2 lines)
> I'm not skilled enough in scheme/guile to write such a parser, sorry.

This can be done along these lines:

Toggle snippet (10 lines)
scheme@(guile-user)> ,use(sxml simple)
scheme@(guile-user)> ,use(web client)
scheme@(guile-user)> ,use(sxml match)
scheme@(guile-user)> (define page (xml->sxml (call-with-values
(lambda ()
(http-get "http://www.gnu.org/software/guix/guix.html" #:streaming? #t))
(lambda (response port)
port))))

… where ‘page’ is the SXML representation of the web page. The
difficulty is to browse this page (using ‘match’ or ‘sxml-match’.)

HTH,
Ludo’.
L
L
Ludovic Courtès wrote on 26 Aug 2017 11:54
(name . Hartmut Goebel)(address . h.goebel@crazy-compilers.com)(address . 28159@debbugs.gnu.org)
87r2vybqnw.fsf@gnu.org
Hello,

I just learned that ftp://ftp.gnu.org will be retired on Nov. 1st, 2017,
so we’ll have to implement a replacement for the ‘gnu’ updater at least.

At worst, we’ll parse HTML index files like the one at
https://ftp.gnu.org/gnu/guile/, but I’m trying to see if the FSF
sysadmin could generate an ‘ls-lR’ file or similar.

Ludo’.
H
H
Hartmut Goebel wrote on 26 Aug 2017 12:33
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 28159@debbugs.gnu.org)
263fe0e0-b9f5-3377-30ad-2675698d41c8@crazy-compilers.com
Hi,
Toggle quote (3 lines)
> I just learned that ftp://ftp.gnu.org will be retired on Nov. 1st, 2017,
> so we’ll have to implement a replacement for the ‘gnu’ updater at least.

By change, also this server provides a `ls-lrRt.txt.gz` file.
Unfurtunaly is as a slightly different (date-) format than the one at
kde.org:

kde:

drwxr-xr-x   3 ftpadmin packager       6 2000-10-01 14:07 adm


gnu:

drwxr-xr-x   2 root root      4096 Aug  2  2003 third-party


Also by chance ftp.gnu.org also provides a file `find.txt.gz`, listing
all files, including the full path:

./video/Stephen_Fry-Happy_Birthday_GNU-nq_600px_425kbit.ogv
./old-gnu/g77/g77-0.5.21.tar.gz
./old-gnu/guile
./old-gnu/guile/guile-www-1.0.1.tar.gz
./old-gnu/guile/guile-1.3.2.tar.gz


Toggle quote (3 lines)
> At worst, we’ll parse HTML index files like the one at
> <https://ftp.gnu.org/gnu/guile/>,

This is what Ihis bug is about :-) Please mind the query-parameters one
can pass to apache: https://ftp.gnu.org/gnu/guile/?F=0 is much more terse.

--
Regards
Hartmut Goebel

| Hartmut Goebel | h.goebel@crazy-compilers.com |
| www.crazy-compilers.com | compilers which you thought are impossible |
L
L
Ludovic Courtès wrote on 3 Sep 2017 23:40
(name . Hartmut Goebel)(address . h.goebel@crazy-compilers.com)(address . 28159@debbugs.gnu.org)
87pob71mwt.fsf@gnu.org
Hi Hartmut,

Hartmut Goebel <h.goebel@crazy-compilers.com> skribis:

Toggle quote (9 lines)
> Also by chance ftp.gnu.org also provides a file `find.txt.gz`, listing
> all files, including the full path:
>
> ./video/Stephen_Fry-Happy_Birthday_GNU-nq_600px_425kbit.ogv
> ./old-gnu/g77/g77-0.5.21.tar.gz
> ./old-gnu/guile
> ./old-gnu/guile/guile-www-1.0.1.tar.gz
> ./old-gnu/guile/guile-1.3.2.tar.gz

This one is nice and smaller than ‘ls-lR’. I reimplemented the GNU
updater in terms of this file, and kept the previous FTP-based updater
around for GNU packages not hosted on ftp.gnu.org:


“guix refresh -t gnu” is now much faster.

The next step may be to have a more-or-less generic updater based on
‘ls-lR’ files.

Thanks,
Ludo’.
L
L
Ludovic Courtès wrote on 8 Sep 2017 10:30
(address . sysadmin@gnu.org)(address . 28159@debbugs.gnu.org)
87r2vhk2x9.fsf@gnu.org
Hello sysadmins!

How frequently is ftp.gnu.org/find.txt.gz updated? It seems to be less
than once a day.

Could we arrange to have it regenerated every time a new file is
uploaded?

I suppose uploads aren’t this frequent, but regenerating ‘find.txt.gz’
right after an upload would ensure that it’s always current.

Thanks in advance!

Ludo’.

PPS: Please reply to all.
R
R
Rubén Rodríguez Pérez via RT wrote on 14 Sep 2017 18:50
(address . 28159@debbugs.gnu.org)
rt-4.2.13-5-gc649048-446-1505407828-916.1238656-6-0@rt.gnu.org
On Fri Sep 08 04:31:05 2017, ludo@gnu.org wrote:
Toggle quote (2 lines)
> Hello sysadmins!

Hi Ludo

Toggle quote (4 lines)
> How frequently is ftp.gnu.org/find.txt.gz updated? It seems to be
> less
> than once a day.

It is run by cron.daily

Toggle quote (6 lines)
> Could we arrange to have it regenerated every time a new file is
> uploaded?
>
> I suppose uploads aren’t this frequent, but regenerating ‘find.txt.gz’
> right after an upload would ensure that it’s always current.

I've modified the cron script to apply that change, now on the look for the next upload to see if it worked.

Regards,
--
Ruben Rodriguez | Senior Systems Administrator, Free Software Foundation
GPG Key: 05EF 1D2F FE61 747D 1FC8 27C3 7FAC 7D26 472F 4409
https://fsf.org| https://gnu.org
R
R
Rubén Rodríguez Pérez via RT wrote on 14 Sep 2017 18:50
(address . ludo@gnu.org)
rt-4.2.13-5-gc649048-446-1505407828-1344.1238656-5-0@rt.gnu.org
On Fri Sep 08 04:31:05 2017, ludo@gnu.org wrote:
Toggle quote (2 lines)
> Hello sysadmins!

Hi Ludo

Toggle quote (4 lines)
> How frequently is ftp.gnu.org/find.txt.gz updated? It seems to be
> less
> than once a day.

It is run by cron.daily

Toggle quote (6 lines)
> Could we arrange to have it regenerated every time a new file is
> uploaded?
>
> I suppose uploads aren’t this frequent, but regenerating ‘find.txt.gz’
> right after an upload would ensure that it’s always current.

I've modified the cron script to apply that change, now on the look for the next upload to see if it worked.

Regards,
--
Ruben Rodriguez | Senior Systems Administrator, Free Software Foundation
GPG Key: 05EF 1D2F FE61 747D 1FC8 27C3 7FAC 7D26 472F 4409
https://fsf.org| https://gnu.org
L
L
Ludovic Courtès wrote on 15 Sep 2017 09:50
(name . Rubén Rodríguez Pérez via RT)(address . sysadmin@gnu.org)(address . 28159@debbugs.gnu.org)
878thg7644.fsf@gnu.org
Hi Rubén,

"Rubén Rodríguez Pérez via RT" <sysadmin@gnu.org> skribis:

Toggle quote (2 lines)
> On Fri Sep 08 04:31:05 2017, ludo@gnu.org wrote:

[...]

Toggle quote (8 lines)
>> Could we arrange to have it regenerated every time a new file is
>> uploaded?
>>
>> I suppose uploads aren’t this frequent, but regenerating ‘find.txt.gz’
>> right after an upload would ensure that it’s always current.
>
> I've modified the cron script to apply that change, now on the look for the next upload to see if it worked.

Awesome, thanks!

Ludo’.
L
L
Ludovic Courtès wrote on 26 Sep 2017 00:39
Re: bug#28159: Updater needs to support HTTP(S) servers
(name . Hartmut Goebel)(address . h.goebel@crazy-compilers.com)(address . 28159@debbugs.gnu.org)
87tvzqpflx.fsf@gnu.org
ludo@gnu.org (Ludovic Courtès) skribis:

Toggle quote (8 lines)
> This one is nice and smaller than ‘ls-lR’. I reimplemented the GNU
> updater in terms of this file, and kept the previous FTP-based updater
> around for GNU packages not hosted on ftp.gnu.org:
>
> https://git.savannah.gnu.org/cgit/guix.git/commit/?id=100b216d8a4218daec4a79024d62d54b52dc07be
>
> “guix refresh -t gnu” is now much faster.

Commit c1d8b3b3b5af8282328b87dd7a8d09357cbb0af7 rewrites the GNOME
updater in terms of the ‘cache.json’ files that can be found in each

Ludo’.
L
L
Ludovic Courtès wrote on 10 Nov 2018 23:38
(name . Hartmut Goebel)(address . h.goebel@crazy-compilers.com)(address . 28159@debbugs.gnu.org)
87efbsv9rb.fsf@gnu.org
ludo@gnu.org (Ludovic Courtès) skribis:

Toggle quote (14 lines)
> ludo@gnu.org (Ludovic Courtès) skribis:
>
>> This one is nice and smaller than ‘ls-lR’. I reimplemented the GNU
>> updater in terms of this file, and kept the previous FTP-based updater
>> around for GNU packages not hosted on ftp.gnu.org:
>>
>> https://git.savannah.gnu.org/cgit/guix.git/commit/?id=100b216d8a4218daec4a79024d62d54b52dc07be
>>
>> “guix refresh -t gnu” is now much faster.
>
> Commit c1d8b3b3b5af8282328b87dd7a8d09357cbb0af7 rewrites the GNOME
> updater in terms of the ‘cache.json’ files that can be found in each
> package directory at <https://ftp.gnome.org/pub/gnome/sources>.

Commit 5230dce154a8861d806fcd667f2d424def571ed6 rewrites the kernel.org
updater so that it’s based on an analysis of HTML directory listings

Ludo’.
H
H
Hartmut Goebel wrote on 10 Sep 2019 19:25
(address . 28159@debbugs.gnu.org)
901a1993-ed8e-4306-d3f2-f3dbcac30424@crazy-compilers.com
Am 22.08.17 um 10:57 schrieb Ludovic Courtès:
Toggle quote (4 lines)
> More precisely, several updaters rely on FTP (gnu, kernel.org, kde,
> etc. see (guix gnu-maintenance)), but others rely on structured data
> retrieved over HTTP(S) (pypi, cran, elpa, etc.)

For the records: KDE no longer relies on FTP access. It now fetches the
ls-lR.bz2 file list using HTTPS from download.kde.org, converts it into
a list of file paths and caches the list.

commit 4eb69bf0d33810886ee118f38989cef696e4c868

--
Regards
Hartmut Goebel

| Hartmut Goebel | h.goebel@crazy-compilers.com |
| www.crazy-compilers.com | compilers which you thought are impossible |
B
B
Brice Waegeneire wrote on 29 Apr 2020 10:21
Closing bug #28159? Updater needs to support HTTP(S) servers
(address . 28159@debbugs.gnu.org)
07e7317fe16fc58790cd99c0a712fcf5@waegenei.re
Hello Guix,

It looks like now most of the major updaters that relied on FTP (GNU,
kernel.org, KDE and Gnbome) now support HTTP(S). I think we can close
this
bug.

Ludovic Courtès wrote on Tue Aug 22 10:57:20+0200 2017:
Toggle quote (4 lines)
> More precisely, several updaters rely on FTP (gnu, kernel.org, kde,
> etc. see (guix gnu-maintenance)), but others rely on structured data
> retrieved over HTTP(S) (pypi, cran, elpa, etc.)

Ludovic Courtès wrote on Sun Sep 03 23:40:18+0200 2017:
Toggle quote (4 lines)
> This one is nice and smaller than ‘ls-lR’. I reimplemented the GNU
> updater in terms of this file, and kept the previous FTP-based updater
> around for GNU packages not hosted on ftp.gnu.org:

Ludovic Courtès wrote on Tue Sep 26 00:39:54+0200 2017:
Toggle quote (4 lines)
> Commit c1d8b3b3b5af8282328b87dd7a8d09357cbb0af7 rewrites the GNOME
> updater in terms of the ‘cache.json’ files that can be found in each
> package directory at <https://ftp.gnome.org/pub/gnome/sources>.

Ludovic Courtès wrote on Sat Nov 10 23:38:16+0100 2018:
Toggle quote (4 lines)
> Commit 5230dce154a8861d806fcd667f2d424def571ed6 rewrites the kernel.org
> updater so that it’s based on an analysis of HTML directory listings
> such as <https://cdn.kernel.org/pub/software/scm/git/>.

Hartmut Goebel wrote on Tue Sep 10 19:25:58+0200 2019:
Toggle quote (4 lines)
> For the records: KDE no longer relies on FTP access. It now fetches the
> ls-lR.bz2 file list using HTTPS from download.kde.org, converts it into
> a list of file paths and caches the list.

- Brice
L
L
Ludovic Courtès wrote on 30 Apr 2020 23:14
(name . Brice Waegeneire)(address . brice@waegenei.re)
87lfmcoeuk.fsf@gnu.org
Hi Brice,

Brice Waegeneire <brice@waegenei.re> skribis:

Toggle quote (5 lines)
> It looks like now most of the major updaters that relied on FTP (GNU,
> kernel.org, KDE and Gnbome) now support HTTP(S). I think we can close
> this
> bug.

Yup. There’s still the ‘gnu-ftp’ and the ‘xorg’ updaters which,
according to ‘guix refresh --list-updaters’, account for 2.2% of the
packages. We can change them later when it becomes necessary.

Closing, thank you!

Ludo’.
Closed
?
Your comment

This issue is archived.

To comment on this conversation send an email to 28159@debbugs.gnu.org

To respond to this issue using the mumi CLI, first switch to it
mumi current 28159
Then, you may apply the latest patchset in this issue (with sign off)
mumi am -- -s
Or, compose a reply to this issue
mumi compose
Or, send patches to this issue
mumi send-email *.patch