Hi,
Here is the second iteration of my Xapian Guix package search patchset. I havefound the reason the earlier patchset did not show significant speedup. Itturns out that most of the time is spent in printing and texinfo rendering ofthe search results. So, in this patchset, I pre-render the search resultswhile building the Xapian index and stuff them into the Xapian databaseitself. Therefore, during `guix search`, I just pull out the pre-renderedsearch results and print it on the screen. This is much faster. See comparisonbelow.
Toggle snippet (8 lines)
With a warm cache,$ time guix search inkscape
real 0m1.787suser 0m1.745ssys 0m0.111s
Toggle snippet (7 lines)
$ time /tmp/test/bin/guix search inkscape
real 0m0.199suser 0m0.182ssys 0m0.024s
If most of the speedup comes from pre-rendering the results, it might seemthat the Xapian search is not so useful. We might as well have stuffed thepre-rendered search results into the existing package cache generated bygenerate-package-cache, or so it might seem. But, there are the followingarguments in favor of Xapian.
- The package cache would grow in size, and lookup would be slowed down because we need to load the entire cache into memory. Xapian, on the other hand, need only look up the specific packages that match the search query.- Xapian can provide superior search results due to it stemming and language models.- Xapian can provide spelling correction and query expansion -- that is, suggest search terms to improve search results. Note that I haven't implemented this yet and is out of scope in this patchset.
* Simplify our package search results
Why not use a simpler package search results format like Arch Linux or Debiandoes? We could just display the package name, version and synopsis like so.
inkscape 0.92.4 Vector graphics editorinklingreader 0.8 Wacom Inkling sketch format conversion and manipulation
Why do we need the entire recutils format? If the user is interested, they canalways use `guix package --show` to get the full recutils formattedinfo. Having shorter search results will make everything even faster and muchmore readable. WDYT?
* How to test this patchset
To get guile-xapian, run a `guix pull`, if you haven't already. Then in yourGuix source directory, drop into an environment with guix dependencies andguile-xapian.
$ guix environment guix --ad-hoc guile-xapian
Apply patches and build.
$ git am v2-0000-cover-letter.patch v2-0002-gnu-Generate-Xapian-package-search-index.patch v2-0001-build-self-Add-guile-xapian-to-Guix-dependencies.patch v2-0003-gnu-Use-Xapian-index-for-package-search.patch$ make
Run a test guix pull.
$ ./pre-inst-env guix pull --url=$PWD --branch=xapian -p /tmp/test
where xapian is the name of the branch you committed the patches to.
Then, run the guix search in /tmp/test.
$ /tmp/test/bin/guix search game
* Comments
Pierre Neidhardt <mail@ambrevar.xyz> writes:
Toggle quote (4 lines)
>> +(define (search-package-index profile querystring)>> Maybe `query-string'?
Done in this patchset.
Toggle quote (15 lines)
>> + (define (regexp? str)>> + (string-any>> + (char-set #\. #\[ #\{ #\} #\( #\) #\\ #\* #\+ #\? #\| #\^ #\$)>> + str))>> +>> + (if (and (current-profile)>> + (not (any regexp? patterns)))>> I would not put characters like ".", "$", or "+" here, lest we mistake a> Xapian pattern for a regexp.>> As you said, I don't think both are compatible without ambiguity> anyways, so we should probably drop regexp (or at least toggle them with> a command line argument).
I agree.
zimoun <zimon.toutoune@gmail.com> writes:
Toggle quote (2 lines)
> In the commit message, I would capitalize Xapian.
Done in this patchset.
Toggle quote (5 lines)
>> +(define (generate-package-search-index directory)>> + "Generate under DIRECTORY a xapian index of all the available packages.">> Xapian with capital.
Done in this patchset.
Toggle quote (2 lines)
> Is (make-stem "en") for the locale?
I still have English hard-coded. I haven't yet figured out how to detect thelocale and stem accordingly. But, there is a larger problem. Since we cannotanticipate what locale the user will run guix search with, should we build theXapian index for all locales? That is, should we index not only the Englishversions of the packages but also all other translations as well?
Toggle quote (3 lines)
> package-search-index and package-cache-file could be refactored> because they share all the same code.
Yes, they could be. However, I'll postpone to the next iteration of thepatchset.
Toggle quote (4 lines)
> I do not know what is the convention for the bindings.> But there is 'fold-packages' so I would be inclined to 'fold-msets' or> something in this flavour.
Well, everywhere else in guile we have such things as vhash-fold, string-fold,hash-fold, stream-fold, etc. That's why I went with mset-fold. Also, we arefolding over a single mset (match-set). So, mset should be in the singular.
Toggle quote (3 lines)
> And more importantly, 'make as-derivations' to avoid a "guix pull" breakage,> Ah do not forget to adapt some tests.
Will do this once we have consensus about the other features of this patchset.
Toggle quote (2 lines)
> b. The xapian relevance should truncated
Done in this patchset.
Toggle quote (3 lines)
> Xapian does not return the package 'emacs' itself as the first. And worse,> it is not returned at all.
In this patchset, since we're indexing the package name as well, emacs isreturned but it is still far from the beginning.
Toggle quote (2 lines)
> I propose the value of 4294967295 for pagesize.
In this patchset, I pass (database-document-count db) as the #:maximum-itemskeyword argument to enquire-mset. This is the upstream recommended way to getall search results. I hadn't done this earlier since I hadn't yet wrappeddatabase-document-count in guile-xapian.
Toggle quote (7 lines)
>> In this patchset, I have only indexed the package descriptions. In the next>> version of this patchset, I will index all other terms as specified in>> %package-metrics of guix/ui.scm.>> Yes, it appears to me a detail that should be easy to fix. I mean, it> does not seems blocking.
Done in this patchset.
Ludovic Courtès <ludo@gnu.org> writes:
Toggle quote (2 lines)
> Note that ‘guix search’ time is largely dominated by I/O.
Yes, `guix search` is I/O intensive. That is why I expect Xapian to do bettersince it only needs to access matching packages not all packages. Also, theXapian index is fast at all times. It is not very dependent on a warmfilesystem cache.
Toggle quote (24 lines)
> On my laptop,> I get (first measurement is cold cache, second one is warm cache):>> --8<---------------cut here---------------start------------->8---> $ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'> $ time guix search foo >/dev/null>> real 0m2.631s> user 0m1.134s> sys 0m0.124s> $ time guix search foo >/dev/null>> real 0m0.836s> user 0m1.027s> sys 0m0.053s> --8<---------------cut here---------------end--------------->8--->> It’s hard to do better on the warm cache case because at this level,> there may be other things to optimize having little to do with searching> itself.>> Note that this is on an SSD; the cold-cache case must be worse on NFS or> on a spinning disk, and there we could gain a lot.
My laptop is quite old with a particularly slow HDD. Hence my motivation toimprove guix search performance!
Toggle quote (4 lines)
> I think we should weigh the pros and cons on all these aspects: speed,> complexity and maintenance cost, search result quality, search features,> etc.
I agree.
Toggle quote (3 lines)
> PS: I have not yet looked at the whole series as I’m just coming back to> the keyboard. :-)
Welcome back! :-)
Arun Isaac (3): build-self: Add guile-xapian to Guix dependencies. gnu: Generate Xapian package search index. gnu: Use Xapian index for package search.
build-aux/build-self.scm | 11 +++++++ gnu/packages.scm | 62 +++++++++++++++++++++++++++++++++++++++- guix/channels.scm | 34 +++++++++++++++++++++- guix/scripts/package.scm | 7 +++-- guix/self.scm | 7 ++++- guix/ui.scm | 37 ++++++++++++++++++++++++ 6 files changed, 153 insertions(+), 5 deletions(-)
-- 2.25.1