From debbugs-submit-bounces@debbugs.gnu.org Sun Mar 08 07:33:55 2020 Received: (at 39258) by debbugs.gnu.org; 8 Mar 2020 11:33:55 +0000 Received: from localhost ([127.0.0.1]:47973 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1jAuBv-0006Vd-2W for submit@debbugs.gnu.org; Sun, 08 Mar 2020 07:33:55 -0400 Received: from eggs.gnu.org ([209.51.188.92]:54547) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1jAuBt-0006VP-8m for 39258@debbugs.gnu.org; Sun, 08 Mar 2020 07:33:53 -0400 Received: from fencepost.gnu.org ([2001:470:142:3::e]:52245) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1jAuBm-0007i4-QB; Sun, 08 Mar 2020 07:33:46 -0400 Received: from [2a01:e0a:1d:7270:af76:b9b:ca24:c465] (port=35386 helo=ribbon) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1jAuBm-0000uw-4B; Sun, 08 Mar 2020 07:33:46 -0400 From: =?utf-8?Q?Ludovic_Court=C3=A8s?= To: Arun Isaac Subject: Re: [PATCH v2 0/3] Xapian for Guix package search References: <20200307133116.11443-1-arunisaac@systemreboot.net> <87sgijgb1v.fsf@gnu.org> X-URL: http://www.fdn.fr/~lcourtes/ X-Revolutionary-Date: 19 =?utf-8?Q?Vent=C3=B4se?= an 228 de la =?utf-8?Q?R?= =?utf-8?Q?=C3=A9volution?= X-PGP-Key-ID: 0x090B11993D9AEBB5 X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc X-PGP-Fingerprint: 3CE4 6455 8A84 FDC6 9DB4 0CFB 090B 1199 3D9A EBB5 X-OS: x86_64-pc-linux-gnu Date: Sun, 08 Mar 2020 12:33:44 +0100 In-Reply-To: (Arun Isaac's message of "Sun, 08 Mar 2020 14:31:42 +0530") Message-ID: <875zffcc87.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Spam-Score: -0.7 (/) X-Debbugs-Envelope-To: 39258 Cc: mail@ambrevar.xyz, 39258@debbugs.gnu.org, zimon.toutoune@gmail.com X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.7 (-) Hi, Arun Isaac skribis: >>> It turns out that most of the time is spent in printing and texinfo >>> rendering of the search results. > > Also, when we put all package metadata into the Xapian index, we don't > have to look up any of the package variables in (gnu packages *) during > `guix search` time. This also contributes substantially to the speedup. Yup. >> In general, pre-rendering doesn=E2=80=99t seem practical to me: the outp= ut of >> =E2=80=98guix search=E2=80=99 is locale-dependent (it speaks the user=E2= =80=99s language) and > > Note that we already need to index package synopses and descriptions in > all languages. I still haven't implemented this, though. Oh, right. Tricky! >> adjusts to the terminal width (well, this is temporarily broken on >> Guile 3.0.0, but see =E2=80=98%text-width=E2=80=99 in (guix ui)). > > This could be accomplished even with pre-rendering. Xapian provides > "slots" to store arbitrary strings with a document. Instead of storing > the pre-rendered document as a whole, we could store pre-rendered fields > in separate slots. Then, during `guix search` time, we can assemble the > result from these pre-rendered fields. I=E2=80=99m not sure I understand. The index wouldn=E2=80=99t store pre-re= ndered strings for every possible terminal width, right? >> Also, if the 12K+ descriptions need to be rendered at the time the user >> runs =E2=80=98guix pull=E2=80=99, the experience may not be great, becau= se it could take >> a bit of time. > > This is a problem, but I would see it as a necessary "compilation" > step. :-P In fact, this whole patchset speeds up `guix search` by doing > part of the work of `guix search` ahead of time. So, some such cost is > unavoidable. Yeah. I think we need to take the whole user experience into account, not just =E2=80=98guix search=E2=80=99. =E2=80=98guix pull=E2=80=99 alread= y feels very slow, and it=E2=80=99s a fairly common operation. Conversely, =E2=80=98guix search=E2=80=99 takes r= oughly between 0.5 and 2 seconds and is an uncommon operation on a =E2=80=9Cslow p= ath=E2=80=9D (in the sense that when you=E2=80=99re searching for software, you=E2=80=99= ll probably have to spend more than a couple of seconds to find what you=E2=80=99re loo= king for.) >> What I like about the recutils format in this context is that it=E2=80= =99s both >> human- and machine-readable. The examples in the manual show how it can >> be useful to select the information displayed or to refine the search >> (info "(guix) Invoking guix package"). > > Xapian's query language is much more natural (as in natural language) > than the regexp based techniques we need to use with recutils. I have > hardly ever used the regexp based search and I suspect many others > haven't either. Also, refining the search query should be easier to do > with Xapian. We could even use Xapian's query expansion feature to > suggest improved queries to the user. I=E2=80=99m not sufficiently familiar with Xapian=E2=80=99s query language.= The examples I had in mind were: guix search malloc | recsel -p name,version,relevance guix search | recsel -p name -e 'license ~ "LGPL 3"' guix search crypto library | \ recsel -e '! (name ~ "^(ghc|perl|python|ruby)")' -p name,synopsis It=E2=80=99s not so much about regexps than it is about selecting individual fields. >> Were you able to measure the cost of rendering specifically? > > generate-package-search-index takes around 50 seconds. If I modify > generate-package-search-index to not pre-render but simply store the > package description alone, it takes around 20 seconds. That gives us a > rough idea of the cost of pre-rendering. To me, adding 20=E2=80=9350 seconds on =E2=80=98guix pull=E2=80=99 would be= undesirable. :-/ >> I think we should look at a profile of =E2=80=98package->recutils=E2=80= =99, there=E2=80=99s >> probably room for improvement there. > > On quick inspection, most of the time in package->recutils is spent in > texinfo rendering the description. Unless we use the simplified search > results format as discussed above, we cannot avoid it. What I meant was that we could use (statprof) to see whether/how Texinfo rendering/parsing can be optimized. Thanks, Ludo=E2=80=99.