[PATCH 0/7] mumi: Boolean prefixes in xapian indexing and others

  • Done
  • quality assurance status badge
Details
3 participants
  • Arun Isaac
  • Felix Lechner
  • Ricardo Wurmus
Owner
unassigned
Submitted by
Arun Isaac
Severity
normal
A
A
Arun Isaac wrote on 29 Dec 2022 21:18
(name . Arun Isaac)(address . arunisaac@systemreboot.net)
20221229201809.27997-1-arunisaac@systemreboot.net
Hi Ricardo,

This is a patchset that has been sleeping for some time in my local
git repo. So, I thought it was about time to send it over!

The main change is that some xapian prefixes should be indexed as
boolean prefixes. This makes the use of an implicit AND operator
unneccessary and lets xapian do the natural thing of ordering results
by relevance. I believe this improves the search significantly. Also,
since we retrieve search results by relevance, we can offload limiting
of search results to xapian. Thus, we improve performance as well.

For this patchset to be useful, mumi's xapian index will have to be
rebuilt. In general, it is good to periodically rebuilt the xapian
index from scratch.

Regards,
Arun

Arun Isaac (7):
xapian: Index several terms as boolean and without positions.
xapian: Declare some prefixes as boolean.
xapian: Do not override the default OR implicit query operator.
messages: Remove unused set intersection feature in search-bugs.
messages: Offload limiting search results to xapian.
cache: Specify that cache! returns the cached value.
xapian: Preserve order of search results.

mumi/cache.scm | 3 +-
mumi/messages.scm | 29 ++++--------
mumi/xapian.scm | 109 +++++++++++++++++++++++++++++++---------------
3 files changed, 86 insertions(+), 55 deletions(-)

--
2.38.1
A
A
Arun Isaac wrote on 29 Dec 2022 21:23
[PATCH 1/7] xapian: Index several terms as boolean and without positions.
(name . Arun Isaac)(address . arunisaac@systemreboot.net)
20221229202400.28565-1-arunisaac@systemreboot.net
* mumi/xapian.scm (index-files): Index bug number, submitter, authors,
owner, severity, tags, status, file and msgids as boolean terms. Index
bug number, severity, tags, status, file and msgids without position
information.
---
mumi/xapian.scm | 65 ++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 51 insertions(+), 14 deletions(-)

Toggle diff (85 lines)
diff --git a/mumi/xapian.scm b/mumi/xapian.scm
index 68169e8..06a54cd 100644
--- a/mumi/xapian.scm
+++ b/mumi/xapian.scm
@@ -1,6 +1,6 @@
;;; mumi -- Mediocre, uh, mail interface
;;; Copyright © 2020, 2022 Ricardo Wurmus <rekado@elephly.net>
-;;; Copyright © 2020 Arun Isaac <arunisaac@systemreboot.net>
+;;; Copyright © 2020, 2022 Arun Isaac <arunisaac@systemreboot.net>
;;;
;;; This program is free software: you can redistribute it and/or
;;; modify it under the terms of the GNU Affero General Public License
@@ -119,20 +119,57 @@ messages and index their contents in the Xapian database at DBPATH."
(term-generator (make-term-generator #:stem (make-stem "en")
#:document doc)))
;; Index fields with a suitable prefix. This allows for
- ;; searching separate fields as in subject:foo,
- ;; from:bar, etc.
- (index-text! term-generator bugid #:prefix "B")
- (index-text! term-generator submitter #:prefix "A")
- (index-text! term-generator authors #:prefix "XA")
+ ;; searching separate fields as in subject:foo, from:bar,
+ ;; etc. We do not keep track of the within document
+ ;; frequencies of terms that will be used for boolean
+ ;; filtering. We do not generate position information for
+ ;; fields that will not need phrase searching or NEAR
+ ;; searches.
+ (index-text! term-generator
+ bugid
+ #:prefix "B"
+ #:wdf-increment 0
+ #:positions? #f)
+ (index-text! term-generator
+ submitter
+ #:prefix "A"
+ #:wdf-increment 0)
+ (index-text! term-generator
+ authors
+ #:prefix "XA"
+ #:wdf-increment 0)
(index-text! term-generator subjects #:prefix "S")
- (index-text! term-generator (or (bug-owner bug) "") #:prefix "XO")
- (index-text! term-generator (or (bug-severity bug) "normal") #:prefix "XS")
- (index-text! term-generator (or (bug-tags bug) "") #:prefix "XT")
- (index-text! term-generator (cond
- ((bug-done bug) "done")
- (else "open")) #:prefix "XSTATUS")
- (index-text! term-generator file #:prefix "F")
- (index-text! term-generator msgids #:prefix "XU")
+ (index-text! term-generator
+ (or (bug-owner bug) "")
+ #:prefix "XO"
+ #:wdf-increment 0)
+ (index-text! term-generator
+ (or (bug-severity bug) "normal")
+ #:prefix "XS"
+ #:wdf-increment 0
+ #:positions? #f)
+ (index-text! term-generator
+ (or (bug-tags bug) "")
+ #:prefix "XT"
+ #:wdf-increment 0
+ #:positions? #f)
+ (index-text! term-generator
+ (cond
+ ((bug-done bug) "done")
+ (else "open"))
+ #:prefix "XSTATUS"
+ #:wdf-increment 0
+ #:positions? #f)
+ (index-text! term-generator
+ file
+ #:prefix "F"
+ #:wdf-increment 0
+ #:positions? #f)
+ (index-text! term-generator
+ msgids
+ #:prefix "XU"
+ #:wdf-increment 0
+ #:positions? #f)
;; Index subject and body without prefixes for general
;; search.
--
2.38.1
A
A
Arun Isaac wrote on 29 Dec 2022 21:23
[PATCH 2/7] xapian: Declare some prefixes as boolean.
(name . Arun Isaac)(address . arunisaac@systemreboot.net)
20221229202400.28565-2-arunisaac@systemreboot.net
Some prefixes will only ever be used to filter the rest of the query
and not for matching approximately using relevance weighting
schemes. Such prefixes should be indexed as boolean prefixes.

* mumi/xapian.scm (parse-query*): Support boolean prefixes.
(search): Declare author, msgid, owner, severity, status, submitter
and tag as boolean prefixes.
---
mumi/xapian.scm | 22 +++++++++++++---------
1 file changed, 13 insertions(+), 9 deletions(-)

Toggle diff (49 lines)
diff --git a/mumi/xapian.scm b/mumi/xapian.scm
index 06a54cd..7bf84d3 100644
--- a/mumi/xapian.scm
+++ b/mumi/xapian.scm
@@ -249,7 +249,7 @@ messages and index their contents in the Xapian database at DBPATH."
(invalid (pk invalid "")))
token))
-(define* (parse-query* querystring #:key stemmer stemming-strategy (prefixes '()))
+(define* (parse-query* querystring #:key stemmer stemming-strategy (prefixes '()) (boolean-prefixes '()))
(let ((queryparser (new-QueryParser))
(date-range-processor (new-DateRangeProcessor 0 "date:" 0))
(mdate-range-processor (new-DateRangeProcessor 1 "mdate:" 0)))
@@ -261,6 +261,10 @@ messages and index their contents in the Xapian database at DBPATH."
((field . prefix)
(QueryParser-add-prefix queryparser field prefix)))
prefixes)
+ (for-each (match-lambda
+ ((field . prefix)
+ (QueryParser-add-boolean-prefix queryparser field prefix)))
+ boolean-prefixes)
(QueryParser-add-rangeprocessor queryparser date-range-processor)
(QueryParser-add-rangeprocessor queryparser mdate-range-processor)
(let ((query (QueryParser-parse-query queryparser querystring
@@ -324,14 +328,14 @@ intact."
;; prefixes for field search.
(query (parse-query* querystring*
#:stemmer (make-stem "en")
- #:prefixes '(("submitter" . "A")
- ("author" . "XA")
- ("subject" . "S")
- ("owner" . "XO")
- ("severity" . "XS")
- ("tag" . "XT")
- ("status" . "XSTATUS")
- ("msgid" . "XU"))))
+ #:prefixes '(("subject" . "S"))
+ #:boolean-prefixes '(("author" . "XA")
+ ("msgid" . "XU")
+ ("owner" . "XO")
+ ("severity" . "XS")
+ ("status" . "XSTATUS")
+ ("submitter" . "A")
+ ("tag" . "XT"))))
(enq (enquire db query)))
;; Collapse on mergedwith value
(Enquire-set-collapse-key enq 2 1)
--
2.38.1
A
A
Arun Isaac wrote on 29 Dec 2022 21:23
[PATCH 3/7] xapian: Do not override the default OR implicit query operator.
(name . Arun Isaac)(address . arunisaac@systemreboot.net)
20221229202400.28565-3-arunisaac@systemreboot.net
An implicit AND operator is overly restrictive. It was only necessary
because prefixes that should have been indexed as boolean prefixes
were not.

* mumi/xapian.scm (parse-query*): Do not override the default OR
implicit query operator.
---
mumi/xapian.scm | 1 -
1 file changed, 1 deletion(-)

Toggle diff (14 lines)
diff --git a/mumi/xapian.scm b/mumi/xapian.scm
index 7bf84d3..ae01acc 100644
--- a/mumi/xapian.scm
+++ b/mumi/xapian.scm
@@ -253,7 +253,6 @@ messages and index their contents in the Xapian database at DBPATH."
(let ((queryparser (new-QueryParser))
(date-range-processor (new-DateRangeProcessor 0 "date:" 0))
(mdate-range-processor (new-DateRangeProcessor 1 "mdate:" 0)))
- (QueryParser-set-default-op queryparser (Query-OP-AND))
(QueryParser-set-stemmer queryparser stemmer)
(when stemming-strategy
(QueryParser-set-stemming-strategy queryparser stemming-strategy))
--
2.38.1
A
A
Arun Isaac wrote on 29 Dec 2022 21:23
[PATCH 4/7] messages: Remove unused set intersection feature in search-bugs.
(name . Arun Isaac)(address . arunisaac@systemreboot.net)
20221229202400.28565-4-arunisaac@systemreboot.net
* mumi/messages.scm (search-bugs): Remove unused set intersection
feature.
---
mumi/messages.scm | 18 +++++++-----------
1 file changed, 7 insertions(+), 11 deletions(-)

Toggle diff (37 lines)
diff --git a/mumi/messages.scm b/mumi/messages.scm
index fb305bb..75ac3b1 100644
--- a/mumi/messages.scm
+++ b/mumi/messages.scm
@@ -1,6 +1,6 @@
;;; mumi -- Mediocre, uh, mail interface
;;; Copyright © 2017, 2018, 2019, 2020, 2021 Ricardo Wurmus <rekado@elephly.net>
-;;; Copyright © 2018, 2019 Arun Isaac <arunisaac@systemreboot.net>
+;;; Copyright © 2018, 2019, 2022 Arun Isaac <arunisaac@systemreboot.net>
;;;
;;; This program is free software: you can redistribute it and/or
;;; modify it under the terms of the GNU Affero General Public License
@@ -250,16 +250,12 @@ PATCH-SET. If PATCH-SET is not provided, return all patches."
message-numbers)
"\n")))
-(define* (search-bugs query #:key (sets '()) (max 400))
- "Return a list of all bugs matching the given QUERY string.
-Intersect the result with the id sets in the list SETS."
- (let* ((ids (map string->number
- (search query)))
- (filtered (match sets
- (() ids)
- (_ (apply lset-intersection eq? ids sets)))))
- (status-with-cache (if (> (length filtered) max)
- (take filtered max) filtered))))
+(define* (search-bugs query #:key (max 400))
+ "Return a list of all bugs matching the given QUERY string."
+ (let ((ids (map string->number
+ (search query))))
+ (status-with-cache (if (> (length ids) max)
+ (take ids max) ids))))
(define (recent-bugs amount)
"Return up to AMOUNT bugs with most recent activity."
--
2.38.1
A
A
Arun Isaac wrote on 29 Dec 2022 21:23
[PATCH 5/7] messages: Offload limiting search results to xapian.
(name . Arun Isaac)(address . arunisaac@systemreboot.net)
20221229202400.28565-5-arunisaac@systemreboot.net
* mumi/messages.scm (search-bugs): Offload limiting search results to
max to xapian.
---
mumi/messages.scm | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)

Toggle diff (19 lines)
diff --git a/mumi/messages.scm b/mumi/messages.scm
index 75ac3b1..b3ae962 100644
--- a/mumi/messages.scm
+++ b/mumi/messages.scm
@@ -252,10 +252,8 @@ PATCH-SET. If PATCH-SET is not provided, return all patches."
(define* (search-bugs query #:key (max 400))
"Return a list of all bugs matching the given QUERY string."
- (let ((ids (map string->number
- (search query))))
- (status-with-cache (if (> (length ids) max)
- (take ids max) ids))))
+ (status-with-cache (map string->number
+ (search query #:pagesize max))))
(define (recent-bugs amount)
"Return up to AMOUNT bugs with most recent activity."
--
2.38.1
A
A
Arun Isaac wrote on 29 Dec 2022 21:23
[PATCH 6/7] cache: Specify that cache! returns the cached value.
(name . Arun Isaac)(address . arunisaac@systemreboot.net)
20221229202400.28565-6-arunisaac@systemreboot.net
* mumi/cache.scm (cache!): Specify in the docstring that cache!
returns the cached value.
---
mumi/cache.scm | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

Toggle diff (22 lines)
diff --git a/mumi/cache.scm b/mumi/cache.scm
index 13b21f9..98a7856 100644
--- a/mumi/cache.scm
+++ b/mumi/cache.scm
@@ -1,5 +1,6 @@
;;; mumi -- Mediocre, uh, mail interface
;;; Copyright © 2020 Ricardo Wurmus <rekado@elephly.net>
+;;; Copyright © 2022 Arun Isaac <arunisaac@systemreboot.net>
;;;
;;; This program is free software: you can redistribute it and/or
;;; modify it under the terms of the GNU Affero General Public License
@@ -34,7 +35,7 @@ expired or return #F."
(define* (cache! key value
#:optional (ttl (%config 'cache-ttl)))
"Store VALUE for the given KEY and mark it to expire after TTL
-seconds."
+seconds. Return VALUE."
(let ((t (current-time)))
(hash-set! %cache key `(#:expires ,(+ t ttl) #:value ,value))
value))
--
2.38.1
A
A
Arun Isaac wrote on 29 Dec 2022 21:24
[PATCH 7/7] xapian: Preserve order of search results.
(name . Arun Isaac)(address . arunisaac@systemreboot.net)
20221229202400.28565-7-arunisaac@systemreboot.net
Xapian orders search results by relevance. Preserve this order.

* mumi/xapian.scm (search): Reverse search results after consing to
preserve the original order.
* mumi/messages.scm (status-with-cache): Do not sort bugs by their bug
number. Preserve the order of bugs passed to this function.
---
mumi/messages.scm | 13 ++++---------
mumi/xapian.scm | 21 +++++++++++----------
2 files changed, 15 insertions(+), 19 deletions(-)

Toggle diff (58 lines)
diff --git a/mumi/messages.scm b/mumi/messages.scm
index b3ae962..fd52571 100644
--- a/mumi/messages.scm
+++ b/mumi/messages.scm
@@ -64,15 +64,10 @@
(define (status-with-cache ids)
"Invoke GET-STATUS, but only on those IDS that have not been cached
yet. Return new results alongside cached results."
- (let* ((cached (filter-map cached? ids))
- (uncached-ids (lset-difference eq?
- ids
- (map bug-num cached)))
- (new (filter-map bug-status uncached-ids )))
- ;; Cache new things
- (map (lambda (bug) (cache! (bug-num bug) bug)) new)
- ;; Return everything from cache
- (sort (append cached new) (lambda (a b) (< (bug-num a) (bug-num b))))))
+ (map (lambda (id)
+ (or (cached? id)
+ (cache! id (bug-status id))))
+ ids))
(define (extract-name address)
(or (assoc-ref address 'name)
diff --git a/mumi/xapian.scm b/mumi/xapian.scm
index ae01acc..7ca5bb8 100644
--- a/mumi/xapian.scm
+++ b/mumi/xapian.scm
@@ -339,16 +339,17 @@ intact."
;; Collapse on mergedwith value
(Enquire-set-collapse-key enq 2 1)
;; Fold over the results, return bug id.
- (mset-fold (lambda (item acc)
- (cons
- (document-data (mset-item-document item))
- acc))
- '()
- ;; Get an Enquire object from the database with the
- ;; search results. Then, extract the MSet from the
- ;; Enquire object.
- (enquire-mset enq
- #:maximum-items pagesize))))))
+ (reverse
+ (mset-fold (lambda (item acc)
+ (cons
+ (document-data (mset-item-document item))
+ acc))
+ '()
+ ;; Get an Enquire object from the database with the
+ ;; search results. Then, extract the MSet from the
+ ;; Enquire object.
+ (enquire-mset enq
+ #:maximum-items pagesize)))))))
(define* (index! #:key full?)
"Index all Debbugs log files corresponding to the selected
--
2.38.1
R
R
Ricardo Wurmus wrote on 31 Dec 2022 19:09
Re: [PATCH 1/7] xapian: Index several terms as boolean and without positions.
(name . Arun Isaac)(address . arunisaac@systemreboot.net)(address . 60410@debbugs.gnu.org)
87v8lr5tqj.fsf@elephly.net
Hi Arun,

thank you for your patches! I applied them all and then ran

./pre-inst-env scripts/mumi fetch

but got this error:

worker error: (keyword-argument-error #f Unrecognized keyword () (#:positions?))

Toggle quote (12 lines)
> + ;; searching separate fields as in subject:foo, from:bar,
> + ;; etc. We do not keep track of the within document
> + ;; frequencies of terms that will be used for boolean
> + ;; filtering. We do not generate position information for
> + ;; fields that will not need phrase searching or NEAR
> + ;; searches.
> + (index-text! term-generator
> + bugid
> + #:prefix "B"
> + #:wdf-increment 0
> + #:positions? #f)

I made sure to update to guile-xapian 0.2.1, the latest commit, as far
as I can tell.

--
Ricardo
A
A
Arun Isaac wrote on 1 Jan 2023 00:02
(name . Ricardo Wurmus)(address . rekado@elephly.net)(address . 60410@debbugs.gnu.org)
871qofnpmt.fsf@systemreboot.net
Hi Ricardo,

Toggle quote (3 lines)
> worker error: (keyword-argument-error #f Unrecognized keyword ()
> (#:positions?))

Oops! It looks like I have been working with some unpublished
guile-xapian code. I have pushed those guile-xapian commits, released
guile-xapian 0.3.0 and updated the Guix guile-xapian package. Hopefully,
it should work now. Could you try again?

Thanks,
Arun
R
R
Ricardo Wurmus wrote on 1 Jan 2023 13:14
(name . Arun Isaac)(address . arunisaac@systemreboot.net)(address . 60410-done@debbugs.gnu.org)
87mt725u51.fsf@elephly.net
Hi Arun,

Toggle quote (8 lines)
>> worker error: (keyword-argument-error #f Unrecognized keyword ()
>> (#:positions?))
>
> Oops! It looks like I have been working with some unpublished
> guile-xapian code. I have pushed those guile-xapian commits, released
> guile-xapian 0.3.0 and updated the Guix guile-xapian package. Hopefully,
> it should work now. Could you try again?

Thank you, thisk works!
I applied the changes.

--
Ricardo
Closed
R
R
Ricardo Wurmus wrote on 2 Jan 2023 00:19
Re: [PATCH 2/7] xapian: Declare some prefixes as boolean.
(name . Arun Isaac)(address . arunisaac@systemreboot.net)(address . 60410@debbugs.gnu.org)
87edsd6du8.fsf@elephly.net
Hi Arun,

Toggle quote (3 lines)
> Some prefixes will only ever be used to filter the rest of the query
> and not for matching approximately using relevance weighting
> schemes. Such prefixes should be indexed as boolean prefixes.
[…]
Toggle quote (21 lines)
> @@ -324,14 +328,14 @@ intact."
> ;; prefixes for field search.
> (query (parse-query* querystring*
> #:stemmer (make-stem "en")
> - #:prefixes '(("submitter" . "A")
> - ("author" . "XA")
> - ("subject" . "S")
> - ("owner" . "XO")
> - ("severity" . "XS")
> - ("tag" . "XT")
> - ("status" . "XSTATUS")
> - ("msgid" . "XU"))))
> + #:prefixes '(("subject" . "S"))
> + #:boolean-prefixes '(("author" . "XA")
> + ("msgid" . "XU")
> + ("owner" . "XO")
> + ("severity" . "XS")
> + ("status" . "XSTATUS")
> + ("submitter" . "A")
> + ("tag" . "XT"))))

This breaks two tests, which allow searching for submitters with partial
names, e.g. “Ricardo” instead of my full name and email address.

I think we should move submitter, author, and owner back to the list of
regular prefixes.

--
Ricardo
A
A
Arun Isaac wrote on 2 Jan 2023 18:01
(name . Ricardo Wurmus)(address . rekado@elephly.net)(address . 60410@debbugs.gnu.org)
87mt70na53.fsf@systemreboot.net
Hi Ricardo,

Toggle quote (3 lines)
> I think we should move submitter, author, and owner back to the list of
> regular prefixes.

You're right. Sorry, I missed that.

Regards,
Arun
F
F
Felix Lechner wrote on 8 Feb 18:25 +0100
(no subject)
(address . control@debbugs.gnu.org)
87plx66f7u.fsf@lease-up.com
unarchive 49115
reassign 49115 mumi
archive 49115

unarchive 41906
reassign 41906 mumi
archive 41906

unarchive 60410
reassign 60410 mumi
archive 60410

unarchive 63215
reassign 63215 mumi
archive 63215

unarchive 41098
reassign 41098 mumi
archive 41098

thanks
F
F
Felix Lechner wrote on 23 Feb 14:23 +0100
(address . control@debbugs.gnu.org)
875xyf1fhb.fsf@lease-up.com
unarchive 68680
reassign 68680 mumi
archive 68680

unarchive 63802
reassign 63802 mumi
archive 63802

unarchive 63215
reassign 63215 mumi
archive 63215

unarchive 61645
reassign 61645 mumi
archive 61645

unarchive 60410
reassign 60410 mumi
archive 60410

unarchive 60292
reassign 60292 mumi
archive 60292

unarchive 60292
reassign 60292 mumi
archive 60292

unarchive 58573
reassign 58573 mumi
archive 58573

unarchive 54024
reassign 54024 mumi
archive 54024

unarchive 49115
reassign 49115 mumi
archive 49115

unarchive 48160
reassign 48160 mumi
archive 48160

unarchive 47739
reassign 47739 mumi
archive 47739

unarchive 47520
reassign 47520 mumi
archive 47520

unarchive 47121
reassign 47121 mumi
archive 47121

unarchive 45015
reassign 45015 mumi
archive 45015

unarchive 43661
reassign 43661 mumi
archive 43661

unarchiv 43166
reassign 43166 mumi
archive 43166

unarchive 41906
reassign 41906 mumi
archive 41906

unarchive 41098
reassign 41098 mumi
archive 41098

unarchive 39924
reassign 39924 mumi
archive 39924

unarchive 39924
reassign 39924 mumi
archive 39924

unarchive 39924
reassign 39924 mumi
archive 39924

thanks
?