IRC channel log search results are not chronological for recent logs

  • Open
  • quality assurance status badge
One participant
  • Hugo Buddelmeijer
Submitted by
Hugo Buddelmeijer
Hugo Buddelmeijer wrote on 24 Mar 2023 16:38
Hi all, Ricardo,

Searching through the IRC channel logs on
show a list of matches sorted on date in descending order, except for
matches from this February or March, those are at the bottom, often beyond
the 100 match limit.

For example, 'vdirsyncer' results in 31 matches (at the time of writing):

Toggle quote (1 lines)
> 2023-01-10 [15:09:09] <elb> this machine has installed emacs, emacs-guix,
Toggle quote (2 lines)
> 2023-01-10 [15:12:26] <nckx> For context, ‘guix size emacs emacs-guix ...
> 2022-01-18 [04:43:24] <lfam> At least, vdirsyncer builds when you simply
Toggle quote (1 lines)
> 2022-01-17 [16:29:41] <johnhamelink> Hey there :) I'm currently an Arch
Toggle quote (1 lines)
> 2020-11-30 [23:29:57] <lfam> jonsger: No, I'm not using radicale. It was
Toggle quote (1 lines)
> 2020-11-30 [23:31:08] <lfam> sneek: later tell jonsger: No, I'm not using
Toggle quote (4 lines)
> 2020-04-29 [09:34:26] <efraim> it also came up in vdirsyncer on ...
> ...
> 2016-01-24 [22:45:28] <lfam> I don't even think you can run vdirsyncer ...
> 2015-12-10 [00:10:51] <lfam> All that and vdirsyncer doesn't even build
Toggle quote (2 lines)
> 2015-12-09 [22:39:51] <lfam>
> 2023-02-25 [03:03:54] <fruit-loops> "#61557 - vdirsyncer fails to verify
Toggle quote (6 lines)
> 2023-02-25 [03:08:01] <fruit-loops> "vdirsyncer fails to verify ...
> 2023-02-25 [03:09:41] <elb> nckx: hmmm when I searched mobile ...
> 2023-02-25 [03:10:49] <elb> ok yeah, it's just not tagged or ...
> 2023-02-25 [03:36:53] <fruit-loops> "vdirsyncer fails to verify ...
> 2023-02-25 [03:38:16] <elb> lechner: no, against vdirsyncer
> 2023-02-25 [03:46:06] <fruit-loops> "vdirsyncer fails to verify

All hits from February and March of this year are at the bottom of the
list, while the rest is in chronological order. (The 'vdirsyncer' example
was chosen because it occurs regularly, but not too often.) The list cuts
off after about 100 matches, so it is impossible to find recent matches for
more popular terms.The most recent chats are usually more interesting, for
example when debugging an issue that occured recently. E.g. a search for
Python shows nothing beyond 2023-01-31:

So my question is, can we improve the sort order of the IRC logs?

I did a bit of investigating myself and discovered the maintenance
repository with the hydra directory. There is so much to learn from that

However, I could not really figure out what could be the problem. My
hypothesis, which is more like a wild guess:
- It seems the sorting is done implicitly by xapian, which will just return
the matching lines in whatever order they are inserted.
- Something went wrong at the transition between January 31th and February
1th, that required manual cleanup. Evidence: there are logs with a tilde in
the filename, 2023-01-31.log~ and 2023-02-01.log~.
- The database was emptied and repopulated to prevent entries from early in
the morning of 2023-02-01 to be counted as beyond-midnight on 2023-01-31.
This put all the lines in the correct order, hence correct sorting up till
- Subsequent lines are added with the mcron job and are therefore at the
end of the database, and thus at the end of the result set (beyond the
limit of 100).

Side note: the ~ files cause some lines to show up three times, e.g.
Toggle quote (1 lines)
> 2023-01-31 [04:52:19] <jgart[m]> dcunit3d: here's another great config:
Toggle quote (1 lines)
> 2023-02-01.log~ [04:52:19] <jgart[m]> dcunit3d: here's another great
Toggle quote (1 lines)
> 2023-01-31.log~ [04:52:19] <jgart[m]> dcunit3d: here's another great

Side side note: those ~ entries cannot be clicked on, because (define stamp
(basename file-name ".log")) lets goggles think that the ".log" is part of
the date.

What I don't understand is why the matches are not sorted correctly. It
seems to me that (Enquire-set-sort-by-value enq 0 #f) would sort by the
value of slot 0, which seems to be the date-stamp. But I don't really have
a good mental model of how xapian works or what value slots actually are.
(Maybe value slots start at 1 and selecting 0 means do not use any of them?)

I tried to compare the results of #guix with those of other channels, but
it seems that the logs of most other channels are either not indexed at
all, or inconsistently. For example, searching for ACTION (which is a "/me"
command it seems) in #spritely shows only 11 matches spread over 5 days,
while it is a very common occurrence:

Attachment: file