# The ones that got away

Datetime:2016-08-23 02:11:54         Topic: Solr  Lucene          Share        Original >>

Two and a half ideas of improving Lucene/Solr performance that did not work out.

## Track the result set bits

At the heart of Lucene (and consequently also Solr and ElasticSearch), there is a great amount of doc ID set handling. Abstractly speaking, a doc ID set is a collection of internal document IDs. If there are 10 documents in the index and a query matches the first and the fourth, the set would be [0, 3] as we count from 0. Every time there is filtering or faceting, bit sets are used.

SortedIntDocSet in Solr is one example of a document ID set. The IDs are stored after each other in an int[]. This takes up 32 bits per document in the set and is used for small result sets only.

FixedBitSet in Lucene is another example. It reserves 1 bit for each document in the full corpus. If the bit for a document is set, it means that the document is part of the set. All the bits are stored in a long[], which makes most operations quite fast: To calculate the union, the relevant code is

```int pos = Math.min(numWords, otherNumWords);
while (--pos >= 0) {
thisArr[pos] |= otherArr[pos];
}```

FixedBitSet is very widely used. It is referenced in the Lucene/Solr code 451 times! As opposed to SortedIntDocSet et al, it can contain result sets of any size and reading or changing a specific bit is O(1). It does have a weakness though: Sparse sets.

Nearly all operations on FixedBitSet takes time linear to the total number of documents in the index, not to the number of documents in the set. Calculating the union of [5, 87] and [1, 5, 1056978], when there are 20 million documents in the index, requires stepping through all the 300,000 longs (20M / 64) in each array to check for set bits.

Idea: Tracking changes to the structure works really well for sparse faceting, so why not for FixedBitSets? Add a new tracking bitset where each bit corresponds to a single long in the underlying FixedBitSet. If the tracking bit is set, the FixedBitSet long contains at least one set bit. To locale set bits, iterate the tracking bitset and look for longs != 0. This reduces the number of unnecessary lookups by a factor 64. Of course a third bitset can be added on top of the two, reducing unnecessary lookups by a total factor of 4096.

An implementation is available at github with a performance test. Instructions for running is in the TRACKED_README.txt file.  Some reduced sample runs, where open if the vanilla FixedBitSet and tracked is the tracked version:

```2K / 20M bits set (0.01% filled)
test         impl        ms
union        open      0.42
union        tracked   0.16
cardinality  open      0.34
cardinality  tracked   0.09
icount       open      0.59
icount       tracked   0.57
nextSetBit   open      0.31
nextSetBit   tracked   0.12
iterator     open      0.32
iterator     tracked   0.14

20K / 20M bits set (0.1% filled)
test         impl        ms
union        open      0.81
union        tracked   1.08
cardinality  open      0.34
cardinality  tracked   0.49
icount       open      0.58
icount       tracked   2.18
nextSetBit   open      0.58
nextSetBit   tracked   0.79
iterator     open      0.68
iterator     tracked   0.80

200K / 20M bits set (1.0% filled)
test         impl        ms
union        open      3.51
union        tracked   4.93
cardinality  open      0.43
cardinality  tracked   1.48
icount       open      0.59
icount       tracked  10.26
nextSetBit   open      3.21
nextSetBit   tracked   4.24
iterator     open      3.98
iterator     tracked   2.54```

Technically the implementation delivers: Sparse result sets are processed faster. Unfortunately most operations require the sets to be less than 0.1% filled and worse, the overhead for non-sparse bitsets is prohibitive. There might be some special places where the tracked bitset can be used, but it is not viable as a general replacement for FixedBitSet.

## Better scaling of top-X searches

This idea got pretty far with LUCENE-6828 JIRA and an analysis of the speed up .

Requesting the top-X relevance ranked results from Lucene/Solr is pretty much the core use case. The underlying implementation uses a binary heap based priority queue with Java Objects. Due to the cache-adverse nature of binary heaps and the large overhead of tiny Objects, this does not scale well: Do not ask top top-1M hits and really do not ask for top-100M.

Using an array based bit packed heap implementation yields superior space- and speed-performance for top-100K+ requests (3x+ for both memory and speed, plus a lot less GC activity; see the previously linked analysis for details), with the advantages growing with the number of concurrent requests. No win for small top-X requests, but no big loss either.

So success? Sorta, kinda. Turns out that there are generally two scenarios:

1. Request a relatively modest top-X, such af top-10 or top-1000.
2. Export the <em>full</em> result set, which can be millions or billions.

As the array-based heap has no advantage over the vanilla Solr heap for small result sets and as full exports are much better handled by Solr streaming export , there is near-zero need for the middle ground that a tracked bitset provides. It is a solution looking for a problem.

## Cache-friendly priority queue

Since the vanilla Solr binary heap is cache-adverse, why not try and fix that problem? Yes, totally ignoring the “there is no problem as nobody requests large top-X results”-part here.

The idea is fairly simple: Binary heaps works well when the heap can be fully contained in the CPU cache, so making a heap of heaps means that most operations will be done on data that are already cached. There is a very nice write up on this at Playful Programming: Cache optimizing a priority queue .

This was implemented in Java at the hitdocs-trunk branch. Performance tests are in org.apache.lucene.search.TestHitQueue. Measurements below are from testPQPerformanceAtomicHeapsLimited, where sentinel is vanilla Solr, packed is the array-based heap described above and bheap4 is the best performing heaps-within-heaps implementation. Lower percentages are best and the winning implementation for each test is marked with ■.

```Threads       Top-X      Hits      Sentinel         Packed         BHeap4
1     1000000    100000   118.49 100%     11.06   9%■    20.47  17%
1     1000000   1000000  1356.56 100%    133.51  10%■   270.82  20%
1     1000000   2000000  2216.50 100%    566.67  26%    472.43  21%■
1     1000000   5000000  3406.35 100%    827.23  24%    771.33  23%■
1     1000000  10000000  4183.84 100%   1054.94  25%   1012.78  24%■
4     1000000    100000   246.00 100%     22.38   9%■    41.73  17%
4     1000000   1000000  2154.64 100%    232.00  11%■   504.80  23%
4     1000000   2000000  3356.96 100%   1057.01  31%    772.90  23%■
4     1000000   5000000  5022.50 100%   1484.44  30%   1227.93  24%■
4     1000000  10000000  6503.33 100%   1910.10  29%   1891.28  29%■
16     1000000    100000  1310.95 100%     68.29   5%■   119.87   9%
16     1000000   1000000  6731.49 100%    605.90   9%■  1405.25  21%
16     1000000   2000000 11882.59 100%   3713.82  31%   2744.36  23%■
16     1000000   5000000 19589.75 100%   5389.25  28%   4447.21  23%■
16     1000000  10000000 25522.07 100%   6921.96  27%   4843.61  19%■
```

Note: The cases where hits <= top-X (100K and 1M) are special as packed sorts result sets with hits <= top-X using the more cache-friendly merge sort. This explains why packed is twice as fast as bheap4 for these cases. A similar optimization could be implemented for bheap4 .

Looking through rose-tinted glasses, bheap4 is a bit better than packed (and they are both vastly superior to sentinel ). But the implementation of bheap4 is bordering on voodoo, with the multi-level offset calculations and all the bit-tricks needed to avoid doing expensive multiplications and modulo calculations.

Hopefully the speed of bheap can be improved further – the Playful Programming author Björn Fahller certainly got fine results with his C++ implementation. But until that happens, its miniscule gains are not worth the added complexity.