Skip to content

Commit ebb3ca2

Browse files
authored
index: use a random sample of ngrams when limiting (#797)
The first bit of data I am getting back indicates this strategy of limiting the number of ngrams we lookup isn't working. I am still experimenting with different limits, but in the meantime it is easy to implement a strategy which picks a random subset. This is so that the first N ngrams of a query aren't the only ones being consulted. Test Plan: ran all tests with the envvar set to 2. I expected tests that assert on stats to fail, but everything else to pass. This was the case. SRC_EXPERIMENT_ITERATE_NGRAM_LOOKUP_LIMIT=2 go test ./...
1 parent 04e7057 commit ebb3ca2

File tree

1 file changed

+19
-1
lines changed

1 file changed

+19
-1
lines changed

bits.go

+19-1
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,8 @@ import (
1818
"cmp"
1919
"encoding/binary"
2020
"math"
21+
"math/rand/v2"
22+
"slices"
2123
"sort"
2224
"unicode"
2325
"unicode/utf8"
@@ -136,7 +138,7 @@ func splitNGramsLimit(str []byte, maxNgrams int) []runeNgramOff {
136138
result := make([]runeNgramOff, 0, len(str))
137139
var i uint32
138140

139-
for len(str) > 0 && len(result) < maxNgrams {
141+
for len(str) > 0 {
140142
r, sz := utf8.DecodeRune(str)
141143
str = str[sz:]
142144
runeGram[0] = runeGram[1]
@@ -157,6 +159,22 @@ func splitNGramsLimit(str []byte, maxNgrams int) []runeNgramOff {
157159
index: len(result),
158160
})
159161
}
162+
163+
// We return a random subset of size maxNgrams. This is to prevent the start
164+
// of the string biasing ngram selection.
165+
if maxNgrams < len(result) {
166+
// Deterministic seed for tests. Additionally makes comparing repeated
167+
// queries performance easier.
168+
r := rand.New(rand.NewPCG(uint64(maxNgrams), 0))
169+
170+
// Pick random subset via a shuffle
171+
r.Shuffle(maxNgrams, func(i, j int) { result[i], result[j] = result[j], result[i] })
172+
result = result[:maxNgrams]
173+
174+
// Caller expects ngrams in order of appearance.
175+
slices.SortFunc(result, runeNgramOff.Compare)
176+
}
177+
160178
return result
161179
}
162180

0 commit comments

Comments
 (0)