-
Notifications
You must be signed in to change notification settings - Fork 1.3k
☂️ Search: simpler query language inspired by "keyword search" #58815
Comments
DONE Test that Zoekt can handle AND queries with many clausesTL;DR from random real world queries I tried I got acceptable performance. For s2 I normally got around 1s when ANDing, but with literal it would be much faster (0.2s). This is slower, but acceptable. I got started on writing something which would try out a bunch of queries and collect data and graph it. But I ended up just writing the part which sampled out real strings from a codebase. When trying out a few samples at different token counts I got pretty consistent performance. In particular I did a bunch of testing with 2, 10, 15 and 20 token lengths. My mental model of why this worked out fine is the larger the number of tokens, the less documents to consider. So even though it is querying the index more, it balances out so we have pretty consistent performance. What I did notice is for a few of the large AND queries the ranking was terrible. I think it will be quite important to fix that (ie promote atom counts on same line, etc). In particular if you have an atom that is a common word, you quickly run into limits since each atom matching counts towards your limit. So I often ran into a top ranked document which was just a large file that happened to contain all the terms as substrings. In the interest of someone maybe wanting to take this further see below for the code which outputs strings from the sourcegraph codebase. #!/usr/bin/env python3
import collections
import subprocess
import random
def gen_corpus():
"return a corpus of strings from go code"
proc = subprocess.run(
args=('rg', '-t', 'go', '-o', '--no-filename', '"[^"]+?"'),
cwd='/Users/keegan/src/github.com/sourcegraph/sourcegraph/',
capture_output=True,
check=True,
)
for line in proc.stdout.split(b'\n'):
yield(line.decode('utf-8'))
literals = collections.defaultdict(set)
for s in gen_corpus():
if ' ' not in s:
continue
if s.startswith('" ') or s.endswith(' "'):
continue
literals[len(s.split())].add(s)
def summarize(upto=20):
for count, ss in sorted(literals.items()):
if count > upto:
break
s = repr(ss)
if len(s) > 80:
s = s[:80]
print(count, len(ss), s)
def sample(tokens, count=10):
# remove items which contain keywords
keywords = set(['and', 'not', 'or'])
population = [s for s in literals[tokens] if keywords.isdisjoint(s.lower().split())]
count = min(count, len(population))
for s in random.sample(population, count):
try:
print(eval(s))
print(' AND '.join(eval(s).split()))
print()
except SyntaxError:
pass
#summarize()
sample(20) |
Relates to #58815 With this change, a quoted pattern, like in `"foo bar"`, is interpreted literally IE spaces are interpreted as spaces instead of as logical AND. Quotes, which are part of the pattern have to be escaped. Example: searching for the literal `foo "bar"`, where "bar" is surrounded by quotes, the query is `"foo \"bar\""` (or equivalently `'foo "bar"'`) Note: This only applies for our keyword search prototype. Test plan: - updated unit test - manual testing - tried out various combinations
I looked into our metrics collection for searches and think we are already tracking the right metrics:
All logged events contain the feature flags that were enabled, so we can compare metrics from before and after the flag is enabled for an instance. We need to be aware of confounding factors, like the fact that they also upgraded to a new version which could affect "search success". One complexity is that users will be able to toggle the new search syntax from the UI. This is controlled through the search pattern type, which is not stored as part of the event. I think this is okay, we'll still get good signal at the instance level as to whether enabling the feature was helpful. It'd be good to add tracking for how often users disable the toggle though. |
Relates to #58815 With this change, a quoted pattern, like `"foo bar"`, is interpreted literally IE spaces are interpreted as spaces instead of as logical `AND`. Quotes that should be matched literally have to be escaped. Example: To search for the text `foo "bar"`, where `bar` is surrounded by quotes, the query is either `"foo \"bar\""` or, if we use single quotes, `'foo "bar"'`. Note: This change only affects our keyword search prototype. Test plan: - updated unit test
Relates to https://github.com/sourcegraph/sourcegraph/issues/58815 Ported directly from https://github.com/sourcegraph/sourcegraph/pull/58849/ We add support for glob syntax to file and repo filters. Notes: - `f:` matches nothing. I think this is less surprising than our [current behavior](https://sourcegraph.com/search?q=context:global+f:&patternType=standard&sm=1) - `*` matches any sequence of characters, including `/` - No other special characters are supported Test plan: - new unit tests
Relates to #58815 When a user clicks on a search result we navigate to the file and set the repo and file filters in the query input. However, this was hardcoded to use regex syntax. With this change we respect the patternType. ## Test plan - new unit tests
5.3 release status ✅ (on track). Feature work completed, now moving on to testing and squashing bugs. |
QA Checklist
Link to test plan: https://docs.google.com/document/d/1skfewTKAcGR0oZrSGYLLBtFhlAHo3ORH5jQQAo2pcNk/edit |
Add a changelog entry for the keyword search feature. Closes #58815
Keyword search: add a changelog entry (#60280) Add a changelog entry for the keyword search feature. Closes #58815 (cherry picked from commit 28443cb) Co-authored-by: Julie Tibshirani <[email protected]>
We plan to bring code search closer to the "keyword search" style that users are used to, where queries are broken up into individual terms that can match broadly across the file name, contents, and symbols. This issue tracks our first round of work on improving the query language, which we aim to merge next release as 'beta'.
Example query:
repo:sourcegraph/sourcegraph auth provider
auth provider
matches files that contain both the stringsauth
andprovider
in any order. Before, it only matched files containing the exact stringauth provider
repo:sourcegraph/sourcegraph
works as usualCore changes
/.../
Usability / performance
Testing + polish
/cc @sourcegraph/search-platform
The text was updated successfully, but these errors were encountered: