☂️ Search: simpler query language inspired by "keyword search" #58815

jtibshirani · 2023-12-06T23:46:50Z

We plan to bring code search closer to the "keyword search" style that users are used to, where queries are broken up into individual terms that can match broadly across the file name, contents, and symbols. This issue tracks our first round of work on improving the query language, which we aim to merge next release as 'beta'.

Example query: repo:sourcegraph/sourcegraph auth provider

auth provider matches files that contain both the strings auth and provider in any order. Before, it only matched files containing the exact string auth provider
There are no changes to filters in this release, so repo:sourcegraph/sourcegraph works as usual

Core changes

Usability / performance

Test that Zoekt can handle AND queries with many clauses @keegancsmith
https://github.com/sourcegraph/sourcegraph/issues/59038 @jtibshirani
Debug why new search mode is much slower on S2 @keegancsmith
Promote results with an exact match, so users can still easily copy/ paste in snippets @keegancsmith
- score: introduce query.Boost to scale score zoekt#728
- https://github.com/sourcegraph/sourcegraph/pull/59940

Testing + polish

search: document keyword search docs#55
Telemetry @jtibshirani
Complete "Keyword search quality plan" (separate doc)

/cc @sourcegraph/search-platform

The text was updated successfully, but these errors were encountered:

keegancsmith · 2023-12-14T09:51:29Z

DONE Test that Zoekt can handle AND queries with many clauses

TL;DR from random real world queries I tried I got acceptable performance. For s2 I normally got around 1s when ANDing, but with literal it would be much faster (0.2s). This is slower, but acceptable.

I got started on writing something which would try out a bunch of queries and collect data and graph it. But I ended up just writing the part which sampled out real strings from a codebase. When trying out a few samples at different token counts I got pretty consistent performance. In particular I did a bunch of testing with 2, 10, 15 and 20 token lengths.

My mental model of why this worked out fine is the larger the number of tokens, the less documents to consider. So even though it is querying the index more, it balances out so we have pretty consistent performance.

What I did notice is for a few of the large AND queries the ranking was terrible. I think it will be quite important to fix that (ie promote atom counts on same line, etc). In particular if you have an atom that is a common word, you quickly run into limits since each atom matching counts towards your limit. So I often ran into a top ranked document which was just a large file that happened to contain all the terms as substrings.

In the interest of someone maybe wanting to take this further see below for the code which outputs strings from the sourcegraph codebase.

#!/usr/bin/env python3

import collections
import subprocess
import random

def gen_corpus():
    "return a corpus of strings from go code"
    proc = subprocess.run(
        args=('rg', '-t', 'go', '-o', '--no-filename', '"[^"]+?"'),
        cwd='/Users/keegan/src/github.com/sourcegraph/sourcegraph/',
        capture_output=True,
        check=True,
    )
    for line in proc.stdout.split(b'\n'):
        yield(line.decode('utf-8'))

literals = collections.defaultdict(set)
for s in gen_corpus():
    if ' ' not in s:
        continue
    if s.startswith('" ') or s.endswith(' "'):
        continue
    literals[len(s.split())].add(s)

def summarize(upto=20):
    for count, ss in sorted(literals.items()):
        if count > upto:
            break
        s = repr(ss)
        if len(s) > 80:
            s = s[:80]
        print(count, len(ss), s)

def sample(tokens, count=10):
    # remove items which contain keywords
    keywords = set(['and', 'not', 'or'])
    population = [s for s in literals[tokens] if keywords.isdisjoint(s.lower().split())]

    count = min(count, len(population))
    for s in random.sample(population, count):
        try:
            print(eval(s))
            print(' AND '.join(eval(s).split()))
            print()
        except SyntaxError:
            pass

#summarize()
sample(20)

Relates to #58815 With this change, a quoted pattern, like in `"foo bar"`, is interpreted literally IE spaces are interpreted as spaces instead of as logical AND. Quotes, which are part of the pattern have to be escaped. Example: searching for the literal `foo "bar"`, where "bar" is surrounded by quotes, the query is `"foo \"bar\""` (or equivalently `'foo "bar"'`) Note: This only applies for our keyword search prototype. Test plan: - updated unit test - manual testing - tried out various combinations

jtibshirani · 2023-12-19T01:59:41Z

Make sure we can quantify improvements to search experience before/ after rollout

I looked into our metrics collection for searches and think we are already tracking the right metrics:

We log SearchSubmitted and SearchResultClicked, so we can see how many searches resulted in clicks vs. not
Thanks to @rrhyne we also report a more nuanced 'search success' metric which captures whether a user clicked then followed up with an action like copying code, searching history, etc.
We also log if search throws an error (SearchResultsFetchFailed) and whether it returns any results (SearchResultsNonEmpty)

All logged events contain the feature flags that were enabled, so we can compare metrics from before and after the flag is enabled for an instance. We need to be aware of confounding factors, like the fact that they also upgraded to a new version which could affect "search success".

One complexity is that users will be able to toggle the new search syntax from the UI. This is controlled through the search pattern type, which is not stored as part of the event. I think this is okay, we'll still get good signal at the instance level as to whether enabling the feature was helpful. It'd be good to add tracking for how often users disable the toggle though.

Relates to #58815 With this change, a quoted pattern, like `"foo bar"`, is interpreted literally IE spaces are interpreted as spaces instead of as logical `AND`. Quotes that should be matched literally have to be escaped. Example: To search for the text `foo "bar"`, where `bar` is surrounded by quotes, the query is either `"foo \"bar\""` or, if we use single quotes, `'foo "bar"'`. Note: This change only affects our keyword search prototype. Test plan: - updated unit test

Relates to https://github.com/sourcegraph/sourcegraph/issues/58815 Ported directly from https://github.com/sourcegraph/sourcegraph/pull/58849/ We add support for glob syntax to file and repo filters. Notes: - `f:` matches nothing. I think this is less surprising than our [current behavior](https://sourcegraph.com/search?q=context:global+f:&patternType=standard&sm=1) - `*` matches any sequence of characters, including `/` - No other special characters are supported Test plan: - new unit tests

Relates to #58815 When a user clicks on a search result we navigate to the file and set the repo and file filters in the query input. However, this was hardcoded to use regex syntax. With this change we respect the patternType. ## Test plan - new unit tests

jtibshirani · 2024-02-01T23:51:12Z

5.3 release status ✅ (on track). Feature work completed, now moving on to testing and squashing bugs.

jtibshirani · 2024-02-02T18:59:28Z

Add a changelog entry for the keyword search feature. Closes #58815

Add a changelog entry for the keyword search feature. Closes #58815 (cherry picked from commit 28443cb)

Keyword search: add a changelog entry (#60280) Add a changelog entry for the keyword search feature. Closes #58815 (cherry picked from commit 28443cb) Co-authored-by: Julie Tibshirani <[email protected]>

jtibshirani added the team/search-platform Issues owned by the search platform team label Dec 6, 2023

This was referenced Dec 18, 2023

search: support "exact match" pattern using "..." #59057

Merged

search: support glob syntax for repo and file filters #59080

Merged

This was referenced Dec 21, 2023

search: don't convert rev to regex for kw search #59165

Merged

search: respect patternType when navigating to search results #59164

Merged

stefanhengl mentioned this issue Jan 4, 2024

search: don't highlight operators inside "..." #59328

Merged

keegancsmith mentioned this issue Jan 8, 2024

search: revert glob filter values from new keyword pattern type #59381

Merged

kalanchan assigned jtibshirani Jan 9, 2024

This was referenced Jan 22, 2024

Search: add 'AND' queries to search blitz #59679

Merged

Remove smart search preview on no results page #59755

Merged

search: document keyword search sourcegraph/docs#55

Merged

keegancsmith mentioned this issue Feb 6, 2024

search: introduce max-line-len streaming parameter #60228

Merged

sourcegraph-release-bot mentioned this issue Feb 6, 2024

[Backport 5.3] search: introduce max-line-len streaming parameter #60232

Merged

jtibshirani closed this as completed Feb 7, 2024

jtibshirani mentioned this issue Feb 7, 2024

Keyword search: add a changelog entry #60280

Merged

jtibshirani added a commit that referenced this issue Feb 7, 2024

Keyword search: add a changelog entry (#60280)

28443cb

Add a changelog entry for the keyword search feature. Closes #58815

sourcegraph-release-bot pushed a commit that referenced this issue Feb 7, 2024

Keyword search: add a changelog entry (#60280)

7c01e49

Add a changelog entry for the keyword search feature. Closes #58815 (cherry picked from commit 28443cb)

sourcegraph-release-bot mentioned this issue Feb 7, 2024

[Backport 5.3] Keyword search: add a changelog entry #60282

Merged

This was referenced Mar 18, 2024

Keyword search GA #61225

Open

Remove references to smart search sourcegraph/docs#181

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

☂️ Search: simpler query language inspired by "keyword search" #58815

☂️ Search: simpler query language inspired by "keyword search" #58815

jtibshirani commented Dec 6, 2023 •

edited

Loading

keegancsmith commented Dec 14, 2023

jtibshirani commented Dec 19, 2023

jtibshirani commented Feb 1, 2024 •

edited

Loading

jtibshirani commented Feb 2, 2024 •

edited

Loading

☂️ Search: simpler query language inspired by "keyword search" #58815

☂️ Search: simpler query language inspired by "keyword search" #58815

Comments

jtibshirani commented Dec 6, 2023 • edited Loading

Core changes

Usability / performance

Testing + polish

keegancsmith commented Dec 14, 2023

DONE Test that Zoekt can handle AND queries with many clauses

jtibshirani commented Dec 19, 2023

jtibshirani commented Feb 1, 2024 • edited Loading

jtibshirani commented Feb 2, 2024 • edited Loading

QA Checklist

jtibshirani commented Dec 6, 2023 •

edited

Loading

jtibshirani commented Feb 1, 2024 •

edited

Loading

jtibshirani commented Feb 2, 2024 •

edited

Loading