Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

☂️ Search: simpler query language inspired by "keyword search" #58815

Closed
17 tasks done
jtibshirani opened this issue Dec 6, 2023 · 4 comments · Fixed by #60280
Closed
17 tasks done

☂️ Search: simpler query language inspired by "keyword search" #58815

jtibshirani opened this issue Dec 6, 2023 · 4 comments · Fixed by #60280
Assignees
Labels
team/search-platform Issues owned by the search platform team

Comments

@jtibshirani
Copy link
Member

jtibshirani commented Dec 6, 2023

We plan to bring code search closer to the "keyword search" style that users are used to, where queries are broken up into individual terms that can match broadly across the file name, contents, and symbols. This issue tracks our first round of work on improving the query language, which we aim to merge next release as 'beta'.

Example query: repo:sourcegraph/sourcegraph auth provider

  • auth provider matches files that contain both the strings auth and provider in any order. Before, it only matched files containing the exact string auth provider
  • There are no changes to filters in this release, so repo:sourcegraph/sourcegraph works as usual

Core changes

Usability / performance

Testing + polish

/cc @sourcegraph/search-platform

@jtibshirani jtibshirani added the team/search-platform Issues owned by the search platform team label Dec 6, 2023
@keegancsmith
Copy link
Member

DONE Test that Zoekt can handle AND queries with many clauses

TL;DR from random real world queries I tried I got acceptable performance. For s2 I normally got around 1s when ANDing, but with literal it would be much faster (0.2s). This is slower, but acceptable.

I got started on writing something which would try out a bunch of queries and collect data and graph it. But I ended up just writing the part which sampled out real strings from a codebase. When trying out a few samples at different token counts I got pretty consistent performance. In particular I did a bunch of testing with 2, 10, 15 and 20 token lengths.

My mental model of why this worked out fine is the larger the number of tokens, the less documents to consider. So even though it is querying the index more, it balances out so we have pretty consistent performance.

What I did notice is for a few of the large AND queries the ranking was terrible. I think it will be quite important to fix that (ie promote atom counts on same line, etc). In particular if you have an atom that is a common word, you quickly run into limits since each atom matching counts towards your limit. So I often ran into a top ranked document which was just a large file that happened to contain all the terms as substrings.

In the interest of someone maybe wanting to take this further see below for the code which outputs strings from the sourcegraph codebase.

#!/usr/bin/env python3

import collections
import subprocess
import random

def gen_corpus():
    "return a corpus of strings from go code"
    proc = subprocess.run(
        args=('rg', '-t', 'go', '-o', '--no-filename', '"[^"]+?"'),
        cwd='/Users/keegan/src/github.com/sourcegraph/sourcegraph/',
        capture_output=True,
        check=True,
    )
    for line in proc.stdout.split(b'\n'):
        yield(line.decode('utf-8'))

literals = collections.defaultdict(set)
for s in gen_corpus():
    if ' ' not in s:
        continue
    if s.startswith('" ') or s.endswith(' "'):
        continue
    literals[len(s.split())].add(s)

def summarize(upto=20):
    for count, ss in sorted(literals.items()):
        if count > upto:
            break
        s = repr(ss)
        if len(s) > 80:
            s = s[:80]
        print(count, len(ss), s)

def sample(tokens, count=10):
    # remove items which contain keywords
    keywords = set(['and', 'not', 'or'])
    population = [s for s in literals[tokens] if keywords.isdisjoint(s.lower().split())]

    count = min(count, len(population))
    for s in random.sample(population, count):
        try:
            print(eval(s))
            print(' AND '.join(eval(s).split()))
            print()
        except SyntaxError:
            pass

#summarize()
sample(20)

stefanhengl added a commit that referenced this issue Dec 18, 2023
Relates to #58815

With this change, a quoted pattern, like in `"foo bar"`, is interpreted
literally IE spaces are interpreted as spaces instead of as logical AND.
Quotes, which are part of the pattern have to be escaped.

Example:

searching for the literal `foo "bar"`, where "bar" is surrounded by quotes, the
query is `"foo \"bar\""` (or equivalently `'foo "bar"'`)

Note: This only applies for our keyword search prototype.

Test plan:
- updated unit test
- manual testing
  - tried out various combinations
@jtibshirani
Copy link
Member Author

Make sure we can quantify improvements to search experience before/ after rollout

I looked into our metrics collection for searches and think we are already tracking the right metrics:

  • We log SearchSubmitted and SearchResultClicked, so we can see how many searches resulted in clicks vs. not
  • Thanks to @rrhyne we also report a more nuanced 'search success' metric which captures whether a user clicked then followed up with an action like copying code, searching history, etc.
  • We also log if search throws an error (SearchResultsFetchFailed) and whether it returns any results (SearchResultsNonEmpty)

All logged events contain the feature flags that were enabled, so we can compare metrics from before and after the flag is enabled for an instance. We need to be aware of confounding factors, like the fact that they also upgraded to a new version which could affect "search success".

One complexity is that users will be able to toggle the new search syntax from the UI. This is controlled through the search pattern type, which is not stored as part of the event. I think this is okay, we'll still get good signal at the instance level as to whether enabling the feature was helpful. It'd be good to add tracking for how often users disable the toggle though.

stefanhengl added a commit that referenced this issue Dec 19, 2023
Relates to #58815

With this change, a quoted pattern, like `"foo bar"`, is interpreted literally IE spaces are interpreted as spaces instead of as logical `AND`. Quotes that should be matched literally have to be escaped.

Example:

To search for the text `foo "bar"`, where `bar` is surrounded by quotes, the query is either `"foo \"bar\""` or, if we use single quotes, `'foo "bar"'`.

Note: This change only affects our keyword search prototype.

Test plan:
- updated unit test
stefanhengl referenced this issue Dec 20, 2023
Relates to https://github.com/sourcegraph/sourcegraph/issues/58815
Ported directly from https://github.com/sourcegraph/sourcegraph/pull/58849/

We add support for glob syntax to file and repo filters.

Notes:
- `f:` matches nothing. I think this is less surprising than our [current behavior](https://sourcegraph.com/search?q=context:global+f:&patternType=standard&sm=1)
- `*` matches any sequence of characters, including `/`
- No other special characters are supported

Test plan:
- new unit tests
stefanhengl added a commit that referenced this issue Dec 21, 2023
Relates to #58815 

When a user clicks on a search result we navigate to the file and set the repo and file filters in the query input. However, this was hardcoded to use regex syntax. With this change we respect the patternType.

## Test plan
- new unit tests
@jtibshirani
Copy link
Member Author

jtibshirani commented Feb 1, 2024

5.3 release status ✅ (on track). Feature work completed, now moving on to testing and squashing bugs.

@jtibshirani
Copy link
Member Author

jtibshirani commented Feb 2, 2024

QA Checklist

  1. Have you made any infra related changes to environment variables, new services, or deployment methods that could affect customers?

    • Yes - I've informed #team-cloud and #team-delivery of the changes.
    • Yes - Changelog or documentation has been updated, this includes changes to defaults and site config flags.
    • No
  2. Which environments have the changes been tested on?

    • rctest.sourcegraph.com
    • sourcegraph.sourcegraph.com
    • sourcegraph.com
    • cody-dev.sgdev.dev (running based off 5.2 release branch)
    • scaletesting.sgdev.org
    • other
  3. Experimental features have been marked and behind a feature flag?

    • Yes
    • No
    • N/A
  4. Has telemetry been added as part of the product requirements?

    • Yes
    • No
    • N/A
  5. Completed entry to release post.

    • Yes
    • No
    • N/A
  6. Is a feature flagged in a way when turns the feature off, it behaves as-if the feature does not exist?

    • Yes
    • No
    • N/A
  7. A CHANGELOG entry has been added for the feature/change?

    • Yes
    • No
    • N/A
  8. Please provide any additional testing you've done that has not been covered above:

Link to test plan: https://docs.google.com/document/d/1skfewTKAcGR0oZrSGYLLBtFhlAHo3ORH5jQQAo2pcNk/edit

keegancsmith pushed a commit that referenced this issue Feb 8, 2024
Keyword search: add a changelog entry (#60280)

Add a changelog entry for the keyword search feature.

Closes #58815

(cherry picked from commit 28443cb)

Co-authored-by: Julie Tibshirani <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
team/search-platform Issues owned by the search platform team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants