|
| 1 | +# Design rationale for SCIP |
| 2 | + |
| 3 | +Sourcegraph supports viewing and navigating code for |
| 4 | +many different programming languages. |
| 5 | +This is similar to an IDE, except for the actual "editor" part. |
| 6 | + |
| 7 | +The design of SCIP is based on this primary motivating use case, |
| 8 | +as well as for fixing the pain points we encountered with using LSIF.[^1] |
| 9 | + |
| 10 | +SCIP is meant to be a _transmission_ format for sending |
| 11 | +data from some producers to some consumers |
| 12 | +-- it is not meant as a _storage_ format for querying. |
| 13 | +Ideally, producers should be able to directly output SCIP |
| 14 | +instead of going through an intermediate format |
| 15 | +and then converting to SCIP for transmission. |
| 16 | + |
| 17 | +[^1]: |
| 18 | + Sourcegraph historically supported LSIF uploads |
| 19 | + as well as maintained our own LSIF indexers, |
| 20 | + but we ran into issues of development velocity, |
| 21 | + debugging, as well as indexer performance bottlenecks. |
| 22 | + |
| 23 | +## Goals |
| 24 | + |
| 25 | +### Primary goals |
| 26 | + |
| 27 | +- Support code navigation at the fidelity of state-of-the-art IDEs. |
| 28 | + - Why: We want people to have an excellent experience navigating |
| 29 | + their code inside Sourcegraph itself. |
| 30 | +- Ease of writing indexers (i.e. producers). |
| 31 | + - Adding cross-repo navigation support should be easy. |
| 32 | + - Why: Scaling for Sourcegraph customers using code spread across many repos, |
| 33 | + as well as supporting code navigation for package ecosystems. |
| 34 | + - Adding file-level incrementality should be easy. |
| 35 | + - Why: Scaling for large monorepos. |
| 36 | + - Making the indexer parallel should be easy |
| 37 | + - Why: Scaling for large monorepos. |
| 38 | + - Writing indexers in different languages should be feasible |
| 39 | + - Why: Generally, tooling for a language tends to be implemented |
| 40 | + in the same language or an adjacent one; it would be impractical |
| 41 | + to expect indexers to settle on a common implementation language. |
| 42 | +- Robustness against indexer bugs: Incorrect code nav data for a certain entity |
| 43 | + should have a limited blast radius. |
| 44 | +- Ease of debugging. |
| 45 | + |
| 46 | +## Non-goals |
| 47 | + |
| 48 | +- Support use cases involving code modifications. |
| 49 | + - Why: Sourcegraph's code search and navigation has historically |
| 50 | + focused on read-only use cases. Adding support for code modifications |
| 51 | + introduces more complexity. |
| 52 | + While the same data can be used for tools for refactoring tools |
| 53 | + and IDE tooling, supporting those is not the focus of SCIP. |
| 54 | +- Ease of writing consumers. |
| 55 | + - Why: We expect the number of SCIP producers to be much higher |
| 56 | + than the number of consumers, |
| 57 | + so it makes sense to optimize for producers. |
| 58 | +- Be as compact as possible in uncompressed form.[^2] |
| 59 | + - Why: Modern general-purpose compression formats like gzip |
| 60 | + and zstd are already very good in terms of both compression |
| 61 | + speed and compression ratio.[^3] |
| 62 | +- Support efficient code navigation by itself. |
| 63 | + |
| 64 | + - Why: Code navigation fundamentally requires some form of bidirectional |
| 65 | + lookup which is best served by a query engine. |
| 66 | + |
| 67 | + For example, finding subclasses and superclasses are dual operations; |
| 68 | + supporting both across different indexes (for cross-repository navigation) |
| 69 | + requires some way of connecting the data together anyways. |
| 70 | + However, if the consumer is capable of supporting that, |
| 71 | + then recording bidirectional links in index data is not useful. |
| 72 | + |
| 73 | +[^2]: The only compromise on this decision is in the choice of representation of source ranges, where benchmarks showed that using a variable length integer array encoding provided significant savings compared to a message-based encoding, even after compression. |
| 74 | + |
| 75 | +[^3]: In practice, SCIP data tends to be have a compression ratio around in the range of 10%-20%, as modern compressors are very good at de-duplicating away the repetitive textual symbols. |
| 76 | + |
| 77 | +## Core design decisions |
| 78 | + |
| 79 | +### Using Protobuf for the schema |
| 80 | + |
| 81 | +- Relatively compact binary format, reducing I/O overhead. |
| 82 | +- Protobuf supports easy code generation. |
| 83 | +- Many languages have Protobuf code generators. |
| 84 | +- TLV format enables streaming reads and writes |
| 85 | + as well as merging by concatenation. |
| 86 | +- Rules for maintaining forward and backward compatibility |
| 87 | + are easily understood. |
| 88 | + |
| 89 | +### Using strings for IDs |
| 90 | + |
| 91 | +Hash tables are a core data type used in compilers and |
| 92 | +hence are likely to be useful in indexers generally. |
| 93 | + |
| 94 | +String types in mainstream languages support equality and hashing. |
| 95 | +where other objects may not be. |
| 96 | + |
| 97 | +### Avoid direct encoding of graphs |
| 98 | + |
| 99 | +One very tempting design for code navigation data is to |
| 100 | +think of all semantic entities as nodes and relationships between |
| 101 | +entities as edges, and to simply record ALL data |
| 102 | +using an adjacency list based graph representation. |
| 103 | + |
| 104 | +This is conceptually appealing, |
| 105 | +but is not desirable for a few reasons: |
| 106 | + |
| 107 | +- It encourages a wholesale approach to writing |
| 108 | + indexers as it involves merging all the data together. |
| 109 | + Such indexers are less likely to be able |
| 110 | + to easily support parallelism. |
| 111 | +- This potentially requires keeping a lot of data of memory |
| 112 | + at indexing time. Ideally, an indexer should be able |
| 113 | + to load parts of the codebase, |
| 114 | + append index data for that part to an open file, |
| 115 | + and then move on to the next part having cleared all memory |
| 116 | + being used from the previous iteration. |
| 117 | +- It potentially requires keeping a lot of data in memory |
| 118 | + on the consumer side. |
| 119 | + |
| 120 | +Instead, the approach to using documents and arrays |
| 121 | +helps colocate relevant data and naturally allows for streaming |
| 122 | +if the underlying data format allows streaming. |
| 123 | + |
| 124 | +### Avoiding integer IDs |
| 125 | + |
| 126 | +This includes avoiding structures like a symbol table |
| 127 | +mapping string IDs to integers and mostly using the integer |
| 128 | +IDs in raw data. |
| 129 | + |
| 130 | +As far as we know, the integer IDs in LSIF are primarily present as |
| 131 | +an ad-hoc compression scheme due to the verbosity of JSON |
| 132 | +and LSIF's graph-based encoding scheme. |
| 133 | + |
| 134 | +Avoiding integer IDs helps with limiting the blast radius |
| 135 | +of indexer bugs. With LSIF, we've had off-by-one bugs in indexers |
| 136 | +cause code navigation to fail repo-wide. |
| 137 | + |
| 138 | +This also helps with debugging, as the raw data itself |
| 139 | +can be inspected much more easily without needing |
| 140 | +a lot of surrounding context. |
0 commit comments