Skip to content

Commit b469406

Browse files
authoredOct 24, 2024
docs: Add DESIGN.md (#289)
1 parent 727d53e commit b469406

File tree

2 files changed

+144
-4
lines changed

2 files changed

+144
-4
lines changed
 

‎DESIGN.md

+140
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
# Design rationale for SCIP
2+
3+
Sourcegraph supports viewing and navigating code for
4+
many different programming languages.
5+
This is similar to an IDE, except for the actual "editor" part.
6+
7+
The design of SCIP is based on this primary motivating use case,
8+
as well as for fixing the pain points we encountered with using LSIF.[^1]
9+
10+
SCIP is meant to be a _transmission_ format for sending
11+
data from some producers to some consumers
12+
-- it is not meant as a _storage_ format for querying.
13+
Ideally, producers should be able to directly output SCIP
14+
instead of going through an intermediate format
15+
and then converting to SCIP for transmission.
16+
17+
[^1]:
18+
Sourcegraph historically supported LSIF uploads
19+
as well as maintained our own LSIF indexers,
20+
but we ran into issues of development velocity,
21+
debugging, as well as indexer performance bottlenecks.
22+
23+
## Goals
24+
25+
### Primary goals
26+
27+
- Support code navigation at the fidelity of state-of-the-art IDEs.
28+
- Why: We want people to have an excellent experience navigating
29+
their code inside Sourcegraph itself.
30+
- Ease of writing indexers (i.e. producers).
31+
- Adding cross-repo navigation support should be easy.
32+
- Why: Scaling for Sourcegraph customers using code spread across many repos,
33+
as well as supporting code navigation for package ecosystems.
34+
- Adding file-level incrementality should be easy.
35+
- Why: Scaling for large monorepos.
36+
- Making the indexer parallel should be easy
37+
- Why: Scaling for large monorepos.
38+
- Writing indexers in different languages should be feasible
39+
- Why: Generally, tooling for a language tends to be implemented
40+
in the same language or an adjacent one; it would be impractical
41+
to expect indexers to settle on a common implementation language.
42+
- Robustness against indexer bugs: Incorrect code nav data for a certain entity
43+
should have a limited blast radius.
44+
- Ease of debugging.
45+
46+
## Non-goals
47+
48+
- Support use cases involving code modifications.
49+
- Why: Sourcegraph's code search and navigation has historically
50+
focused on read-only use cases. Adding support for code modifications
51+
introduces more complexity.
52+
While the same data can be used for tools for refactoring tools
53+
and IDE tooling, supporting those is not the focus of SCIP.
54+
- Ease of writing consumers.
55+
- Why: We expect the number of SCIP producers to be much higher
56+
than the number of consumers,
57+
so it makes sense to optimize for producers.
58+
- Be as compact as possible in uncompressed form.[^2]
59+
- Why: Modern general-purpose compression formats like gzip
60+
and zstd are already very good in terms of both compression
61+
speed and compression ratio.[^3]
62+
- Support efficient code navigation by itself.
63+
64+
- Why: Code navigation fundamentally requires some form of bidirectional
65+
lookup which is best served by a query engine.
66+
67+
For example, finding subclasses and superclasses are dual operations;
68+
supporting both across different indexes (for cross-repository navigation)
69+
requires some way of connecting the data together anyways.
70+
However, if the consumer is capable of supporting that,
71+
then recording bidirectional links in index data is not useful.
72+
73+
[^2]: The only compromise on this decision is in the choice of representation of source ranges, where benchmarks showed that using a variable length integer array encoding provided significant savings compared to a message-based encoding, even after compression.
74+
75+
[^3]: In practice, SCIP data tends to be have a compression ratio around in the range of 10%-20%, as modern compressors are very good at de-duplicating away the repetitive textual symbols.
76+
77+
## Core design decisions
78+
79+
### Using Protobuf for the schema
80+
81+
- Relatively compact binary format, reducing I/O overhead.
82+
- Protobuf supports easy code generation.
83+
- Many languages have Protobuf code generators.
84+
- TLV format enables streaming reads and writes
85+
as well as merging by concatenation.
86+
- Rules for maintaining forward and backward compatibility
87+
are easily understood.
88+
89+
### Using strings for IDs
90+
91+
Hash tables are a core data type used in compilers and
92+
hence are likely to be useful in indexers generally.
93+
94+
String types in mainstream languages support equality and hashing.
95+
where other objects may not be.
96+
97+
### Avoid direct encoding of graphs
98+
99+
One very tempting design for code navigation data is to
100+
think of all semantic entities as nodes and relationships between
101+
entities as edges, and to simply record ALL data
102+
using an adjacency list based graph representation.
103+
104+
This is conceptually appealing,
105+
but is not desirable for a few reasons:
106+
107+
- It encourages a wholesale approach to writing
108+
indexers as it involves merging all the data together.
109+
Such indexers are less likely to be able
110+
to easily support parallelism.
111+
- This potentially requires keeping a lot of data of memory
112+
at indexing time. Ideally, an indexer should be able
113+
to load parts of the codebase,
114+
append index data for that part to an open file,
115+
and then move on to the next part having cleared all memory
116+
being used from the previous iteration.
117+
- It potentially requires keeping a lot of data in memory
118+
on the consumer side.
119+
120+
Instead, the approach to using documents and arrays
121+
helps colocate relevant data and naturally allows for streaming
122+
if the underlying data format allows streaming.
123+
124+
### Avoiding integer IDs
125+
126+
This includes avoiding structures like a symbol table
127+
mapping string IDs to integers and mostly using the integer
128+
IDs in raw data.
129+
130+
As far as we know, the integer IDs in LSIF are primarily present as
131+
an ad-hoc compression scheme due to the verbosity of JSON
132+
and LSIF's graph-based encoding scheme.
133+
134+
Avoiding integer IDs helps with limiting the blast radius
135+
of indexer bugs. With LSIF, we've had off-by-one bugs in indexers
136+
cause code navigation to fail repo-wide.
137+
138+
This also helps with debugging, as the raw data itself
139+
can be inspected much more easily without needing
140+
a lot of surrounding context.

‎Readme.md ‎README.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -7,15 +7,15 @@ such as Go to definition, Find references, and Find implementations.
77

88
This repository includes:
99

10-
- A [protobuf schema for SCIP](./scip.proto).
10+
- A [Protobuf schema for SCIP](./scip.proto).
1111
- Rich Go and Rust bindings for SCIP: These include many utility functions
1212
to help build tooling on top of SCIP.
1313
- Auto-generated bindings for TypeScript and Haskell.
1414
- The [`scip` CLI](./docs/CLI.md), which makes SCIP indexes
1515
a breeze to work with.
1616

1717
If you're interested in better understanding the motivation behind SCIP,
18-
check out the [announcement blog post](https://about.sourcegraph.com/blog/announcing-scip).
18+
check out the [announcement blog post](https://about.sourcegraph.com/blog/announcing-scip) and the [design doc](./DESIGN.md).
1919

2020
If you're interested in writing a new indexer that emits SCIP,
2121
check out our documentation on
@@ -24,8 +24,8 @@ Also, check out the [Debugging section][] in the Development docs.
2424

2525
If you're interested in consuming SCIP data,
2626
you can either use one of the [provided language bindings](https://github.com/sourcegraph/scip/tree/main/bindings),
27-
or generate code for the [SCIP protobuf schema](./scip.proto)
28-
using the protobuf toolchain for your language ecosystem.
27+
or generate code for the [SCIP Protobuf schema](./scip.proto)
28+
using the Protobuf toolchain for your language ecosystem.
2929
Also, check out the [Debugging section][] in the Development docs.
3030

3131
[debugging section]: ./Development.md#debugging

0 commit comments

Comments
 (0)