Support UTF-16 in LSP #656

jakebailey · 2025-03-18T03:51:52Z

Before:

After:

Includes #653

Per https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#textDocuments

The eldritch horror is:

JS defines 4 different newline sequences
LSP defines 3 different newlines sequences
Editors like vim, git, only have one newline sequence
VS Code follows LSP (or vice versa), but you can only select LF or CRLF

DanielRosenwasser · 2025-03-18T17:16:03Z

JS defines 4 different newline sequences

LSP defines 3 different newlines sequences

VS Code follows LSP (or vice versa), but you can only select LF or CRLF

Obligatory link to microsoft/TypeScript#38078

jakebailey · 2025-03-18T17:18:13Z

Yeah, so I am clearly doing the LSP thing here, and I can't find a reason why we shouldn't just do that regardless for everything else, honestly.

DanielRosenwasser · 2025-03-18T17:28:13Z

internal/lsp/converters.go

+		character = position - start
+	} else {
+		// We need to rescan the text as UTF-16 to find the character offset.
+		for _, r := range scriptInfo.Text()[start:position] {


Didn't realize that range does "the right thing" over a string.

Yes, it's actually faster than manually calling the utf8 lib, IIRC.

DanielRosenwasser · 2025-03-18T17:28:21Z

internal/lsp/converters.go

+func positionToLineAndCharacter(scriptInfo *project.ScriptInfo, position core.TextPos) lsproto.Position {
+	// UTF-8 offset to UTF-8/16 0-indexed line and character
+
+	lineMap := scriptInfo.LineMapLSP()


Does this ever get cached anywhere? Is the work being done on every call?

Yes, scriptInfo caches this.

Can you update the source file's line map to be the same if the file contains only ASCII? That way the two will be deduplicated.

Or possibly just track whether a non CR/LF line ending was encountered here and do it then.

By that point we'll have constructed the whole thing, so I'm not totally sure if it's helpful, but I guess we could save a little memory sometimes... If we've even requested the line map for other reasons, which we might not have at all.

internal/lsp/server.go

andrewbranch · 2025-03-18T18:46:55Z

Yeah, so I am clearly doing the LSP thing here, and I can't find a reason why we shouldn't just do that regardless for everything else, honestly.

I feel like we can just standardize on the LSP-compatible line map as long as we don’t use it for parsing/grammar considerations around [no LineTerminator here], which I’m pretty sure we don’t...

IOW, we have to respect ECMAScript’s conception of a line terminator, but we don’t have to use it in our own reporting of line numbers, which exists to help humans find their code in an editor.

jakebailey · 2025-03-18T18:50:59Z

The line map is also used for source maps. I do genuinely wonder if source maps actually use JS's definition, though.

DanielRosenwasser · 2025-03-18T19:25:57Z

https://tc39.es/ecma426/2024/#extraction-javascript does split across ECMAScript code points, so it is a bit implied.

jakebailey · 2025-03-18T19:36:51Z

Let lines be the result of strictly splitting source on ECMAScript line terminator code points.

Yay.

jakebailey added 4 commits March 17, 2025 13:03

Eliminate hardcoded RuneSelf-1

c63c800

Handle positionEncoding

2657a87

Refactor ComputeLineStartsSeq

e188dac

Support UTF-16 in LSP

b31158e

jakebailey requested a review from andrewbranch March 18, 2025 03:51

Readme

717e570

DanielRosenwasser reviewed Mar 18, 2025

View reviewed changes

jakebailey added 2 commits March 18, 2025 10:34

Daniel's refactor

8b80638

Merge branch 'main' into jabaile/utf16

5781851

andrewbranch approved these changes Mar 18, 2025

View reviewed changes

jakebailey added this pull request to the merge queue Mar 18, 2025

jakebailey mentioned this pull request Mar 18, 2025

Eliminate hardcoded RuneSelf-1 #653

Closed

Merged via the queue into main with commit 9c58c8b Mar 18, 2025
21 checks passed

jakebailey deleted the jabaile/utf16 branch March 18, 2025 22:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support UTF-16 in LSP #656

Support UTF-16 in LSP #656

jakebailey commented Mar 18, 2025 •

edited

Loading

DanielRosenwasser commented Mar 18, 2025

jakebailey commented Mar 18, 2025

DanielRosenwasser Mar 18, 2025

jakebailey Mar 18, 2025

DanielRosenwasser Mar 18, 2025

jakebailey Mar 18, 2025

DanielRosenwasser Mar 18, 2025

DanielRosenwasser Mar 18, 2025

jakebailey Mar 18, 2025

andrewbranch commented Mar 18, 2025 •

edited

Loading

jakebailey commented Mar 18, 2025

DanielRosenwasser commented Mar 18, 2025

jakebailey commented Mar 18, 2025

Support UTF-16 in LSP #656

Support UTF-16 in LSP #656

Conversation

jakebailey commented Mar 18, 2025 • edited Loading

DanielRosenwasser commented Mar 18, 2025

jakebailey commented Mar 18, 2025

DanielRosenwasser Mar 18, 2025

Choose a reason for hiding this comment

jakebailey Mar 18, 2025

Choose a reason for hiding this comment

DanielRosenwasser Mar 18, 2025

Choose a reason for hiding this comment

jakebailey Mar 18, 2025

Choose a reason for hiding this comment

DanielRosenwasser Mar 18, 2025

Choose a reason for hiding this comment

DanielRosenwasser Mar 18, 2025

Choose a reason for hiding this comment

jakebailey Mar 18, 2025

Choose a reason for hiding this comment

andrewbranch commented Mar 18, 2025 • edited Loading

jakebailey commented Mar 18, 2025

DanielRosenwasser commented Mar 18, 2025

jakebailey commented Mar 18, 2025

jakebailey commented Mar 18, 2025 •

edited

Loading

andrewbranch commented Mar 18, 2025 •

edited

Loading