Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support UTF-16 in LSP #656

Merged
merged 7 commits into from
Mar 18, 2025
Merged

Support UTF-16 in LSP #656

merged 7 commits into from
Mar 18, 2025

Conversation

jakebailey
Copy link
Member

@jakebailey jakebailey commented Mar 18, 2025

Before:

image

After:

image

Includes #653

Per https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#textDocuments

The eldritch horror is:

  • JS defines 4 different newline sequences
  • LSP defines 3 different newlines sequences
  • Editors like vim, git, only have one newline sequence
  • VS Code follows LSP (or vice versa), but you can only select LF or CRLF

@jakebailey jakebailey requested a review from andrewbranch March 18, 2025 03:51
@DanielRosenwasser
Copy link
Member

  • JS defines 4 different newline sequences

  • LSP defines 3 different newlines sequences

  • VS Code follows LSP (or vice versa), but you can only select LF or CRLF

Obligatory link to microsoft/TypeScript#38078

@jakebailey
Copy link
Member Author

Yeah, so I am clearly doing the LSP thing here, and I can't find a reason why we shouldn't just do that regardless for everything else, honestly.

character = position - start
} else {
// We need to rescan the text as UTF-16 to find the character offset.
for _, r := range scriptInfo.Text()[start:position] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't realize that range does "the right thing" over a string.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's actually faster than manually calling the utf8 lib, IIRC.

func positionToLineAndCharacter(scriptInfo *project.ScriptInfo, position core.TextPos) lsproto.Position {
// UTF-8 offset to UTF-8/16 0-indexed line and character

lineMap := scriptInfo.LineMapLSP()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this ever get cached anywhere? Is the work being done on every call?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, scriptInfo caches this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the source file's line map to be the same if the file contains only ASCII? That way the two will be deduplicated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or possibly just track whether a non CR/LF line ending was encountered here and do it then.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By that point we'll have constructed the whole thing, so I'm not totally sure if it's helpful, but I guess we could save a little memory sometimes... If we've even requested the line map for other reasons, which we might not have at all.

@andrewbranch
Copy link
Member

andrewbranch commented Mar 18, 2025

Yeah, so I am clearly doing the LSP thing here, and I can't find a reason why we shouldn't just do that regardless for everything else, honestly.

I feel like we can just standardize on the LSP-compatible line map as long as we don’t use it for parsing/grammar considerations around [no LineTerminator here], which I’m pretty sure we don’t...

IOW, we have to respect ECMAScript’s conception of a line terminator, but we don’t have to use it in our own reporting of line numbers, which exists to help humans find their code in an editor.

@jakebailey
Copy link
Member Author

The line map is also used for source maps. I do genuinely wonder if source maps actually use JS's definition, though.

@DanielRosenwasser
Copy link
Member

https://tc39.es/ecma426/2024/#extraction-javascript does split across ECMAScript code points, so it is a bit implied.

@jakebailey
Copy link
Member Author

Let lines be the result of strictly splitting source on ECMAScript line terminator code points.

Yay.

@jakebailey jakebailey added this pull request to the merge queue Mar 18, 2025
Merged via the queue into main with commit 9c58c8b Mar 18, 2025
21 checks passed
@jakebailey jakebailey deleted the jabaile/utf16 branch March 18, 2025 22:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants