Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reuse algorithms from the Encoding Standard #3

Open
SimonSapin opened this issue Oct 3, 2014 · 4 comments
Open

Reuse algorithms from the Encoding Standard #3

SimonSapin opened this issue Oct 3, 2014 · 4 comments

Comments

@SimonSapin
Copy link
Owner

The decoding and encoding algorithms are very similar to that of https://encoding.spec.whatwg.org/ . It would be nice to be able to reuse (part of) them instead of duplicating the definitions. However this would require Encoding to provide some hooks. @annevk, are you interested in doing this? What’s needed includes:

  • Decode to / encode from code points rather than just Unicode scalar values. (Or maybe the encoders and decoders are already defined this way? It’s not obvious from a quick look at the spec.)
  • Define UTF-16 in terms of code units, separately from the byte serialization (big-endian or little-endian) of code units. In Unicode terms: define the Encoding Form, not just the Encoding Scheme.
  • Hooks to tweak how the algorithms deal with surrogates. (In particular, excluding a surrogate pair when decoding WTF-8 is a bit tricky to get the right number of U+FFFD’s.)
@annevk
Copy link

annevk commented Oct 9, 2014

What's the benefit of an Encoding Form? I would accept a bug report on this, although it's a bit unclear if we want this. Would Servo put wtf-8 in the same library or its own little legacy space?

@SimonSapin
Copy link
Owner Author

Encoding Form: JS and Windows keep code units in memory as uint16_t 16-bit integers in "native endian" rather than pairs of bytes in either big-endian or little-endian, so you never deal with bytes when using such data.

In the same library as what?

Anyway, I wrote this in response to “why is the WTF-8 spec not based on the Encoding standard?”, but it’s less pressing now (as of c6a271d) that I’ve simplified WTF-8 decoding to assume well-formedness, and not define it from arbitrary bytes. (If you’re decoding WTF-8 from arbitrary bytes such as from the network, you’re probably doing it wrong.)

If you think de-duplicating is not valuable enough to bother, I’ll just close this.

@annevk
Copy link

annevk commented Oct 9, 2014

The only thing I'd think is valuable is to link to the Encoding Standard for utf-8/utf-16, scalar value/code point. So that everyone is aware we follow that document for those terms.

@annevk
Copy link

annevk commented Oct 9, 2014

I would be okay with investigating de-duplication as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants