Reuse algorithms from the Encoding Standard #3

SimonSapin · 2014-10-03T14:12:26Z

The decoding and encoding algorithms are very similar to that of https://encoding.spec.whatwg.org/ . It would be nice to be able to reuse (part of) them instead of duplicating the definitions. However this would require Encoding to provide some hooks. @annevk, are you interested in doing this? What’s needed includes:

Decode to / encode from code points rather than just Unicode scalar values. (Or maybe the encoders and decoders are already defined this way? It’s not obvious from a quick look at the spec.)
Define UTF-16 in terms of code units, separately from the byte serialization (big-endian or little-endian) of code units. In Unicode terms: define the Encoding Form, not just the Encoding Scheme.
Hooks to tweak how the algorithms deal with surrogates. (In particular, excluding a surrogate pair when decoding WTF-8 is a bit tricky to get the right number of U+FFFD’s.)

annevk · 2014-10-09T14:54:56Z

What's the benefit of an Encoding Form? I would accept a bug report on this, although it's a bit unclear if we want this. Would Servo put wtf-8 in the same library or its own little legacy space?

SimonSapin · 2014-10-09T16:14:02Z

Encoding Form: JS and Windows keep code units in memory as uint16_t 16-bit integers in "native endian" rather than pairs of bytes in either big-endian or little-endian, so you never deal with bytes when using such data.

In the same library as what?

Anyway, I wrote this in response to “why is the WTF-8 spec not based on the Encoding standard?”, but it’s less pressing now (as of c6a271d) that I’ve simplified WTF-8 decoding to assume well-formedness, and not define it from arbitrary bytes. (If you’re decoding WTF-8 from arbitrary bytes such as from the network, you’re probably doing it wrong.)

If you think de-duplicating is not valuable enough to bother, I’ll just close this.

annevk · 2014-10-09T17:00:11Z

The only thing I'd think is valuable is to link to the Encoding Standard for utf-8/utf-16, scalar value/code point. So that everyone is aware we follow that document for those terms.

annevk · 2014-10-09T17:00:50Z

I would be okay with investigating de-duplication as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reuse algorithms from the Encoding Standard #3

Reuse algorithms from the Encoding Standard #3

SimonSapin commented Oct 3, 2014

annevk commented Oct 9, 2014

SimonSapin commented Oct 9, 2014

annevk commented Oct 9, 2014

annevk commented Oct 9, 2014

Reuse algorithms from the Encoding Standard #3

Reuse algorithms from the Encoding Standard #3

Comments

SimonSapin commented Oct 3, 2014

annevk commented Oct 9, 2014

SimonSapin commented Oct 9, 2014

annevk commented Oct 9, 2014

annevk commented Oct 9, 2014