IDNA: add a couple interesting ToASCII cases #37907

annevk · 2023-01-12T16:22:24Z

Identified in whatwg/url#341 by karwa.

annevk · 2023-01-12T16:22:34Z

mikewest

I'm happy to trust you that these are the correct serializations, but reading through the issue that wasn't actually clear to me. :/

annevk · 2023-01-12T18:07:10Z

I typically use browsers or something like https://mothereff.in/punycode. And only if they end up disagreeing I'll dive into the relevant specifications.

domenic · 2023-01-13T01:47:39Z

These tests match browsers, but I guess they don't match the URL Standard, right? (Since the URL Standard depends on UTS 46 which disallows them.) So strictly spec-conformant implementations like jsdom/whatwg-url will now fail the test suite?

karwa · 2023-01-13T04:21:22Z

These match the current URL standard, and UTS46.

`xn--ls8h=`

This should fail because in step 4 of compatibility processing, we must decode Punycode labels in to Unicode scalars. This is invalid Punycode, so it should fail.

From RFC-3492, 6.2: Decoding Procedure

for <...>
consume a code point, or fail if there was none to consume
let digit = the code point's digit-value, fail if it has none <---

In the reference implementation, this surfaces as:

/* decode_digit(cp) returns the numeric value of a basic code */
/* point (for use in representing integers) in the range 0 to */
/* base-1, or base if cp is does not represent a value.       */

static punycode_uint decode_digit(punycode_uint cp)
{
  return  cp - 48 < 10 ? cp - 22 :  cp - 65 < 26 ? cp - 65 :
          cp - 97 < 26 ? cp - 97 :  base;
}

enum punycode_status punycode_decode(
  punycode_uint input_length,
  const char input[],
  punycode_uint *output_length,
  punycode_uint output[],
  unsigned char case_flags[] )
{
  [...]
  for (...) {
     if (in >= input_length) return punycode_bad_input;
     digit = decode_digit(input[in++]);
     if (digit >= base) return punycode_bad_input;     // <----
  [...]
}

The = sign is not an ASCII alphanumeric, so its code-point does not have an assigned digit value, and decoding should fail.

JSDOM seems to allow it. It also allows other invalid characters such as in: https://xn--ls8h<=>[^]/.

Chrome seems to not even recognise/decode Punycode at all, or doesn't care if it isn't valid, so it ends up percent-escaping the invalid characters 🤷‍♂️. The standard says all of these should fail.

https://xn--ls8h=/ => https://xn--ls8h%3D/
https://xn--ls8h$<=>*() => https://xn--ls8h%24%3C%3D%3E%2A%28%29/
https://xn--z/ => https://xn--z/

`http://≠/` and friends

These test that the URL parser uses ToAscii with UseStd3AsciiRules = false. In the URL standard, the value of UseStd3AsciiRules is set by the parameter beStrict, and in the parser that is always false.

JSDOM appears to do the correct thing here, from what I can see on he live viewer.

The interesting thing about these inputs is that Unicode only produces conformance test files for UseStd3AsciiRules = true. This is documented in UTS46. So an implementation may erroneously set this parameter true in the parser, and still pass the UTS46 conformance tests.

The situation with the Unicode tests seems weird - surely URLs are the main consumer of UTS46, so why do they not produce test files which allow for the flags we actually use to be tested? The test files include other flags which indicate that a test should fail only if an implementation does such and such, but there is no such indication for UseStd3AsciiRules.

Working around this isn't easy for implementations - we need to detect when inputs fail due to a disallowed code-point, then do further checks to figure out if all disallowed code-points would have actually been allowed if UseStd3AsciiRules = false.

karwa · 2023-01-13T05:09:24Z

I did a bit of digging, and I think I found the package which JSDOM uses for Punycode decoding. I sent a PR which should fix the xn--ls8h= issue.

mathiasbynens/punycode.js#124

annevk · 2023-01-13T06:25:16Z

Thanks for going above and beyond @karwa!

These should indeed match the current specifications. I did file whatwg/url#733 to consider changing the URL Standard, but I'm not entirely convinced it's needed.

TimothyGu · 2023-02-23T19:06:01Z

https://xn--ls8h=/ => https://xn--ls8h%3D/

https://xn--ls8h$<=>*() => https://xn--ls8h%24%3C%3D%3E%2A%28%29/

https://xn--z/ => https://xn--z/

For these cases in Chrome, it's the result of an ASCII fast path that has previously been discussed elsewhere (whatwg/url#438). Not sure about xn--z, but I think forbidding the percent encoding cases might be feasible in Chrome.

IDNA: add a couple interesting ToASCII cases

ac58dc2

Identified in whatwg/url#341 by karwa.

wpt-pr-bot added the url label Jan 12, 2023

wpt-pr-bot assigned mikewest Jan 12, 2023

wpt-pr-bot requested review from domenic, GPHemsley, mikewest and TimothyGu January 12, 2023 16:22

This was referenced Jan 12, 2023

Should we forbid U+226E (≮) and U+226F (≯) in hosts? whatwg/url#733

Closed

Issues with UTS #46 tests whatwg/url#341

Closed

mikewest approved these changes Jan 12, 2023

View reviewed changes

annevk merged commit 3d997a3 into master Jan 12, 2023

annevk deleted the annevk/moar-idna branch January 12, 2023 18:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IDNA: add a couple interesting ToASCII cases #37907

IDNA: add a couple interesting ToASCII cases #37907

annevk commented Jan 12, 2023

annevk commented Jan 12, 2023

mikewest left a comment

annevk commented Jan 12, 2023

domenic commented Jan 13, 2023

karwa commented Jan 13, 2023

karwa commented Jan 13, 2023

annevk commented Jan 13, 2023

TimothyGu commented Feb 23, 2023

IDNA: add a couple interesting ToASCII cases #37907

IDNA: add a couple interesting ToASCII cases #37907

Conversation

annevk commented Jan 12, 2023

annevk commented Jan 12, 2023

mikewest left a comment

Choose a reason for hiding this comment

annevk commented Jan 12, 2023

domenic commented Jan 13, 2023

karwa commented Jan 13, 2023

xn--ls8h=

http://≠/ and friends

karwa commented Jan 13, 2023

annevk commented Jan 13, 2023

TimothyGu commented Feb 23, 2023

`xn--ls8h=`

`http://≠/` and friends