Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IDNA: add a couple interesting ToASCII cases #37907

Merged
merged 1 commit into from
Jan 12, 2023
Merged

Conversation

annevk
Copy link
Member

@annevk annevk commented Jan 12, 2023

Identified in whatwg/url#341 by karwa.

@annevk
Copy link
Member Author

annevk commented Jan 12, 2023

cc @karwa @domenic

Copy link
Member

@mikewest mikewest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to trust you that these are the correct serializations, but reading through the issue that wasn't actually clear to me. :/

@annevk annevk merged commit 3d997a3 into master Jan 12, 2023
@annevk annevk deleted the annevk/moar-idna branch January 12, 2023 18:05
@annevk
Copy link
Member Author

annevk commented Jan 12, 2023

I typically use browsers or something like https://mothereff.in/punycode. And only if they end up disagreeing I'll dive into the relevant specifications.

@domenic
Copy link
Member

domenic commented Jan 13, 2023

These tests match browsers, but I guess they don't match the URL Standard, right? (Since the URL Standard depends on UTS 46 which disallows them.) So strictly spec-conformant implementations like jsdom/whatwg-url will now fail the test suite?

@karwa
Copy link
Contributor

karwa commented Jan 13, 2023

These match the current URL standard, and UTS46.

xn--ls8h=

This should fail because in step 4 of compatibility processing, we must decode Punycode labels in to Unicode scalars. This is invalid Punycode, so it should fail.

From RFC-3492, 6.2: Decoding Procedure

for <...>
consume a code point, or fail if there was none to consume
let digit = the code point's digit-value, fail if it has none <---

In the reference implementation, this surfaces as:

/* decode_digit(cp) returns the numeric value of a basic code */
/* point (for use in representing integers) in the range 0 to */
/* base-1, or base if cp is does not represent a value.       */

static punycode_uint decode_digit(punycode_uint cp)
{
  return  cp - 48 < 10 ? cp - 22 :  cp - 65 < 26 ? cp - 65 :
          cp - 97 < 26 ? cp - 97 :  base;
}
enum punycode_status punycode_decode(
  punycode_uint input_length,
  const char input[],
  punycode_uint *output_length,
  punycode_uint output[],
  unsigned char case_flags[] )
{
  [...]
  for (...) {
     if (in >= input_length) return punycode_bad_input;
     digit = decode_digit(input[in++]);
     if (digit >= base) return punycode_bad_input;     // <----
  [...]
}

The = sign is not an ASCII alphanumeric, so its code-point does not have an assigned digit value, and decoding should fail.

JSDOM seems to allow it. It also allows other invalid characters such as in: https://xn--ls8h<=>[^]/.

Chrome seems to not even recognise/decode Punycode at all, or doesn't care if it isn't valid, so it ends up percent-escaping the invalid characters 🤷‍♂️. The standard says all of these should fail.

  • https://xn--ls8h=/ => https://xn--ls8h%3D/
  • https://xn--ls8h$<=>*() => https://xn--ls8h%24%3C%3D%3E%2A%28%29/
  • https://xn--z/ => https://xn--z/

http://≠/ and friends

These test that the URL parser uses ToAscii with UseStd3AsciiRules = false. In the URL standard, the value of UseStd3AsciiRules is set by the parameter beStrict, and in the parser that is always false.

JSDOM appears to do the correct thing here, from what I can see on he live viewer.

The interesting thing about these inputs is that Unicode only produces conformance test files for UseStd3AsciiRules = true. This is documented in UTS46. So an implementation may erroneously set this parameter true in the parser, and still pass the UTS46 conformance tests.

The situation with the Unicode tests seems weird - surely URLs are the main consumer of UTS46, so why do they not produce test files which allow for the flags we actually use to be tested? The test files include other flags which indicate that a test should fail only if an implementation does such and such, but there is no such indication for UseStd3AsciiRules.

Working around this isn't easy for implementations - we need to detect when inputs fail due to a disallowed code-point, then do further checks to figure out if all disallowed code-points would have actually been allowed if UseStd3AsciiRules = false.

@karwa
Copy link
Contributor

karwa commented Jan 13, 2023

I did a bit of digging, and I think I found the package which JSDOM uses for Punycode decoding. I sent a PR which should fix the xn--ls8h= issue.

mathiasbynens/punycode.js#124

@annevk
Copy link
Member Author

annevk commented Jan 13, 2023

Thanks for going above and beyond @karwa!

These should indeed match the current specifications. I did file whatwg/url#733 to consider changing the URL Standard, but I'm not entirely convinced it's needed.

@TimothyGu
Copy link
Member

  • https://xn--ls8h=/ => https://xn--ls8h%3D/
  • https://xn--ls8h$<=>*() => https://xn--ls8h%24%3C%3D%3E%2A%28%29/
  • https://xn--z/ => https://xn--z/

For these cases in Chrome, it's the result of an ASCII fast path that has previously been discussed elsewhere (whatwg/url#438). Not sure about xn--z, but I think forbidding the percent encoding cases might be feasible in Chrome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants