Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge characterClassEscape and dot type? #68

Open
jviereck opened this issue Sep 5, 2014 · 0 comments
Open

Merge characterClassEscape and dot type? #68

jviereck opened this issue Sep 5, 2014 · 0 comments

Comments

@jviereck
Copy link
Owner

jviereck commented Sep 5, 2014

This came to my mind when preparing the presentation for Amsterdam.JS:

Currently there is a special type="dot" for things like /./. Per see there is nothing wrong with this, but the type feels very similar to type="characterClassEscape ". How do you feel to merge the type characterClassEscape and dot? Maybe into specialCharacterClass?

Or, alternative idea: similar to how different types got merged into type=value, merge dot, characterClassEscape and the existing characterClass into characterClass and add a new kind entry? I like this, as it not only gets away with the type dot, but also with the type characterClassEscape, which sounds similar to characterClass, but is still completly different although similar. Like:

{
  type: "characterClass",
  kind: "range",
  body: [ { type: "characterClassRange", ...} ]
}

{
  type: "characterClass",
  kind: "singleChar",
  char: "d"
  // The body is the not needed here
  // body: [ ]   
}

This looks interesting to me, but I dislike the inconsistency by using body in one case and char in the other one to encode the "meaning" of the characterClass. In the case of value, all the different kinds have a codePoint entry. A possible way to achieve a similar feeling of consistency here could be to store on the body of the type: "characterClass in the case of the kind: "singleChar" the actual ranges that are matched. E.g. in the case of /\d/:

{
  type: "characterClass",
  kind: "singleChar",
  body: [ {type: "characterClassRange", from: 48, to: 57} ],   
  raw: "\d"
}

Looks nice, but encoding /\s/ this way will result in a very large body :/ Here are the two functions used in RegExp.JS to test for a /\s/ string:

function isWhiteSpace(ch) {
    return (ch === 32) ||  // space
        (ch === 9) ||      // tab
        (ch === 0xB) ||
        (ch === 0xC) ||
        (ch === 0xA0) ||
        (ch >= 0x1680 && '\u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u202F\u205F\u3000\uFEFF'.indexOf(String.fromCharCode(ch)) > 0);
}

// 7.3 Line Terminators

function isLineTerminator(ch) {
    return (ch === 10) || (ch === 13) || (ch === 0x2028) || (ch === 0x2029);
}

Personally, I am not sure if the consistency is worth the larger AST output here.

So, maybe go with specialCharacterClass and characterClass? Any thoughts? Or do you think merging dot into a different type is not worth the efford and this issue should be closed right away ;)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant