-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using () to delimit objects breaks auto-url-detectors #16
Comments
I could implement this in a v2 but the problem is that a string produced by a v2 will fail to parse with a v1 parser. So far I have resisted making changes because I did not want to break protocols that use jsurl. |
Well, I respect that, but you can always call the encoding jsurl2 and make
it clear there is no compatibility except in spirit…
My usage so far was to encode data for consumption by the same application,
and I would guess that that is the major use case…
…On Tue, Mar 28, 2017, 9:45 PM Bruno Jouhier ***@***.***> wrote:
I could implement this in a v2 but the problem is that a string produced
by a v2 will fail to parse with a v1 parser. So far I have resisted making
changes because I did not want to break protocols that use jsurl.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#16 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AADWlsJTDVput2un6AQV7qnquSYnravtks5rqWNHgaJpZM4Mr2qb>
.
|
Our situation is different because our app has several components that interact with jsurl and it is more difficult to move them all at once (especially as our components are deployed on-premise). So we need to preserve interop. But I'm not opposed to fixing the issues with a v2. We should solve all the pending issues at once (encoded quote and trailing ~) so that we don't have to move again later. |
So, how about changing the initial character for jsurl2? That way, you can
parse ~ starting strings as v1 and = (or whatever) as v2
For the (), I realized that as you descend into a JS value, there are only
a few possibilities, so if you drop some robustness, you can use any valid
character to delimit blocks.
Furthermore, while parsing the inside of a block, you only need 2
characters: one to stop the block and one to go deeper. Normally these are
) and (, but they could also change on every level. So you could delimit
the first block with / (will be part of the url even at end) and then
alternating with | and / (for example):
=/name~"John*20Doe~age~42~children~|~"Mary~"Bill|/
In fact, at each split point of the JSON structures at http://www.json.org/,
you can use a different set of encoding characters. The example could also
be e.g.
=/name~John*_Doe~age~42~children*Mary~Bill~*/, or even
=/!0~John*_Doe~!1~42~!2*Mary~!3~*/ (with pre-shared dictionary):
- /, | and * start objects/arrays depending on level (rotate the set on
every level, note that * is not needed for escaping here)
- " or any a-zA-Z start a string. " is only needed if a string does not
start with alpha
- -, 0-9 and . start a number, so a decimal can be .5
- !/, !|, !* can be true, false and null. That leaves lots of address
space in ! to refer to a pre-shared dictionary. Keys starting with ! could
also refer to that dictionary.
- inside properties and strings, *_ encodes a space. all of *x is
available if *XX requires uppercase. E.g. *! *~ */ *| **
That should make for shorter encodings that still are fairly readable.
For robustness, a short fixed-size checksum could be added to the end, e.g.
2 characters taking the sum of all character values plus the string length,
module 64^2, base64 (is that url safe?)
…On Wed, Mar 29, 2017 at 11:03 AM Bruno Jouhier ***@***.***> wrote:
Our situation is different because our app has several components that
interact with jsurl and it is more difficult to move them all at once
(especially as our components are deployed on-premise). So we need to
preserve interop.
But I'm not opposed to fixing the issues with a v2. We should solve all
the pending issues at once (encoded quote and trailing ~) so that we don't
have to move again later.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#16 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AADWlntQzv1dlE7GFNp-1MZUfAttNwUEks5rqh5egaJpZM4Mr2qb>
.
|
I was thinking about less invasive changes. I would like to keep the parentheses. If we add a I want the encoded string to be unaltered by So I'm proposing the following changes:
|
~ at the end is good, but then ~ at the beginning is no longer needed.
I thought some more about it, and I think we can encode using only the
unreserved characters of section https://www.ietf.org/rfc/rfc3986.txt,
so ALPHA / DIGIT / "-" / "." / "_" / "~".
Here are the rules:
- all values terminate with ~
- true, false, null become -T~, -F~, -N~
- numbers start with - (+ digit) or a digit and end with ~
- strings start with alpha or * (the only extra non-unreserved character
we use) and terminate with ~
- strings internally get space replaced by _ (common and very
readable), * by **, _ by *_, ~ by *-, % by *. and any others we like
- I don't think we need *XX and *XXXX encoding, that will be done by
uriencoding whenever actually needed. Lots of common characters can be
replaced by *+single char
- Empty string is *~
- objects start with _, arrays start with ., both terminate with ~.
- object keys are encoded as strings, so no starting * needed, only *
escaping is done
- [1, 2] becomes .1~2~~
- {"a": "fo%o", "_test": "_hm*h~m", "5": [1, true]} becomes
_a~fo*.o~*_test~**_hm**h*-m~5~.1~-T~~~
This way, the ending ~ doubles as the value terminator. Any value can be
extracted by reading until the next ~. As a bonus, no value starts with ~ so that can distinguish v1
* is not actually 100% needed if we want to stay pure, . or - could serve
as the escape characters with some adjustments
…On Wed, Mar 29, 2017 at 7:06 PM Bruno Jouhier ***@***.***> wrote:
I was thinking about less invasive changes. I would like to keep the
parentheses. If we add a ~ at the end, do we still have a problem with
parentheses?
I want the encoded string to be unaltered by encodeURIComponent (this was
a *strong* requirement for v1). This limits the character set to ascii
alpha + ascii digits + - _ . ! ~ * ' ( ) (*uriUnescaped* in
https://www.ecma-international.org/ecma-262/5.1/#sec-15.1.3) and I would
restrict even further, and eliminate '. This rules out characters like =
/ |.
So I'm proposing the following changes:
- add a ~ as the end, to keep the auto-url-detectors happy. This
trailing char can also be used to distinguish between v1 and v2
- replace ' by !, to avoid browser encoding.
- maybe a few special * escapes. I like *_ for space, maybe *- for $
(frequent in object keys because it is valid in js identifiers) but I would
not go much further because gain is small and result quickly becomes
cryptic.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#16 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AADWlv0OZHpG1oEDMPbA0FwfhGH7rn6tks5rqo9WgaJpZM4Mr2qb>
.
|
one more optimization: change repeating final ~ to a single ~, and to grab a value search until ~ or end of string. Then the standard example becomes |
Lots of good ideas here but I want to understand why you want to get rid of parentheses. Lots of URLs have parentheses, and parentheses are a good visual clue for nested substructures. |
They are not guaranteed to be left alone, and by making ~ the terminator
for everything, parsing is faster…
…On Sat, Apr 1, 2017, 4:12 PM Bruno Jouhier ***@***.***> wrote:
Lots of good ideas here but I want to understand why you want to get rid
of parentheses. Lots of URLs have parentheses, and parentheses are a good
visual clue for nested substructures.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#16 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AADWlh0yy53qDfj6Lct9XHbR767G8cNzks5rrltIgaJpZM4Mr2qb>
.
|
(plus auto-url detection works better with ~, and we save a few bytes at
the end of the string by merging ˜s)
…On Sat, Apr 1, 2017, 4:16 PM Wout Mertens ***@***.***> wrote:
They are not guaranteed to be left alone, and by making ~ the terminator
for everything, parsing is faster…
On Sat, Apr 1, 2017, 4:12 PM Bruno Jouhier ***@***.***>
wrote:
Lots of good ideas here but I want to understand why you want to get rid
of parentheses. Lots of URLs have parentheses, and parentheses are a good
visual clue for nested substructures.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#16 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AADWlh0yy53qDfj6Lct9XHbR767G8cNzks5rrltIgaJpZM4Mr2qb>
.
|
More detailed comments:
|
When would parentheses get escaped? They are uriUnescaped (but |
There is a problem with strings starting with a number. How do you encode |
Well another reason for not using () is that you then need an extra char to
start an array and I wanted to minimize byte length. Plus, they are part of
the "reserved" set, and most of those get encoded anyway. (so is * but
replacing that with - or _ would make things uglier)
"0" becomes *0~.
…On Sat, Apr 1, 2017, 4:48 PM Bruno Jouhier ***@***.***> wrote:
There is a problem with strings starting with a number. How do you encode
"0"?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#16 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AADWllL2z9kL0Vj8Ies2deBiGWaMRXfeks5rrmO_gaJpZM4Mr2qb>
.
|
We could keep
|
Parentheses are not uriReserved, they are uriUnescaped. |
So the code works by the fact that at the beginning of a value there are only a number of possible characters. All cases are in the if clauses as https://github.com/wmertens/jsurl/blob/4ffcdea624eb29070bd6c44510e438b46799e986/lib/jsurl2.js#L71 - I tried to optimize for stringified length. So strings only start with Parentheses are in section 2.2 "Reserved Characters" https://tools.ietf.org/html/rfc3986#section-2.2 - although wikipedia says that means they can be used. I must say, if I paste How about starting objects with |
I must say, I really like the As for the URI encoding, I was reasoning thusly:
|
Oh and *20~ is "20". If we do our own encoding still it would be **20~. *
is only escape inside string values.
…On Sat, Apr 1, 2017, 5:12 PM Bruno Jouhier ***@***.***> wrote:
Parentheses are not *uriReserved*, they are *uriUnescaped*.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#16 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AADWlh03ghzP7TnCu66qZ0S2SnXF4gJNks5rrmlrgaJpZM4Mr2qb>
.
|
And we could omit the leading |
That already happens, object keys are string context so they don't need a
string marker…
…On Sat, Apr 1, 2017, 5:42 PM Bruno Jouhier ***@***.***> wrote:
And we could omit the leading ! for object keys if the key starts with
alpha.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#16 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AADWllLGuAyiRZd7VsS4e62CKOl0EhMpks5rrnBlgaJpZM4Mr2qb>
.
|
Point taken about generic URL RFC. I was referring to the specs for JS URL handling functions: https://www.ecma-international.org/ecma-262/5.1/#sec-15.1.3. I care most about the JS functions because that what's JS guys use to encode/decode. I like OK for leaving non-ASCII chars as is instead of encoding with I'd like to have the closing parenthesis at the end of objects too. The whole point is to trade a bit of compactness (one extra char at the end - wtf) for readability. Without it, it is very difficult to see where the object ends. I had misunderstood the leading * in strings. I thought that it was the start of an escape sequence. What about prefixing T, F and N by |
Note: with this, a non empty object looks like And then we could use |
Right, and actually you can drop ~ before ), if strings cannot contain ).
Then ) is unambiguous and the initial parse split can split on ~ or ). So
then there is no byte cost, and the string end can replace all ) and ~ with
a single ~ still.
Actually I like !T etc, it doesn't read a
…On Sat, Apr 1, 2017, 6:07 PM Bruno Jouhier ***@***.***> wrote:
Note: with this, a non empty object looks like (<...>~)~ and a non empty
array like .<...>~~. So we have an unambiguous end marker for objects ()~)
and arrays (~~).
And then we could use _T, _F and _N because _ is not reserved for object
start any more.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#16 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AADWlo-4PfLy3CngN564Gs43PKK_bR7Wks5rrnYrgaJpZM4Mr2qb>
.
|
Summary of revised proposal:
|
...as a string.
…On Sat, Apr 1, 2017, 6:26 PM Wout Mertens ***@***.***> wrote:
Right, and actually you can drop ~ before ), if strings cannot contain ).
Then ) is unambiguous and the initial parse split can split on ~ or ). So
then there is no byte cost, and the string end can replace all ) and ~ with
a single ~ still.
Actually I like !T etc, it doesn't read a
On Sat, Apr 1, 2017, 6:07 PM Bruno Jouhier ***@***.***>
wrote:
Note: with this, a non empty object looks like (<...>~)~ and a non empty
array like .<...>~~. So we have an unambiguous end marker for objects ()~)
and arrays (~~).
And then we could use _T, _F and _N because _ is not reserved for object
start any more.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#16 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AADWlo-4PfLy3CngN564Gs43PKK_bR7Wks5rrnYrgaJpZM4Mr2qb>
.
|
What about having arrays start with |
Also, the "force string start" char could be _. Then the final example
becomes (a~fo*.o~*_test~_*_hm**h*-m~5~.1~!T~
(sorry on mobile)
…On Sat, Apr 1, 2017, 6:26 PM Bruno Jouhier ***@***.***> wrote:
Summary of revised proposal:
- all values terminate with ~
- true, false, null become _T~, _F~, _N~
- numbers start with - (+ digit) or a digit and end with ~
- strings start with alpha or * (the only extra non-unreserved
character
we use) and terminate with ~
-
- strings internally get space replaced by _ (common and very
-
- readable), * by **, _ by *_, ~ by *-, % by *..
-
- I don't think we need *XX and *XXXX encoding, that will be done by
uriencoding whenever actually needed.
-
- Empty string is *~
- objects start with ( and end with ')~'
- arrays start with ., and end with ~.
-
- object keys are encoded as strings, so no starting * needed, only *
escaping is done *OK*
- - [1, 2] becomes .1~2~~
- {"a": "fo%o", "_test": "_hm*h~m", "5": [1, true]} becomes
(a~fo*.o~*_test~**_hm**h*-m~5~.1~_T~~)~
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#16 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AADWlvVyXNkJrgrFykeju6FyivGKvVXgks5rrnq6gaJpZM4Mr2qb>
.
|
That can work, it would take the ~ special case for true but that's no
biggie
…On Sat, Apr 1, 2017, 6:32 PM Wout Mertens ***@***.***> wrote:
Also, the "force string start" char could be _. Then the final example
becomes (a~fo*.o~*_test~_*_hm**h*-m~5~.1~!T~
(sorry on mobile)
On Sat, Apr 1, 2017, 6:26 PM Bruno Jouhier ***@***.***>
wrote:
Summary of revised proposal:
- all values terminate with ~
- true, false, null become _T~, _F~, _N~
- numbers start with - (+ digit) or a digit and end with ~
- strings start with alpha or * (the only extra non-unreserved
character
we use) and terminate with ~
-
- strings internally get space replaced by _ (common and very
-
- readable), * by **, _ by *_, ~ by *-, % by *..
-
- I don't think we need *XX and *XXXX encoding, that will be done by
uriencoding whenever actually needed.
-
- Empty string is *~
- objects start with ( and end with ')~'
- arrays start with ., and end with ~.
-
- object keys are encoded as strings, so no starting * needed, only *
escaping is done *OK*
- - [1, 2] becomes .1~2~~
- {"a": "fo%o", "_test": "_hm*h~m", "5": [1, true]} becomes
(a~fo*.o~*_test~**_hm**h*-m~5~.1~_T~~)~
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#16 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AADWlvVyXNkJrgrFykeju6FyivGKvVXgks5rrnq6gaJpZM4Mr2qb>
.
|
I too was thinking of dropping the |
No, it would drop ending ) too :)
…On Sat, Apr 1, 2017, 6:39 PM Bruno Jouhier ***@***.***> wrote:
I too was thinking of dropping the ~ after ). Only gotcha is the
url-auto-detector issue that started this whole thing 😄.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#16 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AADWlvCUnU1_q1gwie7B-SanIgnTZRuLks5rrn21gaJpZM4Mr2qb>
.
|
Summarizing one more time:
Closing characters ( |
I just realized you can't use ~ to start an array because then you can't
have array-in-array - there would be no difference between start and stop.
…On Sat, Apr 1, 2017, 6:57 PM Bruno Jouhier ***@***.***> wrote:
Summarizing one more time:
Summary of revised proposal:
- all values terminate with ~
- true, false, null become _T~, _F~, _N~
- numbers start with - (+ digit) or a digit and end with ~
- strings start with alpha or * (the only extra non-unreserved
character
we use) and terminate with ~
-
- strings internally get space replaced by _ (common and very
readable), * by **, _ by *_, ~ by *-, % by *..
-
- chars that need escaping are embedded *as is*. URI percent
encoding will take care of them.
- empty string is *~
- objects start with ( and end with )
- arrays start with ~, and end with ~
-
- object keys are encoded as strings, so no starting * needed, only *
escaping is done
-
- [1, 2] becomes ~1~2~~
-
-
- {"a": "fo%o", "_test": "_hm*h~m", "5": [1, true]} becomes
-
- (a~fo*.o~*_test~**_hm**h*-m~5~~1~_T~~)
Closing characters (~ and )) could be dropped at the very end? This would
solve the original problem but then parentheses are unbalanced.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#16 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AADWlhouorJjBdlwx-gMQb72ZMX5Eie2ks5rroHhgaJpZM4Mr2qb>
.
|
Good point. It also broke the test on leading I find |
Sure that is fine…
…On Sat, Apr 1, 2017, 7:41 PM Bruno Jouhier ***@***.***> wrote:
Good point. It also broke the test on leading ~ to distinguish v1 and v2.
I find . a bit too difficult to spot visually. Why not start arrays with !
then?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#16 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AADWlgD2Sa9-3cXDIgdlqEljwmSH6Q8oks5rrow3gaJpZM4Mr2qb>
.
|
Getting there. Here it comes:
Regarding closing characters, the rule is a may. Examples: |
Alright, I implemented this, look at the tests to see the results. I had to also escape () to allow unambiguous parsing of ), which also allowed me to drop the last ~ in objects. |
I also made that shortening optional. I wonder if we should not leave a terminal ~ at all times, or maybe make that optional too. I like how an object with booleans now looks like |
Cool. I'll take a look but only tomorrow. Thanks. |
Well, this was fun. I'm extremely happy to report that on my test object in Chrome at least, v2 now outperforms native JSON for both parsing and stringifying 😁
|
if you embed a jsurl object result in a url as the last component, you get something like
http://example.com/foo?q=~(a~'test)
, and if you paste that somewhere, there's a good chance that the url up but not including the final)
is recognized.One option is adding a final
~
, that fixes it?The text was updated successfully, but these errors were encountered: