Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposed spec changes for string literal aliases #1082

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

cuhtis
Copy link

@cuhtis cuhtis commented Feb 26, 2024

At Meta, our GraphQL clients for both web and mobile care a lot about local data consistency. Local data consistency is when the client is able to reconcile data from multiple GraphQL responses in a local client-side cache which can be subscribed to. When data changes, we issue updates to these subscribers — which allows us to keep various screens "in-sync" and show the same view of the data.

The canonical example of this on Facebook is if you navigate to the Groups page from a Feed post and join the group, then when you navigate back to Feed we should update the "Join Group" button in the post to "Joined".

Local data consistency is supported today by Relay for JS clients (widely adopted within industry) and by Meta's internal mobile GraphQL client for Android/iOS clients.

To support consistency, we need to be able to remap aliases/field names into their "canonical names". The canonical name is what we use to keep two fields consistent, as it represents the true definition of a field.

Given a field selection:

alias: field(arg: "foo")

the canonical name would be:

field(arg:"foo")

Internally, we've explored various ways of doing this. We currently need to embed a lot of information about the schema and query in our clients; for example Relay creates a "normalization AST" with this information, and our internal mobile client requires all the information to be pre-compiled into the app binary. This leads to additional costs, e.g. binary size bloat or wire size costs.

We've found that we can more efficiently deliver the canonical name information by embedding it into our response instead, which removes the need for pre-compiled metadata. We run a transform on our queries to add aliases for each field, and use the canonical name as the alias. However since the canonical name may contain non-supported characters (e.g. (, ), $), we are proposing a spec change to allow StringValue tokens as the alias.

This would allow syntax such as:

"field(arg:\"foo\")": field(arg: "foo")

We've attempted alternative ways of implementing this, such as doing this transform during server-side execution (a spec violation, and high server overhead!) or delivering encoded aliases to abide by the NameValue specification (client parsing overhead for decoding!).

Potential side effects / outcomes of this change:

  • Selection set conflict validation should still work out of the box, even if one selection is a string literal and the other is a normal NameValue.
  • Here we parse the StringValue into a NameNode (same as when parsing a NameValue), which doesn't have any requirements on the name itself as far as I can tell.
  • We allow string literal aliases here but not number literals, which abides by the JSON specification. According to the JSON spec, map/object keys must be strings.

At Meta, our GraphQL clients for both web and mobile care a lot about local data consistency. Local data consistency is when the client is able to reconcile data from multiple GraphQL responses in a local client-side cache which can be subscribed to. When data changes, we issue updates to these subscribers — which allows us to keep various screens "in-sync" and show the same view of the data.

The canonical example of this on Facebook is if you navigate to the Groups page from a Feed post and join the group, then when you navigate back to Feed we should update the "Join Group" button in the post to "Joined".

Local data consistency is supported today by Relay for JS clients (widely adopted within industry) and by Meta's internal mobile GraphQL client for Android/iOS clients.

To support consistency, we need to be able to remap aliases/field names into their "canonical names". The canonical name is what we use to keep two fields consistent, as it represents the true definition of a field.

Given a field selection:

alias: field(arg: "foo")
the canonical name would be:

field(arg:"foo")
Internally, we've explored various ways of doing this. We currently need to embed a lot of information about the schema and query in our clients; for example Relay creates a "normalization AST" with this information, and our internal mobile client requires all the information to be pre-compiled into the app binary. This leads to additional costs, e.g. binary size bloat or wire size costs.

We've found that we can more efficiently deliver the canonical name information by embedding it into our response instead, which removes the need for pre-compiled metadata. We run a transform on our queries to add aliases for each field, and use the canonical name as the alias. However since the canonical name may contain non-supported characters (e.g. (, ), $), we are proposing a spec change to allow StringValue tokens as the alias.

This would allow syntax such as:

"field(arg:\"foo\")": field(arg: "foo")
We've attempted alternative ways of implementing this, such as doing this transform during server-side execution (a spec violation, and high server overhead!) or delivering encoded aliases to abide by the NameValue specification (client parsing overhead for decoding!).

Potential side effects / outcomes of this change:

Selection set conflict validation should still work out of the box, even if one selection is a string literal and the other is a normal NameValue.
Here we parse the StringValue into a NameNode (same as when parsing a NameValue), which doesn't have any requirements on the name itself as far as I can tell.
We allow string literal aliases here but not number literals, which abides by the JSON specification. According to the JSON spec, map/object keys must be strings.
Copy link

netlify bot commented Feb 26, 2024

Deploy Preview for graphql-spec-draft ready!

Name Link
🔨 Latest commit 7f43049
🔍 Latest deploy log https://app.netlify.com/sites/graphql-spec-draft/deploys/65dcf5d8aaf6360008435ba2
😎 Deploy Preview https://deploy-preview-1082--graphql-spec-draft.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@benjie
Copy link
Member

benjie commented Mar 1, 2024

Interesting problem and well described! Thanks for opening this discussion.

A major concern I have with the currently proposed solution is that it allows specifying keys that are "special" properties, for example in JavaScript __proto__. A naive client might copy data around and, if they're not using null-prototype objects, may end up corrupting prototypes. This could be a major security concern. (Update: this is already a problem.)

Another concern I have is that in some situations (e.g. when a field is referenced multiple times with different arguments) it makes sense to use the field alias to store a precalculated result when performing "lookahead". Currently "internal" data can be stored to attributes with prefixes such as $ or @ knowing that these will not satisfy GraphQL Name and thus cannot be aliases and cannot conflict with the precalculated lookahead values. I'm concerned that there would be no "reserved" aliases with this proposal (without using language-specific solutions such as symbols in JS).

Finally, governed by the "Favor no change" guiding principle, I find myself wondering why you can't use a hash of your string as the alias:

_0a0ddca3101722506482d0e7fe35091ba07e65f6857f5300e10a9995d56e07d8: field(arg: "foo")

Where 0a0...7d8 is an SHA256 hash of field(arg:"foo"). Since this is a normalized key, I assume it's a one-way value and thus a hash is perfectly suitable for it? This wouldn't require any changes to GraphQL since this is a valid Name.

@cuhtis
Copy link
Author

cuhtis commented Mar 5, 2024

I find myself wondering why you can't use a hash of your string as the alias

So we actually do need the raw string, so a one-way hash like SHA256 doesn't work. The alias isn't actually a canonical name yet, it's a canonical name template — specifically, it can include variables like field(arg:$var). Once we receive key, we need to substitute in the variable value to get the final canonical name.

There's an alternative question: what about a two-way encoding? tl;dr is that we've actually also tried this!

We ran a test where we used base32 encoding to fit the canonical name into the allowed character set, but this had pretty steep performance regressions. The encoded payload doesn't compress as well, and more bytes meant that we spend more time parsing and use more memory during parse. Concretely, we saw a 5%-15% regression in parse times, 20%-60% regressions in payload sizes, and a 10% regression in OOMs.

We're running a test right now to escape unsupported characters, by using z as the escape character (it's the least commonly used character from our codebase!). The data is still pending, but the payload size regression is smaller and the parse performance is better. However, we still expect some regression.

@cuhtis
Copy link
Author

cuhtis commented Mar 5, 2024

I'm concerned that there would be no "reserved" aliases with this proposal (without using language-specific solutions such as symbols in JS).

Yeah, it's interesting because we also do this. We reserve certain characters as prefixes in the field name that we know are illegal in the current syntax. For example, we may prefix the name with $.

Our thoughts on how to mitigate this is to have compile-time validation against using those characters. Since it's a client-specific limitation (e.g. not every client reserves these characters!) then these clients should ship with validation to avoid conflicts.

A major concern I have with the currently proposed solution is that it allows specifying keys that are "special" properties

This is a real concern, and I don't have a better answer than similarly suggesting compile-time validation for JS-based clients.

@benjie
Copy link
Member

benjie commented Mar 6, 2024

Thinking about it… __proto__ is already an allowed alias 🤷‍♂️

Since it's a client-specific limitation (e.g. not every client reserves these characters!) then these clients should ship with validation to avoid conflicts.

To be clear, I was referring to server-side concerns when you have a system that looks into GraphQLResolveInfo (the fourth argument to resolve() in GraphQL.js) and precomputes required data ahead of time, keying it based on the requested alias.

@benjie
Copy link
Member

benjie commented Mar 6, 2024

So we actually do need the raw string, so a one-way hash like SHA256 doesn't work. The alias isn't actually a canonical name yet, it's a canonical name template — specifically, it can include variables like field(arg:$var). Once we receive key, we need to substitute in the variable value to get the final canonical name.

It feels like it wouldn't be too difficult to turn:

updateFoo(input: {id: $id, patch: {body:$body}})

into _1e09d679a402c951386c35572f6f366fc7618b2211c97ddc075e6abce759b889_2id_4body

Where 1e09d679a402c951386c35572f6f366fc7618b2211c97ddc075e6abce759b889 is the SHA256 of updateFoo(input:{id:$id,patch:{body:$body}}) and _2id_4body is the unique list of variables in occurrence order, separated by underscores, with a numeric length prefix (such that variables containing _ don't make the system confused).

Would this be sufficient for your needs? If you need to reverse every facet you could build your own character-constrained format for it; using the numeric prefixes will help with parsing performance because you'll know exactly how many characters to parse... I wouldn't be surprised if you could build a parser for a format like this that's faster than the parser you'd use on the literal updateFoo(input:{id:$id,patch:{body:$body}}). E.g.:

function digestInputValue(iv) {
  switch (iv.kind) {
    case 'variable': {
      return `v` + iv.name.length + iv.name;
    }
    case 'inputObject': {
      return `o` + Object.entries(iv.value).length + Object.entries(iv.value).map(...)
    }
    default: {
      const val = String(iv.value);
      return `r` + val.length + val;
    }
  }
}
function digestField(field) {
  return `_`
    + field.name.length + field.name // `9updateFoo`
    + `_` + args.length // `_1`
    + args.map(a =>
         `_` + a.name.length + a.name // `_5input`
         + digestInputValue(a.value) // o2_2id_v2id_5patch_o1_4body_v4body
      ); 
}

Result would be something like: _9updateFoo_1_5input_o2_2id_v2id_5patch_o1_4body_v4body

@benjie
Copy link
Member

benjie commented Mar 6, 2024

We ran a test where we used base32 encoding to fit the canonical name into the allowed character set

Another consideration is we could expand Name by a single character and then use a GraphQL-flavoured base64 encoding. Perhaps $. Have you tested what the performance difference would be using base64, since I imagine there are more performant codecs for that due to heavy usage? I know one of the choices behind Name was making it so that it didn't need special handling in most languages, not sure which additional character would have least impact on this, but $ should be safe for JS at least.

@cuhtis
Copy link
Author

cuhtis commented Mar 6, 2024

into _1e09d679a402c951386c35572f6f366fc7618b2211c97ddc075e6abce759b889_2id_4body

Where 1e09d679a402c951386c35572f6f366fc7618b2211c97ddc075e6abce759b889 is the SHA256 of updateFoo(input:{id:$id,patch:{body:$body}}) and _2id_4body is the unique list of variables in occurrence order, separated by underscores, with a numeric length prefix (such that variables containing _ don't make the system confused).

So if you have two queries:

query MyQuery($var: Int!) {
  foo(arg: $var)
}

where $var is 4, and:

query MyOtherQuery {
  foo(arg: 4)
}

We need these two to be consistent with each other in our normalized store, so the two foo selections need to have the same canonical name. In the proposed encoding, we'd have to ensure that the SHA256 part is the same, i.e. always shift the argument values out like SHA256(foo(arg:)_SHA256(4) and SHA256(foo(arg:)_3var.

This has a few implications:

  • We need to hash/encode the argument values since they may contain illegal characters (like " for strings).
  • It is unclear how variables embedded inside of a complex field argument would work, for example foo(arg: [{"bar": "baz"}, {"bar", $var}])

So I think we're pretty resigned to a two-way encoding for this to work, and which also generally benefits as they scale in length relative to the input (as opposed to SHA256 which would be consistently 64 bytes, much longer than the original keys!).

Another consideration is we could expand Name by a single character and then use a GraphQL-flavoured base64 encoding.

We did briefly look into base64! It wasn't possible due to GraphQL not having enough characters as you allude. It shares a lot of common properties with the base32 encoded format:

  • base64 has an encoding overhead of 33% (compared to base32's 60% overhead).
  • Readability is poor, since you can't easily determine what the original field name is.

you could build your own character-constrained format for it

This is what we've done for our current experiment that's running:

We're running a test right now to escape unsupported characters, by using z as the escape character (it's the least commonly used character from our codebase!). The data is still pending, but the payload size regression is smaller and the parse performance is better. However, we still expect some regression.

The specific format we're testing is:

  • Any character in [A-Za-y0-9_] is kept as-is.
  • Any disallowed character (e.g. *, #, ?) is encoded as a z followed by the 2-character hex format of the character. For example, # becomes z23.
  • A set of common characters get shorthand encodings. For example, $ is encoded as zo instead of z24.

This has a few benefits:

  • The payload size overhead is lower, since only unsupported characters in the GraphQL syntax get escaped.
  • The field name is still largely readable in this format, which is good for debuggability.

Again, we're still waiting for data here so it'll be interesting to see how it performs. Our local benchmark tests have it at a 3% parse performance regression.

@benjie
Copy link
Member

benjie commented Apr 3, 2024

Been pondering this in the background a bit, have you considered encoding the response on the server-side in a pre-normalized format, @cuhtis? Could be something that clients opt into e.g. via an Accept: application/relay-normalized-graphql-response header or similar? Just a random thought...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants