Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Design) Formatted Parts #463

Merged
merged 13 commits into from
Dec 4, 2023
212 changes: 212 additions & 0 deletions exploration/0003-formatted-parts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,212 @@
# Formatted Parts

Status: **Proposed**

<details>
<summary>Metadata</summary>
<dl>
<dt>Contributors</dt>
<dd>@eemeli</dd>
<dt>First proposed</dt>
<dd>2023-08-29</dd>
<dt>Pull Request</dt>
<dd>#000</dd>
</dl>
</details>

## Objective

Messages often include placeholders that,
when formatted, contain internal structure ("parts").
Preserving this structure in a formatted message
may be helpful to the caller,
who can then manipulate the parts.
For example, a caller may want to style or present
messages with the same content differently
if those messages have different internal structure.

This proposal defines a formatted-parts target for MessageFormat 2.

## Background

Past examples have shown us that if we don't provide a formatter to parts,
the string output will be re-parsed and re-processed by users.
eemeli marked this conversation as resolved.
Show resolved Hide resolved
Recent examples of web browsers needing to account for such user behaviour are available from
[June 2022](https://github.com/WebKit/WebKit/commit/1dc01f753d89a85ee19df8e8bd75f4aece80c594) and
[November 2022](https://bugs.chromium.org/p/v8/issues/detail?id=13494).

## Use-Cases
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These need more flesh.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aphillips This section has been expanded since this comment. Sufficiently?


- Markup elements
- Non-string values
- Message post-processors
- Decoration of placeholder interior parts.
For example, identifying the separate fields in these two currency values
(notice that the symbol, number, and fraction fields
are not in the same order and that the separator has been omitted):
![image](https://github.com/unicode-org/message-format-wg/assets/69082/cb68c87f-9c0c-4bc6-b9a0-b1f97b2b789a)
![image](https://github.com/unicode-org/message-format-wg/assets/69082/aedd4e66-7d47-4026-8b93-4ba061bb4d84)
- Supplying bidirectional isolation of placeholders,
such as by using HTML's `span` element with a `dir` attribute
based on the direction of the placeholder.

eemeli marked this conversation as resolved.
Show resolved Hide resolved
## Requirements

- Define an iterable sequence of formatted part objects.
- Include metadata for each part, such as type, source, direction, and locale.
- Allow the representation of non-string values.
- Allow the representation of values that consist of an iterable sequence of formatted parts.
- Be able to represent each resolved value of a pattern with any number of formatted parts, including none.
- Define the formatted parts in a manner that allows synonymous but appropriate implementations in different programming languages.

## Constraints

- The JS Intl formatters already include formatted-parts representations for each supported data type.
The JS implementation of the MF2 formatted-parts representation should be able to match their structure,
at least as far as that's possible and appropriate.

## Proposed Design

The formatted-parts API is included in the spec as an optional but recommended formatting target.

The shape of the formatted-parts output is defined in a manner similar to the data model,
which includes TypeScript, JSON Schema, and XML DTD definitions of the same data structure.

At the top level, the formatted-parts result is an iterable sequence of parts.
Parts corresponding to each _text_ can be simpler than those of _expressions_,
as they do not have a `source` other than their `value`,
or set any of the other possible metadata fields.

```ts
type MessageParts = Iterable<
MessageTextPart | MessageExpressionPart | MessageBiDiIsolationPart
>;

interface MessageTextPart {
type: "text";
value: string;
}
```

For MessageExpressionPart, the `source` corresponds to the expression's fallback value.
The `dir` and `locale` attributes of a part may be inherited from the message
or from the operand (if present),
or overridden by an expression attribute or formatting function,
or otherwise set by the implementation.
Each part should have at most one of `value` or `parts` defined;
some may have none.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this make sense? Or would it be better to say:

Suggested change
Each part should have at most one of `value` or `parts` defined;
some may have none.
Each part MUST have either a `value` or `parts` defined.
A part MAY have a `value` that is the empty string.
A part MAY have a `parts` that is an empty list.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suggestion would make the proposed MessageFallbackPart invalid, as it does not include either value or parts. It's conceivable for other parts to also exist which do not include either, such as open/close expressions without an annotation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would an empty value or empty parts satisfy that? Or a fallback could have a string expression? Empty strings don't result in the erroneous emission of the string null 😉

I understand that it would "break" the current definition: we should decide what the shapes should be and make consistent.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For fallback when formatting to a string the {...} make sense as a visual indicator, but for a formatted-parts consumer some different representation could be better. So including an explicit value or parts would be misleading.

For open/close, it doesn't make sense to define their explicit parts shapes in this spec, but for JS I have them as:

interface MessageMarkupPart {
  type: 'open' | 'close';
  source: string;
  name: string;
  value?: unknown;
  options: { [key: string]: unknown };
}

There, the value would be 'b' for {b +html}, but it would not be set for {+html.b}. Setting it to an empty string would be misleading, as {+foo} and {|| +foo} could easily mean different things.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #463 (comment) I'm suggesting to define separate interfaces for single-valued and multi-valued parts. This could extend to fallback parts and markup, as well.


```ts
interface MessageExpressionPart {
type: string;
source: string;
parts?: Iterable<{ type: string; value: unknown }>;
value?: unknown;
catamorphism marked this conversation as resolved.
Show resolved Hide resolved
dir?: "ltr" | "rtl" | "auto";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unclear to me why have dir and locale at this level.

We don't need a locale to format anything, because the parts should be already formatted.
The whole proposal is called "Formatted Parts"

The locale might be needed to render things.

Or to process the formatting result (fix grammatical agreements as a post-step, fix "a apple" to "an apple" (en) or "La abeille" => "L'abbeile" (fr), or to sentence case the result of "{item} is foo"

But then that is something that is needed for the whole collection of parts, not on MessageExpressionPart only.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are needed to allow embedding content in a message that uses a different script or locale than the surrounding message.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of clear what they can be used for.

But this info is on the MessageExpressionPart, which comes from an expression.
And an expression can't create this info out of nothing, it is probably something we passed as a parameter.
So if I already know the info (because I passed it to the expression), having it on the MessageExpressionPart is useless duplication.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider this message, intended for consumption by a text-to-speech system:

In French, the number {98 :number} is commonly expressed as {98 :number @locale=fr},
but in Belgium it's {98 :number @locale=fr-BE}.

How would you propose that the locale information is transmitted, if not as a field on the formatted parts?

[
  { type: 'text', value: 'In French, the number ' },
  { type: 'number', source: '|98|', parts: [{ type: 'integer', value: '98' }] },
  { type: 'text', value: ' is commonly expressed as ' },
  { type: 'number', source: '|98|', locale: 'fr', parts: [{ type: 'integer', value: '98' }] },
  { type: 'text', value: ", but in Belgium it's " },
  { type: 'number', source: '|98|', locale: 'fr-BE', parts: [{ type: 'integer', value: '98' }] },
  { type: 'text', value: '.' }
]

As context, the fr number would be "quatre-vingt-dix-huit", while in fr-BE it's "nonante-huit".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Language and direction are needed for placeholders because they represent values being inserted into the overall string. The language (locale) is used to ensure proper rendering and processing (such as line breaking, text transforms, or spell-checking). The direction is used to enable bidi isolation and get the direction of the substring correct.

The language and direction of a formatted part might not match that of the message due to resource fallback when looking up the message. Or because values passed in have different language or direction. (And we want bidi isolation even if the directions match!!!!)

Providing the fields in the formatted part structure allows the user to easily access the values, e.g. it makes it easy to do something like this, resulting in proper isolation of the formatted parts (not shown is decorating the parts separately):

var message = // whatever the host node is for the string
for (let part of formattedMessage.parts) {
    var span = document.createElement('span');
    span.lang = part.lang;
    span.dir = part.dir;
    span.appendChild(document.createTextNode(part.value));
    message.appendChild(span);
}

That is done on portions of the message. The whole message also has a language (locale) and base paragraph direction.

locale?: string;
}
```

The bidi isolation strategies included in the spec may require
the insertion of MessageBiDiIsolationParts in the formatted-parts output.

```ts
interface MessageBiDiIsolationPart {
type: "bidiIsolation";
value: "\u2066" | "\u2067" | "\u2068" | "\u2069"; // LRI | RLI | FSI | PDI
}
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe? But since the parts have a direction, the bidi isolation parts are not necessarily required. Let's consider pattern:

{Today is {$date :datetime}}

In locale ar-AE, the output of the :datetime function might be "30‏/08‏/2023" with a dir of rtl. Note that the string looks bad here because it is an LTR context and not bidi isolated. There are U+200F RLMs after 08 and 30 to assist, but the string needs to be wrapped.

Ignoring that the date has interior parts, getting a MessageExpressionPart:

value: "30‏/08‏/2023"
dir: "rtl"

... could result in a formatted string (in an HTML context) like:

Today is <span dir=rtl>30‏/08‏/2023</span>.

This draws correctly without any further intervention.

If this became a list of parts it still works:

parts: [
   { name: "day", value: "30\u200f" },
   { name: "sep", value: "/"},
   { name: "month", value: "08\u200f"},
   { name: "sep", value: "/"},
   { name: "year", value: "2023"}
]
dir: "rtl"

The caller could use this to produce:

Today is <span dir=rtl>
<span class="day">30&#x200f;</span>/08&#x200f;2023</span>.

Test

You wouldn't want Unicode isolates in this case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that in some cases bidi isolation of formatted-parts output may be achieved e.g. via <span dir=...> tags, but I don't think that's universally true for every use case for formatted parts.

This is one reason why e.g. in the JS Intl.MF proposal my proposal for bidi handling includes the "none" strategy as an alternative to the default (and required) "compatibility" strategy.

Effectively, I think we should include the MessageBiDiIsolationPart by default, but allow for an implementation to provide a way to get rid of them, or just have a consumer of the output ignore all the type: "bidiIsolation" parts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that span is not the only way that bidi will be implemented. none is easy: one can just ignore the bidi/direction. Obviously sometimes folks will want bidi controls. Or they might want to make an attributed string, I suppose, in some UI frameworks.

I'm reluctant to create a "bidi isolation part" because it separates the direction information from the value it is associated with. It's harder if one is doing decoration, such as building some control using various tags, not all of which are span, some of which have classes or other attributes. The only place where the isolation part is easier is stringifying the message (since one just consumes it.

function stringifyParts(parts) {
  let res = ''
  for (let part of parts) {
    res += bidiStrategyImpl.start(part.dir);  // none does nothing, compatibility does a control, etc.
    if (part.type === 'fallback') res += `{${part.source}}`
    else if ('value' in part) res += String(part.value)
    else if ('parts' in part) {
      for (let sub of part.parts) res += String(sub.value)
    }
    res += bidiStrategyImpl.end(part.dir);
  }
  return res
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing as you propose would require implementations to provide explicit bidiStrategyImpl.start() and bidiStrategyImpl.end() interfaces, or the behaviour of format-to-parts-to-string could not be guaranteed to match the behaviour of format-to-string.

That seems like a much bigger ask than defining what the bidi isolation parts would look like if they are to be included in the parts output by the requested bidi isolation strategy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Format-to-string's output would depend on what the bidi isolation strategy was, no?

The start/end thing was just an example. Note that with isolation parts you have to "read ahead" to find out what is being isolated if what one is doing with the parts is decorating a control or generating HTML.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Format-to-string's output would depend on what the bidi isolation strategy was, no?

Yes.

The start/end thing was just an example. Note that with isolation parts you have to "read ahead" to find out what is being isolated if what one is doing with the parts is decorating a control or generating HTML.

Sure, but either the bidi isolation parts need to be included, or some way of obtaining their equivalent is required.


Some of the MessageExpressionPart instances may be further defined
without reference to the function registry.

Unannotated expressions with a _literal_ operand
are represented by MessageStringPart.
As with MessageTextPart,
the `value` of MessageStringPart is always a string.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a need for the distinction? It is simply the source?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to source the dir and locale are not present on MessageTextPart.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't text parts have a direction and language? Are they inheriting from the message?

Aside: we don't necessarily want to span every section of text, that is, this is better:

<span lang=foo dir=ltr>blah blah blah 
   <span lang=bar dir=auto>placeholder here</span> blah blah</span>

than this:

<span lang=foo dir=ltr>blah blah blah </span>
   <span lang=bar dir=auto>placeholder here</span><span lang=foo dir=ltr> blah blah</span>

... which is why text parts would inherit language and base paragraph direction from the message.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Text parts do inherit the language and direction from the message, so they'll be whatever the formatter was called with. But they're not necessary to be repeated on every part, because the caller already knows what they asked for, and they may well be formatting the message in a context with e.g. <body lang=foo dir=ltr> setting them at a much higher level.


```ts
interface MessageStringPart {
type: "string";
source: string;
value: string;
dir?: "ltr" | "rtl" | "auto";
locale?: string;
}
```

Unannotated expressions with a _variable_ operand
whose type is not recognized by the implementation
or for which no default formatter is available
are represented by MessageUnknownPart.

```ts
interface MessageUnknownPart {
type: "unknown";
source: string;
value: unknown;
}
```

When the resolution or formatting of a placeholder fails,
it is represented in the output by MessageFallbackPart.
No `value` is provided; when formatting to a string,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we provide the fallback value? I think we have some text in the spec that allows implementations or functions to supply their own fallback.


Question: should a goal be that the string output of a message be equivalent to concatenating the string representation of its parts? Or at least that a test be that one can assemble the string output from the parts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we provide the fallback value? I think we have some text in the spec that allows implementations or functions to supply their own fallback.

Fallback customization is only available for syntax and data model errors, which default to . That would be used as the source value here.

Question: should a goal be that the string output of a message be equivalent to concatenating the string representation of its parts? Or at least that a test be that one can assemble the string output from the parts?

I think the latter. With the current proposal, it's doable like this:

function stringifyParts(parts) {
  let res = ''
  for (let part of parts) {
    if (part.type === 'fallback') res += `{${part.source}}`
    else if ('value' in part) res += String(part.value)
    else if ('parts' in part) {
      for (let sub of part.parts) res += String(sub.value)
    }
  }
  return res
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the parts are technically not formatted yet?
I can't get from the proposal what a sub.value is.
Can it be a date?

the part's representation would be `'{' + source + '}'`.

```ts
interface MessageFallbackPart {
type: "fallback";
source: string;
}
```

Formatting functions defined in the registry
Each function defined in the registry MUST define its "formatted-parts" representation.
A function can define either a unitary string `value` or a `parts` representation.
Where possible, a function SHOULD provide a `parts` representation
if its output might reasonably consist of multiple fields.
Where available, such a formatted value should itself be represented by `parts`
rather than a unitary string `value`.
These sub-parts should not need fields beyond their `type` and `value`,
and in most cases it's presumed that the sub-part `value` would be a string.

```ts
interface MessageDateTimePart {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we define the "parts" so that they are generic rather than each type having its own special part type?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the MessageExpressionPart definition above, which these interfaces also match. The definitions here are giving more specificity about what e.g. :datetime and :number end up producing, i.e. that they have explicit type identifiers and define parts rather than value.

type: "datetime";
source: string;
parts: Iterable<{ type: string; value: unknown }>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice in my example earlier in this review that I exposed the field name as a field in the parts. This becomes important when trying to decorate an Iterable whose contents shift around due to the locale/localized formatting. Dates have this feature (YMD, DMY, MDY). So do currency values (which may or may not have a decimal part, may have the symbol first or last, and may or may not have a space around the symbol). That's how the screen shots of currency values (from amazon.com and amazon.fr) get decorated.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the "type" field here. I picked that rather than "name" because it's used by the JS Intl formatters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was super non-obvious to me (hence this conversation!), particularly since the type fields in the examples seemed to be focused on the "type" of formatter (datetime, number) rather than on the parts field within them. Admittedly, the MF2-level parts will be at the placeholder level. Interior parts are the problem of the formatter. But this was not at all clear and probably could use an example.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure that it's explained in more detail in the spec PR, should this design doc be accepted.

dir?: "ltr" | "rtl" | "auto";
locale?: string;
}

interface MessageNumberPart {
type: "number";
source: string;
parts: Iterable<{ type: string; value: unknown }>;
dir?: "ltr" | "rtl" | "auto";
locale?: string;
}
```

## Alternatives Considered

### Not Defining a Formatted-Parts Output

Leave it to implementations.
They will each come up with something a bit different,
but each will mostly work.

They will not be interoperable, though.

### Different Parts Shapes

See issue <a href="https://github.com/unicode-org/message-format-wg/issues/41">#41</a> for details.

They can be considered as precursors of the current proposal,
into which they've developed due to evolutionary pressure.

### Annotated String Output

Format to a string, but separately define metadata or other values.

This gets really clunky for parts that are not reasonably stringifiable.