Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

otherMarkup #240

Open
otherdaniel opened this issue Sep 13, 2024 · 7 comments
Open

otherMarkup #240

otherdaniel opened this issue Sep 13, 2024 · 7 comments

Comments

@otherdaniel
Copy link
Collaborator

Comment on #237 (comment): "We should not need otherMarkup"

[Splitting this out from discussion on #237]

I re-introduced otherMarkup, because I wanted a way to express the behaviour of setHTMLUnsafe without parameters. The requirements, I take it, are:

  • parseHTMLUnsafe/setHTMLUnsafe("...") (without passing in a config) should not modify the parse result.
  • the empty options dict should behave like no options dict
    • setHTMLUnsafe("...") == setHTMLUnsafe("...", {}) == setHTMLUnsafe("...", {sanitizer: {})
  • setHTML should behave mostly like setHTMLUnsafe, except for the known-XSS-unsafe stuff.

Generally, I see 3 options to satisfy these:

  1. Have otherMarkup (or somesuch), so that we can have an unsafe-default config that will pass through all markup.
  2. Have the behaviour of otherMarkup, but tie it to the method invocation.
    • I.e., setHTML("<bla>", {sanitizer: s}) would behave differently from setHTMLUnsafe("<bla>", {sanitizer: s})
  3. Special case the absent and empty config to behave in a way that a Sanitizer instance can't.
    • I.e., setHTMLUnsafe("...") does something setHTMLUnsafe("...", {sanitizer: s}) can't, for any value of s.

I have a mild preference for the first; but can live with any of them.

@otherdaniel
Copy link
Collaborator Author

[Follow-up from meeting today:]

I removed otherMarkup from #237. The key sentence is https://pr-preview.s3.amazonaws.com/otherdaniel/purification/pull/237.html#sanitize-core, 2.4.2.

An empty allow list now allows everything. That is:

  • .setHTMLUnsafe("<bla>") => <bla>
  • .setHTML("<bla>") => (Because .setHTML default has an allow-list based default)
  • .setHTML("<bla>", {sanitizer: {}}) => <bla> (Because explicit config is given; and explicit config has no allow list)

One oddity this has is that removing elements can allow more of them. E.g.:

  • s = new Sanitizer({elements: ["div"]); # Allows only <div>, but not e.g. <p>
  • s.removeElement("div"); # Allows everything except <div>. Does allow <p>.

Does this reflect group consensus?


One item that came up when discussing this with Guillaume: Can we distinguish between a missing list and an empty one? That is, {} would allow everything, while {elements: []} would allow nothing?

@annevk
Copy link
Collaborator

annevk commented Sep 18, 2024

I think the policy manipulation methods having some weird side effects is okay, although it does make me wonder whether we need to think even harder about the names and also whether we might need more fine-grained manipulation methods in the future.

Like if you do addElement() twice, and the second call includes or omits details about attributes, I guess it ends up replacing the first call entirely? Almost makes me think they should all be prefixed with set(). setElement(), setRemoveElement(), etc.


It's certainly possible IDL-wise to distinguish between an empty list and a missing list.

@otherdaniel
Copy link
Collaborator Author

Like if you do addElement() twice, and the second call includes or omits details about attributes, I guess it ends up replacing the first call entirely?

Good point. The way it's currently written (https://pr-preview.s3.amazonaws.com/otherdaniel/purification/pull/237.html#sanitizer-allowelement, step 2 + 3) is that is always adds to the per-element lists.

s = new Sanitizer({elements: ["img"]}).
s.get();  // {elements: [{name: "img", namespace: ..html..}]}
s.allowElement({"img", attributes: ["src"], removeAttributes: ["srcset"]});
s.allowElement({"img", attributes: ["alt"]});
s.get();  // {elements: [{name: "img", namespace: ...html..., attributes: ["src", "alt"], removeAttributes: ["srcset"]}]}

In order to set the per-element attributes, one would have to remove it and then allow it again.

Not sure if that's what we want... I admittedly didn't think too much about that. :/

It's certainly possible IDL-wise to distinguish between an empty list and a missing list.

Ah. In that case, I think I'd like to change it to: empty list allowing nothing; missing list meaning no constraint (aka, allowing everything). That seems more consistent.

@annevk
Copy link
Collaborator

annevk commented Sep 18, 2024

Given that we don't have fine-grained operations I think replacing (set semantics) is probably what we want. You don't want multiple of the same element in the list and have to deal with deduplication later and you also don't really want more complicated merge semantics than replace I'd think.


I think I'd like to change it to: empty list allowing nothing; missing list meaning no constraint (aka, allowing everything).

That sounds reasonable. And then elements : [], removeElements: [ aValidRemoveElementEntry ] would throw due to ambiguity?

@otherdaniel
Copy link
Collaborator Author

Given that we don't have fine-grained operations I think replacing (set semantics) is probably what we want. You don't want multiple of the same element in the list and have to deal with deduplication later and you also don't really want more complicated merge semantics than replace I'd think.

Done.

I think I'd like to change it to: empty list allowing nothing; missing list meaning no constraint (aka, allowing everything).

That sounds reasonable. And then elements : [], removeElements: [ aValidRemoveElementEntry ] would throw due to ambiguity?

In the current PR, nothing throws. The old spec had an elaborate "canonicalization" step, and would throw if that encountered ambiguities. That makes sense for an immutable object, where the construction is also the final state of the object. In this PR, we define modifier methods, that would presumably be called a number of times. These methods just "make it fit" instead of throwing. E.g., if you add an element to the allow list, it'll simply remove it from the remove-list. Initialization from a dictionary is now re-written in terms of those manipulation methods. The consequence of that is that it doesn't throw anymore at all. (There's an ISSUE: about that in the draft.)

If we don't want that, I can see several ways:

We could make the manipulation methods throw. (E.g., removeElement would throw if that particular element is in the allow list.) I'm skeptical here, because I think that makes it very easy to write code that will fail in practice if the underlying sanitizer has changed from what the code writer was expecting.

We could keep the manipulation methods als always succeeding, but throw when constructing from a dictionary. I guess if we have an allow-list then any removeElement entry is redundant. I guess we could throw for any config that has both allow and remove lists.

@annevk
Copy link
Collaborator

annevk commented Sep 19, 2024

I was expecting that dictionary construction would still throw for the cases we discussed:

  • Duplicate entries in the same list (perhaps with mismatching attribute lists).
  • Lists that are redundant with each other (such as attributes and removeAttributes).
  • Actual syntax errors (although maybe we don't have any limitations on the strings themselves, that's prolly for the best).

I agree that ideally the manipulation methods always succeed, even if they might not do what you expect. But they would still guarantee some kind of "canonical model" underneath and not allow for duplicates or the ability to end up with redundant lists internally.

@otherdaniel
Copy link
Collaborator Author

I now added checks & throwing back in for the constructor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants