Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SimpleHtmlSanitizer should support self-closing variants of whitelisted tags #9994

Open
CrushaKRool opened this issue Aug 22, 2024 · 3 comments

Comments

@CrushaKRool
Copy link

CrushaKRool commented Aug 22, 2024

GWT version: 2.11
Browser (with version): Firefox 129.0.2
Operating System: Windows 10


Description

The SimpleHtmlSanitizer supports the <br> and <hr> tags in a whitelist to leave them unchanged. But this does not work for self-closing variants of these void elements.

Depending on the browser implementation, it will likely simply discard the trailing slash on start elements. But the sanitizer should still support those variants, since they are used if you aim to create valid XHTML instead of HTML5, and the input you are sanitizing may not be fully under your control in that regard.

Of the list of tags that are already being whitelisted by the SimpleHtmlSanitizer, <br> and <hr> are the only ones that are also listed as official void elements in the documentation above. So these are probably all that we need to support.
A trivial solution would be to add "br/", "br /", "hr/" and "hr /" to the existing whitelist. Alternatively, you could try to change the code to handle these trailing slashes with and without whitespace properly.

Steps to reproduce
  @Test
  void testBr()
  {
    assertAll(
        () -> assertEquals( "a<br>b", SimpleHtmlSanitizer.sanitizeHtml( "a<br>b" ).asString() ), // success
        () -> assertEquals( "a<br/>b", SimpleHtmlSanitizer.sanitizeHtml( "a<br/>b" ).asString() ), // fail -> a&lt;br/&gt;b
        () -> assertEquals( "a<br />b", SimpleHtmlSanitizer.sanitizeHtml( "a<br />b" ).asString() ) // fail -> a&lt;br /&gt;b
    );
  }
  
  @Test
  void testHr()
  {
    assertAll(
        () -> assertEquals( "a<hr>b", SimpleHtmlSanitizer.sanitizeHtml( "a<hr>b" ).asString() ), // success
        () -> assertEquals( "a<hr/>b", SimpleHtmlSanitizer.sanitizeHtml( "a<hr/>b" ).asString() ), // fail -> a&lt;hr/&gt;b
        () -> assertEquals( "a<hr />b", SimpleHtmlSanitizer.sanitizeHtml( "a<hr />b" ).asString() ) // fail -> a&lt;hr /&gt;b
    );
  }
Known workarounds
  • Perform a text replacement via regex to turn the tags with trailing slashes into those without before performing the sanitization. But that can be a lot of boilerplate throughout the application.
  • Implement your own HtmlSanitizer. But it would be preferable to use an officially maintained solution.
@niloc132
Copy link
Contributor

Thanks for the report - I do note the following from your link:

Self-closing tags () do not exist in HTML.

If a trailing / (slash) character is present in the start tag of an HTML element, HTML parsers ignore that slash character.

That paragraph goes on to discuss ways that different implementations might interpret such malformed html differently.

Then this note:

Self-closing tags are required in void elements in XML, XHTML, and SVG (e.g., <circle cx="50" cy="50" r="50" />).

As such, it seems very reasonable that the SimpleHtmlSanitizer does not support XML or XHTML, as these are not the same thing. Am I missing something here that would make it correct for valid HTML to contain self-closing tags? We wouldn't want this to be ambiguous, but it seems it could be safe for void elements? I'll leave this point open to more discussion.


For the rest of this, let's assume that we've either elected to add a SimpleXhtmlSanitizer, or have found a valid way in which HTML can contain self-closing tags:

I think solving the "allow a whitespace char" could be solved distinctly from "allow self-closing tags", but in the same block of code:
https://github.com/gwtproject/gwt/blob/22ee5a676cc2eae46ecb341f2746c67ba8170ec5/user/src/com/google/gwt/safehtml/shared/SimpleHtmlSanitizer.java
Matching for a / just before tagEnd could let us decrement tagEnd by one, which would let tag then be hr or hr depending on if there is a space char. We could then trim the string so that it correctly matches our allow-list.

If I'm reading this right, this would capture just the tag name itself so that either <hr /> or <hr/> would be normalized to <hr> - correct for void elements at least.

The most strict interpretation I can find here then is that if we had a specific list of void elements and only permit the above transformation for those (as you noted, this would be only br and hr).

@CrushaKRool
Copy link
Author

CrushaKRool commented Aug 23, 2024

You can of course make an argument that self-closing tags are not valid in the first place.
But even if they are used, the browser generally does the right thing, so you cannot always expect that people do "the right thing" if the wrong thing still works just the same and they don't know any better.

And since the purpose of the SimpleHtmlSanitizer seems to be to do a best effort to sanitize input that may not be entirely under your control (while still allowing it to make some use of basic formatting tags), I'd argue that it wouldn't hurt to also go the extra mile and support these cases that may not be fully conforming to the spec but may be encountered in the wild nonetheless.

A distinction between a separate sanitizer for HTML and XHTML just for these cases may not be suitable, because you may not know in advance whether the input you receive is aiming to be HTML, XHTML or even just some weird mix that would still work in modern browsers. And if the SimpleXhtmlSanitizer would basically do the same thing as the regular one except for also accounting for self-closing tags, it would rather boil down to the question of "how lenient do you want to be with input that does not follow the HTML5 spec to a tee?".


Side note: The lack of support for self-closing tags seems to have also been brought up in a comment way back when support for <br> was added, but it does not seem to have been discussed any further in that thread.
#7509 (comment)

@niloc132
Copy link
Contributor

I'm not disagreeing that this would be a good addition - that's why I outlined how the change could be made. But I think there is a solid case to be made that this is not what the tool intended, since it forbids lots of other valid HTML5 content too, and supports/normalized no invalid HTML5.

"Sanitizing output outside of your control" usually means a best effort has to be made - is there any example of a case that isn't fully conforming that is supported today?

My naming might not have been ideal - something that implies that like tools like jsoup it is rewriting the content to be valid and then sanitizing to a subset could make sense. I don't have a concrete suggestion here, but I think as I showed the implementation shouldn't be terribly difficult. What about either a subclass that adds this feature (and documents it accordingly), or a default property or ctor arg so that existing implementations continue to function as they did, but new uses can opt-in to slightly more flexible behavior?

Again, I'm not objecting to this, but I want there to be some discussion around suitability and design before loosening what is designed to be a safety feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants