SimpleHtmlSanitizer should support self-closing variants of whitelisted tags #9994

CrushaKRool · 2024-08-22T11:12:18Z

GWT version: 2.11
Browser (with version): Firefox 129.0.2
Operating System: Windows 10

Description

The SimpleHtmlSanitizer supports the <br> and <hr> tags in a whitelist to leave them unchanged. But this does not work for self-closing variants of these void elements.

Depending on the browser implementation, it will likely simply discard the trailing slash on start elements. But the sanitizer should still support those variants, since they are used if you aim to create valid XHTML instead of HTML5, and the input you are sanitizing may not be fully under your control in that regard.

Of the list of tags that are already being whitelisted by the SimpleHtmlSanitizer, <br> and <hr> are the only ones that are also listed as official void elements in the documentation above. So these are probably all that we need to support.
A trivial solution would be to add "br/", "br /", "hr/" and "hr /" to the existing whitelist. Alternatively, you could try to change the code to handle these trailing slashes with and without whitespace properly.

Steps to reproduce

  @Test
  void testBr()
  {
    assertAll(
        () -> assertEquals( "a<br>b", SimpleHtmlSanitizer.sanitizeHtml( "a<br>b" ).asString() ), // success
        () -> assertEquals( "a<br/>b", SimpleHtmlSanitizer.sanitizeHtml( "a<br/>b" ).asString() ), // fail -> a&lt;br/&gt;b
        () -> assertEquals( "a<br />b", SimpleHtmlSanitizer.sanitizeHtml( "a<br />b" ).asString() ) // fail -> a&lt;br /&gt;b
    );
  }
  
  @Test
  void testHr()
  {
    assertAll(
        () -> assertEquals( "a<hr>b", SimpleHtmlSanitizer.sanitizeHtml( "a<hr>b" ).asString() ), // success
        () -> assertEquals( "a<hr/>b", SimpleHtmlSanitizer.sanitizeHtml( "a<hr/>b" ).asString() ), // fail -> a&lt;hr/&gt;b
        () -> assertEquals( "a<hr />b", SimpleHtmlSanitizer.sanitizeHtml( "a<hr />b" ).asString() ) // fail -> a&lt;hr /&gt;b
    );
  }

Known workarounds

Perform a text replacement via regex to turn the tags with trailing slashes into those without before performing the sanitization. But that can be a lot of boilerplate throughout the application.
Implement your own HtmlSanitizer. But it would be preferable to use an officially maintained solution.

The text was updated successfully, but these errors were encountered:

niloc132 · 2024-08-22T14:29:41Z

Thanks for the report - I do note the following from your link:

Self-closing tags () do not exist in HTML.

If a trailing / (slash) character is present in the start tag of an HTML element, HTML parsers ignore that slash character.

That paragraph goes on to discuss ways that different implementations might interpret such malformed html differently.

Then this note:

Self-closing tags are required in void elements in XML, XHTML, and SVG (e.g., <circle cx="50" cy="50" r="50" />).

As such, it seems very reasonable that the SimpleHtmlSanitizer does not support XML or XHTML, as these are not the same thing. Am I missing something here that would make it correct for valid HTML to contain self-closing tags? We wouldn't want this to be ambiguous, but it seems it could be safe for void elements? I'll leave this point open to more discussion.

For the rest of this, let's assume that we've either elected to add a SimpleXhtmlSanitizer, or have found a valid way in which HTML can contain self-closing tags:

I think solving the "allow a whitespace char" could be solved distinctly from "allow self-closing tags", but in the same block of code:
https://github.com/gwtproject/gwt/blob/22ee5a676cc2eae46ecb341f2746c67ba8170ec5/user/src/com/google/gwt/safehtml/shared/SimpleHtmlSanitizer.java
Matching for a / just before tagEnd could let us decrement tagEnd by one, which would let tag then be hr or hr depending on if there is a space char. We could then trim the string so that it correctly matches our allow-list.

If I'm reading this right, this would capture just the tag name itself so that either <hr /> or <hr/> would be normalized to <hr> - correct for void elements at least.

The most strict interpretation I can find here then is that if we had a specific list of void elements and only permit the above transformation for those (as you noted, this would be only br and hr).

CrushaKRool · 2024-08-23T07:05:39Z

You can of course make an argument that self-closing tags are not valid in the first place.
But even if they are used, the browser generally does the right thing, so you cannot always expect that people do "the right thing" if the wrong thing still works just the same and they don't know any better.

And since the purpose of the SimpleHtmlSanitizer seems to be to do a best effort to sanitize input that may not be entirely under your control (while still allowing it to make some use of basic formatting tags), I'd argue that it wouldn't hurt to also go the extra mile and support these cases that may not be fully conforming to the spec but may be encountered in the wild nonetheless.

A distinction between a separate sanitizer for HTML and XHTML just for these cases may not be suitable, because you may not know in advance whether the input you receive is aiming to be HTML, XHTML or even just some weird mix that would still work in modern browsers. And if the SimpleXhtmlSanitizer would basically do the same thing as the regular one except for also accounting for self-closing tags, it would rather boil down to the question of "how lenient do you want to be with input that does not follow the HTML5 spec to a tee?".

Side note: The lack of support for self-closing tags seems to have also been brought up in a comment way back when support for <br> was added, but it does not seem to have been discussed any further in that thread.
#7509 (comment)

niloc132 · 2024-08-23T13:43:20Z

I'm not disagreeing that this would be a good addition - that's why I outlined how the change could be made. But I think there is a solid case to be made that this is not what the tool intended, since it forbids lots of other valid HTML5 content too, and supports/normalized no invalid HTML5.

"Sanitizing output outside of your control" usually means a best effort has to be made - is there any example of a case that isn't fully conforming that is supported today?

My naming might not have been ideal - something that implies that like tools like jsoup it is rewriting the content to be valid and then sanitizing to a subset could make sense. I don't have a concrete suggestion here, but I think as I showed the implementation shouldn't be terribly difficult. What about either a subclass that adds this feature (and documents it accordingly), or a default property or ctor arg so that existing implementations continue to function as they did, but new uses can opt-in to slightly more flexible behavior?

Again, I'm not objecting to this, but I want there to be some discussion around suitability and design before loosening what is designed to be a safety feature.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SimpleHtmlSanitizer should support self-closing variants of whitelisted tags #9994

SimpleHtmlSanitizer should support self-closing variants of whitelisted tags #9994

CrushaKRool commented Aug 22, 2024 •

edited

Loading

niloc132 commented Aug 22, 2024

CrushaKRool commented Aug 23, 2024 •

edited

Loading

niloc132 commented Aug 23, 2024

SimpleHtmlSanitizer should support self-closing variants of whitelisted tags #9994

SimpleHtmlSanitizer should support self-closing variants of whitelisted tags #9994

Comments

CrushaKRool commented Aug 22, 2024 • edited Loading

Description

Steps to reproduce

Known workarounds

niloc132 commented Aug 22, 2024

CrushaKRool commented Aug 23, 2024 • edited Loading

niloc132 commented Aug 23, 2024

CrushaKRool commented Aug 22, 2024 •

edited

Loading

CrushaKRool commented Aug 23, 2024 •

edited

Loading