Project Description
The XHTML Markup Sanitizer takes untrusted (X)HTML and massages it into real, trusted XHTML. While plenty of effort goes into preserving the original intent, markup validity and safety is the first priority.
It's particularly useful with content management systems where users are in control of markup, but you want to target XHTML1.1.
How To Get It
This project is
only available on NuGet:
Install-Package MarkupSanitizer
Project Status
The current build supports most of the scenarios that you can throw at it. It includes a default tag profile, however this is easy to update if you grab the code. The plan from here forward is to revamp the approach to nesting and support multiple tag profiles (eg: basic formatting, no tables, etc).
Simple Examples
The sanitizer cleans up malformed tags, unclosed tags, and unescaped text. In doing so, it preserves correctly formatted tags and correctly escaped text.
| Input | Output |
| text | <p>text</p> |
| <p>text | <p>text</p> |
| <p>text<p> | <p>text</p> |
| <p>text<p>more text | <p>text</p><p>more text</p> |
| <p>some structured text</p>some free text<p>some more structured text</p> | <p>some structured text</p><p>some free text</p><p>some more structured text</p> |
| <p>text<li>bullet | <p>text</p><ul><li>bullet</li></ul> |
| <p>a broken</a> tag</p> | <p>a broken tag</p> |
| <p>a <strong>partial tag</p> | <p>a <strong>partial tag</strong></p> |
| dogs & cats or cats & dogs | <p>dogs & cats or cats & dogs</p> |
| <p>some <b>bold</b> text</p> | <p>some <strong>bold</strong> text</p> |
| <p>some <i>italic</i> text</p> | <p>some <em>italic</em> text</p> |
Secure
The sanitizer is based on a whitelist approach and employs the AntiXSS library. Unknown tags and attributes are stripped. Special characters are encoded (without causing double encoding either). Encoding is context aware too, so text within attributes will be encoded more aggressively.
| Input | Output |
| dogs & cats & amp;   | <p>dogs &  cats & amp;  </p> |
| <p><a href="dogs & cats">text</a></p> | <p><a href="dogs & cats">text</a></p> |
| <form action="hacker"><p>yo</p></form> | <p>yo</p> |
| <p onclick="danger();">yo</p> | <p>yo</p> |
| <a href="http://mysite" onclick="danger();">yo</a> | <a href="http://mysite">yo</a> |
Injections
The sanitizer will inject tags to ensure correct nesting. In doing so, it will also prevent certain nestings.
| Input | Output |
| <p>a <td>bit of</td> text</p> | <p>a </p><table><tr><td>bit of</td></tr></table><p> text</p> |
| <p>Hi <li>there</li></p> | <p>Hi</p><ul><li>there</li></ul> |