Project Description

The XHTML Markup Sanitizer takes untrusted (X)HTML and massages it into real, trusted XHTML. While plenty of effort goes into preserving the original intent, markup validity and safety is the first priority.
It's particularly useful with content management systems where users are in control of markup, but you want to target XHTML1.1.

How To Get It

This project is only available on NuGet: Install-Package MarkupSanitizer

Project Status

The current build supports most of the scenarios that you can throw at it. It includes a default tag profile, however this is easy to update if you grab the code. The plan from here forward is to revamp the approach to nesting and support multiple tag profiles (eg: basic formatting, no tables, etc).

Simple Examples

The sanitizer cleans up malformed tags, unclosed tags, and unescaped text. In doing so, it preserves correctly formatted tags and correctly escaped text.
Input Output
text <p>text</p>
<p>text <p>text</p>
<p>text<p> <p>text</p>
<p>text<p>more text <p>text</p><p>more text</p>
<p>some structured text</p>some free text<p>some more structured text</p> <p>some structured text</p><p>some free text</p><p>some more structured text</p>
<p>text<li>bullet <p>text</p><ul><li>bullet</li></ul>
<p>a broken</a> tag</p> <p>a broken tag</p>
<p>a <strong>partial tag</p> <p>a <strong>partial tag</strong></p>
dogs & cats or cats &amp; dogs <p>dogs &#38; cats or cats &#38; dogs</p>
<p>some <b>bold</b> text</p> <p>some <strong>bold</strong> text</p>
<p>some <i>italic</i> text</p> <p>some <em>italic</em> text</p>

Secure

The sanitizer is based on a whitelist approach and employs the AntiXSS library. Unknown tags and attributes are stripped. Special characters are encoded (without causing double encoding either). Encoding is context aware too, so text within attributes will be encoded more aggressively.
Input Output
dogs &amp;&nbsp; cats & amp; &#160; <p>dogs &#38;&#160; cats &#38; amp&#59; &#160;</p>
<p><a href="dogs & cats">text</a></p> <p><a href="dogs&#32;&#38;&#32;cats">text</a></p>
<form action="hacker"><p>yo</p></form> <p>yo</p>
<p onclick="danger();">yo</p> <p>yo</p>
<a href="http://mysite" onclick="danger();">yo</a> <a href="http://mysite">yo</a>

Injections

The sanitizer will inject tags to ensure correct nesting. In doing so, it will also prevent certain nestings.
Input Output
<p>a <td>bit of</td> text</p> <p>a </p><table><tr><td>bit of</td></tr></table><p> text</p>
<p>Hi <li>there</li></p> <p>Hi</p><ul><li>there</li></ul>

Last edited Aug 3, 2011 at 12:24 AM by tathamoddie, version 8