Certain tags like <script> but also <title> and others switch an HTML5 parser
into raw mode, which causes the rest of the HTML string to be always parsed as
text, including any elements or entities that we do want to support (e.g. <p>).
As we're going to escape any of the raw text elements anyway (it's e.g. script,
style, title, xmp, noframes, and a couple of others) we can just switch of raw
text parsing by disabling it after each starting tag.
The sanitization code does not retain any particular escaped entities - it
parses the HTML and thus loses the information on what entities were in the
original. The result is correct UTF-8 HTML though.
Use an HTML5 compliant parser that interprets HTML as a browser would to parse
the Markdown result and then sanitize based on the result.
Escape unrecognized and disallowed HTML in the result.
Currently works with a hard coded whitelist of safe HTML tags and attributes.
Fixing:
55cd82008e
This commit introduced a html tag whitelist which does not include any table tags (<td>,<tr>,<thead>...). Therefore even tables the markdown parser itself generated will be removed.
Change firstPass() code that checks for fenced code blocks to check all
of them and properly keep track of lastFencedCodeBlockEnd.
This way, it won't misinterpret the end of a fenced code block as a
beginning of a new one.
The issue is that when there are more than 1 fenced code blocks with a
blank line before and after, the parser introduces a single extra new
line to all the fenced code blocks except the last one.
If autolink encounters a link which already has an escaped html entity,
it would escape the ampersand again, producing things like these:
& --> &amp;
" --> &quot;
This commit solves that by first looking for all entity-looking things
in the link and copying those ranges verbatim, only considering the rest
of the string for escaping.
Doesn't seem to have considerable performance impact.
The mailto: links are processed the old way.
This gives a ~10% slowdown of a full test run, which is tolerable.
Switch statement is still slightly slower (~5%). Using map turned out to
be unacceptably slow (~3x slowdown).