showdown/blog/rewriting-the-parser-challenges-and-opportunities.md
Estevao Soares dos Santos ca8c22d9bf update post 3
2018-10-18 03:20:08 +01:00

7.4 KiB

««« title: Rewriting the parser: Challenges and opportunities, author: Estevão Soares dos Santos, author_avatar: img/avatars/tivie.jpg, date: 2018-09-25, language: en, image: img/blog/2018.09.25.jpg, summary: There comes a time when tough (and exciting) choices need to be made. With a version 2.0 in the oven (and the exciting reverse parser already completed), it seemed an excellent opportunity for doing something that I wished for a long time: rewriting showdown's parser from scratch. But maybe that challenge was a lot more tough than I first anticipated! »»»

There comes a time when tough (and exciting) choices need to be made. With a version 2.0 in the oven (and the exciting reverse parser already completed), it seemed an excellent opportunity for doing something that I wished for a long time: rewriting showdown's parser from scratch.

But maybe that challenge was a lot more tough than I first anticipated!

The old parser

The old parser was based on replacing markdown text in loco and in the original string with HTML, through a series of Regular Expression. While this was true to showdown's origins in John Gruber's Markdown.pl, during the last 3 years it became clear that Regex, at least by itself, was not suitable for the parsing job.

Don't get me wrong, Regex is a great tool: not only does it works for 99% of use cases, it also makes it extremely easy to develop extensions that complement showdown's features, even for beginner programmers.

The problem, however, are those pesky edge cases... that 1% of weird scenarios that ended up making showdown's Regular Expressions increasingly complex over time

In fact, when you compare v 1.0 to v. 1.8.6, it's plain to see that almost all bug fixes (with notable exceptions) were, in fact, edge case fixes.

Regex Madness

Some regexes grew so weirdly that I ended up needing to split them in chunks and do some convoluted stuff to keep them minimally sane. For instance, it takes 6 RegExps just to parse __foo__bar__baz__.

  if (options.literalMidWordUnderscores) {
    text = text.replace(/\b___(\S[\s\S]*)___\b/g, function (wm, txt) {
      return parseInside (txt, '<strong><em>', '</em></strong>');
    });
    text = text.replace(/\b__(\S[\s\S]*)__\b/g, function (wm, txt) {
      return parseInside (txt, '<strong>', '</strong>');
    });
    text = text.replace(/\b_(\S[\s\S]*?)_\b/g, function (wm, txt) {
      return parseInside (txt, '<em>', '</em>');
    });
  } else {
    text = text.replace(/___(\S[\s\S]*?)___/g, function (wm, m) {
      return (/\S$/.test(m)) ? parseInside (m, '<strong><em>', '</em></strong>') : wm;
    });
    text = text.replace(/__(\S[\s\S]*?)__/g, function (wm, m) {
      return (/\S$/.test(m)) ? parseInside (m, '<strong>', '</strong>') : wm;
    });
    text = text.replace(/_([^\s_][\s\S]*?)_/g, function (wm, m) {
      // !/^_[^_]/.test(m) - test if it doesn't start with __ (since it seems redundant, we removed it)
      return (/\S$/.test(m)) ? parseInside (m, '<em>', '</em>') : wm;
    });
  }

Others, which were inherited from when I took over the project, were just plainly poorly coded and hid some nasty bugs within.

Javascript's regex limitations (such as no support for Lookbehind or atomic and possessive grouping) meant that I needed to rely on the language's quirks and look for hacks to overcome those limitations. At some point I even had to adapt a recursive RegExp extension which seems an overkill for a syntax that doesn't really care about balanced stuff.

And some stuff was just []impossible to fix](https://github.com/showdownjs/showdown/issues/549) without making the parser very VERY slow for everyone.

Enter the new parser

For all those reasons, and some more, I felt that version 2.0 should have a shinny and proper new parser that read the code sequentially and was aware of context (use more code logic and less RexExp wizardry 🎉). The new parser is, in fact, a lot more similar to a PEG parser than a RegExp converter with in place substitutions.

Some key features include:

  • Sequential

    The new parser reads the input sequentially. When a syntax element match is found, it "stores" the element without changing the original string. This means that Showdown no longer relies on (nor needs) a specific order in which sub parsers are invoked.

  • Full separation of the parsing and conversion steps

    Instead of making in place substitutions of the original input, the new parser creates an intermediary abstract layer, an object that is a node tree of elements, similar to the DOM Tree in the browser, which makes it easy to manipulate each node before outputting it as a string again, in other format.

  • Output manipulation is extremely easy and customizable through templates

    The node tree (and each individual node) can be manipulated as you see fit since it's, in practice, an agnostic representation of an element. Each node has a template for each supported format: HTML and Markdown. But you can even add more formats if you wish.

    What is cool is that, not only can you can manipulate the tree (add, delete or swap nodes), you can also tweak or completely change the output of all nodes of a type or even a specific, individual node.

  • Extensions are a lot more powerful now

    The extension system is being completely redone and will tie in really really well with the new parser and Reverse Converter.

  • Faster (it seems)

    RegExps, specially the complex ones, are slow. So, moving away from them, and only using RegExp for quick checks, should speed up the parser a lot. Well, at least in theory.

Not everything are roses though

Unfortunately, to accomplish this and successfully move away from RexExp, 3 things were needed:

  1. Keep the extension logic simple (while making it more powerful)
  2. Time
  3. Accurately estimate number 2 (time needed vs free time)

While I feel that the first one is being accomplished quite nicely (if I may say so myself), I might have over estimated my free time and I definitely failed at number 3. Which meant I had to keep delaying the alpha release for v2.0, which I intended to release in mid 2018.

Between work and family, the little free time I get, I dedicate it to this library. I do feel that things are still going on the right track, albeit a lot more slowly than expected (and than I wished for).

Regardless, I'm really excited about the new features for version 2.0, specially the reverse converter and how it will tie in toegether with the new parser and event system

I would really, really, really appreciate if you could donate, you know?

For all those reasons, working on the 2.0 version consumes a lot of my free time which, unfortunately, I don't have that much lately.

I would really, REALLY appreciate if you could donate. Your contribution will mean a lot to me and really help me dedicate less time to my dayjob (and those annoying extra hours) and more time developing this (awesome) library.

So... if you like my work and find our library useful, please donate through Patreon (coming soon) or directly through paypal!!

Thank you!!