ConsumeParseRenderVsMatchTransform

MeatballWiki | RecentChanges | Random Page | Indices | Categories

Traditional markup languages are often created in a such a way that they may be parsed using standard tools such as Lex & Yacc. These are well known tools, so I'll skip straight to the point. Wiki languages as implemented by many wiki engines cannot be parsed correctly using Lex and Yacc because the languages they define often fall outside the realm of LALR(1) context-free languages, and often contain ambiguities.

Furthermore a key difference between lex/yacc consume/parse/render engines (e.g. a compiler) and match/transform engines (like wikis) is this:

consume/parse/render engines take a piece of text (from a lexer), when it's matched add to stack, periodically match a grammar rule against the stack and reduce the size of the stack, applying any code, and replacing the top of the stack with a new item. Two key points come out of this - the output of a lex doesn't directly affect any specific reduce action, and the result of a reduce doesn't result in the source text being changed earlier in the stream. (Text later in the source stream can be changed potentially, but not earlier) Given lexing rules generally result in a single match this means that one lexing rule cannot affect another. This means that the languages these parsers cope with best are context-free, non-ambiguous languages.

In match/transform, the entire text is matched against a pattern. Each match is put through a transform, throughout the entire text. This means that repeated rewrite rules can, and do impact the meaning of other language rules. (In some systems this can be useful - such as the substitution of a variable prior to WikiWord linking, in others it can be a pain - such as generating CDML and the resulting markup hitting forced link syntax rules) This means that the majority of languages wikis use are context-sensitive languages. Not only this some wiki languages are ambiguous when viewed from context-free system tools. This ambiguity is normally resolved by producing a wiki implementation and having that order of evaluation as the language definition.

This impacts a WikiInterchangeFormat in that essentially what a WIF needs to convey is meaning, which can be ignored by a second wiki. Traditionally the simplest way of doing this would be to exchange a parse tree - the abstract syntax tree for the document, and allow the second wiki to simplify the content it doesn't understand to just plain text. In a consume/parse/render system such a model is relatively simple since the AST will normally exist, and we just need to do a tree walk.

In the case of a Match/Transform system we need to modify every match/transform pair in the system to help us build the AST. Unlike a standard AST which would be created in parallel with the original datasource with a match/transform system, this decision has to be embedded in the transformation. The ability to "just dump the AST" doesn't apply - since it never gets built.

On the languages defined:

If I define 10 lexing rules, and 100 grammar rules in a lex/yacc system, I define 1 language
If I define 110 match/transform rules in a wiki style rewrite system, unless I define an evaluation order (rare - normally the code is the documentation here), I define O(factorialStyelFunction(110)) languages. Usemod has 185 match/transform rules (taking a build from last year), TWiki has well over 1000. Many of the implicitly defined languages will look and act the same, but border cases will often occur.

This means two wiki implementations can use precisely the same syntax, but have completely different languages due to evaluation sequence - as a result simply exchange the original markup between the two can cause problems. (This also affects upgrades and rewrites of wikis.)

An interesting side effect is that if you are able to extract the AST from a wiki through appropriate repeated match/transform rules (perhaps by interception), then you gain the ability to optimise the wiki rendering process. (Apply the match/transform context-senstive parsing on the way to disk to serialise the AST, and on the way out from disk using a context free consumer/parse/render system to transform to a target format). Furthermore such an approach future-proofs the content from changes to the parser. (Consider that changing the order of 10 markup rules in usemod/TWiki would result in a different language, and hence different parsing results. Simply inserting an extra rewrite rule somewhere in the middle changes the language significantly.)

I'm still not certain on how far this impacts things like a WikiInterchangeFormat and other things, but it struck me as significant enough to document seperately.

-- MichaelSamuels

Is the success of the Wookee engine relevant here? http://wiki.beyondunreal.com/wiki/Wookee --MatthewSimoneau

ConsumeParseRenderVsMatchTransform

Discussion