WikiParserModel

MeatballWiki | RecentChanges | Random Page | Indices | Categories

There are several types of models WikiSoftware varieties use for parsing.

In general WikiSoftware varieties store the WikiSyntax version of a page in a database someplace, then when the page is requested, the WikiSoftware uses one of the following models to translate the WikiSyntax to HTML.

Of course, this is not always true - the WikiSoftware may store the file as some kind of xml type, or it may even be stored as html - then translate it back to WikiSyntax!

At any rate, all WikiSoftware uses a parsing model at some point to convert WikiSyntax to html/xhtml/xml (unless it is a raw html wiki).

Free Form

The simplist form of a parser - the wiki scans the WikiSyntax a character at a time.

Because this model doesn't have any contraints, most things don't need to be hacked around like a traditional line-based model.

In theory, a free-form model sounds good - however, in practical use Free Form models with larger wikis (ones wiki more features), free form simply result in so much duplication of code that the wiki becomes unmanageable after a while.

Usually Free-Form models are found in early-stage wikis, such as the earlier versions of Eddie's wikiserver (Wiki:EddiesWiki).

Line Based

The traditional model, the WikiSoftware scans the file a line at a time.

For example, a typical way of handling paragraphs with line based models is to check at the beginning of a line to see if the line is currently in another paragraph or another HTML tag that functions like a paragraph, and if not, create the paragraph tag (<p> in HTML/XHTML).

There are a lot of benefits line based models have over basic free form models. Line based models work better with more modern object oriented languages (the paragraph example earlier works well with classes). In addition, code duplication is kept to a minimum, while in general having the same speed as a free form model.

However, there are some large drawbacks to a line based model. For one, certain syntax types like table of contents in openwiki would have to re-scan the entire file, thus being quite slow (this problem is shared by the free form model also). In addition, syntax types that span multiple lines (like <nowiki>) are tricky to implement and require quite a bit of inginuity, usually resulting in some hacks around the line based model.

The line based model is found in quite a few wikis.

State Machine

A more elaborate parsing model - WikiSoftware converts the WikiSyntax to a state machine (usually nfa), and then creates an html page off of the state machine.

An obvious drawback to this approach is the amount of time required just to convert the syntax into a state machine, and then the time required to convert that state machine into HTML can be quite slow - especially with interpretated languages like perl. In addition, state machine models can be time consuming to implement.

The state machine model, however, is the most flexible, and operations like the generation of a table of contents for a page require no lookahead and can be quickly generated from a state machine. In addition, the state machine approach doesn't suffer from any of the implementation problems a line based model has.

One approach to a state machine model that might work well is to save the state machine of a page into the database, rather than the wiki syntax. However, even in this case the state machine would have to be translated when the page is edited AND when the page is about to be shown.

Is the state machine approach the same as a grammar based parser approach (i.e., using things like yacc)?

Does anyone know of a place where wiki parsing is discussed frequently (like a mail list or wiki)?

This would be the place. The mailing lists tend to die out quickly. wiki@freelists.org is the mailing list for the WikiMarkupStandard. -- SunirShah

Hm, regular expressions are themselves state machines, and most wiki codebases use regular expressions to transform lines either using character escapes (like tick markup or freelinks), lines, or other pattern matching (WikiWords). I think the distinction you're looking for is formal vs ad hoc. Most wikis are ad hoc, and the ad hoc model just plain works most of the time, because it's an user interface, not a programming language. The advantage of a formal grammar is that it can typically be translated more easily into efficient low-level code (e.g. lex/yacc, lemon, antlr, etc), and anything out of specification (parse error) can be handled in an unambiguous way, such as hilighting the offending section. Both approaches have merits, and both can be combined within a single codebase without much difficulty. --ChuckAdams

True. My intention was that you would "compile" the page, then save the compiled version of the page instead of the wiki text of the page. Supposively you'd crunch it down to a smaller size too. This would make parsing it rather simple and loading quite fast. Of course, you have to recompile it every time the page is edited - or at least the diffed portions if you were really clever. Kind of inconvenient for the user though - 'course UseMod uses a "database" (text with funky seperators) anyway so that isn't much of a problem for a wiki like that. Really involved and complicated, and on a fast server page loading speeds arn't really a problem (after all 90% of them are in Perl for pete's sake :)). -- RyanNorton

WikiSyntax

WikiParserModel

Free Form

Line Based

State Machine

Discussion