WikiInterchangeFormat

MeatballWiki | RecentChanges | Random Page | Indices | Categories

Current effort is MagnusManske? (MediaWiki), SvenDowideit? (TWiki), EugeneEricKim. (maybe MurrayAltheim?)

If I were to offer any advice, stick to the W3C XML standards and best practices as closely as possible. They have solved a lot of the same problems

Use XHTML-Basic for basic style and structure.
Xlink for the links
- why not simply a with correct/extended usage of rel/class attributes? www.microformats.org is a nice read ... --anon.
- That is a great idea! I agree. microformats.org is currently FishBowled, by the way. --ss
Heavy use of xmlns for non-XHTML-Basic sections.

Also, for complex sections, store the XML source data and an xlink to a REST server that can translate that XML into some renderable XHTML. (Server may return XSLT.) Also, cache some XHTML to gracefully degrade when that server is inaccessible. If you do this as a minimum base standard, it will work, and the rest can evolve from there. -- SunirShah

Historical discussion

For now see WikiXmlDtd, MovingWikiParts?, and Wiki:WikiInterchangeFormat

RSS, using ModWiki and the Content module [1] may be an alternative starting point for a WikiInterchangeFormat.

A good WikiInterchangeFormat will probably be simplistic and simply define a syntax for extension with some defaults rather than have large amounts of detail. Examples of good interchange formats currently in use:

The core of MIME as used by email, NNTP, HTTP, RTSP
Perl/python hashes/dictionaries. (Provide interchange between code modules cleanly/extensibly)
HTML, XML, YAML all provide differing levels of interchange simplicity
...

A WikiInterchangeFormat essentially describes the parse tree of a wiki page as viewed by the wiki server. It has to be a lightweight format IMO to encourage implementors to add in support for it to their wikis. (After all the HTML output currently forms this)

Whilst many wikis limit themselves to basic things like simple formatting and simple linking, some wikis take the wiki model to what others may view as "bloaty" limits allowing in-wiki directives that perform anything TransClusion of content from an application server, processing of database requests, interpretting prolog, through to simply displaying the local data/time in a particular format. A WikiInterchangeFormat should allow these sorts of extensions to be added simply and easily.

The specification for a WikiInterchangeFormat should be small and easy to implement. Parsing and creation of WikiInterchangeFormat should be designed to be as simple for machines to handle as possible. Human readability would be nice, but the emphasis is on interchange between software systems.

Motivation for a WikiInterchangeFormat:

The ability for people to upgrade their wiki software from one wiki implementation to another.
- Avoids "vendor" lockin. (Whilst most wikis are OpenSource, having all your data in a format no other wiki recognises limits your choice of software even though the original developer probably didn't want that.)
The ability for a wiki to provide multiple front ends - allowing different syntaxes or editors. Ranging from text box, through a Wiki:WysiwygWiki through to editting using OpenOffice?. (That might sound like favouring FormOverContent but it's more about allowing interaction with the world outside wiki in a manner acceptable to both worlds.)
To simplify direct generation of multiple formats - eg WML, PDF, RTF...

In order to further acceptance of such a format I would suggest that aiming for an IETF standard might be sensible. This strikes me as appropriate since a good WikiInterchangeFormat has much in common with content encoding schemes and transport protocols.

Related toolsets: atox[2] ("XSLT for plain text"), ...

Some random thoughts.

What could a wiki interchange format communicate ?
How could this be used ?
Each WikiEngines has a different feature set

To my mind it all boils down to this:

Each wiki allows its users to type something that looks like email
The system then parses this to find structure of some kind for automatic formatting and linking
This parse result is then rendered

Steps 2 and 3 are often collapsed into one step, but logically this is one view of what happens. An altenative way of looking at this is step 2 consists of forming a parse tree, and then decorating it. In practice, the actual parse tree step is often skipped, but it does not need to be skipped. (This is directly akin in my eyes to writing a calculate just using lex - you'll end up normally with a reverse polish notation calculator) If these ephemeral parse trees were captured, then dumping those out would allow a system to reconstruct either the original source, or the rendered version. Thus the dumped parse tree would essentially form an interchange format if different engines could read the dumped parse tree.

Incidentally I don't think transforming wiki content into a parse tree should be *that* difficult. After all, all(?) wikis currently transform plain text into a serialised HTML DOM tree, "we" just don't think of it that way very often.

This sidesteps the *need* for any WikiMarkupStandard, whilst still allowing interchange. Furthermore, PostelsLaw would definitely need to be embodied by this. (Be strict in what you provide, lenient in what you accept.) A common way of handling this sort of issue in programming languages is to use named parameters, hashes, dictionaries etc, and defaulting to no information.

I doubt much of this is "right". Please be merciless :) I'll refactor ideas into groups when there's enough to make it sensible (unless someone beats me to it). ---

Since the above I've been rewriting part of the wiki engine I work on to seperate out the parse section from render. The simplest possible example would be something like:

$text =~ s/_([^_]*?)_/ &renderAsBold($1) /ge;
Which for the non-perl readers says for every possible shortest string starting and ending with "_" replace that entire string with the text returned after applying &renderAsBold to the text inside underscores.

In practice I'm not using single shot functions, I'm using a rendering object , and adding rendering methods (in this instance, the class is called Render::HTML). Originally I started off adding one rendering method per HTML, and in the past couple of days I've been changing that to add one rendering method per wiki rule. What does this give the user at the moment? Nothing. What does this give the programmer (who's also a user of course) ? The ability to change the markup without changing the parsing rules. For example, I was trying to track down why a piece of HTML was being generated as it was. (Interaction between 2 rules). As a result I create a new rendering class with the same methods and output, but decorating the output with caller information (call stack) inside HTML comments for every piece of generated markup. I quickly discovered that the place I *thought* was the source of the problem was NOT. Thus saving me time.

In essence, the choice of rendering class acts like a style sheet across the wiki text.

I know some wikis probably do this sort of thing already, but it's led me to a new conclusion:

There's no difference from a modern browser's perspective between an HTML "b" tag and an HTML "span" tag with appropriate rendering semantics attached via a style sheet.
DIV and SPAN together provide many of the advantages of languages like XML but do not suffer from the restriction that you need a fixed DTD.
Given decoupling of parsing from rendering, spitting out DIV and SPAN is a pretty simple exercise for a wiki, *and it retains the semantics given a useful subset*
Whilst allowing full HTML in a WikiInterchangeFormat is probably not such a wise idea due to the LandMine argument, allowing the simplest subset based on either DIV, SPAN or both, might be a way forward.
It would also mean that simplest tools required to read the format sensible would be a browser, and text editor. (Though many _normal_ HTML editors may well also work, and work better)
The simplest "wiki" renderer to present this to the user would then be a stylesheet.

I've had some further thoughts as to where this could go, but I'll stop at this point. There are a number of issues I can see with this already, hence why I want to stop. -- MichaelSamuels

Actually, there is a difference between an HTML "b" tag and a "span". Or rather, between an "em" and a "span". Please use semantic markup where it applies -- TarQuin

: OK, what I meant was there is little difference to the user between the "b" and "span" if they see text rendered as bold. However using span allows you to add arbitrary classes to add further semantic information, which degrades sensibly on systems which don't understand the content. (For example if I mark the first paragraph with span class="summary" - that might mean nothing to wiki engine 1, but wiki engine 2 might interpret the contents as being a summary of the document for markup.) This also deals with the fact that different wikis have different rules for wiki words etc. An interchange based on spans could deal with that issue as span class="wikiword" or similar. (Whether that's sensible or not is another matter, but it does give you far more infomation than a plain <a href link>. Whilst you can (and people do) attach classes to different tags (eg <a href="..." class="wikiword"> ) the advantage would be that you're not stuck with forcing every wiki capable of handling a WIF to parse HTML - you have a very small subset of flexibile tags.

: After all any interchange format between wikis is likely to be trying to communicate the intent of the markup, rather than the specific HTML rendering, or similar. Not all wiki sites by definition would want to offer such a format, but some would. In the corporate world, the ability to periodically migrate content out of a wiki into other formats is especially useful, and something like a span based system might help there (Some people like drafting some shared documents in the wiki, getting it 95% of the way there, snapshotting it and formatting it for sharing).

: BTW, there's two ways to read your comment. In my mind gives me more semantic information than . Since I suspect you meant the opposite (hence the above), I'd be very interested as to counter examples as to why the approach suggested above is bad. (Incidentally I've argued long and hard in other forums against using CSS in wiki pages right or wrongly, since it favours FormOverContent. The irony here is that I'm aiming to use them for semantic markup rather than presentation. If someone created a stylesheet for them, that'd almost be beside the point) -- MichaelSamuels

I've argued these points with people over the past eight years, and I've swung both directions on what constitutes "semantic" vs. "presentational" markup. Having now spent several years studying semiotics and knowledge representation, I realize the whole argument is misguided, that what we call "semantic" is a question of meaning, i.e., that's what "semantic" means: meaning. Meaning only comes with interpretation, and that only comes with humans. Computers don't do meaning. They can be programmed to "interpret" content according to fixed rules, but that's not the same as humans interpreting things. Now, without going any further down this rabbit hole, the point I'm trying to make (admittedly circuitously) is that is exactly the same as from a computer perspective. And honestly, from a markup perspective too. People should stop using the word "semantic" as a weapon and just use "meaning". It's a lot clearer. What do you mean by bold? What do you mean by emphasis? Isn't bold a form of emphasis? Do people honestly believe that 'b' is different from 'em'? The word "semantic" has become all but meaningless. All markup is semantic. It is a form of annotation on plain text. CSS merely allows markup to be presented differently than plain text. The link between the author's intention and the reader's interpretation doesn't exist explicitly; it happens across this chasm we call communication.

Now, I'm actually a lot more interested in simply getting things done rather than arguing about what they mean (I had this beaten into my noggin by a friend of mine who got tired of me arguing). In response to some of the markup language issues, I can speak from a great deal of experience that crafting a markup language is a huge amount of work, involves potentially years of versioning and bug fixes, and seldom captures the actual needs of its user base, since requirements are usually a compromise. So... what about not inventing a new markup language? Since wiki text can obviously be expressed as HTML or XHTML, why not use XHTML as the interchange syntax? What is actually missing that can't be expressed? Are there transclusion features or other linking features that can't? Could these be enumerated? I used to write DTDs for a living. I wrote the modular XHTML DTDs. I could create an XHTML DTD variant for wikis. But I don't know that this would be a good idea. It's often better to use existing markup, as tools already understand it, as do people. But I'm very curious to see a set of requirements for the interchange syntax, to see if XHTML couldn't fulfill them.

I do have one possible augmentation, written up in a spec: [Augmented Metadata in XHTML]. I'd rewrite this spec today to be a lot simpler, involving only one minor change to the XHTML DTD, namely, allowing the <meta> element anywhere XHTML inlines are allowed. What does this mean? Upon locating any <meta> element outside of <head>, it is interpreted to apply to its parent element. In this way, you can add metadata to paragraphs, document headings, divisions, etc. Nice thing is that this is completely invisible to and ignored by current browsers. I think this idea might be quite valuable in capturing some of the metadata in the wiki text content. I'm willing to write up the DTD if necessary -- it just involves one minor change to XHTML. I'll probably be doing this anyway, since I plan to use this for my own work. (I tried to get the W3C HTML WG to think about this after I'd quit working with them, but their interest in anything except XML Schema nowadays is minimal.)

Just some ideas... -- MurrayAltheim

: What difference between  and ? Text readers (for the visually impaired) can handle the latter, but the former makes no sense if there is no text to embolden. This is what "semantic mark-up" means.

Just a few thoughts:

I think that XML is not user-friendly enough (redundancies, intolerance against syntax errors, English-centeredness) to be directly used
I wonder what you think about CDML as a markup from a theoretical point of view (look into [SupportWiki:CdmlElements]
I think that semantic markup is possible and necessary for simplicity. [table markup] [question markup]

-- HelmutLeitner

The text reader argument is not a very good justification, as it's not true. I've been heavily involved in accessibility issues surrounding HTML and XHTML, and to my knowledge, despite nonsense like [3], all of the text readers know what  is. (Can you imagine for a moment someone developing a text reader than didn't? Really?) It means "bold", and even blind people know what bold means. They're blind, not stupid. Their text readers typically make the text louder, just as they do with . This is just left-over political correctness from the 1990s. If it makes people feel better to use a  rather than a , fine. But it makes no practical difference. I'm sorry, but don't need anyone telling me what "semantic markup" means. I know what people think it means. What I've tried to explain is that the difference is competely illusory, both in terms of actual semantics and in terms of actual practice. It's a whipping boy for people's misunderstood notion that "bold" has no meaning, but somehow "strong" does. People in general use  and  interchangeably. And like natural language, they acquire meaning via use, i.e., they are now synonyms. Put it this way: when someone is reading a text written by somebody else, do they as a reader know what the author meant by bolding or italicizing a word? Well, "semantic" = "meaning." If you don't know what the author meant, then you don't know the semantics/meaning of the emboldening, and therefore, you just know it's emboldened. And that means something -- you just don't know what exactly.  is probably less meaningful to most people -- they at least know what bold text looks like. To make matters worse, since about 1994 there's been ongoing confusion as to whether  should be typically rendered as bold or italic. I think the thing that really gets my goat on this issue is that all this nonsense over semantic markup obscures the fact that the W3C in general wouldn't know semantic if they got it shoved up their nose. They are in general horribly confused about where meaning resides. --MurrayAltheim

Ok, there is no practical difference between and , but take <address> and for example - there you see huge difference. You can search document (internet) for addresses in contrast to looking for italicized paragraphs.

It is logically correct to render emphased text underlined, but making bold text render as underlined is self-contradictory.

--KornelLesinski

"why not use XHTML as the interchange syntax?"

By itself, you lose meaning. For example, if I wish to denote overlapping sections for later transclusion, how can I do this with XHTML when XML specifically prohibits overlapping elements? (You would have to create a marker tags,for denoting start/end rather than a single <section> </section> tag.)
Also not many wikis that allow arbitrary HTML input are capable of producing valid XHTML. (TWiki for example claims to spit out XHTML, but fails on many many counts on achieving this goal, which is likely to be nigh on impossible for wikis that do allow arbitrary HTML.)
XHTML is not simple for all wikis to parse. Yes, libraries exist - but break PostelsLaw and are required by the XML spec to fail when provided "bad" XHTML. (This is partly a knockon problem from the previous one.) Furthermore a few (but widely deployed) wikis will not incorporate extra libraries of the extent required for parsing XHTML.
XHTML requires a DTD. This means that you get locked into the meaning/semantics used at that time. Consider:
- MIME style headers have been used for, what, 20 years now - with little overall change? Why? Not a rigid spec - it flexed originally using X- extensions until the requirement for X- was dropped.
- Why does object orientation more successful than plain old stucts? (Or unions?) Flexibility - you're not locked into one basic setup. (Which is also possibly why the X windowing system has gone on as long as it has - due to being OO internally, and from that gaining flexibility.)

The point for me here really is the fact that XHTML is not trivial to parse - far far from it. (It should be, but it fails that test.) It's difficult for wiki implementors to get right during creation if they allow arbitrary HTML etc. And it syntax parsing failure behaviour wrong. (A user will be more interested in reading badly marked up text - or better: plain text - than an error message.) If a WIF gets created I would personally like to see it as easy for wiki implementors to read as it is for them to create - or close. If it isn't then you're not going to get your 30 line wiki implementors using it.

"I think that XML is not user-friendly enough"

The point -- of an Interchange I would like to see :) -- is for it to be machine friendly not user friendly. On that front XML is great. However in order to get the flexibility a WIF would need you would need to invent a mechanism for creating further semantic meaning. HTML does this though the use of SPAN and DIV. MIME did it by inventing the concept of multipart MIME, and I'm sure there's further examples. The rigidity of XML is it's (sole) weakness here.

All the arguments against XML could be used against http and TCP/IP: wiki should be stored and transmitted on paper, because TCP/IP and HTTP are complex protocol that cannot be implemented in 100 lines. If the implementations are buggy, and different, the content will not go trhough. True, still this is the best solution we have so far, and having some way for wiki-to-wiki migration is better than nothing.

I personally don't see WIF as something that end users type in, I see it as something that a wiki generates for transmission to other wikis. It should be flexible enough for wiki creators to express the meaning inherent in their implementations, irrelevant of whether the receiver can understand this. This can be for purposes of TransClusion, or for transfer of content between wikis - since no-one ever chooses right first time round. If it is simple enough for user to manually create - then great (indeed that is probably a "nice to have") but it shouldn't be the goal IMO. The reasoning that led to the (abuse of) span/div idea (which I'm not married to ;) , just using it for illustration) is that the lowest common denominator for reading wiki output is a browser. A page marked up solely with span & div with no style sheet will look pretty much like plain text - but will still contain all the semantic information originally intended. I'm sure a similar reasoning process led to the XHTML idea. The problem is I'm used to using a wiki where I *can* express things like authors, structured data, local variables (which is really just another form of structured data), overlapping sections, etc should I wish to, none of which map cleanly to standard XHTML in this case.

CDML looks very interesting - simple syntax, support for parameterised markup, and fall back semantics as plain text, whilst simulataneously being just simple enough for it to be human friendly to people comfortable with markup. -- MichaelSamuels

Michael, from someone who was present during the creation of XHTML, I can tell you it wasn't out of any wisdom. The W3C wasn't even planning on using DTDs at all, people like Dave Raggett were advocating further (post HTML 4) definitions of HTML via "tag sets" (basically, lists of element types, Dave not being clear on the difference between element types and tags). I was one of the few arguing for DTDs. Now, in reading the marks against XHTML above, most of those would be against any XML-based markup language. If you need balanced tags, you need balanced tags. If you have any expectation of well-formedness (even something less than XML's version of that), you'll have to force people into that definition. If I understand correctly that there can be no expectation of well-formedness at any level (given people can hand author bad markup), then no WIF can be based in SGML or XML unless you take Dave Raggett's idea of "islands of validity" to heart. I don't buy it. Try implementing it. It'd be crazy. So the choice here is either to bite the bullet and go with XML, or not. It's not an easy choice. But any interchange syntax that allows bad markup is going to be a big problem. I'm not sure how you insulate a system from bad markup if you don't do it at author-creation time by flagging errors and refusing documents, which I know isn't going to win any favours, i.e., probably a non-starter. BTW, XHTML is a lot easier to parse than HTML, as it's cleaner. (And in this day and age, nobody should need to write a parser for XHTML -- there are a bunch of very good open source ones available, built into Java even.) I'll end with something I've said before: I've been involved in a number of real-life dramas in the creation of markup languages. They all had full-blown working groups that got together in face-to-face meetings in addition to email, and they all took years to create, even with industry markup experts present. XTM was done in one year of extremely hard work by some of the best markup experts in the world. I physically collapsed that year in Steve Newcomb's hotel room, if it's any indication. I'm not trying to be a wet blanket here, just inject a note of realism about goals. That's why I advocated XHTML. I'm not sure you all would be best served by XML, but if the wiki community is going to try an XML-based markup, use an existing one if at all possible. As the author of the modular XHTML DTDs, I may even be willing to help extend XHTML via modules as needed for wikis (if they are needed), though my time is pretty much taken up by my Ph.D. work for the next year or so. -- MurrayAltheim

"The W3C wasn't even planning on using DTDs"

An interesting point, however they do, and XHTML and XML use DTDs, and are therefore statically typed. This limits their expansion and usage, unless you create an expansion mechanism that works around the DTD. (Such as encoding the "real" element types in attributes, or namespaces.) I can easily see the relevance of having DTDs for XHTML & XML given the target audience. I think that is the point where our discussions diverge above.

Rather than argue the pros and cons of XML/XHTML/etc immediately, I think it makes sense for me to lay out the usage I would like to see from a WIF, and some of the problems facing a WIF. At present I feel DTD's add to these problems though I'm happy to be demonstrated wrong.

For me the target audience is this:

Suppose you have 10 different wikis, with 10 different implementations
Users of those wikis would like to share content between their wikis seemlessly. (This implicitly also gives you the situation whereby each user can just use which ever wiki markup they want - since they just choose the engine with the syntax they like)
When sharing content each site converts it's content to WIF (whatever it ends up being :) sends it to the next site, which then decodes this to it's local syntax.

This isn't a hypothetical example, but rather a simplified example of where a collection of wikis at our workplace are headed. (All run by different people with differing needs etc - much like public wikis.)

For me the crux of a WIF is this: it's a serialised parse tree representing the content as understood by a wiki engine. Many wiki's skip the parse tree creation stage - and indeed some have highly non-regular syntax - indeed many are essentially implemented as context sensitive grammars (whether they realise this or not). I do not think it should be viewed as something that is designed for creation, or serving of content. I would personally be in favour of something that does not require special tools for viewing, but a binary format is just as valid.

Inherent problems:

Not all the engines have the same richness due to different choices chosen by the wiki implementors based on their personal tastes.
The simplest wiki is one that has no markup - just text. Early versions of VoodooPad? and TarQuin's KiwiScript. The question is how to preserve the meaning relayed by a WIF. (Consider RicherWiki? -> WIF -> KiwiScript -> Disk -> KiwiScript -> WIF -> RicherWiki?) This is something I would like to see a WIF capable of handling.

At present I think we need to look at the characteristics needed by an interchange format, rather than go down the route of "use/don't use this/that". Some of XML's advantages for example stem from the fact that it's designed as a rigid machine/machine system. In a wiki world, where sub 100 line implementations are common, anything complex is awkward (and XML is not simple except on the surface).

As I said before some (major) wikis do not use external libraries. Simply stating "they can use an existing parser" is not a solution to the politics involved in the development of those wikis. Multipart-MIME for example is successful because whilst libraries exist for parsing it, it is simple enough to parse without using one. Furthermore, if it is processed by a mailer that doesn't understand it (e.g., mail forwarding), it is resilient enough to keep it's structure intact on the other side.

"If I understand correctly that there can be no expectation of well-formedness at any level"

I think that's a valid summary, however where you go next I feel is wrong. Consider the situation that I give you a page of text marked up in an unfamiliar syntax, but written in a language you understand (or one that the fish can translate for you). You as a person would still be able to read the text. It might be uncomfortable, awkward, ugly, etc, but you would be able to read it. Indeed you could put it into your wiki, and it would just work. It wouldn't look pretty etc, but it would work. You could change it, send it back, and things would just work (probably "just" and "work").

However, imagine that I export the text in XML, and my wiki embeds more semantic information than yours does (eg overlapping sections). Suppose this gets lost when stored in your system (due to the interchange not supporting the usecase, since technology was chosen first). You edit the document and send it back - having added something my wiki doesn't support. (eg wiki trails) Best case here, we lose information - since we've both put circles and triangles into square hole. Worst case both of use extended the DTD and both wiki's rejected the XML supplied as invalid, or worse started playing around with XML namespaces - significantly raising the requirements for readers/writers.

OK, this is unfair, but I've taken the extreme to illustrate the point:

Wiki text (and wikis) is essentially hinted natural language to help machine augmentation for rendering. If you break the hinting, the only thing that breaks is the human reader has to work slightly harder to understand what you wrote (e.g., bullet point hinting getting lost).
XML is essentially a human readable machine language - in the same vein as C, C++, etc Complete with rigid type mechanism. Just like other machine languages if anything goes wrong, the failure mode is to fail hard.

An interchange format for wikis needs (IMO obviously) to have characteristics of the former - due to the eventual target being people. The note I made about CDML being interesting is entirely down to it's failure mode. It fails back to ignoring the hinted semantics back to plain text.

I'm not saying that content should never be wellformed - far from it - it would be nice if content that was "broken" was still readable, but "prickly" (for want of a nice term). That way you users can still read what they want to read, but implementors still have an incentive to make things work better.

"I'm not trying to be a wet blanket here"

Me neither :) I like XML, but I'm not (yet) convinced that an interchange in this problem domain that has a hard failure mode and a rigid set of semantic information is wise. That is unless you force all implementations (and their plugins...) to wander off into namespaces for their extensions, or to abuse XML by essentially defining semantic elements in attributes (which is what span/div do afterall). -- MichaelSamuels

To try and find out how difficult it is to build a valid, complete CDML parser (and rendering system), I wrote one. The core was 75 lines including comments, and is pluggable and extensible. [Here] for the curious. -- MichaelSamuels

I've also just tried hacking up an CDML output module for OWikiClone. (This is the first non-HTML output module, so unsuprisingly it has some glitches at the moment). Oddly it's alot simpler than I expected. (Rendering module code can be seen [here] ) Sample output can be see [here ]. The node names used reflect the internal logic/parse representation inside OWikiClone. If this pans out I'll do the same for Usemod. (I tend to play with code to see if something works - from the "I do and I understand" ethos)-- MichaelSamuels

: "For me the crux of a WIF is this: it's a serialised parse tree representing the content as understood by a wiki engine. [...] would personally be in favour of something that does not require special tools for viewing, but a binary format is just as valid."

Well, I'm not sure where to begin. If the crux of a WIF is a serialized parse tree, then the parse tree exists prior to being serialized. And any parse tree can be represented in XML. I'm trying not to get bogged down in representation issues here. The syntax is almost completely immaterial; as you say, "a binary format is just as valid." The real question is the data model. Given that I think we'd both believe that there can be no common data model (apart from plain text) among all wikis, this leads me back to a point that I forgot to make very clear: all interchange syntaxes are by nature lossy. You can't expect a common syntax to capture the nuances of all wiki languages. So what you do is fish or cut bait. I propose we fish, we start with a small fish, and we provide a way for the fish to grow and morph. And our fish will not be diverse of flavour but may satisfy a lot of the hungry mouths. It will have specific kinds of features common to most fish, but not all. We don't want the Frankenstein fish. We want the digestible fish.

Since all wikis (that I'm aware of) are at least serializable to web documents, the most logical serialization syntax (IMO) would be either HTML or XHTML. Yes, specialized features, particularly presentational features, are likely to be lost in translation. What an interchange syntax can do is provide a way to capture most of the relevant content, the links, the WikiWords, and probably the simple formatting like bold and italics. Headings, maybe lists. It doesn't need to be much more than that for an interchange syntax. It's not (again, IMO) necessary for interchange to be able to represent all the content of a page such that it could from the interchange syntax rebuild the page looking identical to the original. That would be unreasonable. People interested in interchange (and not everyone is) would write importers and exporters to that common format. Now, this common format could be an augmented plaintext (like a wiki text), could be XML-based markup (like XHTML), or could be binary.

Plaintext has the problem that the thing we're trying to represent is itself plaintext. This means that there are no delimiters between what is markup and what is interchange content, it would all have to be inferred directly from the syntax. Locating errors would be a nightmare, if impossible.
Binary has the problem that it's impossible to read without a decoder, and likewise difficult to debug.
XML-style markup has the benefit of having a built-in lexical structure that can safely contain any kind of text content. Errors in markup are easily traced and generally easy to fix. The document format is still human-readable, and there are tools (many, many tools) available to process it. Wiki administrators wouldn't have to necessarily even develop conversion tools. It'd be extremely easy to either use downloadable applications, online translators, or customizable applets to do this kind of conversion. No, not everyone can modify such tools for their specific wiki syntax variant. But they could do this at least as easily as for plaintext-to-plaintext translation, and a lot easier than for binary format. This may be a radical notion to some, but in an interchange syntax, any existing HTML markup in wiki text could be escaped as character entities (say, stuff not understood or considered not-well-formed by the parser). For interchange (after transmission, it gets re-translated back into markup). But this eliminates any issues with well-formedness in the wiki source text.

To reiterate, it's important to note that an interchange syntax is not there to be extended -- that's the last thing one would want. What would be the point of it being for interchange if one couldn't assume the syntax matched the interchange specification? You don't need a mechanism to permit further "semantic meaning", you're essentially telling people: "if you want to interchange, here's the syntax, figure out how to read it and write to it." Now, if we're trying to develop a format that allows extension, anything goes. To me that runs counter to purpose: if I've got a wiki syntax that is richer than the interchange syntax, why would I expect that those receiving the interchange document could understand the richer syntax if they've written importers and processors according to the interchange specification? Now, if the real task here is interchange (and we both seem to agree that it is), it will be by its nature a subset of common wiki functionality. It will be lossy. Maybe we're talking about two different projects? A wiki interchange project based on wiki features would be almost done (and I'd be willing to do it), whereas it sounds like a project to develop an extensible representation/markup language capable of representing any wiki features via an extension mechanism could go on for years, as people continued to make changes, extensions, variants, etc. The Modularization of XHTML was designed for this kind of task, but I'm only interested in wiki interchange. -- MurrayAltheim

"It will be lossy."

This isn't acceptable (to me). Yes, I'm being unreasonable. (And yes, I know at some point I'll have to compromise :) However what I'm after in an interchange format is very specifically I need/want people to be able to edit a document created (say) in O'Wiki using it's handling of overlapping sectional markup to be passed to a Kwiki editor then through a usemod editor and to be able to come back to O'Wiki with it's semantic markup untouched. I'm aware that this is a difficult problem to solve, but I'm not interested in an interchange format that doesn't allow that use case, since it is a key usecase for me.

One point I'm probably not expressing well enough here is this:

Almost every wiki has features unique to them
If you choose "lowest common denominator" for interchange, it will always be lossy, and hence unattractive since it fails its task
Therefore any interchange format needs to be extensible out of the box
- This means any WIF parser will always come across the equivalent of tags/elements/flibbles it doesn't understand
- This means rather than failing hard it needs a sensible failure mode. "How to understand content I don't understand the container tags for"

Let me put another usecase of a WikiInterchangeFormat to you:

User installs wiki on their site
After 2 years of usage they decide to change to a different WikiEngine
They export their wiki in the WikiInterchangeFormat
They import this into their new Wiki
The user does not have the ability to add new library modules or compile code on their webserver (either ISP, or company)

In this scenario, XHTML as an interchange format fails. Either it needs to included verbatim into the new wiki (and many wikis don't allow pure HTML due to cross-site scripting issues), or it needs to be parsed and dumped. In the latter case this means the user needs to install libraries for XHTML parsing, and at that point they fail. I think XHTML and XML are extremely useful in their target problem domains, but I (personally) really do not think they're applicable in this scenario.

For an idea of features I (and many others) use a wiki for:

Use as a dumping ground, Discussion tool, Presentation tool (presentation plugins exist in several wikis); as a requirements capture tool, as a database; as a spreadsheet, shared whiteboard, for roughing out and refining final documentation, for book production, for figure and graphics editting

I don't want to be locked into a single platform and syntax forever just to be able to access those existing documents. This means the semantics must be able to be preserved _somehow_ . If that's in XHTML using class tags or meta data then fine, but that limits the audience. If it's in pure XML - again fine with similar caveat.

We could write an interchange format right here, right now, based on arbitrary criteria, and it'd make a fine document, and look very good. However it's worth noting that many wikis do not generate valid XHTML as is. Some very widely deployed wikis allow arbitrary HTML to be entered, and again, those fail to be validated by any parser. Simply defining XHTML to be good, because it's good, and wikis generate HTML without examining the quality of HTML that gets generated is for me missing a key problem. (unfortunately) This doesn't mean we can't use that tech, but it means it needs much more careful examination of what is practical versus ideal.

Maybe we need to separate this page out to WikiInterchangeUseCases? and WikiInterchangeCandidateTechnologies? ? -- MichaelSamuels

One might hold a copy the WIF text alongside the local text so that the O'Wiki text can be sent to a KWiki to a UseMod and back to O'Wiki with all of its original extensions. Once the page is edited, things get tricky. For wikis that don't understand that semantic information, it could be destroyed at that point. Arguably, that's reasonable, as the edit will result in a derived document that makes sense to at least one human being. Of course, if we use the parser level-based model I suggest, the wiki could theoretically understand that an edit is localized to only a block, and thus preserve the semantic markup for the other blocks. -- SunirShah

Depending on how wide your definition of wikis are, building a WikiInterchangeFormat that covers every wiki is theoretically impossible. You can only write interchange formats for wikis with similar or equivalent structures (aka rhetorics). So, for instance, wikis that are software manuals can exchange pages, and wikis that are Pattern Languages might even be able to exchange pages with them, but wikis that are weblogs or news archives might not, and something more like a social network probably cannot.

So, while the {phrase, line/paragraph, block, aggregate, page} model of wiki syntax works well for very print-like wikis, that doesn't mean it extends to every wiki.

That being said, I think one can simply ignore the non-print like wikis and make a useful language for the rest. Mostly since I'm limiting my definition of wiki to "clones of c2".

Note, when I say language, I mean "not a standard, but a data protocol for a sub-community." Of the 5 million wikis around, it's unlikely more than a very small subset will use this protocol, and amongst our local subsets, we can negotiate what is reasonable for the problem at hand.

The model I think we are moving towards for the wiki interchange format is likely to be based on building different levels for phrase-, line/paragraph-, block-, aggregate-, and page-level markup. So, we can all agree on what a link looks like (phrase level), that we will include standard vocabulary for bold, italic, underline (line/paragraph), that we will have various lists, horizontal rules (block), and simple tables (aggregate).

More to the point, by rationalizing the parsers by levels, it makes it much easier to add new syntaxes ad hoc to the language. So, if between your and my PIM wiki we wanted to exchange our Calendars, we can simply define a new Aggregate-level format and exchange it. Or if some people wanted strikethrough, they can just create a new line/paragraph expression and exchange it. That is, the interchange format is better as a framework, a parser model, rather than simply random tags. -- SunirShah

Wouldn't simply having an RSS feed version of every page accomplish a lot of what we're trying to do here? I'm assuming the entire text of the page would be in the RSS instead of just a link. If every wiki supported that simple feature, it would allow two cool things: (1) people to subscribe to specific pages and (2) other wikis to be able to transfer content. RSS is a simple XML schema. You would lose some formatting but the core information would be easily available for everyone.

I think thats unavoidable to use XML as WikiInterchangeFormat. XML can handle Wiki structure (if XHTML does, XML does too).

And IMO there is not point choosing XHTML over XML:

 * XHTML has limited set of features*, chosen for other purpose, by independent (not wiki-related) organization
 * You can transform XML to XHTML easily (using XSLT or even RegExps?)
 * You can apply styles directly to the XML document anyway

For me - as wiki implementator - it doesn't matter if you define interchange format as DTD, XML Schema or verbose RFC. In any case I'll have to get familiar with features it describes and implement them.

I realize that you could add new modueles/dtds to XHTML, but then you would use it just as XML.

-- KornelLesinski

This discussion is pointless intellectual masturbation unless someone just knuckles down and does it. It's very simple:

Write it for MediaWiki
Write it for TWiki
Write it for Trac
Write it for SocialText
Write it for Confluence
Test them with each other

For extra credit, make Trac and Confluence source/patch/issue tracking portable (any sane solution will probably mean just pointing back to the original site, since you lack the ticket/source repository to keep the state current).

If you pull it off with something other than XML, bully for you, but I wouldn't give you good odds up front.

Stop trying to arrive at perfection by consensus. If it's at all important to anyone, just go do it.

-- ChuckAdams

WikiInterchangeFormat

Historical discussion

Discussion