[Home]AllPagesStandard

MeatballWiki | RecentChanges | Random Page | Indices | Categories

1. Introduction
2. Restrictions
3. Implementation notes
4. Client implementation
5. Online test
6. Examples
7. Implementations

CategoryWikiStandard


1. Introduction

For many purposes, it's useful to list all the pages of your wiki in an easily machine-readable format (cf. MachineInterface). Therefore, publish an (X)HTML page on your wiki with links to all the pages on your wiki, and no other links. The content of the links is the display name, and the href attribute of the links is the full URI to access the page.

2. Restrictions

3. Implementation notes

If you want to be XHTML compliant, you need to wrap the links in a block element. The simplest method is to use a <div/>; if so, declare it as <div id="AllPages"/>. If you want to make the list readable by humans, you can use something like an unordered list (<ul/>), one list item (<li/>) per title.

If you wish to include links that are not page titles, an alternate format that is not guaranteed to be parsable by all clients, you must wrap all and only the page title links in a <div id="AllPages"/>

Internationalization. For the most part, internationalization is taken care of by (X)HTML + HTTP. However, if you are using a non-latin-1 CharacterSet, it is strongly recommended to declare the character set with an embedded <meta http-equiv="Content-type" content="text/html; charset=charset"/>. Some clients may cache your AllPages list as files. They might lose the character set in the HTTP header.

4. Client implementation

Assume the simplest clients will be very simple RegularExpression parsers following this format:

 href=['"](.*?)['"][^>]*?>(.*?)<\/a>

More sophisticated clients should recognize links only within <div id="AllPages"/>. This chunk can be found with this regular expression:

 <div[^>]*?id="AllPages".*?>(.*?)<\/div>

Clients that support internationalization should support discovery and use of the Content-type MetaTag:

 <meta http-equiv="Content-type" content="text/html; charset=charset"/></nowiki>

Note that this tends to be case-insensitive. A regular expression that can find this is (when searching insensitive to case):

 <meta\s*(?:(?:http-equiv=['"]content-type['"]|content=['"]text/html;?[^'"]*?charset=(\S+).*['"]|\S+=['"].*?['"])\s*)+\/?>

Caveat: I haven't unit tested these regexs yet. Just placing them here for now. --ss

5. Online test

You can test your implementation at ...

To be written.

6. Examples

This represents a very basic conforming implementation. The UserAgent? will presume the content is HTML in a Latin-1 character set.

<a href="http://wwww.usemod.com/cgi-bin/mb.pl?AGroupIsItsOwnWorstEnemy">AGroupIsItsOwnWorstEnemy</a>
<a href="http://wwww.usemod.com/cgi-bin/mb.pl?AbandonPage">AbandonPage</a>
<a href="http://wwww.usemod.com/cgi-bin/mb.pl?AbbeNormal">AbbeNormal</a>
<a href="http://wwww.usemod.com/cgi-bin/mb.pl?AbbreviationsAreEvil">AbbreviationsAreEvil</a>
<a href="http://wwww.usemod.com/cgi-bin/mb.pl?AboutThisSite">AboutThisSite</a>
<a href="http://wwww.usemod.com/cgi-bin/mb.pl?AboveTheFold">AboveTheFold</a>
<a href="http://wwww.usemod.com/cgi-bin/mb.pl?AbréviationsSontInfernales">AbréviationsSontInfernales</a>

This version represents a completely specified AllPages listing. Note that it is UTF-8

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
<head>
    <meta http-equiv="Content-type" content="text/html; utf-8"/>
</head>
<body>
<div id="AllPages">
<a href="http://wwww.usemod.com/cgi-bin/mb.pl?AGroupIsItsOwnWorstEnemy">AGroupIsItsOwnWorstEnemy</a>
<a href="http://wwww.usemod.com/cgi-bin/mb.pl?AbandonPage">AbandonPage</a>
<a href="http://wwww.usemod.com/cgi-bin/mb.pl?AbbeNormal">AbbeNormal</a>
<a href="http://wwww.usemod.com/cgi-bin/mb.pl?AbbreviationsAreEvil">AbbreviationsAreEvil</a>
<a href="http://wwww.usemod.com/cgi-bin/mb.pl?AboutThisSite">AboutThisSite</a>
<a href="http://wwww.usemod.com/cgi-bin/mb.pl?AboveTheFold">AboveTheFold</a>
</div>
</body></html>

If you would like to use the same display for humans as well as machines, you can use for instance an unordered list. Note the lack of a <div/> element, as it is unnecessary.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html><head></head><body>
<ul>
<li><a href="http://wwww.usemod.com/cgi-bin/mb.pl?AGroupIsItsOwnWorstEnemy">AGroupIsItsOwnWorstEnemy</a></li>
<li><a href="http://wwww.usemod.com/cgi-bin/mb.pl?AbandonPage">AbandonPage</a></li>
<li><a href="http://wwww.usemod.com/cgi-bin/mb.pl?AbbeNormal">AbbeNormal</a></li>
<li><a href="http://wwww.usemod.com/cgi-bin/mb.pl?AbbreviationsAreEvil">AbbreviationsAreEvil</a></li>
<li><a href="http://wwww.usemod.com/cgi-bin/mb.pl?AboutThisSite">AboutThisSite</a></li>
<li><a href="http://wwww.usemod.com/cgi-bin/mb.pl?AboveTheFold">AboveTheFold</a></li>
<li><a href="http://wwww.usemod.com/cgi-bin/mb.pl?AbréviationsSontInfernales">AbréviationsSontInfernales</a></li>
</ul>
</body></html>

If you would like to use the same display for humans as well as machines, and add other links on the page, you must use a <div id="AllPages"/> element. This format is not guaranteed to work with simpler clients.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html><head></head><body>
<p><a href="http://www.usemod.com/cgi-bin/mb.pl">HomePage</a> | <a href="http://www.usemod.com/cgi-bin/mb.pl?action=rc">RecentChanges</a></p>
<hr/>
<div id="AllPages">
<ul>
<li><a class="titlelink" href="http://wwww.usemod.com/cgi-bin/mb.pl?AGroupIsItsOwnWorstEnemy">AGroupIsItsOwnWorstEnemy</a></li>
<li><a class="titlelink" href="http://wwww.usemod.com/cgi-bin/mb.pl?AbandonPage">AbandonPage</a></li>
<li><a class="titlelink" href="http://wwww.usemod.com/cgi-bin/mb.pl?AbbeNormal">AbbeNormal</a></li>
<li><a class="titlelink" href="http://wwww.usemod.com/cgi-bin/mb.pl?AbbreviationsAreEvil">AbbreviationsAreEvil</a></li>
<li><a class="titlelink" href="http://wwww.usemod.com/cgi-bin/mb.pl?AboutThisSite">AboutThisSite</a></li>
<li><a class="titlelink" href="http://wwww.usemod.com/cgi-bin/mb.pl?AboveTheFold">AboveTheFold</a></li>
<li><a class="titlelink" href="http://wwww.usemod.com/cgi-bin/mb.pl?AbréviationsSontInfernales">AbréviationsSontInfernales</a></li>
</ul>
</div>
<hr/>
<p>The footer goes here.</p>
</body></html>

7. Implementations

to be determined.

CategoryWikiStandard


Discussion

I recommend WikiPedia:JSON, as one can present the entire MachineInterface without creating cross-site compatibility problems. Compare and contrast:

http://www.srcf.ucam.org/~cjp39/Peri/RecentChanges RecentChanges http://www.srcf.ucam.org/~cjp39/Current/RecentChanges http://www.srcf.ucam.org/~cjp39/Peri/RecentChanges.rss

{ name: "RecentChanges", url: "http://www.srcf.ucam.org/~cjp39/Peri/RecentChanges", unstable-url: "http://www.srcf.ucam.org/~cjp39/Current/RecentChanges", rss: "http://www.srcf.ucam.org/~cjp39/Peri/RecentChanges.rss" }

One only needs standardize the common keys. Hooks to all the bizarre functionality the site supports can be added without problem, because they are automatically discarded if not understood. JSON parsers are available for all major languages, and trivial to write in others. -- ChrisPurcell

I say the plaintext be the "most-recommended" base standard, and the JSON be an optional add-on standard. Why?:

I want to elaborate on the importance of making the production of AllPages trivial. My theory is that engine developers are much more likely to implement a standardized format if it takes them 5 minutes to write than if it takes them 10. The plaintext format is so simple that people might just implement it the minute they hear about it -- it takes less time than reading a single wiki page, for goodness sakes! When they see JSON, half of them will stop and go, "hmm, JSON, I'll have to check that out -- better put it off until I have time". That's what I would do if I developed a wiki engine. -- BayleShanks

Fair enough - as long as we do standardize both. That way, we solve the problem of exposing the MachineInterface. One doesn't need a library to output a trivial JSON stream, just "[ { name : \"" . join(" }, { name : \"", @allPageNames) . "\" } ]" for instance, if pagenames can't include backslashes or quotes - as you say, 10 minutes rather than 5. -- ChrisPurcell

Certainly, let's standardize both. -- BayleShanks

Stupid question: Character set standard? We have wikis with both mixed characters and nearly totally non-ASCII (eg, Russian or Chinese wikis). An explicit mention of UTF-8 wouldn't hurt, I think. (But there may be wikis still out there using national character sets for the URL encodings...)

Further, am I supplying displayable names or URL fragments? If URL fragments then charset interpretation is less important (I give the fragments in the way I expect to receive them back), but of course now you have opaque illegible names for many languages (eg %D7%A2%D7%A8%D7%A9%D7%98%D7%A2_%D7%96%D7%B2%D6%B7%D7%98). -- BrionVibber

But HTTP already specifies the character set, doesn't it? If you don't specify anything, it's Latin-1. If it is not, you need to use something like the following:

    Content-Type: text/plain; charset=UTF-8

I need URL fragments for my implementation of NearLinks. [1] Thus, no display names for me.

I think we need to define how to get at the various pages given the AllPages URL, lest we need to provide the wiki root URL fragment, too. As far as I am concerned, I would like a convention whereby we can use it as keywords or parameters for GET. Examples:

http://usemod.com/cgi-bin/mb.pl?action=index lists AllPages, therefore use http://usemod.com/cgi-bin/mb.pl?AllPages
http://www.oddmuse.org/cgi-bin/oddmuse?action=index;raw=1 lists a Chinese page, therefore URL-encode using the appropriate characterset for the source system and use http://www.oddmuse.org/cgi-bin/oddmuse?%e7%b0%a1%e4%bb%8b (since the source system uses UTF-8)

As for the complexities of remembering whether the source system uses Latin-1, UTF-8, or something like Big5: I personally convert all the page name results to UTF-8 before I save them locally and convert them back to the source charset when linking to them via proxy scripts. Example:

http://www.emacswiki.org/cgi-bin/latin-1.pl?url=http://www.usemod.com/cgi-bin/mb.pl?R%c3%a8glesDeMiseEnPage

That is, whatever is listed on AllPages should be either HTTP keywords (wiki?All+Pages) or HTTP parameters (wiki?action=browse;id=All%20Pages). This particular proposal would not allow to list something using path_info (wiki/namespace/all_pages).

Here's how it would look for a wiki supporting namespaces: The namespace is part of path_info, but the pagename is not.

    http://www.communitywiki.org/odd/SandWiki?action=index;raw=1 -> http://www.communitywiki.org/odd/SandWiki?OddWiki

-- AlexSchroeder

The purpose of AllPages is not to precook the titles for a script to append to a URL, but for agents that need the actual page titles. An InterMap would describe how to build a URL, so I'd like to avoid putting in the full path since a generic agent won't be able to discover the page titles from that format. I don't understand internationalization well enough to talk about the encoding question, although I can see they are related. URL-encoded titles may be necessary--even if we were to accept titles with newlines in them. I'll read up on internationalization. I'm quite rusty on that front. -- SunirShah

I'm fine with dropping the proposal of using large parts of the request URL to build the page URLs. The important question is whether you agree on listing URL fragments or display names -- one of them is used to retrieve pages from the web by an agent, the other is read by humans (something we do not need, I believe). You need to specify what "the actual page titles" means. -- AlexSchroeder

I mean by the actual page titles precisely and only the data that constitutes the name of the page, not any extra URL fragments that help build the action. The titles are the same text that users would use to link to the pages. Through either a LinkPattern or FreeLinks. This follows the NameAsLocation? concept. So, this page would be AllPages, and never action=browse&id=AllPages nor AllPages.html nor 42.html (think SwikiClone).

Aside: Regarding wikis like SwikiClone that uses numbers for page locations, they are broken. URLs should not be SystemCodes. If they are, there should expose a MachineInterface to redirect from a given page title to its actual URL. If they can write one of those, they can get rid of the SystemCode URLs. Although, I understand there are reasons to use non-semantic URLs, such as ensuring that links never change even if pages are renamed. On a wiki where text is fluid, I'm not sure that matters, but this standard isn't limited to wikis.

Now, the encoding of those titles I think could easily and usefully be URL-encoded. I'm just confused about one point. If I published an AllPages list in Big5, and you had an agent that operates in UTF8, are the URL encodes the same for the title? Is something like %42%42 always the same character, no matter what the encoding is? That's what I'm going to be reading about. I can see the bytewise encodings being different depending on the charset, and thus the URL encodings. UTF-8 is a good standard, in that case. -- SunirShah

There are several things that need to be considered. I'll explain how Oddmuse does it, then we can look at the various issues, and consider what would happen if we ignored them. Perhaps that will give us a good feel of the things that we need to solve and the things that can be optional.

  1. The wiki serves pages encoded in UTF-8. When a use edits the page, the browser knows this, and usually sends back your edit as an UTF-8 encoded parameter, and it uses a Content-Type header telling the CGI script, that the parameters are UTF-8 encoded.
  2. If the user makes a link to a page called "Übersicht", then the Ü will be encoded using two bytes. If this links to an edit page using an ordinary URL (ie. GET), then the name of the page will have to be part of the URL somehow. The URL does not allow non-ASCII characters, therefore an additional layer of encoding has to be used. But there's usually no way to specify the first encoding that you've used. (This is different for Subject headers in mails, for example, where the first encoding used is part of the second encoding! Example from WikiPedia:MIME: Subject: =?utf-8?Q?=C2=A1Hola,=20se=C3=B1or!?=) On a wiki where the link is pointing back to the same CGI script, this is not a problem. In our case, we can use the UTF-8 encoded string, URL-encode it, and use a link such as wiki?%c3%9cbersicht. The wiki will know that it needs to URL-decode the parameter (the CGI library usually does this), and to interpret the two first bytes as the character Ü.
  3. The URL Fragment (%c3%9cbersicht) is not readable by humans. If you want to provide a list of page names to a user, you need to provide the URL-encoded string on a page with the header Content-Type: text/plain; charset=UTF-8. There you can encode the Ü using UTF-8 resulting in at least two bytes (since there are two ways to encode Ü)
  4. Given this human-readable string, and the implicit assumption that the wiki will have UTF-8 encoded page names in requests, we can URL-encode this string: This allows users to link to the page, read the page, etc. The agent will have to translate the UTF-8 encoded Übersicht (using two bytes for Ü) back to %c3%9cbersicht using URL encoding.
  5. We can also return a page with Content-Type: text/html; charset=UTF-8 and use a link containt an UTF-8 encoded page name linked to an URL containing the UTF-8 and URL-encoded page name. That the basic HTML list of all pages most of us already have anyway.

If you don't want to make the charset assumption, either because you want to ignore the content-type header, or because you do not care about human readability, then we can list all URL fragments identifying pages (eg. %c3%9cbersicht) without further ado. It will be possible to construct links to these pages using simple text replacement. Without the encoding used, it will not be possible to display this page name to humans, however, and thus it will not be possible for humans to link to it. That is why I oppose such a simplistic approach.

Thus, in my scheme you need to know encoding used by both source and target wiki before you can write an interlink refering to a page name with non-ASCII characters. CommunityWiki:Übersicht produces an URL ending in %DCbersicht, which will not work. Übersicht was first encoded using Latin-1 and then URL encoded.

On Community Wiki, however, linking to MeatBall:ÜberSicht will link to a proxy: http://www.emacswiki.org/cgi-bin/latin-1.pl?url=http://www.usemod.com/cgi-bin/mb.pl?%C3%9CberSicht which will then redirect to http://www.usemod.com/cgi-bin/mb.pl?%DCberSicht. See how the encoding is correct.

The script is really simple, by the way. ([source], [test])

Remember when I said that there are two ways to encode Ü using UTF-8? These two forms are called NFC and NFD -- one prefers composed characters, where Ü is one code-point translated to two bytes, the other prefers decomposed characters, so Ü is decomposed into U and the diaresis, which results in three bytes. The standard on the web is NFC. The standard on the HFS+ filesystem is NFD. Thus, if you use the page name bytes as a file name without translating to Unicode (using the internal representation of the system instead of an encoding such as UTF-8), you'll run into trouble... But that's just an aside.

-- AlexSchroeder

Discussion continued to a resolution between AlexSchröder and SunirShah at irc://irc.freenode.net/#wiki, logged to http://sunir.org/meatball/AllPages/01-Nov-2005-FreeNode-%23wiki.txt

I'm missing some steps in the logic of the linked argument. How did we go from "we must use plain-text because it's simpler" to "we must use HTML"? -- ChrisPurcell

I'm pretty confused, but this is because I know nothing about internationalization. My first reaction is to agree with Chris; how is HTML simpler than plaintext? I gather it has something to do with internationalization, so I'm trying to focus my ignorance into questions. Here are my questions:

What is wrong with having a text/plain document listing, on each line: a readable human name, then a delimiter, and then the associated URL? If you did that, would a client that didn't care about showing the page titles to humans be able to ignore the charset and easily use the URLs, or is knowing the charset necessary to use the URLs? (I'm assuming knowing the charset is necessary, since i'm assuming that different numbers of charsets use different number of bytes per char)

If something is wrong with that, then would it be possible to have TWO documents, both text/plain, one with one human readable page name on each line, and the other with a machine-usable URL on each line? The first document could be any charset, and the second one must be ASCII?

So, you can see what I'm proposing (if my understanding is correct): it may be a sad truth that you have to be concerned about charsets if you're displaying page titles to a user, but you could just use ASCII if you just want the links, right? So why not separate the page names and the URLs into two separate documents -- then clients who just want the URLs can use ASCII, and clients who need the page names can worry about encodings. In addition, nothing is HTML, so both of these documents are easy to produce and easy to parse.

Sorry for my denseness (and is there any reasonably short intro to this stuff that I should read?).

If HTML is required, however, then I suggest that the spec just explicitly say, "the list of page names must be recoverable by the regular expression >(.*)?</a>" (or whatever the regex should be), and "the list of URLs must be recoverable by the regular expression <a href="([^"]*)"" (or whatever the regex should be).

thanks, -- BayleShanks

The long and the short of it is that URL encoding is not sufficient information to understand what a string means if it is 'internationalized'. You also need the CharacterSet. You also need the URL since the Title may be encoded in a different CharacterSet than how the server accepts URLs (consider an aggregator that canonicalizes the CharacterSets to UTF8). We decided it would be easier to use an existing standard for encoding that information (i.e. HTML/HTTP) than invent a wholly new one. The minimal representation is not that much more complicated than Title URL. It's simply <a href="URL">Title</a>.

The second point was that the list can also be useful for human beings, which means that it may be surrounded by interface crud. Since a <div/> is required for an XHTML representation, we might as well use it. Further, it is easier for some implementations to just stuff a <div/> in their templates than write a whole new event handler. -- SunirShah

As I understand it, URLs are and must be written solely in restricted ASCII (octets 20-7E) as per RFC 1738. There are no character set issues unless you're given an incorrectly-formed URL and have to guess how to %-escape it, which one should simply rule out. -- ChrisPurcell

URLs must be in ASCII, but the encoding of the bytes can be in any CharacterSet. Bytes not in the restricted set are encoded in the %xx format, where x is a hexadecimal digit. So, if I had a string encoded in Big5 and another in UTF8, their URL encodings will be different (they will have different %xx) values. -- SunirShah

But that encoding depends on the charset the server uses to decode, not the charset of the file you send the URLs in. Hence, using ASCII is fine. -- ChrisPurcell


There's been discussion about producing several AllPages standards (plaintext, JSON), and it seems a bit arbitrary to come to an out-of-line resolution summarily discarding them. -- ChrisPurcell

As always, the discussion continues. The resolution was regarding the CharacterSet issue. I promised to write it up. I've posted it here publicly the same day because I strongly dislike making decisions in other channels.

Remember, it isn't a "standard" until everyone agrees to implement it, and I have no doubt that agreement will be impossible if we just make decisions arbitrarily. However, all that being said, I agree to this standard enough that I will write a little code to test it out. The standardization process I prefer is to come to a small, brief agreement over a specification amongst a few developers, then implement a small prototype to see if it will work. Then expand and adapt to accommodate more and more people until the specification becomes de facto standardized. -- SunirShah

Okay. Confusion over definition of "resolution", then. -- ChrisPurcell


(moved from SisterSitesImplementationGuide)

Typically SisterSite and NearLink implementers work with small page lists (perhaps a few hundreds or thousands pages) and are quite happy to update once a day, typically during night. But working across wikis this isn't efficient because the display lacks behind for hours (e. g. begging page on near linking).

One should think about sites containing 10^5 pages and keeping page lists pretty current (e. g. updating every 5-10 minutes).

This is easy to do but needs a different, incremental API: give me the pages added or deleted since (not necessarily Unix) second.

I do this by having a "page index log" format on page additions (+) and deletions (-) :

  second[+|-]pagename
  ....

When you update you call something like "action=pagelog&since=1129883609" and the server transfers only the most recent lines including the information from when to continue on the next update naturally in the last transferred line. This way, only a few page names have to be transferred during updating and one can just append the lines to the local "page index log" which duplicated the server log. The method is independent from availablity problems and - working with server timestamps - needs no synchronized time. From the log the page index can be extracted easily adding and deleting through a hash table.

-- HelmutLeitner

I think using UNIX diff makes more sense as you can then use UNIX patch. This limits the coding challenge. -- SunirShah

UNIX diff and patch are also available for Windows.

I don't understand the reference to diff. How should that work? -- HelmutLeitner

The server sends a diff between the current AllPages file and what it looked like the last time the client got the file.

To be honest, it's not clear if this is computationally easier or pragmatic. We should try the diff method out first before agreeing to it. It does have the advantage that it will apply equally well to many formats, like the LinkDatabase. -- SunirShah

Using diff means that the server has to keep states for clients. That would be ugly. -- HelmutLeitner

Alternately, client could send the time of it's last check. --JohnAbbe

A UNIX diff would not require the server to keep any more state than a +/- change log. Simply take the pagelog for today; then using the change log, compose the pagelog from the past as requested; and then compute the diff. This requires twice the memory and more computation time. It may be simpler to just use a +/- change log. -- SunirShah


At RecentChangesCamp:SisterSites2006, ChrisPurcell suggested using Rest:HttpEvents to keep sister wikis synchronized with low latency and low overheads.


Discussion

MeatballWiki | RecentChanges | Random Page | Indices | Categories
Edit text of this page | View other revisions
Search: