For many purposes, it's useful to list all the pages of your wiki in an easily machine-readable format (cf. MachineInterface). Therefore, publish an (X)HTML page on your wiki with links to all the pages on your wiki, and no other links. The content of the links is the display name, and the href attribute of the links is the full URI to access the page.
If you want to be XHTML compliant, you need to wrap the links in a block element. The simplest method is to use a <div/>; if so, declare it as <div id="AllPages"/>. If you want to make the list readable by humans, you can use something like an unordered list (<ul/>), one list item (<li/>) per title.
If you wish to include links that are not page titles, an alternate format that is not guaranteed to be parsable by all clients, you must wrap all and only the page title links in a <div id="AllPages"/>
Internationalization. For the most part, internationalization is taken care of by (X)HTML + HTTP. However, if you are using a non-latin-1 CharacterSet, it is strongly recommended to declare the character set with an embedded <meta http-equiv="Content-type" content="text/html; charset=charset"/>. Some clients may cache your AllPages list as files. They might lose the character set in the HTTP header.
Assume the simplest clients will be very simple RegularExpression parsers following this format:
href=['"](.*?)['"][^>]*?>(.*?)<\/a>
More sophisticated clients should recognize links only within <div id="AllPages"/>. This chunk can be found with this regular expression:
<div[^>]*?id="AllPages".*?>(.*?)<\/div>
Clients that support internationalization should support discovery and use of the Content-type MetaTag:
<meta http-equiv="Content-type" content="text/html; charset=charset"/></nowiki>
Note that this tends to be case-insensitive. A regular expression that can find this is (when searching insensitive to case):
<meta\s*(?:(?:http-equiv=['"]content-type['"]|content=['"]text/html;?[^'"]*?charset=(\S+).*['"]|\S+=['"].*?['"])\s*)+\/?>
Caveat: I haven't unit tested these regexs yet. Just placing them here for now. --ss
You can test your implementation at ...
To be written.
This represents a very basic conforming implementation. The UserAgent? will presume the content is HTML in a Latin-1 character set.
<a href="http://wwww.usemod.com/cgi-bin/mb.pl?AGroupIsItsOwnWorstEnemy">AGroupIsItsOwnWorstEnemy</a> <a href="http://wwww.usemod.com/cgi-bin/mb.pl?AbandonPage">AbandonPage</a> <a href="http://wwww.usemod.com/cgi-bin/mb.pl?AbbeNormal">AbbeNormal</a> <a href="http://wwww.usemod.com/cgi-bin/mb.pl?AbbreviationsAreEvil">AbbreviationsAreEvil</a> <a href="http://wwww.usemod.com/cgi-bin/mb.pl?AboutThisSite">AboutThisSite</a> <a href="http://wwww.usemod.com/cgi-bin/mb.pl?AboveTheFold">AboveTheFold</a> <a href="http://wwww.usemod.com/cgi-bin/mb.pl?AbréviationsSontInfernales">AbréviationsSontInfernales</a>
This version represents a completely specified AllPages listing. Note that it is UTF-8
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html> <head> <meta http-equiv="Content-type" content="text/html; utf-8"/> </head> <body> <div id="AllPages"> <a href="http://wwww.usemod.com/cgi-bin/mb.pl?AGroupIsItsOwnWorstEnemy">AGroupIsItsOwnWorstEnemy</a> <a href="http://wwww.usemod.com/cgi-bin/mb.pl?AbandonPage">AbandonPage</a> <a href="http://wwww.usemod.com/cgi-bin/mb.pl?AbbeNormal">AbbeNormal</a> <a href="http://wwww.usemod.com/cgi-bin/mb.pl?AbbreviationsAreEvil">AbbreviationsAreEvil</a> <a href="http://wwww.usemod.com/cgi-bin/mb.pl?AboutThisSite">AboutThisSite</a> <a href="http://wwww.usemod.com/cgi-bin/mb.pl?AboveTheFold">AboveTheFold</a> </div> </body></html>
If you would like to use the same display for humans as well as machines, you can use for instance an unordered list. Note the lack of a <div/> element, as it is unnecessary.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html><head></head><body> <ul> <li><a href="http://wwww.usemod.com/cgi-bin/mb.pl?AGroupIsItsOwnWorstEnemy">AGroupIsItsOwnWorstEnemy</a></li> <li><a href="http://wwww.usemod.com/cgi-bin/mb.pl?AbandonPage">AbandonPage</a></li> <li><a href="http://wwww.usemod.com/cgi-bin/mb.pl?AbbeNormal">AbbeNormal</a></li> <li><a href="http://wwww.usemod.com/cgi-bin/mb.pl?AbbreviationsAreEvil">AbbreviationsAreEvil</a></li> <li><a href="http://wwww.usemod.com/cgi-bin/mb.pl?AboutThisSite">AboutThisSite</a></li> <li><a href="http://wwww.usemod.com/cgi-bin/mb.pl?AboveTheFold">AboveTheFold</a></li> <li><a href="http://wwww.usemod.com/cgi-bin/mb.pl?AbréviationsSontInfernales">AbréviationsSontInfernales</a></li> </ul> </body></html>
If you would like to use the same display for humans as well as machines, and add other links on the page, you must use a <div id="AllPages"/> element. This format is not guaranteed to work with simpler clients.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html><head></head><body> <p><a href="http://www.usemod.com/cgi-bin/mb.pl">HomePage</a> | <a href="http://www.usemod.com/cgi-bin/mb.pl?action=rc">RecentChanges</a></p> <hr/> <div id="AllPages"> <ul> <li><a class="titlelink" href="http://wwww.usemod.com/cgi-bin/mb.pl?AGroupIsItsOwnWorstEnemy">AGroupIsItsOwnWorstEnemy</a></li> <li><a class="titlelink" href="http://wwww.usemod.com/cgi-bin/mb.pl?AbandonPage">AbandonPage</a></li> <li><a class="titlelink" href="http://wwww.usemod.com/cgi-bin/mb.pl?AbbeNormal">AbbeNormal</a></li> <li><a class="titlelink" href="http://wwww.usemod.com/cgi-bin/mb.pl?AbbreviationsAreEvil">AbbreviationsAreEvil</a></li> <li><a class="titlelink" href="http://wwww.usemod.com/cgi-bin/mb.pl?AboutThisSite">AboutThisSite</a></li> <li><a class="titlelink" href="http://wwww.usemod.com/cgi-bin/mb.pl?AboveTheFold">AboveTheFold</a></li> <li><a class="titlelink" href="http://wwww.usemod.com/cgi-bin/mb.pl?AbréviationsSontInfernales">AbréviationsSontInfernales</a></li> </ul> </div> <hr/> <p>The footer goes here.</p> </body></html>
to be determined.
http://www.srcf.ucam.org/~cjp39/Peri/RecentChanges RecentChanges http://www.srcf.ucam.org/~cjp39/Current/RecentChanges http://www.srcf.ucam.org/~cjp39/Peri/RecentChanges.rss
{ name: "RecentChanges", url: "http://www.srcf.ucam.org/~cjp39/Peri/RecentChanges", unstable-url: "http://www.srcf.ucam.org/~cjp39/Current/RecentChanges", rss: "http://www.srcf.ucam.org/~cjp39/Peri/RecentChanges.rss" }
I say the plaintext be the "most-recommended" base standard, and the JSON be an optional add-on standard. Why?:
I want to elaborate on the importance of making the production of AllPages trivial. My theory is that engine developers are much more likely to implement a standardized format if it takes them 5 minutes to write than if it takes them 10. The plaintext format is so simple that people might just implement it the minute they hear about it -- it takes less time than reading a single wiki page, for goodness sakes! When they see JSON, half of them will stop and go, "hmm, JSON, I'll have to check that out -- better put it off until I have time". That's what I would do if I developed a wiki engine. -- BayleShanks
"[ { name : \"" . join(" }, { name : \"", @allPageNames) . "\" } ]"
for instance, if pagenames can't include backslashes or quotes - as you say, 10 minutes rather than 5. -- ChrisPurcell
Stupid question: Character set standard? We have wikis with both mixed characters and nearly totally non-ASCII (eg, Russian or Chinese wikis). An explicit mention of UTF-8 wouldn't hurt, I think. (But there may be wikis still out there using national character sets for the URL encodings...)
Further, am I supplying displayable names or URL fragments? If URL fragments then charset interpretation is less important (I give the fragments in the way I expect to receive them back), but of course now you have opaque illegible names for many languages (eg %D7%A2%D7%A8%D7%A9%D7%98%D7%A2_%D7%96%D7%B2%D6%B7%D7%98). -- BrionVibber
But HTTP already specifies the character set, doesn't it? If you don't specify anything, it's Latin-1. If it is not, you need to use something like the following:
Content-Type: text/plain; charset=UTF-8
I need URL fragments for my implementation of NearLinks. [1] Thus, no display names for me.
I think we need to define how to get at the various pages given the AllPages URL, lest we need to provide the wiki root URL fragment, too. As far as I am concerned, I would like a convention whereby we can use it as keywords or parameters for GET. Examples:
As for the complexities of remembering whether the source system uses Latin-1, UTF-8, or something like Big5: I personally convert all the page name results to UTF-8 before I save them locally and convert them back to the source charset when linking to them via proxy scripts. Example:
That is, whatever is listed on AllPages should be either HTTP keywords (wiki?All+Pages) or HTTP parameters (wiki?action=browse;id=All%20Pages). This particular proposal would not allow to list something using path_info (wiki/namespace/all_pages).
Here's how it would look for a wiki supporting namespaces: The namespace is part of path_info, but the pagename is not.
http://www.communitywiki.org/odd/SandWiki?action=index;raw=1 -> http://www.communitywiki.org/odd/SandWiki?OddWiki
The purpose of AllPages is not to precook the titles for a script to append to a URL, but for agents that need the actual page titles. An InterMap would describe how to build a URL, so I'd like to avoid putting in the full path since a generic agent won't be able to discover the page titles from that format. I don't understand internationalization well enough to talk about the encoding question, although I can see they are related. URL-encoded titles may be necessary--even if we were to accept titles with newlines in them. I'll read up on internationalization. I'm quite rusty on that front. -- SunirShah
I'm fine with dropping the proposal of using large parts of the request URL to build the page URLs. The important question is whether you agree on listing URL fragments or display names -- one of them is used to retrieve pages from the web by an agent, the other is read by humans (something we do not need, I believe). You need to specify what "the actual page titles" means. -- AlexSchroeder
I mean by the actual page titles precisely and only the data that constitutes the name of the page, not any extra URL fragments that help build the action. The titles are the same text that users would use to link to the pages. Through either a LinkPattern or FreeLinks. This follows the NameAsLocation? concept. So, this page would be AllPages
, and never action=browse&id=AllPages
nor AllPages.html
nor 42.html
(think SwikiClone).
Aside: Regarding wikis like SwikiClone that uses numbers for page locations, they are broken. URLs should not be SystemCodes. If they are, there should expose a MachineInterface to redirect from a given page title to its actual URL. If they can write one of those, they can get rid of the SystemCode URLs. Although, I understand there are reasons to use non-semantic URLs, such as ensuring that links never change even if pages are renamed. On a wiki where text is fluid, I'm not sure that matters, but this standard isn't limited to wikis.
Now, the encoding of those titles I think could easily and usefully be URL-encoded. I'm just confused about one point. If I published an AllPages list in Big5, and you had an agent that operates in UTF8, are the URL encodes the same for the title? Is something like %42%42 always the same character, no matter what the encoding is? That's what I'm going to be reading about. I can see the bytewise encodings being different depending on the charset, and thus the URL encodings. UTF-8 is a good standard, in that case. -- SunirShah
There are several things that need to be considered. I'll explain how Oddmuse does it, then we can look at the various issues, and consider what would happen if we ignored them. Perhaps that will give us a good feel of the things that we need to solve and the things that can be optional.
If you don't want to make the charset assumption, either because you want to ignore the content-type header, or because you do not care about human readability, then we can list all URL fragments identifying pages (eg. %c3%9cbersicht) without further ado. It will be possible to construct links to these pages using simple text replacement. Without the encoding used, it will not be possible to display this page name to humans, however, and thus it will not be possible for humans to link to it. That is why I oppose such a simplistic approach.
Thus, in my scheme you need to know encoding used by both source and target wiki before you can write an interlink refering to a page name with non-ASCII characters. CommunityWiki:Übersicht produces an URL ending in %DCbersicht, which will not work. Übersicht was first encoded using Latin-1 and then URL encoded.
On Community Wiki, however, linking to MeatBall:ÜberSicht will link to a proxy: http://www.emacswiki.org/cgi-bin/latin-1.pl?url=http://www.usemod.com/cgi-bin/mb.pl?%C3%9CberSicht which will then redirect to http://www.usemod.com/cgi-bin/mb.pl?%DCberSicht. See how the encoding is correct.
The script is really simple, by the way. ([source], [test])
Remember when I said that there are two ways to encode Ü using UTF-8? These two forms are called NFC and NFD -- one prefers composed characters, where Ü is one code-point translated to two bytes, the other prefers decomposed characters, so Ü is decomposed into U and the diaresis, which results in three bytes. The standard on the web is NFC. The standard on the HFS+ filesystem is NFD. Thus, if you use the page name bytes as a file name without translating to Unicode (using the internal representation of the system instead of an encoding such as UTF-8), you'll run into trouble... But that's just an aside.
Discussion continued to a resolution between AlexSchröder and SunirShah at irc://irc.freenode.net/#wiki, logged to http://sunir.org/meatball/AllPages/01-Nov-2005-FreeNode-%23wiki.txt
I'm missing some steps in the logic of the linked argument. How did we go from "we must use plain-text because it's simpler" to "we must use HTML"? -- ChrisPurcell
I'm pretty confused, but this is because I know nothing about internationalization. My first reaction is to agree with Chris; how is HTML simpler than plaintext? I gather it has something to do with internationalization, so I'm trying to focus my ignorance into questions. Here are my questions:
What is wrong with having a text/plain document listing, on each line: a readable human name, then a delimiter, and then the associated URL? If you did that, would a client that didn't care about showing the page titles to humans be able to ignore the charset and easily use the URLs, or is knowing the charset necessary to use the URLs? (I'm assuming knowing the charset is necessary, since i'm assuming that different numbers of charsets use different number of bytes per char)
If something is wrong with that, then would it be possible to have TWO documents, both text/plain, one with one human readable page name on each line, and the other with a machine-usable URL on each line? The first document could be any charset, and the second one must be ASCII?
So, you can see what I'm proposing (if my understanding is correct): it may be a sad truth that you have to be concerned about charsets if you're displaying page titles to a user, but you could just use ASCII if you just want the links, right? So why not separate the page names and the URLs into two separate documents -- then clients who just want the URLs can use ASCII, and clients who need the page names can worry about encodings. In addition, nothing is HTML, so both of these documents are easy to produce and easy to parse.
Sorry for my denseness (and is there any reasonably short intro to this stuff that I should read?).
If HTML is required, however, then I suggest that the spec just explicitly say, "the list of page names must be recoverable by the regular expression >(.*)?</a>" (or whatever the regex should be), and "the list of URLs must be recoverable by the regular expression <a href="([^"]*)"" (or whatever the regex should be).
thanks, -- BayleShanks
<a href="URL">Title</a>
.
<div/>
is required for an XHTML representation, we might as well use it. Further, it is easier for some implementations to just stuff a <div/>
in their templates than write a whole new event handler. -- SunirShah
As I understand it, URLs are and must be written solely in restricted ASCII (octets 20-7E) as per RFC 1738. There are no character set issues unless you're given an incorrectly-formed URL and have to guess how to %-escape it, which one should simply rule out. -- ChrisPurcell
But that encoding depends on the charset the server uses to decode, not the charset of the file you send the URLs in. Hence, using ASCII is fine. -- ChrisPurcell
There's been discussion about producing several AllPages standards (plaintext, JSON), and it seems a bit arbitrary to come to an out-of-line resolution summarily discarding them. -- ChrisPurcell
Okay. Confusion over definition of "resolution", then. -- ChrisPurcell
(moved from SisterSitesImplementationGuide)
Typically SisterSite and NearLink implementers work with small page lists (perhaps a few hundreds or thousands pages) and are quite happy to update once a day, typically during night. But working across wikis this isn't efficient because the display lacks behind for hours (e. g. begging page on near linking).
One should think about sites containing 10^5 pages and keeping page lists pretty current (e. g. updating every 5-10 minutes).
This is easy to do but needs a different, incremental API: give me the pages added or deleted since (not necessarily Unix) second.
I do this by having a "page index log" format on page additions (+) and deletions (-) :
second[+|-]pagename ....
When you update you call something like "action=pagelog&since=1129883609" and the server transfers only the most recent lines including the information from when to continue on the next update naturally in the last transferred line. This way, only a few page names have to be transferred during updating and one can just append the lines to the local "page index log" which duplicated the server log. The method is independent from availablity problems and - working with server timestamps - needs no synchronized time. From the log the page index can be extracted easily adding and deleting through a hash table.
I think using UNIX diff makes more sense as you can then use UNIX patch. This limits the coding challenge. -- SunirShah
UNIX diff and patch are also available for Windows.
I don't understand the reference to diff. How should that work? -- HelmutLeitner
The server sends a diff between the current AllPages file and what it looked like the last time the client got the file.
To be honest, it's not clear if this is computationally easier or pragmatic. We should try the diff method out first before agreeing to it. It does have the advantage that it will apply equally well to many formats, like the LinkDatabase. -- SunirShah
Using diff means that the server has to keep states for clients. That would be ugly. -- HelmutLeitner
Alternately, client could send the time of it's last check. --JohnAbbe
A UNIX diff would not require the server to keep any more state than a +/- change log. Simply take the pagelog for today; then using the change log, compose the pagelog from the past as requested; and then compute the diff. This requires twice the memory and more computation time. It may be simpler to just use a +/- change log. -- SunirShah
At RecentChangesCamp:SisterSites2006, ChrisPurcell suggested using Rest:HttpEvents to keep sister wikis synchronized with low latency and low overheads.