[Home]WikiNameCanonicalization

MeatballWiki | RecentChanges | Random Page | Indices | Categories

Since different wikis use different LinkPatterns and different people spell things differently (e.g. CategoryHomepage vs. CategoryHomePage), it's necessary to canonicalize the names when moving from one namespace to another.

MetaWiki does this when doing TwinPages matching. The latest and best code is always available on http://sunir.org/src/meatball/IndexingSchemes/meta.pl, but this will be useful for discussion purposes.

sub Canonicalize
{
    my $title = shift;
    $title =~ s/\%(..)/pack('C', hex($1))/ge;
    $title =~ s/(.*)\~.*/$1/g; # FoxWiki namespaces
    $title =~ s/\W+//g;
    $title =~ s/_//g;
    $title =~ tr/[A-Z]ÖÄÜöäü/[a-z]oauoau/;
    return $title;
}
So, essentially it


This method has a flaw. An example of this flaw (stolen from PhpWiki) would be BrigAndJail? being a different title than BrigandJail?. A solution would be to place and underscore at each word boundery, thus eliminating the ambiguity. (Also, spaces in free links should be converted to underscores as well, adjacent underscores then being reduced to one.) -- IanBollinger?

I can only regard the above example as extremely contrived. Furthermore, the reason to remove spaces and not simply convert them to underscores is so "WikiWord" and "[[Wiki Word]]" canonicalize to the same node. It's a good first step, but I'd consider also adding language-specific canonical transforms that are applied when a node is not found. For english, one such transform would be to strip the plural off (chopping off the 's' usually works, though one can make it more intelligent). German would probably want to transform the eszet (ß) to double-s and append an 'e' after an umlauted vowel after the umlaut is stripped. Mind you you'll annoy plenty of Germans by canonicalizing the eszet in the page title, so you may want then to keep a disconnect between the page identifier and the title... complexities abound. -- ChuckAdams

Canonicalization may have different goals. It might help to specify them. Typically canonicalization will make sense only within one language, so talking here about foreign language pages doesn't make much sense, I would leave that to the German or Sanskrit speakers. Maybe some languages will prefer to canonicalize to Unicode, so some issues will look quite different. There is a word boundary issue that has different aspects (keep boundaries alive vs. unify word boundaries vs. "identify HomePage with Homepage"). There may be semantic issues (X PRO AND CON and X DISCUSSION, JAVAVERSUSCSHARP and JAVAVSCSHARP, CATEGORYHOMEPAGE and FOLDERCONTRIBUTORS). There may be regional issues (like COLOUR and COLOR, GRAY and GREY). -- HelmutLeitner


Perhaps the best way to address the Wiki:WikiNamePluralProblem is to use the [Porter stemmer] algorithm as part of WIkiNameCanonicalization?. -- SunirShah


CategoryWikiTechnology

Discussion

MeatballWiki | RecentChanges | Random Page | Indices | Categories
Edit text of this page | View other revisions
Search: