MetaWiki does this when doing TwinPages matching. The latest and best code is always available on http://sunir.org/src/meatball/IndexingSchemes/meta.pl, but this will be useful for discussion purposes.
sub Canonicalize { my $title = shift; $title =~ s/\%(..)/pack('C', hex($1))/ge; $title =~ s/(.*)\~.*/$1/g; # FoxWiki namespaces $title =~ s/\W+//g; $title =~ s/_//g; $title =~ tr/[A-Z]ÖÄÜöäü/[a-z]oauoau/; return $title; }So, essentially it
This method has a flaw. An example of this flaw (stolen from PhpWiki) would be BrigAndJail? being a different title than BrigandJail?. A solution would be to place and underscore at each word boundery, thus eliminating the ambiguity. (Also, spaces in free links should be converted to underscores as well, adjacent underscores then being reduced to one.) -- IanBollinger?
Canonicalization may have different goals. It might help to specify them. Typically canonicalization will make sense only within one language, so talking here about foreign language pages doesn't make much sense, I would leave that to the German or Sanskrit speakers. Maybe some languages will prefer to canonicalize to Unicode, so some issues will look quite different. There is a word boundary issue that has different aspects (keep boundaries alive vs. unify word boundaries vs. "identify HomePage with Homepage"). There may be semantic issues (X PRO AND CON and X DISCUSSION, JAVAVERSUSCSHARP and JAVAVSCSHARP, CATEGORYHOMEPAGE and FOLDERCONTRIBUTORS). There may be regional issues (like COLOUR and COLOR, GRAY and GREY). -- HelmutLeitner
Perhaps the best way to address the Wiki:WikiNamePluralProblem is to use the [Porter stemmer] algorithm as part of WIkiNameCanonicalization?. -- SunirShah