This is a cheap hack to avoid doing a lot of complicated math to figure out what the clusters are. This IndexingScheme essentially picks a random page in the database and follows its BackLinks, keeping track of the number of visits to each page as it does this. The current walk depth is 3 (I don't think a lot of people are likely to travel further than 3 pages before getting bored), and the current sample size is 5000 trials.
The reason why it follows BackLinks is that we want to find pages that people will like to start from, not go to. Consequently, we reverse the direction of the walk essentially doing what a typical user would do in reverse.
The principle failure in this model, besides the statistical cheating, is that it initializes with a random page. Not all pages all likely to be accessed randomly. But weighting this with say statistics from HitCounts might unnecessarily bias this with existing practice. If existing practice is poor, the centre detection will be poor. Maybe that isn't much of a loss anyway.
Another failure is that it considers all links equal, and all collections of links equal, which is why EricScheidRandomThoughts is on the list. What would happen if I put the list of "centres" all on one page -- would that page float to the top of the list?
Source code at http://sunir.org/src/meatball/IndexingSchemes.
By the way, there is a perfectly accurate way that may be computationally feasible for some wikis, especially smaller ones. For each node, recursively follow every child link, every child's child's link, etc. to a depth of d, counting the nodes as you go. The nodes that root the largest hypertrees are the best centres.
Centres(graph G) for each node N in G sizes[N] <-- countChildren N, d end for for the 25 largest sizes, print their respective nodes. end Centres countChildren(node N, depth d) size <-- 1 return size if d = 0. for each child C of N size <-- size + countChildren( C, d-1 ) end for return size end countChildren
However, this is superexpensive on a large wiki, or for a large d, or for a wiki with dense linking (as are most wikis). The AccumulatedRandomWalk is cheaper, yet probably accurate.
Assuming I'm not confused, (not necessarily a safe assumption,) here is a non-recursive algorithm which computes the same numbers as countChildren. (pardon the pseudo-perl):
foreach my $page (@pages) { $count{$page}[0] = 1; } for ($d = 1; $d < 3; $d++) { foreach my $page (@pages) { $count{$page}[$d] = 0; } foreach my $child (@pages) { foreach my $parent (backlinks($child)) { $count{$parent}[$d] += $count{$child}[$d - 1]; } } }Then, I think,
countChildren($page,3) == sum(@{count{$page}}[0 .. 3])
Note that $count{$page}[0] (normalized appropriately) is the pdf for your random walk at the zeroth step (i.e. your initial uniform distribution across all pages.) $Count{$page}[1] gives the pdf describing where you'll be one (backwards) step into your walk. (And so on...)
-- JeffDairiki
For reference, the "starting points" on 2001-09-30 are:
CategoryGraphTheory CategoryIndexingScheme