AccumulatedRandomWalk

MeatballWiki | RecentChanges | Random Page | Indices | Categories

: http://sunir.org/meatball/walk.html

This is a cheap hack to avoid doing a lot of complicated math to figure out what the clusters are. This IndexingScheme essentially picks a random page in the database and follows its BackLinks, keeping track of the number of visits to each page as it does this. The current walk depth is 3 (I don't think a lot of people are likely to travel further than 3 pages before getting bored), and the current sample size is 5000 trials.

The reason why it follows BackLinks is that we want to find pages that people will like to start from, not go to. Consequently, we reverse the direction of the walk essentially doing what a typical user would do in reverse.

The principle failure in this model, besides the statistical cheating, is that it initializes with a random page. Not all pages all likely to be accessed randomly. But weighting this with say statistics from HitCounts might unnecessarily bias this with existing practice. If existing practice is poor, the centre detection will be poor. Maybe that isn't much of a loss anyway.

Another failure is that it considers all links equal, and all collections of links equal, which is why EricScheidRandomThoughts is on the list. What would happen if I put the list of "centres" all on one page -- would that page float to the top of the list?

Source code at http://sunir.org/src/meatball/IndexingSchemes.

By the way, there is a perfectly accurate way that may be computationally feasible for some wikis, especially smaller ones. For each node, recursively follow every child link, every child's child's link, etc. to a depth of d, counting the nodes as you go. The nodes that root the largest hypertrees are the best centres.

Centres(graph G)

    for each node N in G
        sizes[N] <-- countChildren N, d
    end for

    for the 25 largest sizes, print their respective nodes.
end Centres

countChildren(node N, depth d)
     size <-- 1
     return size if d = 0.

     for each child C of N
         size <-- size + countChildren( C, d-1 )
     end for

     return size
end countChildren

However, this is superexpensive on a large wiki, or for a large d, or for a wiki with dense linking (as are most wikis). The AccumulatedRandomWalk is cheaper, yet probably accurate.

This is another example where recursion makes for a clean but less than optimal algorithm. The root of the problem is that countChildren will be called many times for the same page (& same d). This could be solved by caching the results of each countChildren call ("memoization"), or by using a non-recursive algorithm.

Assuming I'm not confused, (not necessarily a safe assumption,) here is a non-recursive algorithm which computes the same numbers as countChildren. (pardon the pseudo-perl):

foreach my $page (@pages) {
  $count{$page}[0] = 1;
}
for ($d = 1; $d &lt; 3; $d++) {
  foreach my $page (@pages) {
    $count{$page}[$d] = 0;
  }
  foreach my $child (@pages) {
    foreach my $parent (backlinks($child)) {
      $count{$parent}[$d] += $count{$child}[$d - 1];
    }
  }
}

Then, I think,

 countChildren($page,3) == sum(@{count{$page}}[0 .. 3])

Note that $count{$page}[0] (normalized appropriately) is the pdf for your random walk at the zeroth step (i.e. your initial uniform distribution across all pages.) $Count{$page}[1] gives the pdf describing where you'll be one (backwards) step into your walk. (And so on...)

-- JeffDairiki

For reference, the "starting points" on 2001-09-30 are:

CategoryGraphTheory CategoryIndexingScheme

AccumulatedRandomWalk

Discussion