That is, for any given page P, find the shortest path to each other page and sum the path lengths. Sort the pages based on this sum.
As an implementation detail, you would probably use some sort of all pairs shortest path algorithm.
See also MostReferencedPages which is a close cousin.
I am thinking of something like finding the set of k (10?) pages that has the lowest sum total distances between any given page and the closest page in the set. That is, these pages would be the ones distributed across the page graph the most evenly. However, I think the computation would take forever. C(n,k) * T(ShortestPathPages). T(ShortestPathPages) ~= 5 minutes, which means the computation could eventually take several years. There must be a better way to calculate this. Perhaps some mutated GraphBalancing? algorithm. -- SunirShah
what is C(n,k)? Is it the number of possible choices of k elements out of a set of n (in this case k pages out of a total of n)? Yes. n choose k.
If the goal is to avoid getting a lot of "duplicate" good centres in the set, preferring instead to get pages which are close to sortof-different regions of the wiki, maybe a greedy approach would work: start with an empty set, find the best centre, now each time you add an element to the set, find the element with the most "shortening" of paths relative to the best that your current set can do. -- BayleShanks
[August 19, 2000] I've implemented this for MeatballWiki. It is generated once a day because of the load it extolls on my ISP's server. Not like things are really going to change all that much from day to day.
This script is non obvious, so I'm going to paste some Perl code here. Assume there are two hash tables, %forward and %back that track forward and back links of pages respectively. @(%forward{ShortestPathPages}} is the list of forward links from this page, for instance.
A "cluster" is actually a component of the link graph (i.e. a maximal connected subgraph). The link graph isn't guaranteed to be connected (and, it isn't), so this allows us to give more weight to pages in larger components. We need to do that because pages in very small components will have very small path lengths, though they are anything but centres!
The algorithm below breadth first searches the link database from each node. It might have been more efficient to use Floyd-Warshall all pairs shortest path algorithm, but that would quickly consume all memory on my ISP's machine. This will have to do. The script takes about 30 seconds to run with 500 pages in the database.
my %clusters; foreach $page (keys %forward) { my %distances = ($page => 0); my %visited = ($page => 1); my @queue = ($page); while( scalar @queue ) { my $current = shift @queue; foreach $link (@{$forward{$current}}) { if( !$visited{$link} ) { $visited{$link}++; $distances{$link} = $distances{$current} + 1; push @queue, $link; } } foreach $link (@{$back{$current}}) { if( !$visited{$link} ) { $visited{$link}++; $distances{$link} = $distances{$current} + 1; push @queue, $link; } } } my $distance; foreach $to (keys %distances) { $distance += $distances{$to}; } # Ignore orphans next if !$distance; my $cluster = join '|', sort keys %distances; $clusters{$cluster} = {} if !exists $clusters{$cluster}; $clusters{$cluster}->{$page} = $distance; }
-- SunirShah
what a great idea! I'm really impressed by all the IndexingSchemes you have made for this site -- thanks a lot. I am sure that I will use a few of them to find central content after I have finished my own protracted breadth-first-search personal exploration of the site.
i was wondering if the data on the wiki that your scripts use is publically available? As in, is there a place on the site I can get something like a %forward hashtable?
See the LinkDatabase for the data source. And no, there are no options as it is run offline, but you can modify the script if you'd like. -- SunirShah
thanks. Wow, the LinkDatabase is sweet! -- bs
Script to do
Script updates
I don't see the problem of path distance. Shortest path would only mean something if you have a link structure that is semantically in a near perfect condition. But you never have. In this wiki any page is a maximum of 4 clicks away (Indices => AllPages => action=index => Page). It is trivial (and we have in DseWiki) to put "Index"("action=index") on that link bar, so any page is a maximum of 2 clicks away. -- HelmutLeitner
Well, two clicks with an awfully lot of scrolling in between to find the page you want. -- EricScheid
Does the script ignore indexing pages when calculating ShortestPathPages? -- bs
There are no indexing pages except for RecentChanges. It includes the text at the top of RecentChanges, but not the rclog.
Also, for anyone interested in this stuff, you might be interested in [Jon Kleinberg] & friends, who do research in this sort of thing. From what I've heard, his ["hubs and authorities"] idea is all the rage. -- BayleShanks
A good indicator for the quality of this index is this grouping, which is the first longer stretch of non-meta pages. (2004-04-20)
Quite enlightening.