ShortestPathPages

MeatballWiki | RecentChanges | Random Page | Indices | Categories

In the style of SixDegreesOfKevinBacon?, an IndexingScheme that lists pages that have the lowest sum total shortest paths to other pages first. This scheme hopefully will expose natural centres or nodal points in the PageDatabase.

That is, for any given page P, find the shortest path to each other page and sum the path lengths. Sort the pages based on this sum.

As an implementation detail, you would probably use some sort of all pairs shortest path algorithm.

See also MostReferencedPages which is a close cousin.

I am thinking of something like finding the set of k (10?) pages that has the lowest sum total distances between any given page and the closest page in the set. That is, these pages would be the ones distributed across the page graph the most evenly. However, I think the computation would take forever. C(n,k) * T(ShortestPathPages). T(ShortestPathPages) ~= 5 minutes, which means the computation could eventually take several years. There must be a better way to calculate this. Perhaps some mutated GraphBalancing? algorithm. -- SunirShah

: Somehow I have a vision of Sunir *finally* solving this problem, and then realizing it's equivalent to something famous like the "Travelling Salesman Problem". :-) (I have no idea whether the problems are equivalent, but it's beginning to sound somewhat familiar.) --CliffordAdams

what is C(n,k)? Is it the number of possible choices of k elements out of a set of n (in this case k pages out of a total of n)? Yes. n choose k.

If the goal is to avoid getting a lot of "duplicate" good centres in the set, preferring instead to get pages which are close to sortof-different regions of the wiki, maybe a greedy approach would work: start with an empty set, find the best centre, now each time you add an element to the set, find the element with the most "shortening" of paths relative to the best that your current set can do. -- BayleShanks

CategoryGraphTheory

It would also be interesting to know what the shortest path is, given two input pages. --SebPaquet

[August 19, 2000] I've implemented this for MeatballWiki. It is generated once a day because of the load it extolls on my ISP's server. Not like things are really going to change all that much from day to day.

: http://sunir.org/meatball/path.html

This script is non obvious, so I'm going to paste some Perl code here. Assume there are two hash tables, %forward and %back that track forward and back links of pages respectively. @(%forward{ShortestPathPages}} is the list of forward links from this page, for instance.

A "cluster" is actually a component of the link graph (i.e. a maximal connected subgraph). The link graph isn't guaranteed to be connected (and, it isn't), so this allows us to give more weight to pages in larger components. We need to do that because pages in very small components will have very small path lengths, though they are anything but centres!

The algorithm below breadth first searches the link database from each node. It might have been more efficient to use Floyd-Warshall all pairs shortest path algorithm, but that would quickly consume all memory on my ISP's machine. This will have to do. The script takes about 30 seconds to run with 500 pages in the database.

my %clusters;

foreach $page (keys %forward) {
    my %distances = ($page => 0);
    my %visited = ($page => 1);
    my @queue = ($page);

    while( scalar @queue ) {
        my $current = shift @queue;

        foreach $link (@{$forward{$current}}) {
            if( !$visited{$link} ) {
                $visited{$link}++;
                $distances{$link} = $distances{$current} + 1;
                push @queue, $link;
            }
        }

        foreach $link (@{$back{$current}}) {
            if( !$visited{$link} ) {
                $visited{$link}++;
                $distances{$link} = $distances{$current} + 1;
                push @queue, $link;
            }
        }
    }

    my $distance;
    foreach $to (keys %distances) {
        $distance += $distances{$to};
    }

    # Ignore orphans
    next if !$distance;

    my $cluster = join '|', sort keys %distances;
    $clusters{$cluster} = {} if !exists $clusters{$cluster};
    $clusters{$cluster}->{$page} = $distance;
}

-- SunirShah

what a great idea! I'm really impressed by all the IndexingSchemes you have made for this site -- thanks a lot. I am sure that I will use a few of them to find central content after I have finished my own protracted breadth-first-search personal exploration of the site.

i was wondering if the data on the wiki that your scripts use is publically available? As in, is there a place on the site I can get something like a %forward hashtable?

See the LinkDatabase for the data source. And no, there are no options as it is run offline, but you can modify the script if you'd like. -- SunirShah

thanks. Wow, the LinkDatabase is sweet! -- bs

Script to do

[CategoryToDo]

Script updates

[February 24, 2001] Upgraded to Standard.pm.

[October 9, 2000] Fixed bug where the CategoryHomePage search slurps up RecentChanges and MeatballWiki. SlashDot has moved done again!

[August 22, 2000] I've changed the script the exclude pages with the CategoryHomePage marker. The listing is now more useful to the casual browser, I think. -- SunirShah

I don't see the problem of path distance. Shortest path would only mean something if you have a link structure that is semantically in a near perfect condition. But you never have. In this wiki any page is a maximum of 4 clicks away (Indices => AllPages => action=index => Page). It is trivial (and we have in DseWiki) to put "Index"("action=index") on that link bar, so any page is a maximum of 2 clicks away. -- HelmutLeitner

Well, two clicks with an awfully lot of scrolling in between to find the page you want. -- EricScheid

Does the script ignore indexing pages when calculating ShortestPathPages? -- bs

There are no indexing pages except for RecentChanges. It includes the text at the top of RecentChanges, but not the rclog.

Also, for anyone interested in this stuff, you might be interested in [Jon Kleinberg] & friends, who do research in this sort of thing. From what I've heard, his ["hubs and authorities"] idea is all the rage. -- BayleShanks

A good indicator for the quality of this index is this grouping, which is the first longer stretch of non-meta pages. (2004-04-20)

12 BarnRaising (5492)
13 StartingPoints (5494)
14 GodKing (5539)
15 ForgiveAndForget (5549)
16 SpaghettiWiki (5587)

Quite enlightening.

[CategoryIndexingScheme]

ShortestPathPages

Discussion