PageRank

MeatballWiki | RecentChanges | Random Page | Indices | Categories

Summary

Google's first incarnation famously used an algorithm TradeMarked as PageRank, which is described in the not too clear paper [(Brin and Page, 1998)], although you can find better [explanations]. While the current version of Google's ranking algorithm has changed many times and incorporates reportedly over one hundred factors, PageRank remains central.

Explanation

There are numerous of folk explanations of PageRank that serve particular agendas. The most prosaic and accurate descriptions explain that when you link to a page, you are casting a vote for that page. Popular descriptions that align more closely with the ClueTrainManifesto? often claim PageRank democratically reflects what the public at large thinks, and the public knows best. You might hear that Google benefits from TheWisdomOfCrowds?, or that it benefits from PreferentialAttachment. Bloggers might say that Google reflects the conversation.

In the original paper, Sergey and Brin (1998) describe PageRank as "a 'random surfer' who is given a web page at random and keeps clicking on links, never hitting "back" but eventually gets bored and starts on another random page. The probability that the random surfer visits a page is its PageRank. And, the d damping factor is the probability at each page the "random surfer" will get bored and request another random page." PageRank is therefore defined as:

: We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d in the next section. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:

: PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

: Note that the PageRanks? form a probability distribution over web pages, so the sum of all web pages' PageRanks? will be one.

To translate this more into English, what this says is that every page has a PageRank, which is then distributed evenly amongst the targets of its outbound links. A page with 3 links will assign 1/3 of its PageRank to each of targets of the outbound links. A page with 100 links will assign 1/100 of its PageRank to each of the targets of the outbound links.

The Google Toolbar's PageRank feature displays a visited page's PageRank as a whole number between 0 and 10. The most popular websites have a PageRank of 10. The least have a PageRank of 0. Google has not disclosed the precise method for determining a Toolbar PageRank value. Google representative Matt Cutts has publicly indicated that the Toolbar PageRank values are republished about once every three months, indicating that the Toolbar PageRank values are historical rather than real-time values [Google PageRank]

The implications are that, on average, a page with a lot of inbound links from external sites will have a higher PageRank. However, a fewer number of links from pages with higher PageRanks may be better. Additionally, the fewer links on a page, the more valuable a link from that page. Finally, since the PageRank for a whole site is measured by the totality of PageRank for all the pages within the site, links away from the site will penalize its PageRank. It's better to create many pages with a lot of links to other pages on the same site, and a relatively few links to pages on other sites. In short, PageRank leaks.

Criticisms

PageRank is a relatively simple algorithm that can easily be gamed by SearchEngineOptimization?. A common gaming strategy is to create LinkFarm?s, which are websites that have many pages that link heavily to one another. In 2003, Google experienced a Spam War that nearly defeated Google entirely. Their response was to penalize sites with little distinct content. In 2004 and beyond, LinkSpam has become an increasing problem, albeit primarily for secondary victims such as wikis and WebLogs that are being hit, rather than Google itself. Google announced NoFollow as their 'solution' to LinkSpam, which hasn't worked for obvious reasons. NoFollow serves, rather, to break PageRank completely. What does it mean to have a WorldWideWeb where no website 'links' to another? Further, NoFollow plugs PageRank leaks, increasing your website's own PageRank--at least until all the links pointing to your website are NoFollowed themselves.

Wikis are particularly good at optimizing for PageRank. They are often large websites with much distinct content with mostly internal links and relatively few external links. Strong community-oriented wikis are also good at attracting links from other people's wikis and weblogs. Their editability also makes them attractive to LinkSpammers.

Conversely, PageRank is particularly bad at understanding a fluid conversation, such as with SocialSoftware. PageRank is built on the presumption that the Web is made from static, stable documents that have already been written, and thus the links from them have some significant semantic weight. However, the Web is not static, but quite often an ongoing conversation. Results like GoogleBomb?ing can happen when WebLogs as a collective decide (explicitly or implicitly) to all talk about the same thing at once. PageRank may experience large swings as it lacks the ability to dampen its input with TemporalContext--although Google may have adapted to this challenge by now. Even if it was desireable to know what people are talking about now (e.g. Google:paris+fashion might suggest what to wear this season), you might not want those search results to invisibly mix with what people were talking about then (unless you want to be a fashion victim).

Additionally, consider that the PageRank is a global computation, measured across all the pages on the Web, irrespective of search terms, and static between indexing sessions. Therefore, PageRank benefits those with a large LifeInText on the WorldWideWeb over what may make the most sense in your search's context. Google:Sunir lists SunirShah higher than Sunir Kapoor, who by contrast is more powerful and 'relevant' to more people. This behaviour made sense when the web was only a few million web pages as the Web itself was the context of the search ("I want to find the fishcam."). As the Web takes on more and more of the world's total information, the context has to become more specialized, based on the surrounding context of use. If AdSense? can use a page's context, so should Google's search.

CategoryGoogle? CategorySpam

References

Brin, S. and Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. . In Proceedings of the 7th International World Wide Web Conference, Brisbane, Australia, April 1998, 107-117. Available from http://www-db.stanford.edu/~backrub/google.html.

Although their terms of service suggest the opposite at the first glance, one might argue that Google in fact benefits spammers. They explicitly disallow automated queries with the goal of investigating rankings, so public research of that topic is made impossible. Spammers on the other hand have to operate secretly, so they don't have any disadvantage in that regard. As a result of this, they have a scientific advantage granted by Google's policy. -- AlanMcCoy?

My claim that Google works best with static information is weakly described here, I admit. I have a strong suspicion that PageRank benefits corporations with brochureware more than wikis and weblogs that often rapidly rotate the text and links on/to their sites. -- SunirShah

PageRank

Summary

Explanation

Criticisms

References

Discussion