PageRankFr

MeatballWiki | RecentChanges | Random Page | Indices | Categories

Cette page a démarré sur MeatBall:PageRank

Résumé

La première incarnation de Google a fameusement utilisé un algorithme TradeMarké sous PageRank, qui est décrit dans l'article pas trop clair [(Brin and Page, 1998)], même si vous pouvez trouver de meilleures [explications]. Alors que la version actuelle de l'algorithme de classement de Google a changé beaucoup de fois et incorpore d'après certaines information plus d'une centaine de facteurs, le PageRank demeure central.

Explication

Il y a beaucoup d'explications de PageRank qui servent différents programmes. Les descriptions les plus prosaïques et pertinentes expliquent que vous faites un lien vers une page, vous formulez un vote pour cette page. Les descriptions populaires qui s'alignent plus à proximité du ClueTrainManifesto? prétendent souvent que le PageRank renvoie reflète démocratiquement ce que le public pense en général, et ce que le public connaît le mieux. Vous pourriez entendre que Google bénéficie de LaSagesseDesFoules?, ou que qu'il bénéficie d'un AttachementPréférentiel. Les blogueurs pourraient dire que Google reflète la conversation.

Dans l'article original, Sergey et Brin (1998) décrivent le PageRank comme un "'surfeur au hasard' qui se voit donner une page web au hasard et continue à cliquer sur des liens, sans ne jamais presser le bouton "arrière" mais en fin de compte s'ennuie et démarre sur une autre page au hasard. La probablité que le surfer au hasard visite une page est son PageRank. Et, le facteur d'amortissement d est la probabilité pour chaque page que le "surfeur au hasard" s'ennuiera et requêtera une autre page au hasard." Le PageRank est par conséquent définit comme :

: Nous supposons que la page A a les pages T1...Tn qui pointent vers elle (par exemple, des citations). Le paramètre d est un facteur d'amortissement qui peut être réglé entre 0 et 1. Nous réglons généralement d à 0.85. Il y a plus de détails à propos de d dans la section suivante. Aussi, C(A) est défini comme le nombre de liens sortant de la page A. Le PageRank d'une page A est donné comme suit :

: PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

: Remarquez que les PageRanks? forment une probabilité de distribution sur les pages web, ainsi la somme de tous les PageRanks? des pages web sera de un.

Pour traduire cela plus en anglais, ce que cela dit est que chaque page a un PageRank, qui est ensuite distribué régulièrement parmi les cibles de ses liens sortants. Une page avec trois liens assignera 1/3 de son PageRank à chacune des cibles des liens sortants. Une page avec 100 liens assignera 1/100 de son PageRank à chacune des cibles des liens sortants.

Les implications sont que, en moyenne, une page avec beaucoup de liens entrants provenant de sites externes aura un PageRank plus élevé. Néanmoins, moins il y a de liens sur une page, plus un lien est valable à partir de cette page. Pour finir, parce que le PageRank pour un site global est mesuré par la totalité de son PageRank pour toutes les pages dans le site, les liens absents du site pénaliseront son PageRank. Il est mieux de créer beaucoup de pages dans le site avec beaucoup de liens vers d'autres pages sur le même site, et relativement peu de liens vers les pages sur d'autres sites. En résumé, le PageRank fuit.

Critiques

PageRank is a relatively simple algorithm that can easily be gamed by SearchEngineOptimization?. A common gaming strategy is to create LinkFarm?s, which are websites that have many pages that link heavily to one another. In 2003, Google experienced a Spam War that nearly defeated Google entirely. Their response was to penalize sites with little distinct content. In 2004 and beyond, LinkSpam has become an increasing problem, albeit primarily for secondary victims such as wikis and WebLogs that are being hit, rather than Google itself. Google announced NoFollow as their 'solution' to LinkSpam, which hasn't worked for obvious reasons. NoFollow serves, rather, to break PageRank completely. What does it mean to have a WorldWideWeb where no website 'links' to another? Further, NoFollow plugs PageRank leaks, increasing your website's own PageRank--at least until all the links pointing to your website are NoFollowed themselves.

Wikis are particularly good at optimizing for PageRank. They are often large websites with much distinct content with mostly internal links and relatively few external links. Strong community-oriented wikis are also good at attracting links from other people's wikis and weblogs. Their editability also makes them attractive to LinkSpammers.

Conversely, PageRank is particularly bad at understanding a fluid conversation, such as with SocialSoftware. PageRank is built on the presumption that the Web is made from static, stable documents that have already been written, and thus the links from them have some significant semantic weight. However, the Web is not static, but quite often an ongoing conversation. Results like GoogleBomb?ing can happen when WebLogs as a collective decide (explicitly or implicitly) to all talk about the same thing at once. PageRank may experience large swings as it lacks the ability to dampen its input with TemporalContext--although Google may have adapted to this challenge by now. Even if it was desireable to know what people are talking about now (e.g. Google:paris+fashion might suggest what to wear this season), you might not want those search results to invisibly mix with what people were talking about then (unless you want to be a fashion victim).

Additionally, consider that the PageRank is a global computation, measured across all the pages on the Web, irrespective of search terms, and static between indexing sessions. Therefore, PageRank benefits those with a large LifeInText on the WorldWideWeb over what may make the most sense in your search's context. Google:Sunir lists SunirShah higher than Sunir Kapoor, who by contrast is more powerful and 'relevant' to more people. This behaviour made sense when the web was only a few million web pages as the Web itself was the context of the search ("I want to find the fishcam."). As the Web takes on more and more of the world's total information, the context has to become more specialized, based on the surrounding context of use. If AdSense? can use a page's context, so should Google's search.

DossierGoogle? DossierSpam?

Références

Brin, S. and Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. . In Proceedings of the 7th International World Wide Web Conference, Brisbane, Australia, April 1998, 107-117. Available from http://www-db.stanford.edu/~backrub/google.html.

Although their terms of service suggest the opposite at the first glance, one might argue that Google in fact benefits spammers. They explicitly disallow automated queries with the goal of investigating rankings, so public research of that topic is made impossible. Spammers on the other hand have to operate secretly, so they don't have any disadvantage in that regard. As a result of this, they have a scientific advantage granted by Google's policy. -- AlanMcCoy?

My claim that Google works best with static information is weakly described here, I admit. I have a strong suspicion that PageRank benefits corporations with brochureware more than wikis and weblogs that often rapidly rotate the text and links on/to their sites. -- SunirShah

PageTranslation LangueFrançaise PageRank

PageRankFr

Résumé

Explication

Critiques

Références

Discussion