PageRank

from Wikipedia, the free encyclopedia

The PageRank - Algorithm is a process, a lot of linked documents, such as the World Wide Web , based on their structure to assess and weigh . Each element is assigned a weight, the PageRank, based on its link structure. The algorithm was developed by Larry Page (hence the name PageRank) and Sergei Brin at Stanford University , who applied for a patent. It served the search engine Google of the company Google Inc. founded by Brin and Page as the basis for the evaluation of pages.

The PageRank algorithm is a special method to determine the link popularity of a page or a document. The basic principle is: the more links refer to a page, the higher the weight of that page. The higher the weight of the referring pages, the greater the effect. The aim of the method is to sort the links according to their weight, in order to produce a result order in a search query, i. H. Show links to more important pages further up in the results list.

The PageRank algorithm simulates a random user surfing the net. The probability with which it will come across a certain website correlates with the PageRank of the website.

The PageRank algorithm

construction

Visualization of the PageRank for a simple graph with damping factor . According to the random surfer model , the size of the circles is roughly proportional to the probability that a surfer is on that page. So page C is only linked from a single, but more important page and consequently has a higher PageRank than page E, although it is linked from a total of six pages.

The principle of the PageRank algorithm is that each page has a weight (PageRank) that is greater, the more pages (with the highest possible weight) refer to this page. The weight of a page is calculated from the weight of the linked pages . If linked to a total of different pages, the weight of is divided proportionally to these pages. The following recursive formula can be seen as the definition of the PageRank algorithm:

Here, the total number of pages and an attenuation factor between 0 and 1, with a small proportion of the weight ( disconnected) of each side and is distributed evenly to all detected by the algorithm sides. This is necessary so that the weight does not "flow off" to pages that do not refer to other pages. The above formula is often also given without the normalization factor.

The equation can also be used as a line-by-line notation of the linear system of equations

with the column vector , the constant column vector and the matrix with coefficients

interpreted, whereby the Kronecker delta denotes and the matrix by

is defined.

This equation leads to the eigenvalue problem

,

where is the so-called Google matrix and denotes the transpose of the matrix.

For the solution of the system of equations is unique ( Perron-Frobenius theorem ). A smaller damping factor leads to a more homogeneous distribution of the PageRank.

The solution of the linear system of equations can be carried out analytically or numerically. Using the Jacobi iteration for the numerical solution results in the above recursive equation. However, other numerical methods for matrix inversion, such as the minimum residual, the over- relaxation or the Gauss-Seidel method , usually converge more quickly.

Random surfer model

The random surfer model offers an alternative interpretation of the page rank algorithm, which comes from stochastics . If the PageRank is normalized to 1, the weight of a page can be interpreted as the probability that a random surfer ( random path ) is on this page. A random surfer moves through the net by randomly choosing one of the outgoing links of the current page with a probability . With probability he will choose any new side. The model can be understood as a Markov chain , the normalized page rank is then the stationary distribution of this chain.

Rational surfer model

The Rational Surfer model is a patent filed by Google in 2010. It represents a further development of the random surfer model. Here, the importance of a link is differentiated depending on its placement according to empirical data. The aim is to give more weight to links that are more likely to be clicked by a rational surfer. Thus, link buying should be counteracted.

history

The idea of ​​the PageRank algorithm originally came from sociometry and was first shown in the specialist literature in 1953 by Katz. As early as 1949, Seeley used the procedure to explain how the status of an individual came about, but in his description there is still no standardization of the number of outgoing edges and no damping term. The latter was introduced in 1965 by Charles H. Hubbell.

Brin and Page developed the algorithm in 1996 at Stanford University . Page applied for a patent in 1997, which was registered at Stanford University. Brin and Page published the algorithm together in 1998. In their original work, they cite Massimo Marchiori ( University of Padua , developer of Hyper Search), Eugene Garfield , who developed citation analysis in the 1950s , and Jon Kleinberg , who at about the same time as Brin and Page “ Hubs and Authorities ”(HITS).

In addition to Brin and Page, not only Kleinberg, but also Robin Li developed a similar algorithm (RankDex) in China around 1996, which he used in the search engine Baidu (patent 1999).

After Google was founded, Stanford University received 1.8 million shares from Google for the patent, which went exclusively to Google. In 2005 they sold the shares for $ 336 million.

Washington State University researchers state that Google's PageRank algorithm can also be used to determine the geometrical orientation of water molecules relative to other molecules in a solution, e.g. B. to calculate approximately those of toxic chemicals.

Adjustments

The algorithm used by Google today probably no longer has this exact form, but is based on this formula. Alternative algorithms are the method of the hubs and authorities of Jon Kleinberg , the Hilltop and the TrustRank algorithm.

In 2013 the ranking was replaced by the new Hummingbird algorithm . PageRank is just one aspect that Hummingbird includes in its rating.

Toolbar and directory values

Information about the PageRank can be found in the Google Toolbar and the Google Directory. The PageRank displayed by Google in the toolbar is between 0 and 10. The value given in the Google directory was between 0 and 7 until the beginning of 2008, but now corresponds to the value displayed in the toolbar. The displayed values ​​represent the real PageRank on a logarithmic scale and show the result as a rounded integer value.

The PageRank displayed in the Google toolbar used to be updated every 30 days. In the meantime, the interval between the updates is carried out very irregularly, the interval length fluctuates between a little less than thirty to over a hundred days. PageRank was last updated by Google on December 6, 2013.

Google has now finally abolished the PageRank toolbar and stopped delivering the relevant data. This means that the PageRank is no longer publicly visible to website operators. Internally, however, Google will continue to use the data for the algorithms.

manipulation

In the meantime, due to its economic importance, targeted manipulations and forgeries have occurred. So the system has been in practice Suchmaschinenoptimierern by search engine spamming in guest books , blogs and forums , the operation of link farms subvert and other dubious methods. This includes, among other things, the possibility of mirroring the PageRank displayed in the toolbar of a low-ranking page by forwarding it to an existing page with a high PageRank. The forwarding causes the display of the high PageRank of the target page to be copied with the following update. If the forwarding is subsequently removed, the actual page content in connection with the mirrored PageRank is presented to the visitor for the duration of the then running interval. The actual PageRank value and the ranking in the search algorithm are not affected by this, only the display is manipulated. This can be used fraudulently, for example, to obtain a higher price when selling the domain or links .

At the beginning of 2005, Google implemented a new attribute for referrals with rel = "nofollow" as an attempt to take action against spam. Links with this attribute are not taken into account for the PageRank calculation. By marking outgoing links, for example, guest book, blog and forum spamming can be counteracted. However, this method is controversial because, on the one hand, not all search engines take the attribute into account and, on the other hand, the links are not taken into account for the PageRank calculation, but the linked pages are still crawled by most search engines .

disadvantage

The disadvantages of PageRank at a glance:

  • The decisive factor is not the interest of the readers, but only that of other website operators.
  • Financially strong site operators can buy backlinks and are thus positioned higher in search results. As a result, instead of high-quality content, financial options often decide the order of the search results.
  • Webmasters often see PageRank as the only evaluation criterion for the link exchange. The content of the linked pages takes a back seat.
  • The PageRank does not contribute to the qualitative measurement of websites.

See also

Commons : PageRank  - collection of images, videos and audio files

literature

  • Amy N. Langville and Carl D. Meyer: Google's pagerank and beyond: the science of search engine rankings , Princeton, NJ: Princeton University Press 2006. ISBN 978-1-4008-3032-9 ( sample chapter )
  • Arvind Arasu, Junghoo Cho, Héctor García-Molina , Andreas Paepcke and Sriram Raghavan: Searching the Web , Technical Report, Stanford University, USA, 2000 ( online ; PDF; 1.7 MB)
  • Sergei Brin, Lawrence Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine . In: Computer Networks and ISDN Systems, Volume 30, 1998, pp. 107-117
  • Charles H. Hubbell: An input-output approach to clique identification . In: Sociometry, number 28, pages 377-399, 1965
  • Leo Katz: A new status index derived from sociometric analysis . In: Psychometrika, number 18 (1), pages 39–43, 1953 ( PDF )
  • J. Seeley: The net of reciprocal influence: A problem in treating sociometric data , Canadian Journal of Psychology, 3, 234-240, 1949.

Web links

Individual evidence

  1. Patent US6285999 : Method for node ranking in a linked database. Filed January 10, 1997 , inventor: Lawrence Page.
  2. Method for node ranking in a linked database , US Patent 6285999. Priority date January 10, 1997, granted January 9, 1998, published September 4, 2001. Page is named as inventor
  3. Page also quotes these in his patent as well as Katz, Hubbell and others
  4. ^ Stanford earns 335 million off google stocks, Redorbit 2005
  5. Wired : Researchers Fight Toxic Waste With Google PageRank February 17, 2012
  6. webmagazin.de about Hummingbird ( Memento from October 3, 2013 in the Internet Archive ); Hummingbird includes PageRank.
  7. Google has confirmed it is removing Toolbar PageRank. Searchengineland.co, March 8, 2016, accessed March 10, 2016 .