Google matrix

Excerpt from the Google matrix of English-language Wikipedia articles (2009)

The Google matrix is a square matrix that is created when the PageRank algorithm is constructed. Since it is often very large (with many millions of rows and columns), the numerical and algebraic properties of this matrix are of great importance for the quick and exact determination of the PageRanks.

definition

The normalized Google matrix of a network or directed graph with nodes is the real matrix: ${\ displaystyle n}$ ${\ displaystyle n \ times n}$

{\ displaystyle P: = d \ left (L + {\ tfrac {1} {n}} w \ mathbf {1} ^ {T} \ right) + (1-d) {\ tfrac {1} {n}} \ mathbf {1} \! \ cdot \! \ mathbf {1} ^ {T}}

The individual components of the Google matrix are defined as follows:

The link matrix is the adjacency matrix of the examined graph, standardized line by line : ${\ displaystyle L}$ ${\ displaystyle 1}$ ${\ displaystyle A = (a_ {ij})}$

{\ displaystyle l_ {ij}: = {\ begin {cases} {\ frac {1} {c_ {i}}} & {\ text {falls}} \; a_ {ij} = 1 \\ 0 & {\ text {otherwise}} \ end {cases}}}

where is the exit degree of the node , i.e. the number of edges that leave the node .

{\ displaystyle c_ {i}}

{\ displaystyle i}

{\ displaystyle i}

The vector is component-wise defined as ${\ displaystyle w}$

{\ displaystyle w_ {i} = {\ begin {cases} 1 & {\ text {if}} \; c_ {i} = 0 \\ 0 & {\ text {otherwise}} \ end {cases}}}

It therefore contains a one if and only if the initial degree of a side or a node is zero. These nodes are also called dangling nodes . There are several methods of treating these lumps in the literature, the most common being the one discussed here.

{\ displaystyle d}

is a real number between and called the damping factor

{\ displaystyle 0}

{\ displaystyle 1}

${\ displaystyle \ mathbf {1}}$ is a one- vector of length , i.e. a vector that has only ones as entries. So the matrix is exactly the one matrix . ${\ displaystyle n}$ ${\ displaystyle \ mathbf {1} \! \ cdot \! \ mathbf {1} ^ {T}}$

properties

PageRank

For the calculation of the PageRanks one is particularly interested in the existence and multiplicity of left eigenvectors of the matrix . These correspond exactly to the usual eigenvectors of the matrix for the eigenvalue . If one interprets the eigenvalue problem ${\ displaystyle P}$ ${\ displaystyle P ^ {T}}$ ${\ displaystyle 1}$

{\ displaystyle P ^ {T} x = x}

as a calculation of the stationary distribution of a Markov chain , the vector is a stochastic vector consisting of the PageRanks . This reduces the eigenvector problem to the linear system of equations ${\ displaystyle x}$

{\ displaystyle \ left (Id \ left (L + {\ tfrac {1} {n}} w \ mathbf {1} ^ {T} \ right) ^ {T} \ right) x = (1-d) {\ tfrac {1} {n}} \ mathbf {1}}

.

In order to be able to solve this linear system of equations efficiently, the question of the regularity of the matrix and its condition number arises .

Norms

Both the matrix and the matrix are generally only substochastic . If you add both, you get a row-stochastic matrix , since the non-zero rows of the matrices complement each other. Since it is also line stochastic (strictly speaking, even double stochastic ) and only convex combinations are formed by the damping parameter (with regard to which the stochastic matrices are closed), the Google matrix is also a line stochastic matrix. This applies to the line sum norm of the Google matrix ${\ displaystyle L}$ ${\ displaystyle {\ tfrac {1} {n}} w \ mathbf {1} ^ {T}}$ ${\ displaystyle {\ tfrac {1} {n}} \ mathbf {1} \! \ cdot \! \ mathbf {1} ^ {T}}$

{\ displaystyle \ Vert P \ Vert _ {\ infty} = 1}

and thus also for the column sum norm of the transposed

{\ displaystyle \ Vert P ^ {T} \ Vert _ {1} = 1}

.

Eigenvectors and Eigenvalues

The existence of an eigenvector from to the eigenvalue follows directly from the fact that the matrix is a stochastic matrix. That there is even the largest positive eigenvalue, for which there is a simple, strictly positive eigenvector, follows from Perron-Frobenius' theorem , where it holds. It is important here that the introduction of the damping parameter guarantees the positivity of the matrix and thus the solvability of the eigenvalue problem. ${\ displaystyle P ^ {T}}$ ${\ displaystyle 1}$ ${\ displaystyle 1}$ ${\ displaystyle P ^ {T}> 0}$

Furthermore it can be shown that eigenvalues apply to all other eigenvalues. The separation of the eigenvalues is only determined by the damping parameter. This guarantees a good speed of convergence for many of the numerical methods for calculating eigenvalues, such as the power method , as long as the damping factor is not chosen too close . Usually . ${\ displaystyle | \ lambda _ {i} | \ leq d}$ ${\ displaystyle 1}$ ${\ displaystyle d \ approx 0.85}$

Regularity and stamina

There

{\ displaystyle \ Vert d (L + {\ tfrac {1} {n}} w \ mathbf {1} ^ {T}) ^ {T} \ Vert _ {1} = d <1}

holds, the Neumann series provides the invertibility of the matrix

{\ displaystyle K: = (Id (L + {\ tfrac {1} {n}} w \ mathbf {1} ^ {T}) ^ {T})}

.

Thus the problem can be solved as a linear system of equations. At the same time also applies to the norm of the inverse

{\ displaystyle \ Vert K ^ {- 1} \ Vert _ {1} \ leq {\ frac {1} {1-d}}}

and thus the estimate for the condition number

{\ displaystyle \ kappa _ {1} = \ Vert K \ Vert _ {1} \ Vert K ^ {- 1} \ Vert _ {1} \ leq {\ frac {1 + d} {1-d}}}

.

Thus, only the choice of the damping parameter is responsible for the condition and should not be too close to . ${\ displaystyle 1}$

Numerical calculation of the eigenvector

The largest eigenvector of the Google matrix is usually determined approximately using the power method. Based on a starting approximation, the matrix-vector product of the Google matrix is formed with the current approximation of the eigenvector in each iteration step . In every iteration step is therefore ${\ displaystyle b_ {0}}$ ${\ displaystyle b_ {k}}$

{\ displaystyle P ^ {T} b_ {k} = dL ^ {T} b_ {k} + d {\ tfrac {1} {n}} \ mathbf {1} w ^ {T} b_ {k} + ( 1-d) {\ tfrac {1} {n}} \ mathbf {1}}

to calculate. If the starting approximation is a stochastic vector, then every subsequent approximation vector is also stochastic. Since the eigenvalues of the Google matrix are well separated, a slow convergence speed of the power method is excluded.

The special structure of the Google matrix can be used for the calculation. The link matrix is usually extremely sparse , that is, almost all of its entries are zero. As a result, it can on the one hand be saved in a very space-saving manner and, on the other hand, be multiplied very efficiently by a vector. The vector is also usually sparsely populated, which means that the term can also be calculated very quickly. ${\ displaystyle L ^ {T}}$ ${\ displaystyle w}$ ${\ displaystyle \ mathbf {1} w ^ {T} b_ {k}}$

example

The directed graph treated in the example

If one looks at the directed graph with 8 nodes on the right as an example, nodes 5 and 6 are dangling nodes. Then the row-by-row normalized adjacency matrix is

{\ Display Style L = {\ begin {bmatrix} 0 0 1 0 0 0 0 0 \\ 0.5 & 0 0 0 0 0.5 & 0 0 \\ 0 0 0 0.5 & 0.5 & 0 0 0 \\ 0 0.5 & 0 0 0 0 0.5 & 0 \\ 0 0 0 0 0 0 0 0 \\ 0 0 0 0 0 0 0 0 \\ 0 0 0 0 0 0 0 1 \\ 0 0 0 0 0 0 1 0 \\\ end {bmatrix }}}

and the vector

{\ displaystyle w = {\ begin {bmatrix} 0 & 0 & 0 & 0 & 1 & 1 & 0 & 0 \ end {bmatrix}} ^ {T}}

.

Then with the above construction and a damping parameter of ${\ displaystyle d = 0.8}$

{\ Display style P = {\ frac {1} {40}} {\ begin {bmatrix} 1 1 33 1 1 1 1 1 \\ 17 & 1 & 1 & 1 & 1 & 17 & 1 & 1 \\ 1 & 1 & 1 & 17 & 17 & 1 & 1 & 1 \\ 1 17? 1 1 1 1 17? 1 \\ 5 5 5 5 5 5 5 5 \\ 5 5 5 5 5 5 5 5 \\ 1 1 1 1 1 1 1 33 \\ 1 1 1 1 1 1 33 1 \ end {bmatrix}}}

The eigenvector of for eigenvalue 1 is then ${\ displaystyle P ^ {T}}$

{\ displaystyle x = (0.0675,0.0701,0.0934,0.0768,0.0768,0.0675,0.2825,0.2654) ^ {T}}

.

So nodes 7 and 8 have the highest PageRanks (0.2825 and 0.2654) and nodes 1 and 6 the lowest (0.0675 each). The second eigenvalue is , so the above estimate is sharp. Furthermore is the condition number ${\ displaystyle \ lambda _ {2} = - 0.8}$

{\ displaystyle \ kappa _ {1} = 9 = {\ frac {1 + 0.8} {1-0.8}}}

,

so this estimate is also sharp.

Individual evidence

↑ Deeper Inside Page Rank Amy N. Langville and Carl D. Meyer. Retrieved August 30, 2013.
↑ TH Haveliwala and SD Kamvar: The Second Eigenvalue of the Google matrix. Technical Report, Stanford University, 2003. Retrieved August 30, 2013.

literature

Peter Knabner , Wolf Barth : Lineare Algebra . Basics and applications (= Springer textbook ). Springer, Berlin 2012, ISBN 978-3-642-32185-6 .

[1] Deeper Inside Page Rank Amy N. Langville and Carl D. Meyer. Retrieved August 30, 2013.

[2] TH Haveliwala and SD Kamvar: The Second Eigenvalue of the Google matrix. Technical Report, Stanford University, 2003. Retrieved August 30, 2013.