PageRank Algorithm in Web Search

Ljiljana Raja
Ljiljana Raja
čić
čić
Web as a directed graph
Nodes: Web pages
Edges: Hyperlinks
2
 / 25
Ljiljana
 
Raja
čić
 
Two challenges of web search
1.
Web contains many sources of information
Who to trust?
2.
What is the “best” answer to a query?
No single right answer
Not all web pages are equally “important”
Ljiljana Raja
čić
3
 / 25
Link analysis approaches
Rank pages (nodes)
by analyzing topology of the web graph
Idea: 
Links as votes
-
Page is more important if
it has more links adjacent to it
Incoming links? 
Outgoing links?
Links from important pages have higher
weight => 
recursive problem!
Ljiljana Raja
čić
4
 / 25
Ljiljana Raja
čić
5
 / 25
Link weight proportional to
the 
importance
 of its 
source page
If page 
j
 
 with importance 
r
j
  has 
n
 out-links,
each link gets 
r
j
 
/ n
  
votes
Page 
j
 ‘s own importance is
the sum of the votes on its in-links
Ljiljana Raja
čić
6
 / 25
A page is important if
it is pointed to by other important pages
Rank 
r
j
  of page 
j
 :
d
i
  out-degree of node 
i
Ljiljana Raja
čić
7
 / 25
Ljiljana Raja
čić
8
 / 25
Ljiljana Raja
čić
9
 / 25
Since
Flow equasion in the matrix form:
     
Ljiljana Raja
čić
10
 / 25
M ∙ r = r 
Page i links to 3 pages, 
including j
 
x
 is an eigenvector with the
corresponding eigenvalue 
λ
 if
Since
Rank vector 
r
 is an eigenvector of web matrix 
M
,
with corresponding eigenvalue 
1
We can now efficiently find 
r
 
!
Power iteration method
Ljiljana Raja
čić
11
 / 25
Mx = 
λ
x
M ∙ r = r
Ljiljana Raja
čić
12
 / 25
d
i
out-degree of node 
i
Page rank simulates a random web surfer:
At any time 
t
, surfer is on some page 
i
At 
t + 1
, he follows an out-link from 
i
uniformly at random
Ends up on some page 
j
  linked from 
i
Rank vector r is a stationary distribution
of probabilities that a random walker is
on page 
i
 
 at arbitrary time 
t
Ljiljana Raja
čić
13
 / 25
Ljiljana Raja
čić
14
 / 25
Does this converge?
Does it converge to what we want?
Are the results reasonable?
Ljiljana Raja
čić
15
 / 25
All out-links are within an isolated group
Spider traps absorbe all rank eventually
At each step, random surfer has 2 options:
Follow a random link with probability 
β
Jump to random page with probability 
1 – 
β
β
 is usually in range 0.8 – 0.9
Ljiljana Raja
čić
16
 / 25
Ljiljana Raja
čić
17
 / 25
A dead end is a page with no out-links
They cause rank “leaking out”
All 0 in 
b
’s column
Always jump to random page
from a dead end
Ljiljana Raja
čić
18
 / 25
PageRank equation [Brin – Page, 1998]:
Google matrix A:
Ljiljana Raja
čić
19
 / 25
e – vector of all 1s
Key step is matrix – vector multiplication
A is dense – no 0 elements
M was sparse
only ~ 10 – 100 non-zero elements per column
We want to work with M
It’s possible!
Ljiljana Raja
čić
20
 / 25
Ljiljana Raja
čić
21
 / 25
Ljiljana Raja
čić
22
 / 25
CPU
Graph representation: Adjecency list
O(m) per iteration, where
m is the number of edges
m = O(n) => O(n) per iteration
CUDA
Graph representation: Adjecency matrix
O(n
2
) per iteration
Ljiljana Raja
čić
23
 / 25
Ljiljana Raja
čić
24
 / 25
Thanks for the
Thanks for the
attention!
attention!
Ljiljana Raja
čić
25
 / 25
Slide Note
Embed
Share

Dive into the intricacies of the PageRank algorithm, a key component of web search, which ranks web pages based on their importance and influence. Explore topics like link analysis, recursive formulation, flow model, and matrix formulation to grasp how PageRank determines the relevance and credibility of web pages in search results.

  • PageRank Algorithm
  • Web Search
  • Link Analysis
  • Importance Score
  • Web Graph

Uploaded on Aug 28, 2024 | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Page Rank Ljiljana Raja i

  2. Web as a Graph Web as a directed graph Nodes: Web pages Edges: Hyperlinks Page Rank 2 / 25 Ljiljana Raja i

  3. Web Search: Challenges Two challenges of web search 1. Web contains many sources of information Who to trust? 2. What is the best answer to a query? No single right answer Not all web pages are equally important Page Rank 3 / 25 Ljiljana Raja i

  4. Link analysis Link analysis approaches Rank pages (nodes) by analyzing topology of the web graph Idea: Links as votes - Page is more important if it has more links adjacent to it Incoming links? Outgoing links? Links from important pages have higher weight => recursive problem! Page Rank 4 / 25 Ljiljana Raja i

  5. Example: Page Rank scores Page Rank 5 / 25 Ljiljana Raja i

  6. Recursive formulation Link weight proportional to the importance of its source page If page j with importance rjhas n out-links, each link gets rj/ n votes Page j s own importance is the sum of the votes on its in-links Page Rank 6 / 25 Ljiljana Raja i

  7. The Flow model A page is important if it is pointed to by other important pages Rank rjof page j : diout-degree of node i Page Rank 7 / 25 Ljiljana Raja i

  8. The Flow equations No single solution Additional constraint forces uniqueness: ry+ ra+ rm= 1 Solution: ry= 2 Gaussian elimination > O(n3) Bad for large graphs 5, ra= 2 5, rm= 1 5 Page Rank 8 / 25 Ljiljana Raja i

  9. Matrix formulation Stochastic adjacency matrix M Page i has diout-links If i j, then Mji = 1 Each column sums to 1 Rank vector r : one entry for each page riis the importance score of page i ?ri= 1 di, else Mji = 0 Page Rank 9 / 25 Ljiljana Raja i

  10. Matrix formulation Since M r = r Flow equasion in the matrix form: Page i links to 3 pages, including j Page Rank 10 / 25 Ljiljana Raja i

  11. Eigenvector formulation x is an eigenvector with the corresponding eigenvalue if Since Rank vector r is an eigenvector of web matrix M, with corresponding eigenvalue 1 We can now efficiently find r ! Power iteration method Mx = x M r = r Page Rank 11 / 25 Ljiljana Raja i

  12. The power iteration Suppose there are N web pages Initialize r(0) = [? Iterate r(t+1) = M r(t) Stop when ?| r(t+1) r(t)|j< ?, , ? ?]T di out-degree of node i Page Rank 12 / 25 Ljiljana Raja i

  13. Random walk interpretation Page rank simulates a random web surfer: At any time t, surfer is on some page i At t + 1, he follows an out-link from i uniformly at random Ends up on some page j linked from i Rank vector r is a stationary distribution of probabilities that a random walker is on page i at arbitrary time t Page Rank 13 / 25 Ljiljana Raja i

  14. Page rank: three questions Does this converge? Does it converge to what we want? Are the results reasonable? Page Rank 14 / 25 Ljiljana Raja i

  15. Spider trap problem All out-links are within an isolated group Spider traps absorbe all rank eventually Page Rank 15 / 25 Ljiljana Raja i

  16. Spider traps: Google solution At each step, random surfer has 2 options: Follow a random link with probability Jump to random page with probability 1 is usually in range 0.8 0.9 Page Rank 16 / 25 Ljiljana Raja i

  17. Dead ends problem A dead end is a page with no out-links They cause rank leaking out All 0 in b s column Page Rank 17 / 25 Ljiljana Raja i

  18. Dead ends: Google solution Always jump to random page from a dead end Page Rank 18 / 25 Ljiljana Raja i

  19. The Google matrix PageRank equation [Brin Page, 1998]: Google matrix A: e vector of all 1s Page Rank 19 / 25 Ljiljana Raja i

  20. Computing page rank Key step is matrix vector multiplication A is dense no 0 elements M was sparse only ~ 10 100 non-zero elements per column We want to work with M It s possible! Page Rank 20 / 25 Ljiljana Raja i

  21. Rearranging the equation Page Rank 21 / 25 Ljiljana Raja i

  22. Complete algorithm Page Rank 22 / 25 Ljiljana Raja i

  23. Implementation CPU Graph representation: Adjecency list O(m) per iteration, where m is the number of edges m = O(n) => O(n) per iteration CUDA Graph representation: Adjecency matrix O(n2) per iteration Page Rank 23 / 25 Ljiljana Raja i

  24. CUDA vs CPU Number of pages CPU CUDA 300 290 ms 340 ms 400 570 ms 380 ms 500 860 ms 550 ms >850000 ~6.5 s Memory overflow Page Rank 24 / 25 Ljiljana Raja i

  25. Questions? Thanks for the attention! Page Rank 25 / 25 Ljiljana Raja i

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#