Text Processing: Indexing, Zipf's Law, and Vocabulary Growth

undefined
 
Chapter 4
 
Processing Text
 
P
r
o
c
e
s
s
i
n
g
 
T
e
x
t
 
n
Modifying/Converting 
documents
 to 
index terms
Convert the many forms of 
words
 into more consistent 
index
 
   terms
 that represent the content of a document
n
What are the problems?
Matching the 
exact
 string of characters typed by the user is too
 
   restrictive, e.g., case-sensitivity, punctuation, stemming
it doesn’t work very well in terms of 
effectiveness
Sometimes not clear where words begin and end
Not even clear what a word is in some languages, e.g., in
 
Chinese and Korean
Not
 all words are of 
equal value 
in a search, and understanding
 
    the 
statistical
 nature of text is critical
2
 
I
n
d
e
x
i
n
g
 
P
r
o
c
e
s
s
3
 
Identifies and stores
documents for
indexing
 
Transforms documents into
index terms 
or 
features
 
Takes index terms and
creates data structures
(
indexes
) to support
fast searching
 
Text + Meta data
(Doc type, structure,
features, size, etc.)
or Text Processing
 
Z
i
p
f
s
 
L
a
w
 
n
Distribution of word frequencies 
is very 
skewed
Few words occur very often, many hardly ever occur
e.g., “the” and “of”, two common words, make up about 10%
 
    of all word occurrences in text documents
n
Zipf’s law
:
The frequency 
f 
of a word in a corpus is 
inversely
 
proportional
 
    to its rank
 r 
(assuming words are ranked in order of
 
    
decreasing
 frequency)
 
where 
k
 is a constant for the corpus
4
 
  
f 
 
r
 = 
k
T
o
p
 
5
0
 
W
o
r
d
s
 
f
r
o
m
 
A
P
8
9
5
 
Z
i
p
f
s
 
L
a
w
6
 
[Ha 02] Ha et al. 
Extension of Zipf's Law to Words and Phrases
. In Proc. of Int. Conf.
 
on Computational Linguistics. 2002.
 
Example
. 
Zipf’s law for
 
AP89 with 
problems at
 
high
 and 
low
 frequencies
 
According to [Ha 02], Zipf’s law
does not hold for rank > 5,000
is valid when considering single words as well as 
n
-gram phrases,
 
combined in a single curve.
 
V
o
c
a
b
u
l
a
r
y
 
G
r
o
w
t
h
 
n
Heaps’ Law
, another prediction of 
word occurrence
n
As 
corpus
 grows, so does 
vocabulary size. 
However,
 
fewer
 new words when corpus is already 
large
n
Observed relationship (
Heaps’ Law
):
v
 = 
k × n
β
where
 
 
 
Predicting that the number of 
new
 words increases very
 
   rapidly when the corpus is 
small
 
 
7
 
n
 is the 
total
 
number
 of 
words
 in corpus
 
k
, 
β 
are parameters that vary for each corpus
 
(typical values given are 10 ≤
 k 
 
100
 
and 
β ≈ 
0.5)
 
v
 is the 
vocabulary size 
(number of 
unique words
)
A
P
8
9
 
E
x
a
m
p
l
e
 
(
4
0
 
m
i
l
l
i
o
n
 
w
o
r
d
s
)
8
(
k
)
(
)
v
 = 
k × n
β
 
H
e
a
p
s
 
L
a
w
 
P
r
e
d
i
c
t
i
o
n
s
 
n
Number of 
new
 words 
increases
 very rapidly when the
 
corpus is 
small
, and continue to increase indefinitely
n
Predictions for TREC collections are accurate for large
 
numbers of words, e.g.,
First 10,879,522 
words 
of the AP89 collection scanned
Prediction is 100,151 
unique words
Actual number is 100,024
n
Predictions for 
small
 numbers of words (i.e., < 1000) are
 
much worse
9
 
H
e
a
p
s
 
L
a
w
 
o
n
 
t
h
e
 
W
e
b
 
n
Heaps’ Law 
works with very 
large
 corpora
New words occurring even after seeing 30 million!
Parameter values different than typical TREC values
n
New words come from a variety of sources
Spelling errors
, 
invented words 
(e.g., product, company
 
 names), 
code
, 
other languages
, 
email addresses
, etc.
n
Search engines must deal with these 
large
 and
 
 
growing
 
vocabularies
10
 
H
e
a
p
s
 
L
a
w
 
v
s
.
 
Z
i
p
f
s
 
L
a
w
 
n
As stated in [French 02]:
The observed vocabulary growth has a positive correlation
 
  with 
Heaps’ law
Zipf’s law
, on the other hand, is a poor predictor of high-
 
   ranked terms, i.e., Zipf’s law is adequate for predicting
 
   
medium
 to 
low
 ranked terms
While Heaps’ law is a valid model for vocabulary growth of
 
   web data, Zipf’s 
law is not strongly correlated with web data
11
 
[French 02] J. French. 
Modeling Web Data
. In Proc. of Joint Conf. on Digital Libraries
 
(JCDL). 2002.
 
E
s
t
i
m
a
t
i
n
g
 
R
e
s
u
l
t
 
S
e
t
 
S
i
z
e
 
n
Word occurrence statistics 
can be used to estimate the 
size
 
of the results from a web search
n
How many pages (in the results) contain 
all
 of the query
 
terms (based on 
word occurrence statistics
)?
n
Example
. For the query “
a b c
”:
        
f
abc
 = N × f
a
/
N × f
b
/
N × f
c
/
N = (f
a
 × f
b
 × f
c
)
/
N
2
f
abc
: 
estimated size 
of the result set
 
using 
joint probability
f
a
, f
b
, f
c
: the number of 
documents
 that terms 
a
, 
b
, and 
c
 occur
 
in, respectively
N
 is the 
total
 number of documents in the collection
Assuming that terms occur 
independently
12
T
R
E
C
 
G
O
V
2
 
E
x
a
m
p
l
e
 
Collection size (
N
) is 25,205,179
13
 
R
e
s
u
l
t
 
S
e
t
 
S
i
z
e
 
E
s
t
i
m
a
t
i
o
n
 
n
Poor estimates 
because words are 
not
 
independent
n
Better estimates 
possible if 
co-occurrence info.
 available
P
(
a ∩ b ∩ c
)
 = P
(
a ∩ b
)
 × P
(
c 
| 
a ∩ b
)
                            = P
(
a ∩ b
)
 × 
(
P
(
b ∩ c
) / 
P
(
b
))
f 
tropical ∩ aquarium ∩ fish 
= f 
tropical ∩ aquarium 
×
 f 
aquarium ∩ fish 
/ 
f 
aquarium
  
    
 = 1921 × 9722 / 26480
                              = 705 (1,529, actual)
f 
tropical ∩ breeding ∩ fish 
= f 
tropical ∩ breeding
 
×
 f 
breeding ∩ fish 
/ 
f 
breeding
   
            
= 5510 × 36427 / 81885
                             = 2,451 (3,629 actual)
14
 
R
e
s
u
l
t
 
S
e
t
 
E
s
t
i
m
a
t
i
o
n
 
n
Even 
better estimates 
using 
initial result set
 (
word
 
frequency
 + 
current result set
)
Estimate is simply 
C
/
s
where 
s 
is the 
proportion
 of the total number of 
documents
 
that have been 
ranked
 (i.e., processed) & 
C
 is the number
 
of documents found that contain all the 
query words
E
x
a
m
p
l
e
.
 
t
r
o
p
i
c
a
l
 
f
i
s
h
 
a
q
u
a
r
i
u
m
 
i
n
 
G
O
V
2
After processing 3,000 out of the 26,480 documents that
 
contain “aquarium”, 
C
 = 258
  f 
tropical ∩ fish ∩ aquarium 
= 258 / (3000 ÷ 26480) = 2,277 (> 1,529)
After processing 20% of the documents,
  f 
tropical ∩ fish ∩ aquarium 
= 1,778 (1,529 is real value)
15
 
where 
C
 = 356 & 5,296 documents have been ranked
 
T
o
k
e
n
i
z
i
n
g
 
P
r
o
b
l
e
m
s
 
n
Small words 
can be important in some queries, usually in
 
combinations
xp
, 
bi
, 
pm
, 
cm
, 
el
 
paso
, 
kg
, 
ben
 
e
 
king
, 
master
 
p
, 
world
 
war
 II
n
Both 
hyphenated 
and 
non-hyphenated
 forms of many
 
words are common
Sometimes 
hyphen 
is 
not
 needed
e-bay, wal-mart, active-x, cd-rom, t-shirts
At other times, 
hyphens
 should be considered either as 
part
 
    of the word or a word 
separator
winston-salem, mazda rx-7, e-cards, pre-diabetes, t-mobile,
 
spanish-speaking
16
 
T
o
k
e
n
i
z
i
n
g
 
P
r
o
b
l
e
m
s
 
n
Special characters 
are an important part of tags, URLs,
 
code in documents
n
Capitalized words 
can have different meaning from lower
 
case words
Bush, Apple, House, Senior, Time, Key
n
Apostrophes
 can be a part of a word/possessive, or just
 
a mistake
rosie o'donnell, can't, don't, 80's, 1890's, men's straw hats,
 
   master's degree, england's ten largest cities, shriner's
17
 
T
o
k
e
n
i
z
i
n
g
 
P
r
o
b
l
e
m
s
 
n
Numbers
 can be important, including decimals
Nokia 3250, top 10 courses, united 93, quicktime 6.5 pro,
 
   92.3 the beat, 288358
n
Periods
 can occur in numbers, abbreviations, URLs,
 
  ends of sentences, and other situations
I.B.M., Ph.D., cs.umass.edu, F.E.A.R.
n
Note: tokenizing steps for 
queries
 must be 
identical to
 
steps for 
documents
18
 
T
o
k
e
n
i
z
i
n
g
 
P
r
o
c
e
s
s
 
n
Not that different than simple tokenizing process used in
 
the past
n
Examples of rules used with TREC
Apostrophes
 in words 
ignored
o’connor → oconnor  bob’s → bobs
Periods
 in abbreviations 
ignored
I.B.M. → ibm  Ph.D. → phd
19
 
S
t
o
p
p
i
n
g
 
n
Function words 
(conjunctions, prepositions, articles) have
 
little meaning on their own
n
High occurrence frequencies
n
Treated as 
stopwords
 
(i.e., text processing stops when
 
words are detected & removed hereafter)
Reduce
 index 
space
Increase 
efficiency
 (i.e., 
response time
)
Improve 
effectiveness
n
Can be important in combinations
e.g., “to be or not to be”
20
 
S
t
o
p
p
i
n
g
 
n
Stopword list can be created from 
high-frequency words
 
or based on a 
standard list
n
Lists are 
customized
 for applications, domains, and even
 
parts of documents
e.g., “click” is a good stopword for anchor text
21
 
S
t
e
m
m
i
n
g
 
n
Many 
morphological variations 
of words to convey a
 
single idea
Inflectional
 (plurals, tenses)
Derivational
 (making verbs into nouns, etc.)
n
In most cases, these have the same or very similar
 
meanings
n
Stemmers attempt to 
reduce
 
morphological
 
variations
 of
 
words to a 
common stem
Usually involves removing 
suffixes
n
Can be done at indexing time/as part of query processing
 
(like stopwords)
22
 
S
t
e
m
m
i
n
g
 
n
Two basic types
Dictionary-based
: uses lists of related words
Algorithmic
: uses program to determine related words
n
Algorithmic stemmers
Suffix-s: 
remove ‘s’ endings assuming plural
e.g., cats → cat, lakes → lake, wiis → wii
Some 
false positives
: ups → up
 
Many 
false negatives
: countries → countrie
23
 
(find a relationship when
 none exists)
 
(Fail to find term relationship)
 
P
o
r
t
e
r
 
S
t
e
m
m
e
r
 
n
Algorithmic stemmer used in IR experiments since the 70’s
n
Consists of a series of rules designed to extract the 
longest
 
possible suffix 
at each step, e.g.,
       Step 1a:
           - Replace 
sses
 by 
ss
 (e.g., stresses 
 
stress)
           - Delete 
s
 if the preceding word contains a 
vowel
 not
 
     immediately before 
s
 (e.g., gaps 
 
gap, gas 
 
gas)
           - Replace 
ied
 or 
ies
 by 
i
 if preceded by > 1 letter; o.w.,
 
     by 
ie
 (e.g., ties 
 
tie, cries 
 
cri)
n
Effective in TREC
n
Produces 
stems
 
not
 
words
n
Makes a number of 
errors
 and 
difficult 
to 
modify
24
 
E
r
r
o
r
s
 
o
f
 
P
o
r
t
e
r
 
S
t
e
m
m
e
r
 
n
It is difficult to capture all the subtleties of a language in
 
a simple algorithm
 
 
 
 
 
n
Porter2 stemmer addresses some of these issues
n
Approach has been used with other languages
25
 
{ No Relationship }
 
{ Fail to Find a Relationship }
 
L
i
n
k
 
A
n
a
l
y
s
i
s
 
n
Links
 are a key component of the Web
n
Important for 
navigation
, but also for 
search
e.g., <a href="http://example.com">Example website</a>
“Example website” is the 
anchor text
“http://example.com” is the 
destination link
both are used by search engines
26
 
A
n
c
h
o
r
 
T
e
x
t
 
n
Describe the 
content
 of the 
destination page
i.e., collection of 
anchor text 
in all links pointing to a page
 
     used as an additional text field
n
Anchor text tends to be 
short
, 
descriptive
, and 
similar
 to
 
query text
n
Retrieval experiments have shown that 
anchor text 
has
 
significant impact on 
effectiveness
 for 
some types
 
of queries
i.e., more than PageRank
27
 
P
a
g
e
R
a
n
k
 
n
Billions of web pages, some more informative than others
n
Links can be viewed as 
information
 about the 
popularity
 
(
authority
?) of a web page
Can be used by 
ranking
 
algorithms
n
Inlink
 
count
 could be used as simple measure
n
Link analysis algorithms like 
PageRank
 provide more
 
reliable ratings, but 
less susceptible 
to link spam
n
PageRank
 of a page is the 
probability
 that the “random
 
surfer” will be looking at that page
Links from 
popular
 pages increase PageRank of pages
 
    they point to, i.e., links tend to point to popular pages
28
 
P
a
g
e
R
a
n
k
 
n
PageRank
 (
PR
) of page 
C
 = 
PR
(A)/2 + 
PR
(B)/1
n
More generally
 
 
where 
u
 is a web page
 
          B
u
 is the set of pages that 
point to u
              
L
v
 is the number of outgoing links from page 
v
  
    
(not counting duplicate links)
 
 
29
 
P
a
g
e
R
a
n
k
 
n
Don’t know 
PageRank
 
values
 at start
n
Example
. Assume equal values of 1/3, then
1
st
 iteration:   
PR
(C) = 0.33/2 + 0.33/1 = 
0.5
  
      
PR
(A) = 0.33/1 = 
0.33
 
                   
PR
(B) = 0.33/2 = 
0.17
2
nd
 iteration:  
PR
(C) = 0.33/2 + 0.17/1 = 
0.33
  
      
PR
(A) = 0.5/1 = 
0.5
 
                   
PR
(B) = 0.33/2 = 
0.17
3
rd
 iteration:  
PR
(C) = 0.5/2 + 0.17/1 = 
0.42
 
                   
PR
(A) = 0.33/1 = 
0.33
  
      
PR
(B) = 0.5/2 = 
0.25
n
Converges to  
PR
(C) = 
0.4
 
                
PR
(A) = 
0.4
 
                
PR
(B) = 
0.2
30
Slide Note
Embed
Share

Processing text involves converting documents into index terms, addressing issues like word variations, indexing text and metadata, understanding word frequency distribution with Zipf's Law, and predicting vocabulary growth with Heaps' Law.

  • Text Processing
  • Indexing
  • Zipfs Law
  • Vocabulary Growth

Uploaded on Sep 10, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Chapter 4 Processing Text

  2. Processing Text Modifying/Converting documents to index terms Convert the many forms of words into more consistent index terms that represent the content of a document What are the problems? Matching the exact string of characters typed by the user is too restrictive, e.g., case-sensitivity, punctuation, stemming it doesn t work very well in terms of effectiveness Sometimes not clear where words begin and end Not even clear what a word is in some languages, e.g., in Chinese and Korean Not all words are of equal value in a search, and understanding the statistical nature of text is critical 2

  3. Indexing Process Text + Meta data (Doc type, structure, features, size, etc.) Takes index terms and creates data structures (indexes) to support fast searching Identifies and stores documents for indexing Transforms documents into index terms or features or Text Processing 3

  4. Zipfs Law Distribution of word frequencies is very skewed Few words occur very often, many hardly ever occur e.g., the and of , two common words, make up about 10% of all word occurrences in text documents Zipf s law: The frequency f of a word in a corpus is inversely proportional to its rank r (assuming words are ranked in order of decreasing frequency) f = k f r = k r where k is a constant for the corpus 4

  5. Top 50 Words from AP89 5

  6. Zipfs Law Example. Zipf s law for AP89 with problems at high and low frequencies According to [Ha 02], Zipf s law does not hold for rank > 5,000 is valid when considering single words as well as n-gram phrases, combined in a single curve. [Ha 02] Ha et al. Extension of Zipf's Law to Words and Phrases. In Proc. of Int. Conf. on Computational Linguistics. 2002. 6

  7. Vocabulary Growth Heaps Law, another prediction of word occurrence As corpus grows, so does vocabulary size. However, fewer new words when corpus is already large Observed relationship (Heaps Law): v = k n where v is the vocabulary size (number of unique words) n is the totalnumber of words in corpus k, are parameters that vary for each corpus (typical values given are 10 k 100and 0.5) Predicting that the number of new words increases very rapidly when the corpus is small 7

  8. AP89 Example (40 million words) ( ) (k) v = k n 8

  9. Heaps Law Predictions Number of new words increases very rapidly when the corpus is small, and continue to increase indefinitely Predictions for TREC collections are accurate for large numbers of words, e.g., First 10,879,522 words of the AP89 collection scanned Prediction is 100,151 unique words Actual number is 100,024 Predictions for small numbers of words (i.e., < 1000) are much worse 9

  10. Heaps Law on the Web Heaps Law works with very large corpora New words occurring even after seeing 30 million! Parameter values different than typical TREC values New words come from a variety of sources Spelling errors, invented words (e.g., product, company names), code, other languages, email addresses, etc. Search engines must deal with these large and growingvocabularies 10

  11. Heaps Law vs. Zipfs Law As stated in [French 02]: The observed vocabulary growth has a positive correlation with Heaps law Zipf s law, on the other hand, is a poor predictor of high- ranked terms, i.e., Zipf s law is adequate for predicting medium to low ranked terms While Heaps law is a valid model for vocabulary growth of web data, Zipf slaw is not strongly correlated with web data [French 02] J. French. Modeling Web Data. In Proc. of Joint Conf. on Digital Libraries (JCDL). 2002. 11

  12. Estimating Result Set Size Word occurrence statistics can be used to estimate the size of the results from a web search How many pages (in the results) contain all of the query terms (based on word occurrence statistics)? Example. For the query a b c : fabc= N fa/N fb/N fc/N = (fa fb fc)/N2 fabc: estimated size of the result setusing joint probability fa, fb, fc: the number of documents that terms a, b, and c occur in, respectively N is the total number of documents in the collection Assuming that terms occur independently 12

  13. TREC GOV2 Example Poor Estimation Due to the Independent Assumption Collection size (N) is 25,205,179 13

  14. Result Set Size Estimation Poor estimates because words are not independent Better estimates possible if co-occurrence info. available P(a b c) = P(a b) P(c | a b) = P(a b) (P(b c) / P(b)) (c) (b) (a) f tropical aquarium fish = f tropical aquarium f aquarium fish / f aquarium = 1921 9722 / 26480 = 705 (1,529, actual) f tropical breeding fish = f tropical breeding f breeding fish / f breeding = 5510 36427 / 81885 = 2,451 (3,629 actual) 14

  15. Result Set Estimation Even better estimates using initial result set (word frequency + current result set) Estimate is simply C/s where s is the proportion of the total number of documents that have been ranked (i.e., processed) & C is the number of documents found that contain all the query words Example. tropical fish aquarium in GOV2 After processing 3,000 out of the 26,480 documents that contain aquarium , C = 258 11% f tropical fish aquarium = 258 / (3000 26480) = 2,277 (> 1,529) After processing 20% of the documents, f tropical fish aquarium = 1,778 (1,529 is real value) where C = 356 & 5,296 documents have been ranked 15

  16. Tokenizing Problems Small words can be important in some queries, usually in combinations xp, bi, pm, cm, elpaso, kg, beneking, masterp, worldwar II Both hyphenated and non-hyphenated forms of many words are common Sometimes hyphen is not needed e-bay, wal-mart, active-x, cd-rom, t-shirts At other times, hyphens should be considered either as part of the word or a word separator winston-salem, mazda rx-7, e-cards, pre-diabetes, t-mobile, spanish-speaking 16

  17. Tokenizing Problems Special characters are an important part of tags, URLs, code in documents Capitalized words can have different meaning from lower case words Bush, Apple, House, Senior, Time, Key Apostrophes can be a part of a word/possessive, or just a mistake rosie o'donnell, can't, don't, 80's, 1890's, men's straw hats, master's degree, england's ten largest cities, shriner's 17

  18. Tokenizing Problems Numbers can be important, including decimals Nokia 3250, top 10 courses, united 93, quicktime 6.5 pro, 92.3 the beat, 288358 Periods can occur in numbers, abbreviations, URLs, ends of sentences, and other situations I.B.M., Ph.D., cs.umass.edu, F.E.A.R. Note: tokenizing steps for queries must be identical to steps for documents 18

  19. Tokenizing Process Not that different than simple tokenizing process used in the past Examples of rules used with TREC Apostrophes in words ignored o connor oconnor bob s bobs Periods in abbreviations ignored I.B.M. ibm Ph.D. phd 19

  20. Stopping Function words (conjunctions, prepositions, articles) have little meaning on their own High occurrence frequencies Treated as stopwords(i.e., text processing stops when words are detected & removed hereafter) Reduce index space Increase efficiency (i.e., response time) Improve effectiveness Can be important in combinations e.g., to be or not to be 20

  21. Stopping Stopword list can be created from high-frequency words or based on a standard list Lists are customized for applications, domains, and even parts of documents e.g., click is a good stopword for anchor text 21

  22. Stemming Many morphological variations of words to convey a single idea Inflectional (plurals, tenses) Derivational (making verbs into nouns, etc.) In most cases, these have the same or very similar meanings Stemmers attempt to reduce morphological variations of words to a common stem Usually involves removing suffixes Can be done at indexing time/as part of query processing (like stopwords) 22

  23. Stemming Two basic types Dictionary-based: uses lists of related words Algorithmic: uses program to determine related words Algorithmic stemmers Suffix-s: remove s endings assuming plural e.g., cats cat, lakes lake, wiis wii Some false positives: ups up (find a relationship when none exists) Many false negatives: countries countrie (Fail to find term relationship) 23

  24. Porter Stemmer Algorithmic stemmer used in IR experiments since the 70 s Consists of a series of rules designed to extract the longest possible suffix at each step, e.g., - Delete s if the preceding word contains a vowel not immediately before s (e.g., gaps gap, gas gas) - Replace ied or ies by i if preceded by > 1 letter; o.w., by ie (e.g., ties tie, cries cri) Step 1a: - Replace sses by ss (e.g., stresses stress) Effective in TREC Produces stems not words Makes a number of errors and difficult to modify 24

  25. Errors of Porter Stemmer It is difficult to capture all the subtleties of a language in a simple algorithm { No Relationship } { Fail to Find a Relationship } Porter2 stemmer addresses some of these issues Approach has been used with other languages 25

  26. Link Analysis Links are a key component of the Web Important for navigation, but also for search e.g., <a href="http://example.com">Example website</a> Example website is the anchor text http://example.com is the destination link both are used by search engines 26

  27. Anchor Text Describe the content of the destination page i.e., collection of anchor text in all links pointing to a page used as an additional text field Anchor text tends to be short, descriptive, and similar to query text Retrieval experiments have shown that anchor text has significant impact on effectiveness for some types of queries i.e., more than PageRank 27

  28. PageRank Billions of web pages, some more informative than others Links can be viewed as information about the popularity (authority?) of a web page Can be used by rankingalgorithms Inlinkcount could be used as simple measure Link analysis algorithms like PageRank provide more reliable ratings, but less susceptible to link spam PageRank of a page is the probabilitythat the random surfer will be looking at that page Links from popular pages increase PageRank of pages they point to, i.e., links tend to point to popular pages 28

  29. PageRank PageRank (PR) of page C = PR(A)/2 + PR(B)/1 More generally where u is a web page Bu is the set of pages that point to u Lv is the number of outgoing links from page v (not counting duplicate links) 29

  30. PageRank Don t know PageRankvalues at start Example. Assume equal values of 1/3, then 1st iteration: PR(C) = 0.33/2 + 0.33/1 = 0.5 PR(A) = 0.33/1 = 0.33 PR(B) = 0.33/2 = 0.17 2nd iteration: PR(C) = 0.33/2 + 0.17/1 = 0.33 PR(A) = 0.5/1 = 0.5 PR(B) = 0.33/2 = 0.17 3rd iteration: PR(C) = 0.5/2 + 0.17/1 = 0.42 PR(A) = 0.33/1 = 0.33 PR(B) = 0.5/2 = 0.25 Converges to PR(C) = 0.4 PR(A) = 0.4 PR(B) = 0.2 30

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#