Cosine Similarity in Inverted Index for Querying

 
Calculating Cosine Similarity
in an Inverted Index
 
Dr. Claudia Pearce
CMSC476/676
 
In the lecture on InvertedIndexHowToBuild.pptx,
we built an inverted index for these documents.
 
How can we query from it? How to calculate the
Cosine Similarity from the inverted Index
 
1
 
.33
 
1
 
.33
 
1
 
.42
 
1
 
.17
 
2
 
.2
 
2
 
.2
 
2
 
.5
 
3
 
1
 
3
 
.2
 
3
 
.2
 
3
 
.17
 
3
 
.33
 
D(i)= number of terms in i
 
Query {likes, wink}
 
Term-at-a-Time
 
N (i)= Sum of dot product terms in Doc(i) also in query
 
Init to 0
 
N is for Numerator
 
DOT(Q, i)= SUM [wt(Q,term(j))*wt(i,term(j)) ]
For all term(j) in Q
For all Doc(i) in the corpus
 
 
 
 
N(i)=DOT(Q,i) when we have processed each list
 
Query {likes, wink}
 
Term-at-a-Time
Bring up the posting
List for 
likes
 
Assume no weights
on the query terms
for now
 
N(i)= Sum of dot product terms in Doc(i) also in query
 
Init to 0
 
N is for Numerator
 
N(1,likes) = (1*.33)
                  =.33
 
N(2,likes) = (1*.2)
                  =.2
 
N(3,likes) = (1*.2)
                  =.2
 
N(5,likes) = (1*.17)
                  =.17
 
N(4,likes) = (1*.2)
                  =.2
 
Update  sums by
adding corresponding
values as you process
each document in
the list for each term
in the query.  In this case
the term is 
likes
 
If a wt is provided
in the query,
put that in instead
of 1
 
Query {likes, wink}
 
Term-at-a-Time
Bring up the posting
List for 
wink
 
Assume no weights
On the query terms
For now
 
DOT(Q,i)=N(i)= Sum of dot product terms in Doc(i) also in query
 
Update  sums by
Adding corresponding
Values for 
wink
 
N(1,wink) = (1*.42)
                  =.42
 
N(5,wink) = (1*.42)
                  =.42
 
+.42
 
+.42
 
+0
 
+0
 
+0
 
DOT(Q, i)=wt(Q,likes)*wt(i,likes) + wt(Q,wink)*wt(i, wink)
 
DOT(Q,1)=wt(Q,likes)*wt(1,likes) + wt(Q,wink)*wt(1, wink)
                =              1*.33            +                1*.42
                =              .75
DOT(Q,2)=  wt(Q,likes)*wt(2,likes) + wt(Q,wink)*wt(2, wink)
                 =               1*.2               +                1*0
                  =              .2
….
DOT(Q,5)=  wt(Q,likes)*wt(5,likes) + wt(Q,wink)*wt(5, wink)
                 =                1*.17                +              1*.42
                 =                .59
 
N(i)
 
In Phase 4, it says :
(the sum of the term weights in the document).
This is just the DOT product with a wt of 1 for
each query term
 
DOT(Q, i)=wt(Q,likes)*wt(i,likes) + wt(Q,wink)*wt(i, wink)
 
DOT(Q,1)=wt(Q,likes)*wt(1,likes) + wt(Q,wink)*wt(1, wink)
                =              1*.33            +                1*.42
                =              .75
DOT(Q,2)=  wt(Q,likes)*wt(2,likes) + wt(Q,wink)*wt(2, wink)
                 =               1*.2               +                1*0
                  =              .2
….
DOT(Q,5)=  wt(Q,likes)*wt(5,likes) + wt(Q,wink)*wt(5, wink)
                 =                1*.17                +              1*.42
                 =                .59
 
Term-at-a-time
N(i) held Sum of
dot product terms
in Doc(i) that were
also in query
 
Just using 1 as a
query wight since
we didn’t provide
any thing else
 
N(i)= Sum of dot product terms in Doc(i) also in query
 
What if the query has weights as in Query2 {.5 likes, .4 wink}?
 
N(i)
 
DOT(Q2, i)=wt(Q,likes)*wt(i,likes) + wt(Q,wink)*wt(i, wink)
 
DOT(Q2,1)=wt(Q,likes)*wt(1,likes) + wt(Q,wink)*wt(1, wink)
                =              .5*.33            +                .4*.42
                =              .165 +.344   = .509  =.51
DOT(Q2,2)=  wt(Q,likes)*wt(2,likes) + wt(Q,wink)*wt(2, wink)
                 =               .5*.2               +                .4*0
                  =              .1
….
DOT(Q2,5)=  wt(Q,likes)*wt(5,likes) + wt(Q,wink)*wt(5, wink)
                 =                .5*.17                +              .4*.42
                 =                .085 + .344 = .43
 
Term-at-a-time
N(i) held Sum of dot
product terms in Doc(i)
that were also in query
 
Query2 {.5 likes, .4 wink}
 
N(i)= Sum of dot product terms in Doc(i) also in query
 
The DOT(Q2,i) are complete for all i at this point
 and you can then divide each by the
Square Root of Sum of Squares
To get the normalized Cosine Similarity score
 
Query {likes, wink}
 
Doc-at-a-Time
 
DOT(Q, i)= SUM [wt(Q,term(j))*wt(i,term(j)) ]
For all term(j) in Q
For all Doc(i) in the corpus
 
Linear Merge of 
likes
 and 
wink
 lists
 
Start with pointers at the top of each list
Do all/both the documents match?
Yes
     DOT(Q,1) = 1*.33 + 1*.42=.75
     Advance both pointers
 
DOT(Q,1) is complete at this point
 and you can then divide it by the
Square Root of Sum of Squares
To get the normalized Cosine Similarity score
 
Query {likes, wink}
 
Doc-at-a-Time
 
DOT(Q, i)= SUM [wt(Q,term(j))*wt(i,term(j)) ]
For all term(j) in Q
For all Doc(i) in the corpus
 
Linear Merge of 
likes
 and 
wink
 lists
 
Do the documents match?
No, then get smallest doc number wt
     DOT(Q,2) = 1*.2 =.2
     increment pointer to the smallest doc
 
DOT(Q,2) is complete at this point
 and you can then divide it by the
Square Root of Sum of Squares
To get the normalized Cosine Similarity score
 
Query {likes, wink}
 
Doc-at-a-Time
 
DOT(Q, i)= SUM [wt(Q,term(j))*wt(i,term(j)) ]
For all term(j) in Q
For all Doc(i) in the corpus
 
Linear Merge of 
likes
 and 
wink
 lists
 
Do the documents match?
No then get smallest doc number wt
     DOT(Q,3) = 1*.2 =.2
     increment pointer to the smallest doc
 
DOT(Q,3) is complete at this point
 and you can then divide it by the
Square Root of Sum of Squares
To get the normalized Cosine Similarity score
 
Query {likes, wink}
 
Doc-at-a-Time
 
DOT(Q, i)= SUM [wt(Q,term(j))*wt(i,term(j)) ]
For all term(j) in Q
For all Doc(i) in the corpus
 
Linear Merge of 
likes
 and 
wink
 lists
 
Do the documents match?
No then get smallest doc number wt
     DOT(Q,4) = 1*.2 =.2
     increment pointer to the smallest doc
 
DOT(Q,4) is complete at this point
 and you can then divide it by the
Square Root of Sum of Squares
To get the normalized Cosine Similarity score
 
Query {likes, wink}
 
Doc-at-a-Time
 
DOT(Q, i)= SUM [wt(Q,term(j))*wt(i,term(j)) ]
For all term(j) in Q
For all Doc(i) in the corpus
 
Linear Merge of 
likes
 and 
wink
 lists
 
Do the documents match?
Yes then
     DOT(Q,5) = 1*.17 + 1*.42=.59
     increment pointer to the smallest doc
 No more docs, then done
 
DOT(Q,5) is complete at this point
 and you can then divide it by the
Square Root of Sum of Squares
To get the normalized Cosine Similarity score
 
DEN(Q, i)=SQRT(wt(Q,likes)^2+ wt(Q,wink)^2) * SQRT(wt(i,likes)^2 + wt(i, wink)^2)
 
DEN(Q,1)=SQRT(wt(Q,likes)^2+ wt(Q,wink)^2) * SQRT(wt(1,likes)^2 + wt(1, wink)^2)
                = SQRT( 1*1 + 1*1) *SQRT(.33*.33 +.42*.42)
                = SQRT(2)*SQRT(.11+.18) = SQRT(2)*SQRT(.29)
                =(1.41)*(.54)
                =. 76
DEN(Q,2)=SQRT(wt(Q,likes)^2+ wt(Q,wink)^2) * SQRT(wt(2,likes)^2 + wt(2, wink)^2)
                = SQRT( 1*1 + 1*1) *SQRT(.2*.2 +0*0)
                = SQRT(2)*SQRT(.04) = SQRT(2)*SQRT(.04)
                =(1.41)*(.2)
                =. 28
……..
DEN(Q,5)=SQRT(wt(Q,likes)^2+ wt(Q,wink)^2) * SQRT(wt(5,likes)^2 + wt(5, wink)^2)
                = SQRT( 1*1 + 1*1) *SQRT(.17*.17 +.42*.42)
                = SQRT(2)*SQRT(.03+.48) = SQRT(2)*SQRT(.51)
                =(1.41)*(.72)
                =1.02
 
Here is the Denominator of the Cosine Similarity:
 
Keep an accumulator
 of these Sum-of-Squares
 while you are processing the
 terms in the numerator
 
Calculate the Cosine Similarity
 
SIM(Q,i) = DOT(Q, i)/DEN(Q,i)
 
SIM(Q,1) = DOT(Q,1)/DEN(Q,1) = .75/.76=.98
SIM(Q,2) = DOT(Q,2)/DEN(Q,1) = .2/.28=.71
….
SIM(Q,5) = DOT(Q,5)/DEN(Q,5) = .59/1.02=.57
Slide Note
Embed
Share

In this document, Dr. Claudia Pearce explains how to build and query from an inverted index, focusing on calculating the Cosine Similarity. The process involves calculating the dot product of terms in the document and query, updating sums based on term weights, and understanding the significance of term weights in the document. The step-by-step guide provides insights into efficiently querying and retrieving relevant information from an inverted index.

  • Cosine Similarity
  • Inverted Index
  • Querying
  • Document Retrieval

Uploaded on Sep 06, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Calculating Cosine Similarity in an Inverted Index Dr. Claudia Pearce CMSC476/676

  2. In the lecture on InvertedIndexHowToBuild.pptx, we built an inverted index for these documents. How can we query from it? How to calculate the Cosine Similarity from the inverted Index

  3. ink 3 pink 2 likes 5 wink 2 drink 5 thing 1 he 5 .5 4 3 .33 1 .42 3 1 1 .33 1 .33 1 .17 2 .5 5 .42 2 .2 5 .42 2 .2 4 .33 3 .2 3 .2 3 .17 5 .44 .2 4 4 .17 .2 4 D(i)= number of terms in i 5 .17 5 .17 5 .17 1|6 2|5 3|5 5|6 4|5

  4. N (i)= Sum of dot product terms in Doc(i) also in query likes 5 wink 2 Query {likes, wink} 1|0 2|0 3|0 5|0 Init to 0 4|0 Term-at-a-Time 1 .42 1 .33 N(i)=DOT(Q,i) when we have processed each list 2 .2 5 .42 DOT(Q, i)= SUM [wt(Q,term(j))*wt(i,term(j)) ] For all term(j) in Q For all Doc(i) in the corpus 3 .2 .2 4 5 .17 N is for Numerator

  5. N(i)= Sum of dot product terms in Doc(i) also in query If a wt is provided in the query, put that in instead of 1 likes 5 1|0 2|0 3|0 5|0 Init to 0 4|0 Query {likes, wink} Term-at-a-Time Bring up the posting List for likes Update sums by adding corresponding values as you process each document in the list for each term in the query. In this case the term is likes N(1,likes) = (1*.33) =.33 1 .33 1|.33 4|.2 5|.17 2|.2 3|.2 Assume no weights on the query terms for now N(2,likes) = (1*.2) =.2 2 .2 N(3,likes) = (1*.2) =.2 3 .2 N(4,likes) = (1*.2) =.2 .2 4 N(5,likes) = (1*.17) =.17 5 .17 N is for Numerator

  6. DOT(Q,i)=N(i)= Sum of dot product terms in Doc(i) also in query wink 2 1|.33 4|.2 5|.17 2|.2 3|.2 Query {likes, wink} N(i) Update sums by Adding corresponding Values for wink Term-at-a-Time Bring up the posting List for wink N(1,wink) = (1*.42) =.42 +.42 +0 +0 +0 +.42 1 .42 1|.75 4|.2 5|.59 2|.2 3|.2 Assume no weights On the query terms For now N(5,wink) = (1*.42) =.42 5 .42 DOT(Q, i)=wt(Q,likes)*wt(i,likes) + wt(Q,wink)*wt(i, wink) DOT(Q,1)=wt(Q,likes)*wt(1,likes) + wt(Q,wink)*wt(1, wink) = 1*.33 + 1*.42 = .75 DOT(Q,2)= wt(Q,likes)*wt(2,likes) + wt(Q,wink)*wt(2, wink) = 1*.2 + 1*0 = .2 . DOT(Q,5)= wt(Q,likes)*wt(5,likes) + wt(Q,wink)*wt(5, wink) = 1*.17 + 1*.42 = .59 In Phase 4, it says : (the sum of the term weights in the document). This is just the DOT product with a wt of 1 for each query term

  7. N(i)= Sum of dot product terms in Doc(i) also in query N(i) 1|.75 4|.2 5|.59 2|.2 3|.2 Term-at-a-time N(i) held Sum of dot product terms in Doc(i) that were also in query DOT(Q, i)=wt(Q,likes)*wt(i,likes) + wt(Q,wink)*wt(i, wink) Just using 1 as a query wight since we didn t provide any thing else DOT(Q,1)=wt(Q,likes)*wt(1,likes) + wt(Q,wink)*wt(1, wink) = 1*.33 + 1*.42 = .75 DOT(Q,2)= wt(Q,likes)*wt(2,likes) + wt(Q,wink)*wt(2, wink) = 1*.2 + 1*0 = .2 . DOT(Q,5)= wt(Q,likes)*wt(5,likes) + wt(Q,wink)*wt(5, wink) = 1*.17 + 1*.42 = .59 What if the query has weights as in Query2 {.5 likes, .4 wink}?

  8. N(i)= Sum of dot product terms in Doc(i) also in query 1|.51 4|.1 5|.43 2|.1 3|.1 Term-at-a-time N(i) held Sum of dot product terms in Doc(i) that were also in query DOT(Q2, i)=wt(Q,likes)*wt(i,likes) + wt(Q,wink)*wt(i, wink) Query2 {.5 likes, .4 wink} DOT(Q2,1)=wt(Q,likes)*wt(1,likes) + wt(Q,wink)*wt(1, wink) = .5*.33 + .4*.42 = .165 +.344 = .509 =.51 DOT(Q2,2)= wt(Q,likes)*wt(2,likes) + wt(Q,wink)*wt(2, wink) = .5*.2 + .4*0 = .1 . DOT(Q2,5)= wt(Q,likes)*wt(5,likes) + wt(Q,wink)*wt(5, wink) = .5*.17 + .4*.42 = .085 + .344 = .43 The DOT(Q2,i) are complete for all i at this point and you can then divide each by the Square Root of Sum of Squares To get the normalized Cosine Similarity score

  9. DOT(Q, i)= SUM [wt(Q,term(j))*wt(i,term(j)) ] For all term(j) in Q For all Doc(i) in the corpus likes 5 wink 2 Query {likes, wink} Doc-at-a-Time Linear Merge of likes and wink lists Start with pointers at the top of each list Do all/both the documents match? Yes DOT(Q,1) = 1*.33 + 1*.42=.75 Advance both pointers 1 .42 1 .33 2 .2 5 .42 3 .2 DOT(Q,1) is complete at this point and you can then divide it by the Square Root of Sum of Squares To get the normalized Cosine Similarity score .2 4 5 .17

  10. DOT(Q, i)= SUM [wt(Q,term(j))*wt(i,term(j)) ] For all term(j) in Q For all Doc(i) in the corpus likes 5 wink 2 Query {likes, wink} Doc-at-a-Time Linear Merge of likes and wink lists Do the documents match? No, then get smallest doc number wt DOT(Q,2) = 1*.2 =.2 increment pointer to the smallest doc 1 .42 1 .33 2 .2 5 .42 3 .2 DOT(Q,2) is complete at this point and you can then divide it by the Square Root of Sum of Squares To get the normalized Cosine Similarity score .2 4 5 .17

  11. DOT(Q, i)= SUM [wt(Q,term(j))*wt(i,term(j)) ] For all term(j) in Q For all Doc(i) in the corpus likes 5 wink 2 Query {likes, wink} Doc-at-a-Time Linear Merge of likes and wink lists Do the documents match? No then get smallest doc number wt DOT(Q,3) = 1*.2 =.2 increment pointer to the smallest doc 1 .42 1 .33 2 .2 5 .42 3 .2 DOT(Q,3) is complete at this point and you can then divide it by the Square Root of Sum of Squares To get the normalized Cosine Similarity score .2 4 5 .17

  12. DOT(Q, i)= SUM [wt(Q,term(j))*wt(i,term(j)) ] For all term(j) in Q For all Doc(i) in the corpus likes 5 wink 2 Query {likes, wink} Doc-at-a-Time Linear Merge of likes and wink lists Do the documents match? No then get smallest doc number wt DOT(Q,4) = 1*.2 =.2 increment pointer to the smallest doc 1 .42 1 .33 2 .2 5 .42 3 .2 DOT(Q,4) is complete at this point and you can then divide it by the Square Root of Sum of Squares To get the normalized Cosine Similarity score .2 4 5 .17

  13. DOT(Q, i)= SUM [wt(Q,term(j))*wt(i,term(j)) ] For all term(j) in Q For all Doc(i) in the corpus likes 5 wink 2 Query {likes, wink} Doc-at-a-Time Linear Merge of likes and wink lists Do the documents match? Yes then DOT(Q,5) = 1*.17 + 1*.42=.59 increment pointer to the smallest doc No more docs, then done 1 .42 1 .33 2 .2 5 .42 3 .2 DOT(Q,5) is complete at this point and you can then divide it by the Square Root of Sum of Squares To get the normalized Cosine Similarity score .2 4 5 .17

  14. Here is the Denominator of the Cosine Similarity: Keep an accumulator of these Sum-of-Squares while you are processing the terms in the numerator DEN(Q, i)=SQRT(wt(Q,likes)^2+ wt(Q,wink)^2) * SQRT(wt(i,likes)^2 + wt(i, wink)^2) DEN(Q,1)=SQRT(wt(Q,likes)^2+ wt(Q,wink)^2) * SQRT(wt(1,likes)^2 + wt(1, wink)^2) = SQRT( 1*1 + 1*1) *SQRT(.33*.33 +.42*.42) = SQRT(2)*SQRT(.11+.18) = SQRT(2)*SQRT(.29) =(1.41)*(.54) =. 76 DEN(Q,2)=SQRT(wt(Q,likes)^2+ wt(Q,wink)^2) * SQRT(wt(2,likes)^2 + wt(2, wink)^2) = SQRT( 1*1 + 1*1) *SQRT(.2*.2 +0*0) = SQRT(2)*SQRT(.04) = SQRT(2)*SQRT(.04) =(1.41)*(.2) =. 28 .. DEN(Q,5)=SQRT(wt(Q,likes)^2+ wt(Q,wink)^2) * SQRT(wt(5,likes)^2 + wt(5, wink)^2) = SQRT( 1*1 + 1*1) *SQRT(.17*.17 +.42*.42) = SQRT(2)*SQRT(.03+.48) = SQRT(2)*SQRT(.51) =(1.41)*(.72) =1.02

  15. Calculate the Cosine Similarity SIM(Q,i) = DOT(Q, i)/DEN(Q,i) SIM(Q,1) = DOT(Q,1)/DEN(Q,1) = .75/.76=.98 SIM(Q,2) = DOT(Q,2)/DEN(Q,1) = .2/.28=.71 . SIM(Q,5) = DOT(Q,5)/DEN(Q,5) = .59/1.02=.57

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#