Understanding Cosine Similarity in Inverted Index for Querying
In this document, Dr. Claudia Pearce explains how to build and query from an inverted index, focusing on calculating the Cosine Similarity. The process involves calculating the dot product of terms in the document and query, updating sums based on term weights, and understanding the significance of term weights in the document. The step-by-step guide provides insights into efficiently querying and retrieving relevant information from an inverted index.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Calculating Cosine Similarity in an Inverted Index Dr. Claudia Pearce CMSC476/676
In the lecture on InvertedIndexHowToBuild.pptx, we built an inverted index for these documents. How can we query from it? How to calculate the Cosine Similarity from the inverted Index
ink 3 pink 2 likes 5 wink 2 drink 5 thing 1 he 5 .5 4 3 .33 1 .42 3 1 1 .33 1 .33 1 .17 2 .5 5 .42 2 .2 5 .42 2 .2 4 .33 3 .2 3 .2 3 .17 5 .44 .2 4 4 .17 .2 4 D(i)= number of terms in i 5 .17 5 .17 5 .17 1|6 2|5 3|5 5|6 4|5
N (i)= Sum of dot product terms in Doc(i) also in query likes 5 wink 2 Query {likes, wink} 1|0 2|0 3|0 5|0 Init to 0 4|0 Term-at-a-Time 1 .42 1 .33 N(i)=DOT(Q,i) when we have processed each list 2 .2 5 .42 DOT(Q, i)= SUM [wt(Q,term(j))*wt(i,term(j)) ] For all term(j) in Q For all Doc(i) in the corpus 3 .2 .2 4 5 .17 N is for Numerator
N(i)= Sum of dot product terms in Doc(i) also in query If a wt is provided in the query, put that in instead of 1 likes 5 1|0 2|0 3|0 5|0 Init to 0 4|0 Query {likes, wink} Term-at-a-Time Bring up the posting List for likes Update sums by adding corresponding values as you process each document in the list for each term in the query. In this case the term is likes N(1,likes) = (1*.33) =.33 1 .33 1|.33 4|.2 5|.17 2|.2 3|.2 Assume no weights on the query terms for now N(2,likes) = (1*.2) =.2 2 .2 N(3,likes) = (1*.2) =.2 3 .2 N(4,likes) = (1*.2) =.2 .2 4 N(5,likes) = (1*.17) =.17 5 .17 N is for Numerator
DOT(Q,i)=N(i)= Sum of dot product terms in Doc(i) also in query wink 2 1|.33 4|.2 5|.17 2|.2 3|.2 Query {likes, wink} N(i) Update sums by Adding corresponding Values for wink Term-at-a-Time Bring up the posting List for wink N(1,wink) = (1*.42) =.42 +.42 +0 +0 +0 +.42 1 .42 1|.75 4|.2 5|.59 2|.2 3|.2 Assume no weights On the query terms For now N(5,wink) = (1*.42) =.42 5 .42 DOT(Q, i)=wt(Q,likes)*wt(i,likes) + wt(Q,wink)*wt(i, wink) DOT(Q,1)=wt(Q,likes)*wt(1,likes) + wt(Q,wink)*wt(1, wink) = 1*.33 + 1*.42 = .75 DOT(Q,2)= wt(Q,likes)*wt(2,likes) + wt(Q,wink)*wt(2, wink) = 1*.2 + 1*0 = .2 . DOT(Q,5)= wt(Q,likes)*wt(5,likes) + wt(Q,wink)*wt(5, wink) = 1*.17 + 1*.42 = .59 In Phase 4, it says : (the sum of the term weights in the document). This is just the DOT product with a wt of 1 for each query term
N(i)= Sum of dot product terms in Doc(i) also in query N(i) 1|.75 4|.2 5|.59 2|.2 3|.2 Term-at-a-time N(i) held Sum of dot product terms in Doc(i) that were also in query DOT(Q, i)=wt(Q,likes)*wt(i,likes) + wt(Q,wink)*wt(i, wink) Just using 1 as a query wight since we didn t provide any thing else DOT(Q,1)=wt(Q,likes)*wt(1,likes) + wt(Q,wink)*wt(1, wink) = 1*.33 + 1*.42 = .75 DOT(Q,2)= wt(Q,likes)*wt(2,likes) + wt(Q,wink)*wt(2, wink) = 1*.2 + 1*0 = .2 . DOT(Q,5)= wt(Q,likes)*wt(5,likes) + wt(Q,wink)*wt(5, wink) = 1*.17 + 1*.42 = .59 What if the query has weights as in Query2 {.5 likes, .4 wink}?
N(i)= Sum of dot product terms in Doc(i) also in query 1|.51 4|.1 5|.43 2|.1 3|.1 Term-at-a-time N(i) held Sum of dot product terms in Doc(i) that were also in query DOT(Q2, i)=wt(Q,likes)*wt(i,likes) + wt(Q,wink)*wt(i, wink) Query2 {.5 likes, .4 wink} DOT(Q2,1)=wt(Q,likes)*wt(1,likes) + wt(Q,wink)*wt(1, wink) = .5*.33 + .4*.42 = .165 +.344 = .509 =.51 DOT(Q2,2)= wt(Q,likes)*wt(2,likes) + wt(Q,wink)*wt(2, wink) = .5*.2 + .4*0 = .1 . DOT(Q2,5)= wt(Q,likes)*wt(5,likes) + wt(Q,wink)*wt(5, wink) = .5*.17 + .4*.42 = .085 + .344 = .43 The DOT(Q2,i) are complete for all i at this point and you can then divide each by the Square Root of Sum of Squares To get the normalized Cosine Similarity score
DOT(Q, i)= SUM [wt(Q,term(j))*wt(i,term(j)) ] For all term(j) in Q For all Doc(i) in the corpus likes 5 wink 2 Query {likes, wink} Doc-at-a-Time Linear Merge of likes and wink lists Start with pointers at the top of each list Do all/both the documents match? Yes DOT(Q,1) = 1*.33 + 1*.42=.75 Advance both pointers 1 .42 1 .33 2 .2 5 .42 3 .2 DOT(Q,1) is complete at this point and you can then divide it by the Square Root of Sum of Squares To get the normalized Cosine Similarity score .2 4 5 .17
DOT(Q, i)= SUM [wt(Q,term(j))*wt(i,term(j)) ] For all term(j) in Q For all Doc(i) in the corpus likes 5 wink 2 Query {likes, wink} Doc-at-a-Time Linear Merge of likes and wink lists Do the documents match? No, then get smallest doc number wt DOT(Q,2) = 1*.2 =.2 increment pointer to the smallest doc 1 .42 1 .33 2 .2 5 .42 3 .2 DOT(Q,2) is complete at this point and you can then divide it by the Square Root of Sum of Squares To get the normalized Cosine Similarity score .2 4 5 .17
DOT(Q, i)= SUM [wt(Q,term(j))*wt(i,term(j)) ] For all term(j) in Q For all Doc(i) in the corpus likes 5 wink 2 Query {likes, wink} Doc-at-a-Time Linear Merge of likes and wink lists Do the documents match? No then get smallest doc number wt DOT(Q,3) = 1*.2 =.2 increment pointer to the smallest doc 1 .42 1 .33 2 .2 5 .42 3 .2 DOT(Q,3) is complete at this point and you can then divide it by the Square Root of Sum of Squares To get the normalized Cosine Similarity score .2 4 5 .17
DOT(Q, i)= SUM [wt(Q,term(j))*wt(i,term(j)) ] For all term(j) in Q For all Doc(i) in the corpus likes 5 wink 2 Query {likes, wink} Doc-at-a-Time Linear Merge of likes and wink lists Do the documents match? No then get smallest doc number wt DOT(Q,4) = 1*.2 =.2 increment pointer to the smallest doc 1 .42 1 .33 2 .2 5 .42 3 .2 DOT(Q,4) is complete at this point and you can then divide it by the Square Root of Sum of Squares To get the normalized Cosine Similarity score .2 4 5 .17
DOT(Q, i)= SUM [wt(Q,term(j))*wt(i,term(j)) ] For all term(j) in Q For all Doc(i) in the corpus likes 5 wink 2 Query {likes, wink} Doc-at-a-Time Linear Merge of likes and wink lists Do the documents match? Yes then DOT(Q,5) = 1*.17 + 1*.42=.59 increment pointer to the smallest doc No more docs, then done 1 .42 1 .33 2 .2 5 .42 3 .2 DOT(Q,5) is complete at this point and you can then divide it by the Square Root of Sum of Squares To get the normalized Cosine Similarity score .2 4 5 .17
Here is the Denominator of the Cosine Similarity: Keep an accumulator of these Sum-of-Squares while you are processing the terms in the numerator DEN(Q, i)=SQRT(wt(Q,likes)^2+ wt(Q,wink)^2) * SQRT(wt(i,likes)^2 + wt(i, wink)^2) DEN(Q,1)=SQRT(wt(Q,likes)^2+ wt(Q,wink)^2) * SQRT(wt(1,likes)^2 + wt(1, wink)^2) = SQRT( 1*1 + 1*1) *SQRT(.33*.33 +.42*.42) = SQRT(2)*SQRT(.11+.18) = SQRT(2)*SQRT(.29) =(1.41)*(.54) =. 76 DEN(Q,2)=SQRT(wt(Q,likes)^2+ wt(Q,wink)^2) * SQRT(wt(2,likes)^2 + wt(2, wink)^2) = SQRT( 1*1 + 1*1) *SQRT(.2*.2 +0*0) = SQRT(2)*SQRT(.04) = SQRT(2)*SQRT(.04) =(1.41)*(.2) =. 28 .. DEN(Q,5)=SQRT(wt(Q,likes)^2+ wt(Q,wink)^2) * SQRT(wt(5,likes)^2 + wt(5, wink)^2) = SQRT( 1*1 + 1*1) *SQRT(.17*.17 +.42*.42) = SQRT(2)*SQRT(.03+.48) = SQRT(2)*SQRT(.51) =(1.41)*(.72) =1.02
Calculate the Cosine Similarity SIM(Q,i) = DOT(Q, i)/DEN(Q,i) SIM(Q,1) = DOT(Q,1)/DEN(Q,1) = .75/.76=.98 SIM(Q,2) = DOT(Q,2)/DEN(Q,1) = .2/.28=.71 . SIM(Q,5) = DOT(Q,5)/DEN(Q,5) = .59/1.02=.57