Cosine Similarity in Inverted Index for Querying

Calculating Cosine Similarity

in an Inverted Index

Dr. Claudia Pearce

CMSC476/676

In the lecture on InvertedIndexHowToBuild.pptx,

we built an inverted index for these documents.

How can we query from it? How to calculate the

Cosine Similarity from the inverted Index

.33

.33

.42

.17

.2

.2

.5

.2

.2

.17

.33

D(i)= number of terms in i

Query {likes, wink}

Term-at-a-Time

N (i)= Sum of dot product terms in Doc(i) also in query

Init to 0

N is for Numerator

DOT(Q, i)= SUM [wt(Q,term(j))*wt(i,term(j)) ]

For all term(j) in Q

For all Doc(i) in the corpus

N(i)=DOT(Q,i) when we have processed each list

Query {likes, wink}

Term-at-a-Time

Bring up the posting

List for

likes

Assume no weights

on the query terms

for now

N(i)= Sum of dot product terms in Doc(i) also in query

Init to 0

N is for Numerator

N(1,likes) = (1*.33)

                  =.33

N(2,likes) = (1*.2)

=.2

N(3,likes) = (1*.2)

=.2

N(5,likes) = (1*.17)

                  =.17

N(4,likes) = (1*.2)

=.2

Update  sums by

adding corresponding

values as you process

each document in

the list for each term

in the query.  In this case

the term is

likes

If a wt is provided

in the query,

put that in instead

of 1

Query {likes, wink}

Term-at-a-Time

Bring up the posting

List for

wink

Assume no weights

On the query terms

For now

DOT(Q,i)=N(i)= Sum of dot product terms in Doc(i) also in query

Update  sums by

Adding corresponding

Values for

wink

N(1,wink) = (1*.42)

                  =.42

N(5,wink) = (1*.42)

                  =.42

+.42

+.42

+0

+0

+0

DOT(Q, i)=wt(Q,likes)*wt(i,likes) + wt(Q,wink)*wt(i, wink)

DOT(Q,1)=wt(Q,likes)*wt(1,likes) + wt(Q,wink)*wt(1, wink)

                =              1*.33            +                1*.42

                =              .75

DOT(Q,2)=  wt(Q,likes)*wt(2,likes) + wt(Q,wink)*wt(2, wink)

                 =               1*.2               +                1*0

                  =              .2

….

DOT(Q,5)=  wt(Q,likes)*wt(5,likes) + wt(Q,wink)*wt(5, wink)

                 =                1*.17                +              1*.42

                 =                .59

N(i)

In Phase 4, it says :

(the sum of the term weights in the document).

This is just the DOT product with a wt of 1 for

each query term

DOT(Q, i)=wt(Q,likes)*wt(i,likes) + wt(Q,wink)*wt(i, wink)

DOT(Q,1)=wt(Q,likes)*wt(1,likes) + wt(Q,wink)*wt(1, wink)

                =              1*.33            +                1*.42

                =              .75

DOT(Q,2)=  wt(Q,likes)*wt(2,likes) + wt(Q,wink)*wt(2, wink)

                 =               1*.2               +                1*0

                  =              .2

….

DOT(Q,5)=  wt(Q,likes)*wt(5,likes) + wt(Q,wink)*wt(5, wink)

                 =                1*.17                +              1*.42

                 =                .59

Term-at-a-time

N(i) held Sum of

dot product terms

in Doc(i) that were

also in query

Just using 1 as a

query wight since

we didn’t provide

any thing else

N(i)= Sum of dot product terms in Doc(i) also in query

What if the query has weights as in Query2 {.5 likes, .4 wink}?

N(i)

DOT(Q2, i)=wt(Q,likes)*wt(i,likes) + wt(Q,wink)*wt(i, wink)

DOT(Q2,1)=wt(Q,likes)*wt(1,likes) + wt(Q,wink)*wt(1, wink)

                =              .5*.33            +                .4*.42

                =              .165 +.344   = .509  =.51

DOT(Q2,2)=  wt(Q,likes)*wt(2,likes) + wt(Q,wink)*wt(2, wink)

                 =               .5*.2               +                .4*0

                  =              .1

….

DOT(Q2,5)=  wt(Q,likes)*wt(5,likes) + wt(Q,wink)*wt(5, wink)

                 =                .5*.17                +              .4*.42

                 =                .085 + .344 = .43

Term-at-a-time

N(i) held Sum of dot

product terms in Doc(i)

that were also in query

Query2 {.5 likes, .4 wink}

N(i)= Sum of dot product terms in Doc(i) also in query

The DOT(Q2,i) are complete for all i at this point

 and you can then divide each by the

Square Root of Sum of Squares

To get the normalized Cosine Similarity score

Query {likes, wink}

Doc-at-a-Time

DOT(Q, i)= SUM [wt(Q,term(j))*wt(i,term(j)) ]

For all term(j) in Q

For all Doc(i) in the corpus

Linear Merge of

likes

and

wink

 lists

Start with pointers at the top of each list

Do all/both the documents match?

Yes

     DOT(Q,1) = 1*.33 + 1*.42=.75

     Advance both pointers

DOT(Q,1) is complete at this point

 and you can then divide it by the

Square Root of Sum of Squares

To get the normalized Cosine Similarity score

Query {likes, wink}

Doc-at-a-Time

DOT(Q, i)= SUM [wt(Q,term(j))*wt(i,term(j)) ]

For all term(j) in Q

For all Doc(i) in the corpus

Linear Merge of

likes

and

wink

 lists

Do the documents match?

No, then get smallest doc number wt

     DOT(Q,2) = 1*.2 =.2

     increment pointer to the smallest doc

DOT(Q,2) is complete at this point

 and you can then divide it by the

Square Root of Sum of Squares

To get the normalized Cosine Similarity score

Query {likes, wink}

Doc-at-a-Time

DOT(Q, i)= SUM [wt(Q,term(j))*wt(i,term(j)) ]

For all term(j) in Q

For all Doc(i) in the corpus

Linear Merge of

likes

and

wink

 lists

Do the documents match?

No then get smallest doc number wt

     DOT(Q,3) = 1*.2 =.2

     increment pointer to the smallest doc

DOT(Q,3) is complete at this point

 and you can then divide it by the

Square Root of Sum of Squares

To get the normalized Cosine Similarity score

Query {likes, wink}

Doc-at-a-Time

DOT(Q, i)= SUM [wt(Q,term(j))*wt(i,term(j)) ]

For all term(j) in Q

For all Doc(i) in the corpus

Linear Merge of

likes

and

wink

 lists

Do the documents match?

No then get smallest doc number wt

     DOT(Q,4) = 1*.2 =.2

     increment pointer to the smallest doc

DOT(Q,4) is complete at this point

 and you can then divide it by the

Square Root of Sum of Squares

To get the normalized Cosine Similarity score

Query {likes, wink}

Doc-at-a-Time

DOT(Q, i)= SUM [wt(Q,term(j))*wt(i,term(j)) ]

For all term(j) in Q

For all Doc(i) in the corpus

Linear Merge of

likes

and

wink

 lists

Do the documents match?

Yes then

     DOT(Q,5) = 1*.17 + 1*.42=.59

     increment pointer to the smallest doc

 No more docs, then done

DOT(Q,5) is complete at this point

 and you can then divide it by the

Square Root of Sum of Squares

To get the normalized Cosine Similarity score

DEN(Q, i)=SQRT(wt(Q,likes)^2+ wt(Q,wink)^2) * SQRT(wt(i,likes)^2 + wt(i, wink)^2)

DEN(Q,1)=SQRT(wt(Q,likes)^2+ wt(Q,wink)^2) * SQRT(wt(1,likes)^2 + wt(1, wink)^2)

                = SQRT( 1*1 + 1*1) *SQRT(.33*.33 +.42*.42)

                = SQRT(2)*SQRT(.11+.18) = SQRT(2)*SQRT(.29)

                =(1.41)*(.54)

                =. 76

DEN(Q,2)=SQRT(wt(Q,likes)^2+ wt(Q,wink)^2) * SQRT(wt(2,likes)^2 + wt(2, wink)^2)

                = SQRT( 1*1 + 1*1) *SQRT(.2*.2 +0*0)

                = SQRT(2)*SQRT(.04) = SQRT(2)*SQRT(.04)

                =(1.41)*(.2)

                =. 28

……..

DEN(Q,5)=SQRT(wt(Q,likes)^2+ wt(Q,wink)^2) * SQRT(wt(5,likes)^2 + wt(5, wink)^2)

                = SQRT( 1*1 + 1*1) *SQRT(.17*.17 +.42*.42)

                = SQRT(2)*SQRT(.03+.48) = SQRT(2)*SQRT(.51)

                =(1.41)*(.72)

                =1.02

Here is the Denominator of the Cosine Similarity:

Keep an accumulator

 of these Sum-of-Squares

 while you are processing the

 terms in the numerator

Calculate the Cosine Similarity

SIM(Q,i) = DOT(Q, i)/DEN(Q,i)

SIM(Q,1) = DOT(Q,1)/DEN(Q,1) = .75/.76=.98

SIM(Q,2) = DOT(Q,2)/DEN(Q,1) = .2/.28=.71

….

SIM(Q,5) = DOT(Q,5)/DEN(Q,5) = .59/1.02=.57

Slide Note

Embed Share

Download

In this document, Dr. Claudia Pearce explains how to build and query from an inverted index, focusing on calculating the Cosine Similarity. The process involves calculating the dot product of terms in the document and query, updating sums based on term weights, and understanding the significance of term weights in the document. The step-by-step guide provides insights into efficiently querying and retrieving relevant information from an inverted index.

kaila Follow

Uploaded on Sep 06, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Calculating Cosine Similarity in an Inverted Index Dr. Claudia Pearce CMSC476/676

In the lecture on InvertedIndexHowToBuild.pptx, we built an inverted index for these documents. How can we query from it? How to calculate the Cosine Similarity from the inverted Index

ink 3 pink 2 likes 5 wink 2 drink 5 thing 1 he 5 .5 4 3 .33 1 .42 3 1 1 .33 1 .33 1 .17 2 .5 5 .42 2 .2 5 .42 2 .2 4 .33 3 .2 3 .2 3 .17 5 .44 .2 4 4 .17 .2 4 D(i)= number of terms in i 5 .17 5 .17 5 .17 1|6 2|5 3|5 5|6 4|5

N (i)= Sum of dot product terms in Doc(i) also in query likes 5 wink 2 Query {likes, wink} 1|0 2|0 3|0 5|0 Init to 0 4|0 Term-at-a-Time 1 .42 1 .33 N(i)=DOT(Q,i) when we have processed each list 2 .2 5 .42 DOT(Q, i)= SUM [wt(Q,term(j))*wt(i,term(j)) ] For all term(j) in Q For all Doc(i) in the corpus 3 .2 .2 4 5 .17 N is for Numerator

N(i)= Sum of dot product terms in Doc(i) also in query If a wt is provided in the query, put that in instead of 1 likes 5 1|0 2|0 3|0 5|0 Init to 0 4|0 Query {likes, wink} Term-at-a-Time Bring up the posting List for likes Update sums by adding corresponding values as you process each document in the list for each term in the query. In this case the term is likes N(1,likes) = (1*.33) =.33 1 .33 1|.33 4|.2 5|.17 2|.2 3|.2 Assume no weights on the query terms for now N(2,likes) = (1*.2) =.2 2 .2 N(3,likes) = (1*.2) =.2 3 .2 N(4,likes) = (1*.2) =.2 .2 4 N(5,likes) = (1*.17) =.17 5 .17 N is for Numerator

DOT(Q,i)=N(i)= Sum of dot product terms in Doc(i) also in query wink 2 1|.33 4|.2 5|.17 2|.2 3|.2 Query {likes, wink} N(i) Update sums by Adding corresponding Values for wink Term-at-a-Time Bring up the posting List for wink N(1,wink) = (1*.42) =.42 +.42 +0 +0 +0 +.42 1 .42 1|.75 4|.2 5|.59 2|.2 3|.2 Assume no weights On the query terms For now N(5,wink) = (1*.42) =.42 5 .42 DOT(Q, i)=wt(Q,likes)*wt(i,likes) + wt(Q,wink)*wt(i, wink) DOT(Q,1)=wt(Q,likes)*wt(1,likes) + wt(Q,wink)*wt(1, wink) = 1*.33 + 1*.42 = .75 DOT(Q,2)= wt(Q,likes)*wt(2,likes) + wt(Q,wink)*wt(2, wink) = 1*.2 + 1*0 = .2 . DOT(Q,5)= wt(Q,likes)*wt(5,likes) + wt(Q,wink)*wt(5, wink) = 1*.17 + 1*.42 = .59 In Phase 4, it says : (the sum of the term weights in the document). This is just the DOT product with a wt of 1 for each query term

N(i)= Sum of dot product terms in Doc(i) also in query N(i) 1|.75 4|.2 5|.59 2|.2 3|.2 Term-at-a-time N(i) held Sum of dot product terms in Doc(i) that were also in query DOT(Q, i)=wt(Q,likes)*wt(i,likes) + wt(Q,wink)*wt(i, wink) Just using 1 as a query wight since we didn t provide any thing else DOT(Q,1)=wt(Q,likes)*wt(1,likes) + wt(Q,wink)*wt(1, wink) = 1*.33 + 1*.42 = .75 DOT(Q,2)= wt(Q,likes)*wt(2,likes) + wt(Q,wink)*wt(2, wink) = 1*.2 + 1*0 = .2 . DOT(Q,5)= wt(Q,likes)*wt(5,likes) + wt(Q,wink)*wt(5, wink) = 1*.17 + 1*.42 = .59 What if the query has weights as in Query2 {.5 likes, .4 wink}?

N(i)= Sum of dot product terms in Doc(i) also in query 1|.51 4|.1 5|.43 2|.1 3|.1 Term-at-a-time N(i) held Sum of dot product terms in Doc(i) that were also in query DOT(Q2, i)=wt(Q,likes)*wt(i,likes) + wt(Q,wink)*wt(i, wink) Query2 {.5 likes, .4 wink} DOT(Q2,1)=wt(Q,likes)*wt(1,likes) + wt(Q,wink)*wt(1, wink) = .5*.33 + .4*.42 = .165 +.344 = .509 =.51 DOT(Q2,2)= wt(Q,likes)*wt(2,likes) + wt(Q,wink)*wt(2, wink) = .5*.2 + .4*0 = .1 . DOT(Q2,5)= wt(Q,likes)*wt(5,likes) + wt(Q,wink)*wt(5, wink) = .5*.17 + .4*.42 = .085 + .344 = .43 The DOT(Q2,i) are complete for all i at this point and you can then divide each by the Square Root of Sum of Squares To get the normalized Cosine Similarity score

DOT(Q, i)= SUM [wt(Q,term(j))*wt(i,term(j)) ] For all term(j) in Q For all Doc(i) in the corpus likes 5 wink 2 Query {likes, wink} Doc-at-a-Time Linear Merge of likes and wink lists Start with pointers at the top of each list Do all/both the documents match? Yes DOT(Q,1) = 1*.33 + 1*.42=.75 Advance both pointers 1 .42 1 .33 2 .2 5 .42 3 .2 DOT(Q,1) is complete at this point and you can then divide it by the Square Root of Sum of Squares To get the normalized Cosine Similarity score .2 4 5 .17

DOT(Q, i)= SUM [wt(Q,term(j))*wt(i,term(j)) ] For all term(j) in Q For all Doc(i) in the corpus likes 5 wink 2 Query {likes, wink} Doc-at-a-Time Linear Merge of likes and wink lists Do the documents match? No, then get smallest doc number wt DOT(Q,2) = 1*.2 =.2 increment pointer to the smallest doc 1 .42 1 .33 2 .2 5 .42 3 .2 DOT(Q,2) is complete at this point and you can then divide it by the Square Root of Sum of Squares To get the normalized Cosine Similarity score .2 4 5 .17

DOT(Q, i)= SUM [wt(Q,term(j))*wt(i,term(j)) ] For all term(j) in Q For all Doc(i) in the corpus likes 5 wink 2 Query {likes, wink} Doc-at-a-Time Linear Merge of likes and wink lists Do the documents match? No then get smallest doc number wt DOT(Q,3) = 1*.2 =.2 increment pointer to the smallest doc 1 .42 1 .33 2 .2 5 .42 3 .2 DOT(Q,3) is complete at this point and you can then divide it by the Square Root of Sum of Squares To get the normalized Cosine Similarity score .2 4 5 .17

DOT(Q, i)= SUM [wt(Q,term(j))*wt(i,term(j)) ] For all term(j) in Q For all Doc(i) in the corpus likes 5 wink 2 Query {likes, wink} Doc-at-a-Time Linear Merge of likes and wink lists Do the documents match? No then get smallest doc number wt DOT(Q,4) = 1*.2 =.2 increment pointer to the smallest doc 1 .42 1 .33 2 .2 5 .42 3 .2 DOT(Q,4) is complete at this point and you can then divide it by the Square Root of Sum of Squares To get the normalized Cosine Similarity score .2 4 5 .17

DOT(Q, i)= SUM [wt(Q,term(j))*wt(i,term(j)) ] For all term(j) in Q For all Doc(i) in the corpus likes 5 wink 2 Query {likes, wink} Doc-at-a-Time Linear Merge of likes and wink lists Do the documents match? Yes then DOT(Q,5) = 1*.17 + 1*.42=.59 increment pointer to the smallest doc No more docs, then done 1 .42 1 .33 2 .2 5 .42 3 .2 DOT(Q,5) is complete at this point and you can then divide it by the Square Root of Sum of Squares To get the normalized Cosine Similarity score .2 4 5 .17

Here is the Denominator of the Cosine Similarity: Keep an accumulator of these Sum-of-Squares while you are processing the terms in the numerator DEN(Q, i)=SQRT(wt(Q,likes)^2+ wt(Q,wink)^2) * SQRT(wt(i,likes)^2 + wt(i, wink)^2) DEN(Q,1)=SQRT(wt(Q,likes)^2+ wt(Q,wink)^2) * SQRT(wt(1,likes)^2 + wt(1, wink)^2) = SQRT( 1*1 + 1*1) *SQRT(.33*.33 +.42*.42) = SQRT(2)*SQRT(.11+.18) = SQRT(2)*SQRT(.29) =(1.41)*(.54) =. 76 DEN(Q,2)=SQRT(wt(Q,likes)^2+ wt(Q,wink)^2) * SQRT(wt(2,likes)^2 + wt(2, wink)^2) = SQRT( 1*1 + 1*1) *SQRT(.2*.2 +0*0) = SQRT(2)*SQRT(.04) = SQRT(2)*SQRT(.04) =(1.41)*(.2) =. 28 .. DEN(Q,5)=SQRT(wt(Q,likes)^2+ wt(Q,wink)^2) * SQRT(wt(5,likes)^2 + wt(5, wink)^2) = SQRT( 1*1 + 1*1) *SQRT(.17*.17 +.42*.42) = SQRT(2)*SQRT(.03+.48) = SQRT(2)*SQRT(.51) =(1.41)*(.72) =1.02

Calculate the Cosine Similarity SIM(Q,i) = DOT(Q, i)/DEN(Q,i) SIM(Q,1) = DOT(Q,1)/DEN(Q,1) = .75/.76=.98 SIM(Q,2) = DOT(Q,2)/DEN(Q,1) = .2/.28=.71 . SIM(Q,5) = DOT(Q,5)/DEN(Q,5) = .59/1.02=.57

Cosine Similarity in Inverted Index for Querying

Download Presentation

Presentation Transcript

Related

More Related Content