Retrieval Models in Information Retrieval

undefined

Chapter 7

Retrieval Models



Provide a mathematical framework for defining the

search

 process



Includes explanation of

assumptions



Basis of many

ranking

algorithms



Progress in retrieval models has corresponded with

improvements in

effectiveness



Theories about

relevance



Complex concept that has been studied for some time



Many factors to consider



People often disagree when making

relevance

 judgments



Retrieval models make various assumptions about

relevance to simplify problem



e.g.,

topical

vs.

user relevance



e.g.,

binary

vs.

multi-valued

 relevance



Older models



Boolean

 retrieval



Vector Space

model



Probabilistic Models



BM25



Language

 models



Combining evidence



Inference networks



Learning to Rank



Sequence of queries driven by the number of retrieved

documents

Example

. “Lincoln” search of news articles



president AND Lincoln



president AND Lincoln AND NOT (automobile OR car)



president AND Lincoln AND biography AND life AND

  birthplace AND gettysburg AND NOT (automobile OR car)



president AND Lincoln AND (biography OR life OR birthplace

  OR gettysburg) AND NOT (automobile OR car)



Two possible outcomes for query processing



TRUE and FALSE



“

Exact-match

” retrieval



No

ranking

 at all



Query usually specified using Boolean operators



AND, OR, NOT



Advantages



Results are

predictable

, relatively easy to explain



Many different

document features

, such as metadata

    (e.g., type/date), can be incorporated



Efficient processing,

since many documents can be

    eliminated from search



Disadvantages



Simple queries usually

don’t

 work well



Complex

 queries are

difficult

 to construct



Effectiveness

 depends entirely on users



Simple and intuitively appealing



Documents and query represented by a

vector of term

weights



Collection represented by a

matrix

of

term weights

Vector Representation of Stemmed Documents w/o Stopwords



3-D pictures useful, but can be misleading for high-

dimensional space



Documents ranked by

differences

 between points

representing

query

and

documents



Using

 similarity measure

rather than a distance or

dissimilarity

 measure



e.g. Cosine correlation



Consider two documents

and

and a query



 = (0.5, 0.8, 0.3)



 = (0.9, 0.4, 0.2)

 = (1.5, 1.0, 0)



TF

IDF

 Weight



Term frequency weight

measures importance in

   document:



Inverse document frequency

measures importance

   in collection:



Some heuristic modifications



Advantages



Simple

 computational framework for ranking



Any

similarity measure

or

term weighting

scheme could be

    used



Disadvantages



Assumption of

term independence



No

assumption

 on whether relevance is binary/multivalued



According to [Greiff 02]



In

probabilistic approaches

to IR, the

occurrence

 of a

query term

in a

document D

contributes to the

    probability that

will be judged as

relevant



The

weight

 assigned to a

query term

should be based

    on the expected value of that contribution

[Greiff 02] Greiff, et al.

The Role of Variance in Term Weighting

for Probabilistic Information Retrieval

. In Proc. of  Intl. Conf. on

Information and Knowledge Management (CIKM). 2002.

(D)



Bayes Decision Rule



A document

is

relevant

if

) >

NR

), where

) and

NR

) are

conditional probabilities



Estimating probabilities



Use Bayes Rule



Classify a document as

relevant

if

•

L.H.S. is the

likelihood ratio

of

 being

relevant

Based on the Bayes Decision

Rule

) >

NR



Assume

word independence

Naïve Bayes

assumption)

and use

individual term

word

probabilities



Binary (weights in doc) independence (of word) model



Document represented by a vector of

binary

 features

   indicating

term occurrence

(or non-occurrence)



is

probability

 that term

 occurs in relevant document,

is

probability of occurrence

of

 in non-relevant document



Example

. Given a document representation (1, 0, 0, 1, 1),

the

probability

 of the document occurring in the relevant

set is

 × (1 -

) × (1 -

) ×

×



Computing the

likelihood ratio

of

using

and



Scoring function is



Using

log

to avoid multiplying lots of small numbers



If there is no other information on the relevant set,

i.e., neither

nor

NR

 exists, then



 is set to be a constant (= 0.5)



 is approximated as

, where

is the number of

    documents in the collection

 that include



 is similar to the

IDF

-like weight

Gives the

scoring

 function:

where

 = number of

relevant

 documents containing term

 = number of documents in a collection

 containing

 = number of documents in the collection

 = number of

relevant

 documents in the collection



ranking

 algorithm based on

probabilistic arguments

and

experimental validation

, but it is not a formal model



Adds

document

and

query

 term

weights



 is the

frequency of term i

 in a document



, a constant, plays the role of

tf

 along with



qf

is the frequency of term

 in query

, much lower than



plays the role as

but is applied to query term weight

qf



                                     , where

dl

 is document length &

avdl

is the average length of a doc regulated by

(0 ≤

≤ 1)



, k

 are set empirically as 1.2, 0..1000, & 0.75, resp.



Given a query with two terms, “president lincoln”, (

qf = 1)



No relevance information (

 and R are

zero)



 = 500,000 documents



“president”

occurs in 40,000 documents (

= 40,000)



“lincoln”

occurs in 300 documents (

 = 300)



“president” occurs 15 times in doc

= 15)



“lincoln”

occurs 25 times in doc

= 25)



document length is 90% of the average length (

dl / avdl

= 0.9)



= 1.2,

= 0.75, and

= 100



= 1.2 × (0.25 + 0.75 × 0.9) = 1.11



Effect of term frequencies, especially on “lincoln”

The large occurrence of a single important

query term can yield a higher score than

the occurrence of both query terms





LM

 is a probability distribution over sequences of words

to

represent

 docs, i.e., puts a probability measure

over strings from some vocabulary,





) = 1



The original and basic method for using LMs in IR is

the

query likelihood model

 in which a document

 is constructed from a

language model

, i.e.,



Estimate

 a LM for each

document



Rank

documents by the likelihood of the

query

 according

    to the estimated LM, i.e.,



Three possibilities models of

topical relevance



Probability of generating the

query text

from a

document

language) model

, i.e., the

query likelihood model

mle

) =





mle

), where

mle

) =



Probability of generating the

document text

from a

query

(language) model

, i.e.,

mle

) =

mle

), where

 is the set of relevant

    documents of a query



Comparing the

language models

 representing the

query

and

document topics



Three ways of developing the LM modeling approach:

a)

The

Query Likelihood Model

) or

b)

The

Document Likelihood Model

) or

c)

Comparison of the query and document LMs using

KL-

Divergence



The

Query Likelihood Model



Documents are

ranked

 based on the

probability

 that the

query

could be generated by the

document model

(i.e., on the same topic) based on doc

, i.e.,



Given a query

…

 and a document

…

compute the

conditional probability P

) =

Q | D

where

Q | D

) is the

query likelihood

given

) is the prior belief that

is

relevant

 to any query



Obvious estimate for

unigram probabilities

is



Maximum likelihood estimate

mle



Makes the observed value of

most likely



If any query words are missing from document, score will

be

zero



Missing 1 out of 4 query words same as missing 3 out of 4



Example

. The

Query Likelihood Model



Given the two (unigram) LMs,

and

, as shown below

Q:  Frog   said  that   toad    likes  that   dog

  0.01   0.03 0.04   0.01  0.02  0.04  0.005

 0.0002 0.03 0.04 0.0001 0.04  0.04  0.01

) = 0.00000000000048

) = 0.00000000000000034

) >



 Example

Using language models, we estimate if “Barbie

car” (i.e., treated as

) is generated from content (i.e.,

treated as

) targeting younger or adults



Based on the results, it is more likely that “Barbie car” is

 generated from content written for younger audiences



The

Document Likelihood Model



Instead of using the probability of a

document language

model

) generating the

query

, use the probability of

query language model M

 generating the

document



Creating a

document likelihood model

, even though it is

   less appealing, since there is much less text available

   to estimate a LM based on the query text



Solution: incorporate relevance feedback into the model &

   update the language model

•

e.g.,  the

relevance model

of Lavrenko & Croft (2001) is an

instance of a document likelihood model that incorporates

pseudo relevance feedback

into an LM approach



“

”



Using the following

query language model

) we estimate

the probability of docs

and

 on their relevance for

query

 = “

Frozen Movie

”



: “

Sven, Anna, Disney, Frozen

”



P(

) =





) =

1/9



1/9



2/9



 2/9 = 4/6561 = 0.00061



: “

Frozen ice cream

”



P(

) =





) =

2/9





 0 = 0



Comparing the

Query Likelihood Model

Document

Likelihood Model



Rather than generating either a

document language model

) or a

query language model

), create an LM

    from both the document & query & find the difference



To model the risk of returning a document

 as relevant to

    a query

, use the KL-divergence

) = KL(

||

) =





P(

) log

•

KL-divergence measures how bad the probability distribution

 is at modeling

•

It has been shown that the

comparison model

outperforms

both query-likelihood & document-likelihood models

and useful for ad hoc retrieval



Example

. Consider the probability distributions generated

using texts from Simple Wikipedia (a simplified

version of Wikipedia) and Wikipedia



Probabilities generated using text targeting younger versus

    more mature audiences



Example (Cont.)

Using K-L Divergence

 on vocabulary

      (0.0043



 log (0.0043/0.0042) + 0.0048



 log (0.0048/0.0039)

    + 0.0002



 log (0.002/0.0056) + 0.0439



 log (0.0439/0.00000001)

    + 0.0756



 log (0.0756/0.0003) + 0.0037



 log (0.0037/0.0076))



The vocabulary distribution for children versus more mature

    audiences is indeed different



Document texts are a

sample

 from the

language model



Missing words (in a document) should

not

 have

zero

    probability of occurring



Smoothing

 is a technique for estimating probabilities for

missing (or unseen) words

for the words that are

not

seen

 in the text



Lower (or

discount

) the

probability

 estimates for words that

are

seen

 in the document text



Assign that “left-over” probability to the estimates



As stated in [Zhai 04]



Smoothing

is the problem of

adjusting

the

maximum

likelihood estimator

to compensate for

data

sparseness

and it is required to avoid assigning

zero probability

to

unseen

 words



Smoothing accuracy is directly related to the retrieval

   performance:

•

The retrieval performance is generally

 sensitive

to

smoothing parameters



Smoothing plays two different roles:

•

Making the estimated document LM more

accurate

•

Explaining the

non-informative

 words in the query



Estimate for

unseen words

is

α



) is the

probability

 for query word

in the

collection

    language model for collection

(background probability)



α

, which can be a constant, is a coefficient of the

probability

   assigned to

unseen

words

, depending on the document



Estimate for words that occur is

(1 −

α

) +

α



Different forms of

estimation

 come from different

α



Different forms of

estimation

 come from different

α

(or



where

w | d

) is the

smoothed probability

of a word seen in

w | C

) is the collection language model,

α

is a

coefficient

 controlling the probability mass assigned

  to unseen words,

           all probabilities sum to one



α

is a constant,

λ

 (= 0.1 for short queries or = 0.7 for

long queries in TREC evaluations



Gives estimate of



Ranking score



Use

logs

 for convenience: solve the accuracy problem

of multiplying small numbers



roportional to the

term frequency

TF

), inversely

proportional to the

collection frequency

IDF



α

depends on document length

    where



 is a parameter value set empirically



Gives

probability estimation

of



and

document score



Example

.  Given                                                  ,



Let Q be “President Lincoln”,

(a doc) in

(ollection)



For the term “President”, let

,D

= 15,

= 160,000



For the term “Lincoln”, let

,D

= 25,

= 2,400



Let no. of word occurrences in

(i.e., |

|) be 1,800



Let no. of word occurrences in

 be 10

 = 500,000

    (docs) × 2,000 (average no. of words in a doc)



μ

 = 2,000

Negative due to

the summating

logs of small

numbers



The Boolean model imposes a

binary

 criterion for

deciding relevance



The question of how to extend the Boolean model

to accomodate

partial matching

 and a

ranking

has attracted considerable attention in the past



Two set theoretic models for this extension are



Fuzzy Set

Model



Extended Boolean

Model



Binary queries

are based on

binary logic

, in which a document

either

satisfies

 or does

not satisfy

a query.



Fuzzy queries

are based on

fuzzy logic

, in which a query term

is considered as a fuzzy set & each document is given a

degree of relevance with respect to a query, usually [0 – 1].



Example



In

binary logic

: The set of "tall people" are defined as those > 6

  feet. Thus, a person 5' 9.995" is

not

 considered tall.



In

fuzzy logic

: no clear distinction on “tall people.” Every person

  is given a

degree of membership

to the set of tall people, e.g.,

•

a person 7'0" will have a grade 1.0

•

a person 4'5" will have a grade 0.0

•

a person 6‘2" will have a grade 0.85

•

a person 5’8" will have a grade 0.7





Queries and documents are represented by sets of

index

terms

matching

is

approximate

 from the start



This

vagueness

 can be modeled using a

fuzzy

 framework,

as follows:



with each (query) term is associated a

fuzzy

set



each document has a

degree of membership

(0





 1) in

    this

fuzzy

set



This interpretation provides the foundation for many

models for IR based on

 fuzzy

theory



The model proposed by Ogawa, Morita & Kobayashi

(1991) is discussed



Fuzzy sets

 are modeled based on a

thesaurus



This thesaurus is built as follows:



Let

 be a

term-term correlation matrix

(or

keyword

connection matrix



Let

) be a

normalized correlation factor

for (

):

•

: number of documents which contain

•

: number of documents which contain

•

): number of documents which contain both

and



The notion of

proximity

 among indexed terms



The

correlation factor

) can be used to define

fuzzy

set membership

 for a document

 in the

fuzzy set

for

query term K

 as follows:



) = 1   -



    (1 -

))



•



): the

degree of membership

of document

in

  fuzzy subset associated with



The above expression computes an

algebraic sum

 over

all

terms

in

 with respect to

, and is implemented

as

the complement of a negated algebraic product



A document

 belongs to the fuzzy set for

, if its own

terms

 are associated with





) = 1 -



    (1 -

))







): membership of document

in fuzzy subset

    associated with query term



If document

 contains any index term

 which is

closely related

to query term

, we have



if



 such that

c(

) ~ 1, then



) ~ 1, and



query index

 is a good

fuzzy index

 for document



q = k









dnf

(1,1,1)

(1,1,0)

(1,0,0)

(binary weighted)

cc

cc

cc

(conjunctive comp)





q, d



cc

+ cc

+ cc

, j

(1



a, j



b, j



c, j

))



(1



a, j



b, j

(1



c, j

)))



(1



a, j

(1



b, j

))

(1



c, j

)))

Example



Fuzzy set

are useful for representing

vagueness

and

imprecision



Fuzzy IR

 models have been discussed mainly in the

literature associated with

fuzzy theory



Experiments with standard test collections are not

available



Difficult to

compare

 because previously conducted

experiences used only small collections

Slide Note

Embed Share

Download

Retrieval models play a crucial role in defining the search process, with various assumptions and ranking algorithms. Relevance, a complex concept, is central to these models, though subject to disagreement. An overview of different retrieval models like Boolean, Vector Space, and Probabilistic Models is provided, emphasizing the importance of relevance in simplifying the search problem. The search process is guided by retrieved documents, with Boolean Retrieval offering predictability but some limitations. The Vector Space Model offers a simple and intuitive representation of documents and queries.

calv_92 Follow

Uploaded on Sep 23, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Chapter 7 Retrieval Models 1

Retrieval Models Provide a mathematical framework for defining the search process Includes explanation of assumptions Basis of many ranking algorithms Progress in retrieval models has corresponded with improvements in effectiveness Theories about relevance 2

Relevance Complex concept that has been studied for some time Many factors to consider People often disagree when making relevance judgments Retrieval models make various assumptions about relevance to simplify problem e.g.,topical vs. user relevance e.g., binary vs. multi-valued relevance 3

Retrieval Model Overview Older models Boolean retrieval Vector Space model Probabilistic Models BM25 Language models Combining evidence Inference networks Learning to Rank 4

Searching Based on Retrieved Documents Sequence of queries driven by the number of retrieved documents Example. Lincoln search of news articles president AND Lincoln president AND Lincoln AND NOT (automobile OR car) president AND Lincoln AND biography AND life AND birthplace AND gettysburg AND NOT (automobile OR car) president AND Lincoln AND (biography OR life OR birthplace OR gettysburg) AND NOT (automobile OR car) 5

Boolean Retrieval Two possible outcomes for query processing TRUE and FALSE Exact-match retrieval No ranking at all Query usually specified using Boolean operators AND, OR, NOT 6

Boolean Retrieval Advantages Results are predictable, relatively easy to explain Many different document features, such as metadata (e.g., type/date), can be incorporated Efficient processing, since many documents can be eliminated from search Disadvantages Simple queries usually don t work well Complex queries are difficult to construct Effectiveness depends entirely on users 7

Vector Space Model Simple and intuitively appealing Documents and query represented by a vector of term weights Collection represented by a matrix of term weights 8

Vector Space Model Vector Representation of Stemmed Documents w/o Stopwords 9

Vector Space Model 3-D pictures useful, but can be misleading for high- dimensional space 10

Vector Space Model Documents ranked by differences between points representing query and documents Using similarity measure rather than a distance or dissimilarity measure e.g. Cosine correlation 11

Similarity Calculation Consider two documents D1 and D2, and a query Q D1 = (0.5, 0.8, 0.3) Q = (1.5, 1.0, 0) D2 = (0.9, 0.4, 0.2) More similar to Q than D1 12

Term Weights TF-IDF Weight Term frequency weight measures importance in document: Inverse document frequency measures importance in collection: Ensure non-zero weight Some heuristic modifications t t 13

Vector Space Model Advantages Simple computational framework for ranking Any similarity measure or term weighting scheme could be used Disadvantages Assumption of term independence No assumption on whether relevance is binary/multivalued 14

Probabilistic Models According to [Greiff 02] In probabilistic approaches to IR, the occurrence of a query term in a document D contributes to the probability that D will be judged as relevant The weight assigned to a query term should be based on the expected value of that contribution [Greiff 02] Greiff, et al. The Role of Variance in Term Weighting for Probabilistic Information Retrieval. In Proc. of Intl. Conf. on Information and Knowledge Management (CIKM). 2002. 15

IR as Classification Bayes Decision Rule Based on Bayes Classifier (D) Conditional Probability 16

Bayes Classifier Bayes Decision Rule A document D is relevant if P(R | D) > P(NR | D), where P(R | D) and P(NR | D) are conditional probabilities based on the probability of occurrence of words in D that are in R Estimating probabilities Use Bayes Rule A prior probability of relevance Classify a document as relevant if Based on the Bayes Decision Rule P(R | D) > P(NR | D) L.H.S. is the likelihood ratio of D being relevant 17

Estimating P(D | R) Assume word independence (Na ve Bayes assumption) and use individual term/word (di) probabilities Binary (weights in doc) independence (of word) model Document represented by a vector of binary features indicating term occurrence (or non-occurrence) pi is probability that term i occurs in relevant document, si is probability of occurrence of i in non-relevant document Example. Given a document representation (1, 0, 0, 1, 1), the probability of the document occurring in the relevant set is p1 (1 - p2) (1 - p3) p4 p5 18

Binary Independence Model Computing the likelihood ratio of D using pi and si Probability of i in R Probability of i in NR Probability of i not in R Probability of i not in NR word i not in D word i in D 1 (Same for all documents) 19

Binary Independence Model Scoring function is Using log to avoid multiplying lots of small numbers If there is no other information on the relevant set, i.e., neither R nor NR exists, then pi is set to be a constant (= 0.5) si is approximated as ni/ N, where ni is the number of documents in the collection N that include i si is similar to the IDF-like weight 20

Contingency Table, If (non-)Relevant Sets available n where ri = number of relevant documents containing term i ni = number of documents in a collection N containing i R = number of relevant documents in the collection N = number of documents in the collection Gives the scoring function: 21

BM25 Ranking Algorithm A ranking algorithm based on probabilistic arguments and experimental validation, but it is not a formal model Adds document D and query Q term weights fi is the frequency of term i in a document D k1, a constant, plays the role of tf along with fi qfiis the frequency of term i in query Q, much lower than fi k2plays the role as k1 but is applied to query term weight qfi , where dl is document length & avdl is the average length of a doc regulated by b (0 b 1) k1, k2& K are set empirically as 1.2, 0..1000, & 0.75, resp. 22

BM25 Example Given a query with two terms, president lincoln , (qf = 1) No relevance information (ri and R are zero) N = 500,000 documents president occurs in 40,000 documents (n1= 40,000) lincoln occurs in 300 documents (n2 = 300) president occurs 15 times in doc D (f1= 15) lincoln occurs 25 times in doc D (f2= 25) document length is 90% of the average length (dl / avdl = 0.9) k1= 1.2, b = 0.75, and k2= 100 K = 1.2 (0.25 + 0.75 0.9) = 1.11 23

BM25 Example 24

BM25 Example Effect of term frequencies, especially on lincoln The large occurrence of a single important query term can yield a higher score than the occurrence of both query terms 25

Language Model (LM) A language model is associated with each doc & has been successfully applied to search applications to represent the topical content of a document & query LM is a probability distribution over sequences of words to represent docs, i.e., puts a probability measure over strings from some vocabulary, s * P(s) = 1 The original and basic method for using LMs in IR is the query likelihood model in which a document d is constructed from a language model Md, i.e., Estimate a LM for each documentd Rank documents by the likelihood of the query according to the estimated LM, i.e., P(Q | Md) 26

LMs for Retrieval Three possibilities models of topical relevance Probability of generating the query text from a document (language) model, i.e., the query likelihood model tft,d dld Pmle (Q | Md) = t Q Pmle (t | Md), where Pmle (t | Md) = Probability of generating the document text from a query (language) model, i.e., Pmle (D |Mq) = Pmle (D | R), where R is the set of relevant documents of a query q Comparing the language models representing the query and document topics 27

LMs for Retrieval Three ways of developing the LM modeling approach: The Query Likelihood Model: P(Q | Md) or P(Q | D) a) The Document Likelihood Model: P(D | MQ) or P(D | R) b) Comparison of the query and document LMs using KL- Divergence c) 28

LMs for Retrieval The Query Likelihood Model Documents are ranked based on the probability that the queryQ could be generated by the document model (i.e., on the same topic) based on doc D, i.e., P(Q | D) Given a query Q = q1 qn and a document D = d1 dm, compute the conditional probability P(D | Q) A prior probability of document D P(D | Q) = P(Q | D) P(D) (Assumed to be uniform) where P(Q | D) P(D) is the query likelihood given D & P(D) is the prior belief that D is relevant to any query 29

Estimating Probabilities Obvious estimate for unigram probabilities is Maximum likelihood estimate (mle) Makes the observed value of fqi,Dmost likely If any query words are missing from document, score will be zero Missing 1 out of 4 query words same as missing 3 out of 4 30

LMs for Retrieval Example. The Query Likelihood Model Given the two (unigram) LMs, M1 and M2, as shown below Model M1 Model M2 Q: Frog said that toad likes that dog the a frog toad said likes that dog cat monkey 0.2 0.1 0.01 0.01 0.03 0.02 0.04 0.005 0.003 0.001 the a frog toad said likes that dog cat Monkey 0.15 0.12 0.0002 0.0001 0.03 0.04 0.04 0.01 0.015 0.002 M1 0.01 0.03 0.04 0.01 0.02 0.04 0.005 M2 0.0002 0.03 0.04 0.0001 0.04 0.04 0.01 P(Q | M1) = 0.00000000000048 P(Q | M2) = 0.00000000000000034 P(Q | M1) > P(Q | M2) 31

The Language Models Example. Using language models, we estimate if Barbie car (i.e., treated as Q) is generated from content (i.e., treated as M) targeting younger or adults P( | Barbie car ) = P( Barbie car | ) = 0.0056 0.0039 = 0.00002184 Words Probability on Simple Wiki 0.0042 0.0039 0.0056 0.00000001 0.0003 0.0076 Probabilit on Wikipedia Blue Car Barbie Auburn Software Toy 0.0043 0.0048 0.0002 0.0439 0.0756 0.0037 P( | Barbie car ) = P( Barbie car | ) = 0.0048 0.0002 = 0.00000096 Based on the results, it is more likely that Barbie car is generated from content written for younger audiences 32

LMs for Retrieval The Document Likelihood Model Instead of using the probability of a document language model (Md) generating the query, use the probability of a query language model Mq generating the document Creating a document likelihood model, even though it is less appealing, since there is much less text available to estimate a LM based on the query text Solution: incorporate relevance feedback into the model & update the language model Mq e.g., the relevance model of Lavrenko & Croft (2001) is an instance of a document likelihood model that incorporates pseudo relevance feedback into an LM approach 33

Language Model (LM) - Example Consider the query Q= Frozen Movie & the following documents, where R is relevant & NR is non-relevant Frozen Character Anna Disney Movie Elsa Frozen Disney Sven Frozen Food from Fridge R R R NR Words (R) Frozen P(w) 2/9 Character 1/9 Anna 1/9 Disney 2/9 Movie 1/9 Elsa 1/9 Sven 1/9 34

Language Model (LM) - Example Using the following query language model (Mq) we estimate the probability of docs D1 and D2 on their relevance for query Q= Frozen Movie Words (R) P(w) D1: Sven, Anna, Disney, Frozen P(D1 | Mq) = t D1P(t | Mq) = 1/9 1/9 2/9 2/9 = 4/6561 = 0.00061 Frozen 2/9 Character 1/9 Anna 1/9 Disney 2/9 Movie 1/9 Elsa 1/9 D2: Frozen ice cream P(D2 | Mq) = t D2P(t | Mq) = 2/9 0 0 = 0 Sven 1/9 35

LMs for Retrieval Comparing the Query Likelihood Model & Document Likelihood Model Rather than generating either a document language model (Md) or a query language model (Mq), create an LM from both the document & query & find the difference To model the risk of returning a document d as relevant to a query q, use the KL-divergence P(t | Mq) P(t | Md) R(d; q) = KL(Md || Mq) = t VP(t | Mq) log KL-divergence measures how bad the probability distribution Mq is at modeling Md It has been shown that the comparison model outperforms both query-likelihood & document-likelihood models and useful for ad hoc retrieval 36

KL-Divergence Example. Consider the probability distributions generated using texts from Simple Wikipedia (a simplified version of Wikipedia) and Wikipedia Probabilities generated using text targeting younger versus more mature audiences Words Probability based on Simple Wiki 0.0042 0.0039 0.0056 0.00000001 0.0003 0.0076 Probability based on Wikipedia 0.0043 0.0048 0.0002 0.0439 0.0756 0.0037 Blue Car Barbie Auburn Software Toy 37

KL-Divergence Example (Cont.). Using K-L Divergence on vocabulary K-L Divergence ( || ) = (0.0043 log (0.0043/0.0042) + 0.0048 log (0.0048/0.0039) + 0.0002 log (0.002/0.0056) + 0.0439 log (0.0439/0.00000001) + 0.0756 log (0.0756/0.0003) + 0.0037 log (0.0037/0.0076)) = 0.47218 The vocabulary distribution for children versus more mature audiences is indeed different 38

Smoothing Document texts are a sample from the language model Missing words (in a document) should not have zero probability of occurring Smoothing is a technique for estimating probabilities for missing (or unseen) words for the words that are not seen in the text Lower (or discount) the probability estimates for words that are seen in the document text Assign that left-over probability to the estimates 39

Smoothing As stated in [Zhai 04] Smoothing is the problem of adjusting the maximum likelihood estimator to compensate for data sparseness and it is required to avoid assigning zero probability to unseen words Smoothing accuracy is directly related to the retrieval performance: The retrieval performance is generally sensitive to smoothing parameters Smoothing plays two different roles: Making the estimated document LM more accurate Explaining the non-informative words in the query 40

Estimating Probabilities Estimate for unseen words is D P(qi | C) P(qi | C) is the probability for query word i in the collection language model for collection C (background probability) D, which can be a constant, is a coefficient of the probability assigned to unseen words, depending on the document D Estimate for words that occur is (1 D) P(qi | D) + D P(qi | C) Different forms of estimation come from different D 41

Estimating Probabilities Different forms of estimation come from different D (or ) where ps(w | d) is the smoothed probability of a word seen in d, p(w | C) is the collection language model, dis a coefficient controlling the probability mass assigned to unseen words, all probabilities sum to one 42

Jelinek-Mercer Smoothing D is a constant, (= 0.1 for short queries or = 0.7 for long queries in TREC evaluations Gives estimate of Ranking score Use logs for convenience: solve the accuracy problem of multiplying small numbers 43

Where is tf-idf Weight? i: fqi,D> 0 i: fqi,D= 0 i: 1..n Same for all documents TF IDF Proportional to the term frequency (TF), inversely proportional to the collection frequency (IDF) 44

Dirichlet Smoothing D depends on document length where is a parameter value set empirically Gives probability estimation of and document score 45

Query Likelihood Example. Given , Let Q be President Lincoln , D (a doc) in C(ollection) For the term President , let fqi,D= 15, cqi = 160,000 For the term Lincoln , let fqi,D= 25, cqi= 2,400 Let no. of word occurrences in D (i.e., |D|) be 1,800 Let no. of word occurrences in C be 109 = 500,000 (docs) 2,000 (average no. of words in a doc) = 2,000 Negative due to the summating logs of small numbers 46

Query Likelihood 47

Set Theoretic Models The Boolean model imposes a binary criterion for deciding relevance The question of how to extend the Boolean model to accomodate partial matching and a ranking has attracted considerable attention in the past Two set theoretic models for this extension are Fuzzy Set Model Extended Boolean Model 48

Fuzzy Queries Binary queries are based on binary logic, in which a document either satisfies or does not satisfy a query. Fuzzy queries are based on fuzzy logic, in which a query term is considered as a fuzzy set & each document is given a degree of relevance with respect to a query, usually [0 1]. Example. In binary logic: The set of "tall people" are defined as those > 6 feet. Thus, a person 5' 9.995" is not considered tall. In fuzzy logic: no clear distinction on tall people. Every person is given a degree of membership to the set of tall people, e.g., a person 7'0" will have a grade 1.0 a person 4'5" will have a grade 0.0 a person 6 2" will have a grade 0.85 a person 5 8" will have a grade 0.7 49