Retrieval Models in Information Retrieval

undefined
 
Chapter 7
 
Retrieval Models
1
 
R
e
t
r
i
e
v
a
l
 
M
o
d
e
l
s
 
Provide a mathematical framework for defining the
 
search
 process
Includes explanation of 
assumptions
Basis of many 
ranking 
algorithms
Progress in retrieval models has corresponded with
 
improvements in 
effectiveness
Theories about 
relevance
2
 
R
e
l
e
v
a
n
c
e
 
Complex concept that has been studied for some time
Many factors to consider
People often disagree when making 
relevance
 judgments
Retrieval models make various assumptions about
 
relevance to simplify problem
e.g.,
 
topical
 vs. 
user relevance
e.g., 
binary
 vs. 
multi-valued
 relevance
3
 
R
e
t
r
i
e
v
a
l
 
M
o
d
e
l
 
O
v
e
r
v
i
e
w
 
Older models
Boolean
 retrieval
Vector Space 
model
Probabilistic Models
BM25
Language
 models
Combining evidence
Inference networks
Learning to Rank
4
 
S
e
a
r
c
h
i
n
g
 
B
a
s
e
d
 
o
n
 
R
e
t
r
i
e
v
e
d
 
D
o
c
u
m
e
n
t
s
 
Sequence of queries driven by the number of retrieved
 
documents
Example
. “Lincoln” search of news articles
president AND Lincoln
president AND Lincoln AND NOT (automobile OR car)
president AND Lincoln AND biography AND life AND
 
  birthplace AND gettysburg AND NOT (automobile OR car)
president AND Lincoln AND (biography OR life OR birthplace
 
  OR gettysburg) AND NOT (automobile OR car)
5
 
B
o
o
l
e
a
n
 
R
e
t
r
i
e
v
a
l
 
Two possible outcomes for query processing
TRUE and FALSE
Exact-match
” retrieval
No 
ranking
 at all
Query usually specified using Boolean operators
AND, OR, NOT
6
 
B
o
o
l
e
a
n
 
R
e
t
r
i
e
v
a
l
 
Advantages
Results are 
predictable
, relatively easy to explain
Many different 
document features
, such as metadata
 
    (e.g., type/date), can be incorporated
Efficient processing, 
since many documents can be
 
    eliminated from search
Disadvantages
Simple queries usually 
don’t
 work well
Complex
 queries are 
difficult
 to construct
Effectiveness
 depends entirely on users
7
 
V
e
c
t
o
r
 
S
p
a
c
e
 
M
o
d
e
l
 
Simple and intuitively appealing
Documents and query represented by a 
vector of term
 
weights
Collection represented by a 
matrix
 of 
term weights
8
 
V
e
c
t
o
r
 
S
p
a
c
e
 
M
o
d
e
l
9
 
Vector Representation of Stemmed Documents w/o Stopwords
 
V
e
c
t
o
r
 
S
p
a
c
e
 
M
o
d
e
l
 
3-D pictures useful, but can be misleading for high-
 
dimensional space
10
 
V
e
c
t
o
r
 
S
p
a
c
e
 
M
o
d
e
l
 
Documents ranked by 
differences
 between points
 
representing 
query
 and 
documents
Using
 similarity measure 
rather than a distance or
 
  
dissimilarity
 measure
e.g. Cosine correlation
11
 
S
i
m
i
l
a
r
i
t
y
 
C
a
l
c
u
l
a
t
i
o
n
 
Consider two documents 
D
1 
and 
D
2
,
 
and a query 
Q
D
1
 = (0.5, 0.8, 0.3)
D
2
 = (0.9, 0.4, 0.2)
12
 
Q
 = (1.5, 1.0, 0)
 
T
e
r
m
 
W
e
i
g
h
t
s
 
TF
-
IDF
 Weight
Term frequency weight 
measures importance in
 
   document:
 
Inverse document frequency 
measures importance
 
   in collection:
 
Some heuristic modifications
13
 
V
e
c
t
o
r
 
S
p
a
c
e
 
M
o
d
e
l
 
Advantages
Simple
 computational framework for ranking
Any 
similarity measure
 or 
term weighting 
scheme could be
 
    used
Disadvantages
Assumption of 
term independence
No 
assumption
 on whether relevance is binary/multivalued
14
 
P
r
o
b
a
b
i
l
i
s
t
i
c
 
M
o
d
e
l
s
 
According to [Greiff 02]
In 
probabilistic approaches 
to IR, the 
occurrence
 of a
 
    
query term 
in a 
document D 
contributes to the
 
    probability that 
D 
will be judged as 
relevant
The 
weight
 assigned to a 
query term 
should be based
 
    on the expected value of that contribution
15
 
[Greiff 02] Greiff, et al. 
The Role of Variance in Term Weighting
for Probabilistic Information Retrieval
. In Proc. of  Intl. Conf. on
Information and Knowledge Management (CIKM). 2002.
I
R
 
a
s
 
C
l
a
s
s
i
f
i
c
a
t
i
o
n
16
(D)
 
B
a
y
e
s
 
C
l
a
s
s
i
f
i
e
r
 
Bayes Decision Rule
A document 
D 
is 
relevant
 if 
P
(
R 
| 
D
) >
 P
(
NR 
| 
D
), where
 
  
P
(
R 
| 
D
) and 
P
(
NR 
| 
D
) are 
conditional probabilities
Estimating probabilities
Use Bayes Rule
 
Classify a document as 
relevant
 if
 
 
L.H.S. is the 
likelihood ratio 
of 
D
 being 
relevant
17
 
Based on the Bayes Decision
Rule
 P
(
R 
| 
D
) >
 P
(
NR 
| 
D
)
 
E
s
t
i
m
a
t
i
n
g
 
P
(
D
 
|
 
R
)
 
Assume 
word independence
 (
Naïve Bayes 
assumption)
 
and use 
individual term
/
word 
(
d
i
) 
probabilities
 
 
Binary (weights in doc) independence (of word) model
Document represented by a vector of 
binary
 features
 
   indicating 
term occurrence 
(or non-occurrence)
p
i
 
is 
probability
 that term 
i
 occurs in relevant document,
s
i
 
is 
probability of occurrence 
of 
i
 in non-relevant document
Example
. Given a document representation (1, 0, 0, 1, 1),
 
the 
probability
 of the document occurring in the relevant
 
set is
                        p
1
 × (1 - 
p
2
) × (1 - 
p
3
) × 
p
4
 × 
p
5
18
 
B
i
n
a
r
y
 
I
n
d
e
p
e
n
d
e
n
c
e
 
M
o
d
e
l
19
 
 
  
Computing the 
likelihood ratio 
of 
D 
using
 p
i
 and
 s
i
 
B
i
n
a
r
y
 
I
n
d
e
p
e
n
d
e
n
c
e
 
M
o
d
e
l
 
Scoring function is
 
Using 
log
 
to avoid multiplying lots of small numbers
If there is no other information on the relevant set,
 
i.e., neither 
R
 nor 
NR
 exists, then
p
i
 is set to be a constant (= 0.5)
s
i
 is approximated as 
n
i
 
/ 
N
, where 
n
i 
 
is the number of
 
    documents in the collection 
N
 that include 
i
s
i
 is similar to the 
IDF
-like weight
20
 
C
o
n
t
i
n
g
e
n
c
y
 
T
a
b
l
e
,
 
I
f
 
(
n
o
n
-
)
R
e
l
e
v
a
n
t
 
S
e
t
s
 
a
v
a
i
l
a
b
l
e
 
Gives the 
scoring
 function:
21
 
where 
r
i
 = number of 
relevant
 documents containing term 
i
 
n
i
 = number of documents in a collection 
N
 containing 
i
 
N
 = number of documents in the collection
 
R
 = number of 
relevant
 documents in the collection
n
 
B
M
2
5
 
R
a
n
k
i
n
g
 
A
l
g
o
r
i
t
h
m
 
A 
ranking
 algorithm based on 
probabilistic arguments 
and
 
experimental validation
, but it is not a formal model
Adds 
document
 
D
 and 
query
 
Q
 term 
weights
 
 
f
i
 is the 
frequency of term i
 in a document 
D
k
1
, a constant, plays the role of 
tf
 along with 
f
i
qf
i
 
is the frequency of term 
i
 in query 
Q
, much lower than 
f
i
k
2
 
plays the role as 
k
1 
but is applied to query term weight 
qf
i
                                     , where 
dl
 is document length & 
avdl
 
    
is the average length of a doc regulated by 
b 
(0 ≤
 b 
≤ 1)
k
1
, k
2
  
&
 K
 are set empirically as 1.2, 0..1000, & 0.75, resp.
22
 
B
M
2
5
 
E
x
a
m
p
l
e
 
Given a query with two terms, “president lincoln”, (
qf = 1)
No relevance information (
r
i
 and R are 
zero)
N
 = 500,000 documents
“president” 
occurs in 40,000 documents (
n
1
 
= 40,000)
“lincoln” 
occurs in 300 documents (
n
2
 = 300)
“president” occurs 15 times in doc 
D
 (
f
1
 
= 15)
“lincoln” 
occurs 25 times in doc 
D
 (
f
2
 
= 25)
document length is 90% of the average length (
dl / avdl 
= 0.9)
k
1
 
= 1.2, 
b 
= 0.75, and 
k
2
 
= 100
K 
= 1.2 × (0.25 + 0.75 × 0.9) = 1.11
23
 
B
M
2
5
 
E
x
a
m
p
l
e
24
 
B
M
2
5
 
E
x
a
m
p
l
e
 
Effect of term frequencies, especially on “lincoln”
25
The large occurrence of a single important 
query term can yield a higher score than
the occurrence of both query terms  
 
L
a
n
g
u
a
g
e
 
M
o
d
e
l
 
(
L
M
)
 
A
 
l
a
n
g
u
a
g
e
 
m
o
d
e
l
 
i
s
 
a
s
s
o
c
i
a
t
e
d
 
w
i
t
h
 
e
a
c
h
 
d
o
c
 
&
 
h
a
s
b
e
e
n
 
s
u
c
c
e
s
s
f
u
l
l
y
 
a
p
p
l
i
e
d
 
t
o
 
s
e
a
r
c
h
 
a
p
p
l
i
c
a
t
i
o
n
s
 
t
o
r
e
p
r
e
s
e
n
t
 
t
h
e
 
t
o
p
i
c
a
l
 
c
o
n
t
e
n
t
 
o
f
 
a
 
d
o
c
u
m
e
n
t
 
&
 
q
u
e
r
y
LM
 is a probability distribution over sequences of words
 
to 
represent
 docs, i.e., puts a probability measure
 
over strings from some vocabulary, 
s

* 
P
(
s
) = 1
The original and basic method for using LMs in IR is
 
the 
query likelihood model
 in which a document
 
d
 is constructed from a 
language model
 
M
d
, i.e.,
Estimate
 a LM for each 
document
 
d
Rank 
documents by the likelihood of the 
query
 according
 
    to the estimated LM, i.e., 
P
(
Q 
| 
M
d
)
26
 
L
M
s
 
f
o
r
 
R
e
t
r
i
e
v
a
l
 
Three possibilities models of 
topical relevance
Probability of generating the 
query text 
from a 
document
 
   (
language) model
, i.e., the 
query likelihood model
    
P
mle 
(
Q
 | 
M
d
) = 
t
Q 
P
mle 
(
t
 | 
M
d
), where 
P
mle 
(
t 
| 
M
d
) =
Probability of generating the 
document text 
from a 
query
 
     
(language) model
, i.e.,
    P
mle 
(
D
 
 
|
 
M
q
) = 
P
mle 
(
D
 | 
R
), where 
R
 is the set of relevant
 
    documents of a query 
q
Comparing the 
language models
 representing the 
query
 
   and 
document topics
27
 
L
M
s
 
f
o
r
 
R
e
t
r
i
e
v
a
l
 
Three ways of developing the LM modeling approach:
 
 
 
a)
The 
Query Likelihood Model
: 
P
(
Q
 | 
M
d
) or 
P
(
Q
 | 
D
)
b)
The 
Document Likelihood Model
: 
P
(
D
 | 
M
Q
) or 
P
(
D 
| 
R
)
c)
Comparison of the query and document LMs using 
KL-
 
Divergence
28
 
L
M
s
 
f
o
r
 
R
e
t
r
i
e
v
a
l
 
The 
Query Likelihood Model
Documents are 
ranked
 based on the 
probability
 that the
 
    
query
 
Q 
could be generated by the 
document model
 
    
(i.e., on the same topic) based on doc 
D
, i.e., 
P
(
Q
 | 
D
)
Given a query 
Q
 = 
q
1
q
n
 and a document 
D
 = 
d
1
d
m
,
 
compute the 
conditional probability P
(
D
 | 
Q
)
                          
P
(
D
 | 
Q
) = 
P
(
Q | D
) 
P
(
D
)
 
where 
P
(
Q | D
) 
P
(
D
) is the 
query likelihood 
given 
D
 &
 
     
P
(
D
) is the prior belief that 
D
 is 
relevant
 to any query
29
 
E
s
t
i
m
a
t
i
n
g
 
P
r
o
b
a
b
i
l
i
t
i
e
s
 
Obvious estimate for 
unigram probabilities 
is
 
 
Maximum likelihood estimate
 (
mle
)
Makes the observed value of 
f
q
i
,
D
 
most likely
If any query words are missing from document, score will
 
be 
zero
Missing 1 out of 4 query words same as missing 3 out of 4
30
 
L
M
s
 
f
o
r
 
R
e
t
r
i
e
v
a
l
 
Example
. The 
Query Likelihood Model
Given the two (unigram) LMs, 
M
1
 and 
M
2
, as shown below
 
 
 
 
 
 
 
31
 
Q:  Frog   said  that   toad    likes  that   dog
M
1
  0.01   0.03 0.04   0.01  0.02  0.04  0.005
M
2
 0.0002 0.03 0.04 0.0001 0.04  0.04  0.01
 
P
(
Q
 | 
M
1
) = 0.00000000000048
P
(
Q
 | 
M
2
) = 0.00000000000000034
 
P
(
Q
 | 
M
1
) > 
P
(
Q
 | 
M
2
)
 
T
h
e
 
L
a
n
g
u
a
g
e
 
M
o
d
e
l
s
 
 Example
. 
Using language models, we estimate if “Barbie
 
car” (i.e., treated as 
Q
) is generated from content (i.e.,
 
treated as 
M
) targeting younger or adults
 
 
 
 
 
 
Based on the results, it is more likely that “Barbie car” is
 
 generated from content written for younger audiences
32
 
L
M
s
 
f
o
r
 
R
e
t
r
i
e
v
a
l
 
The 
Document Likelihood Model
Instead of using the probability of a 
document language
 
   
model
 (
M
d
) generating the 
query
, use the probability of
 
   a 
query language model M
q
 generating the 
document
Creating a 
document likelihood model
, even though it is
 
   less appealing, since there is much less text available
 
   to estimate a LM based on the query text
Solution: incorporate relevance feedback into the model &
 
   update the language model 
M
q
e.g.,  the 
relevance model 
of Lavrenko & Croft (2001) is an
 
instance of a document likelihood model that incorporates
 
pseudo relevance feedback 
into an LM approach
33
 
L
a
n
g
u
a
g
e
 
M
o
d
e
l
 
(
L
M
)
 
-
 
E
x
a
m
p
l
e
C
o
n
s
i
d
e
r
 
t
h
e
 
q
u
e
r
y
 
Q
 
=
 
F
r
o
z
e
n
 
M
o
v
i
e
 
&
 
t
h
e
 
f
o
l
l
o
w
i
n
g
d
o
c
u
m
e
n
t
s
,
 
w
h
e
r
e
 
R
 
i
s
 
r
e
l
e
v
a
n
t
 
&
 
N
R
 
i
s
 
n
o
n
-
r
e
l
e
v
a
n
t
34
 
Using the following 
query language model 
(
M
q
) we estimate
 
the probability of docs 
D
1
 and 
D
2
 on their relevance for
 
query 
Q
 = “
Frozen Movie
 
D
1
: “
Sven, Anna, Disney, Frozen
P(
D
1 
| 
M
q
) =
     
t
D
1
  
P
(
t
 | 
M
q
) =
  1/9 
 1/9 
 2/9 
 2/9 = 4/6561 = 0.00061
 
D
2
: “
Frozen ice cream
P(
D
2 
| 
M
q
) =
       
t
D
2
  
P
(
t
 | 
M
q
) = 
2/9 
 0 
 0 = 0
 
35
 
L
a
n
g
u
a
g
e
 
M
o
d
e
l
 
(
L
M
)
 
-
 
E
x
a
m
p
l
e
 
L
M
s
 
f
o
r
 
R
e
t
r
i
e
v
a
l
 
Comparing the 
Query Likelihood Model 
& 
Document
 
Likelihood Model
Rather than generating either a 
document language model
 
    (
M
d
) or a 
query language model 
(
M
q
), create an LM
 
    from both the document & query & find the difference
To model the risk of returning a document 
d
 as relevant to
 
    a query 
q
, use the KL-divergence
               
R
(
d
; 
q
) = KL(
M
d
 || 
M
q
) = 
t 
V
 
P(
t
 | 
M
q
) log
KL-divergence measures how bad the probability distribution
 
M
q
 is at modeling 
M
d
It has been shown that the 
comparison model 
outperforms
 
both query-likelihood & document-likelihood models
 
and useful for ad hoc retrieval
36
 
K
L
-
D
i
v
e
r
g
e
n
c
e
 
Example
. Consider the probability distributions generated
 
using texts from Simple Wikipedia (a simplified
 
version of Wikipedia) and Wikipedia
Probabilities generated using text targeting younger versus
 
    more mature audiences
37
 
K
L
-
D
i
v
e
r
g
e
n
c
e
 
Example (Cont.)
. 
Using K-L Divergence
 on vocabulary
 
 
      (0.0043 
 log (0.0043/0.0042) + 0.0048 
 log (0.0048/0.0039)
    + 0.0002 
 log (0.002/0.0056) + 0.0439 
 log (0.0439/0.00000001)
    + 0.0756 
 log (0.0756/0.0003) + 0.0037 
 log (0.0037/0.0076))
=
 
0
.
4
7
2
1
8
The vocabulary distribution for children versus more mature
 
    audiences is indeed different
38
 
S
m
o
o
t
h
i
n
g
 
Document texts are a 
sample
 from the 
language model
Missing words (in a document) should 
not
 have 
zero
 
    probability of occurring
Smoothing
 is a technique for estimating probabilities for
 
missing (or unseen) words 
for the words that are 
not
 
seen
 in the text
Lower (or 
discount
) the 
probability
 estimates for words that
 
   are 
seen
 in the document text
Assign that “left-over” probability to the estimates
39
 
S
m
o
o
t
h
i
n
g
 
As stated in [Zhai 04]
Smoothing 
is the problem of 
adjusting
 the 
maximum
 
    
likelihood estimator 
to compensate for 
data
 
    
sparseness
 
and it is required to avoid assigning
 
    
zero probability 
to 
unseen
 words
Smoothing accuracy is directly related to the retrieval
 
   performance:
The retrieval performance is generally
 sensitive 
to
 
smoothing parameters
Smoothing plays two different roles:
Making the estimated document LM more 
accurate
Explaining the 
non-informative
 words in the query
40
 
E
s
t
i
m
a
t
i
n
g
 
P
r
o
b
a
b
i
l
i
t
i
e
s
 
Estimate for 
unseen words
 is 
α
D 
P
(
q
i 
| 
C
)
P
(
q
i 
| 
C
) is the 
probability
 for query word 
i 
in the 
collection
 
    language model for collection 
C 
(background probability)
α
D
, which can be a constant, is a coefficient of the 
probability
 
   assigned to 
unseen
 
words
, depending on the document 
D
Estimate for words that occur is
 
(1 −
 α
D
)
 P
(
q
i 
| 
D
) + 
α
D 
P
(
q
i 
| 
C
)
Different forms of 
estimation
 come from different 
α
D
41
 
E
s
t
i
m
a
t
i
n
g
 
P
r
o
b
a
b
i
l
i
t
i
e
s
 
Different forms of 
estimation
 come from different 
α
D
 (or 
)
42
 
where 
p
s
(
w | d
) is the 
smoothed probability 
of a word seen in 
d
,
           p
(
w | C
) is the collection language model,
           α
d
 
is a 
coefficient
 controlling the probability mass assigned
 
  to unseen words,
           all probabilities sum to one
 
J
e
l
i
n
e
k
-
M
e
r
c
e
r
 
S
m
o
o
t
h
i
n
g
 
α
D 
is a constant, 
λ
 (= 0.1 for short queries or = 0.7 for
 
long queries in TREC evaluations
Gives estimate of
 
Ranking score
 
Use 
logs
 for convenience: solve the accuracy problem
 
of multiplying small numbers
 
43
 
W
h
e
r
e
 
i
s
 
t
f
-
i
d
f
 
W
e
i
g
h
t
?
 
  
P
roportional to the 
term frequency
 (
TF
), inversely
proportional to the 
collection frequency
 (
IDF
)
44
 
D
i
r
i
c
h
l
e
t
 
S
m
o
o
t
h
i
n
g
 
α
D 
depends on document length
 
 
    where 
 is a parameter value set empirically
Gives 
probability estimation
 of
 
 
and 
document score
45
 
Q
u
e
r
y
 
L
i
k
e
l
i
h
o
o
d
 
Example
.  Given                                                  ,
Let Q be “President Lincoln”, 
D 
(a doc) in 
C
(ollection)
For the term “President”, let 
f
q
i
,D
 
= 15, 
c
q
i
 
= 160,000
For the term “Lincoln”, let 
f
q
i
,D
 
= 25, 
c
q
i
 
= 2,400
Let no. of word occurrences in 
D 
(i.e., |
D
|) be 1,800
Let no. of word occurrences in 
C
 be 10
9
 = 500,000
 
    (docs) × 2,000 (average no. of words in a doc)
μ
 = 2,000
46
 
Negative due to
the summating
logs of small
numbers
47
 
Q
u
e
r
y
 
L
i
k
e
l
i
h
o
o
d
48
 
S
e
t
 
T
h
e
o
r
e
t
i
c
 
M
o
d
e
l
s
 
The Boolean model imposes a 
binary
 criterion for
 
deciding relevance
The question of how to extend the Boolean model
 
to accomodate 
partial matching
 and a 
ranking
 
has attracted considerable attention in the past
Two set theoretic models for this extension are
Fuzzy Set 
Model
Extended Boolean 
Model
49
 
F
u
z
z
y
 
Q
u
e
r
i
e
s
 
Binary queries 
are based on 
binary logic
, in which a document
 
either 
satisfies
 or does 
not satisfy 
a query.
Fuzzy queries 
are based on 
fuzzy logic
, in which a query term
 
is considered as a fuzzy set & each document is given a
 
degree of relevance with respect to a query, usually [0 – 1].
Example
.
In 
binary logic
: The set of "tall people" are defined as those > 6
 
  feet. Thus, a person 5' 9.995" is 
not
 considered tall.
In 
fuzzy logic
: no clear distinction on “tall people.” Every person
 
  is given a 
degree of membership 
to the set of tall people, e.g.,
a person 7'0" will have a grade 1.0
a person 4'5" will have a grade 0.0
a person 6‘2" will have a grade 0.85
a person 5’8" will have a grade 0.7
50
 
F
u
z
z
y
 
Q
u
e
r
i
e
s
 
B
i
n
a
r
y
 
L
o
g
i
c
 
v
s
.
 
F
u
z
z
y
 
L
o
g
i
c
 
51
F
u
z
z
y
 
S
e
t
 
M
o
d
e
l
 
Queries and documents are represented by sets of 
index
 
terms
 &
 
matching
 is 
approximate
 from the start
This 
vagueness
 can be modeled using a 
fuzzy
 framework,
 
as follows:
with each (query) term is associated a 
fuzzy
 
set
each document has a 
degree of membership
 (0 
 
X
 
 1) in
 
    this 
fuzzy
 
set
This interpretation provides the foundation for many
 
models for IR based on
 fuzzy
 
theory
The model proposed by Ogawa, Morita & Kobayashi
 
(1991) is discussed
52
F
u
z
z
y
 
I
n
f
o
r
m
a
t
i
o
n
 
R
e
t
r
i
e
v
a
l
 
Fuzzy sets
 are modeled based on a 
thesaurus
This thesaurus is built as follows:
Let 
c
 be a 
term-term correlation matrix 
(or
 
keyword
 
    
connection matrix
)
Let 
c
(
i
, 
l
) be a 
normalized correlation factor 
for (
k
i
, 
k
l
):
n
i
: number of documents which contain 
k
i
n
l
: number of documents which contain 
k
l
n
(
i
, 
l
): number of documents which contain both 
k
i  
and 
k
l
The notion of 
proximity
 among indexed terms
53
F
u
z
z
y
 
I
n
f
o
r
m
a
t
i
o
n
 
R
e
t
r
i
e
v
a
l
 
The 
correlation factor 
c
(
i
, 
l
) can be used to define 
fuzzy
 
set membership
 for a document 
d
j
 in the
 
fuzzy set
 
for 
query term K
i
 as follows:
  
   
  
(
i
, 
j
) = 1   -    
    (1 - 
c
(
i
, 
l
))
   
           
k
l
 
 
d
j
(
i
, 
j
): the 
degree of membership 
of document 
d
j
 in
 
  fuzzy subset associated with 
k
i
The above expression computes an 
algebraic sum
 over
 
all 
terms
 in 
d
j
 with respect to 
k
i
, and is implemented
 
as 
the complement of a negated algebraic product
A document 
d
j
 belongs to the fuzzy set for 
k
i
, if its own
 
terms
 are associated with 
k
i
54
F
u
z
z
y
 
I
n
f
o
r
m
a
t
i
o
n
 
R
e
t
r
i
e
v
a
l
 
  
(
i
, 
j
) = 1 -   
    (1 - 
c
(
i
, 
l
))
 
                            
k
l
 
 
d
j
(
i
, 
j
): membership of document 
d
j
 
in fuzzy subset
 
    associated with query term 
k
i
If document 
d
j
 contains any index term 
k
l
 which is
 
closely related 
to query term 
k
i
, we have
if 
l
 such that 
c(
i
, 
l
) ~ 1, then 
(
i
, 
j
) ~ 1, and
query index 
k
i
 is a good 
fuzzy index
 for document
 
d
j
55
F
u
z
z
y
 
I
R
 
q = k
a
 
 (
k
b
 
 
k
c
)
q
dnf
  = 
(1,1,1)
 + 
(1,1,0)
 + 
(1,0,0) 
(binary weighted)
 
 
= 
cc
1
 + 
cc
2
 + 
cc
3
 
(conjunctive comp)
(
q, d
j
)
  =  
(
cc
1 
+ cc
2 
+ cc
3
, j
)
 
      =  
1
 - 
(1
 - 
(
a, j
)
 
(
b, j
)
 
(
c, j
))
 
  
    
(1
 - 
(
a, j
)
 
(
b, j
)
 
(1 
- 
(
c, j
)))
 
  
    
(1
 - 
(
a, j
)
 
(1 
- 
(
b, j
))
 
(1 
- 
(
c, j
)))
 
Example
.
56
F
u
z
z
y
 
I
n
f
o
r
m
a
t
i
o
n
 
R
e
t
r
i
e
v
a
l
 
Fuzzy set 
are useful for representing 
vagueness
 
and 
imprecision
Fuzzy IR
 models have been discussed mainly in the
 
literature associated with 
fuzzy theory
Experiments with standard test collections are not
 
available
Difficult to 
compare
 because previously conducted
 
experiences used only small collections
Slide Note
Embed
Share

Retrieval models play a crucial role in defining the search process, with various assumptions and ranking algorithms. Relevance, a complex concept, is central to these models, though subject to disagreement. An overview of different retrieval models like Boolean, Vector Space, and Probabilistic Models is provided, emphasizing the importance of relevance in simplifying the search problem. The search process is guided by retrieved documents, with Boolean Retrieval offering predictability but some limitations. The Vector Space Model offers a simple and intuitive representation of documents and queries.

  • Retrieval Models
  • Relevance
  • Search Process
  • Boolean Retrieval
  • Vector Space Model

Uploaded on Sep 23, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Chapter 7 Retrieval Models 1

  2. Retrieval Models Provide a mathematical framework for defining the search process Includes explanation of assumptions Basis of many ranking algorithms Progress in retrieval models has corresponded with improvements in effectiveness Theories about relevance 2

  3. Relevance Complex concept that has been studied for some time Many factors to consider People often disagree when making relevance judgments Retrieval models make various assumptions about relevance to simplify problem e.g.,topical vs. user relevance e.g., binary vs. multi-valued relevance 3

  4. Retrieval Model Overview Older models Boolean retrieval Vector Space model Probabilistic Models BM25 Language models Combining evidence Inference networks Learning to Rank 4

  5. Searching Based on Retrieved Documents Sequence of queries driven by the number of retrieved documents Example. Lincoln search of news articles president AND Lincoln president AND Lincoln AND NOT (automobile OR car) president AND Lincoln AND biography AND life AND birthplace AND gettysburg AND NOT (automobile OR car) president AND Lincoln AND (biography OR life OR birthplace OR gettysburg) AND NOT (automobile OR car) 5

  6. Boolean Retrieval Two possible outcomes for query processing TRUE and FALSE Exact-match retrieval No ranking at all Query usually specified using Boolean operators AND, OR, NOT 6

  7. Boolean Retrieval Advantages Results are predictable, relatively easy to explain Many different document features, such as metadata (e.g., type/date), can be incorporated Efficient processing, since many documents can be eliminated from search Disadvantages Simple queries usually don t work well Complex queries are difficult to construct Effectiveness depends entirely on users 7

  8. Vector Space Model Simple and intuitively appealing Documents and query represented by a vector of term weights Collection represented by a matrix of term weights 8

  9. Vector Space Model Vector Representation of Stemmed Documents w/o Stopwords 9

  10. Vector Space Model 3-D pictures useful, but can be misleading for high- dimensional space 10

  11. Vector Space Model Documents ranked by differences between points representing query and documents Using similarity measure rather than a distance or dissimilarity measure e.g. Cosine correlation 11

  12. Similarity Calculation Consider two documents D1 and D2, and a query Q D1 = (0.5, 0.8, 0.3) Q = (1.5, 1.0, 0) D2 = (0.9, 0.4, 0.2) More similar to Q than D1 12

  13. Term Weights TF-IDF Weight Term frequency weight measures importance in document: Inverse document frequency measures importance in collection: Ensure non-zero weight Some heuristic modifications t t 13

  14. Vector Space Model Advantages Simple computational framework for ranking Any similarity measure or term weighting scheme could be used Disadvantages Assumption of term independence No assumption on whether relevance is binary/multivalued 14

  15. Probabilistic Models According to [Greiff 02] In probabilistic approaches to IR, the occurrence of a query term in a document D contributes to the probability that D will be judged as relevant The weight assigned to a query term should be based on the expected value of that contribution [Greiff 02] Greiff, et al. The Role of Variance in Term Weighting for Probabilistic Information Retrieval. In Proc. of Intl. Conf. on Information and Knowledge Management (CIKM). 2002. 15

  16. IR as Classification Bayes Decision Rule Based on Bayes Classifier (D) Conditional Probability 16

  17. Bayes Classifier Bayes Decision Rule A document D is relevant if P(R | D) > P(NR | D), where P(R | D) and P(NR | D) are conditional probabilities based on the probability of occurrence of words in D that are in R Estimating probabilities Use Bayes Rule A prior probability of relevance Classify a document as relevant if Based on the Bayes Decision Rule P(R | D) > P(NR | D) L.H.S. is the likelihood ratio of D being relevant 17

  18. Estimating P(D | R) Assume word independence (Na ve Bayes assumption) and use individual term/word (di) probabilities Binary (weights in doc) independence (of word) model Document represented by a vector of binary features indicating term occurrence (or non-occurrence) pi is probability that term i occurs in relevant document, si is probability of occurrence of i in non-relevant document Example. Given a document representation (1, 0, 0, 1, 1), the probability of the document occurring in the relevant set is p1 (1 - p2) (1 - p3) p4 p5 18

  19. Binary Independence Model Computing the likelihood ratio of D using pi and si Probability of i in R Probability of i in NR Probability of i not in R Probability of i not in NR word i not in D word i in D 1 (Same for all documents) 19

  20. Binary Independence Model Scoring function is Using log to avoid multiplying lots of small numbers If there is no other information on the relevant set, i.e., neither R nor NR exists, then pi is set to be a constant (= 0.5) si is approximated as ni/ N, where ni is the number of documents in the collection N that include i si is similar to the IDF-like weight 20

  21. Contingency Table, If (non-)Relevant Sets available n where ri = number of relevant documents containing term i ni = number of documents in a collection N containing i R = number of relevant documents in the collection N = number of documents in the collection Gives the scoring function: 21

  22. BM25 Ranking Algorithm A ranking algorithm based on probabilistic arguments and experimental validation, but it is not a formal model Adds document D and query Q term weights fi is the frequency of term i in a document D k1, a constant, plays the role of tf along with fi qfiis the frequency of term i in query Q, much lower than fi k2plays the role as k1 but is applied to query term weight qfi , where dl is document length & avdl is the average length of a doc regulated by b (0 b 1) k1, k2& K are set empirically as 1.2, 0..1000, & 0.75, resp. 22

  23. BM25 Example Given a query with two terms, president lincoln , (qf = 1) No relevance information (ri and R are zero) N = 500,000 documents president occurs in 40,000 documents (n1= 40,000) lincoln occurs in 300 documents (n2 = 300) president occurs 15 times in doc D (f1= 15) lincoln occurs 25 times in doc D (f2= 25) document length is 90% of the average length (dl / avdl = 0.9) k1= 1.2, b = 0.75, and k2= 100 K = 1.2 (0.25 + 0.75 0.9) = 1.11 23

  24. BM25 Example 24

  25. BM25 Example Effect of term frequencies, especially on lincoln The large occurrence of a single important query term can yield a higher score than the occurrence of both query terms 25

  26. Language Model (LM) A language model is associated with each doc & has been successfully applied to search applications to represent the topical content of a document & query LM is a probability distribution over sequences of words to represent docs, i.e., puts a probability measure over strings from some vocabulary, s * P(s) = 1 The original and basic method for using LMs in IR is the query likelihood model in which a document d is constructed from a language model Md, i.e., Estimate a LM for each documentd Rank documents by the likelihood of the query according to the estimated LM, i.e., P(Q | Md) 26

  27. LMs for Retrieval Three possibilities models of topical relevance Probability of generating the query text from a document (language) model, i.e., the query likelihood model tft,d dld Pmle (Q | Md) = t Q Pmle (t | Md), where Pmle (t | Md) = Probability of generating the document text from a query (language) model, i.e., Pmle (D |Mq) = Pmle (D | R), where R is the set of relevant documents of a query q Comparing the language models representing the query and document topics 27

  28. LMs for Retrieval Three ways of developing the LM modeling approach: The Query Likelihood Model: P(Q | Md) or P(Q | D) a) The Document Likelihood Model: P(D | MQ) or P(D | R) b) Comparison of the query and document LMs using KL- Divergence c) 28

  29. LMs for Retrieval The Query Likelihood Model Documents are ranked based on the probability that the queryQ could be generated by the document model (i.e., on the same topic) based on doc D, i.e., P(Q | D) Given a query Q = q1 qn and a document D = d1 dm, compute the conditional probability P(D | Q) A prior probability of document D P(D | Q) = P(Q | D) P(D) (Assumed to be uniform) where P(Q | D) P(D) is the query likelihood given D & P(D) is the prior belief that D is relevant to any query 29

  30. Estimating Probabilities Obvious estimate for unigram probabilities is Maximum likelihood estimate (mle) Makes the observed value of fqi,Dmost likely If any query words are missing from document, score will be zero Missing 1 out of 4 query words same as missing 3 out of 4 30

  31. LMs for Retrieval Example. The Query Likelihood Model Given the two (unigram) LMs, M1 and M2, as shown below Model M1 Model M2 Q: Frog said that toad likes that dog the a frog toad said likes that dog cat monkey 0.2 0.1 0.01 0.01 0.03 0.02 0.04 0.005 0.003 0.001 the a frog toad said likes that dog cat Monkey 0.15 0.12 0.0002 0.0001 0.03 0.04 0.04 0.01 0.015 0.002 M1 0.01 0.03 0.04 0.01 0.02 0.04 0.005 M2 0.0002 0.03 0.04 0.0001 0.04 0.04 0.01 P(Q | M1) = 0.00000000000048 P(Q | M2) = 0.00000000000000034 P(Q | M1) > P(Q | M2) 31

  32. The Language Models Example. Using language models, we estimate if Barbie car (i.e., treated as Q) is generated from content (i.e., treated as M) targeting younger or adults P( | Barbie car ) = P( Barbie car | ) = 0.0056 0.0039 = 0.00002184 Words Probability on Simple Wiki 0.0042 0.0039 0.0056 0.00000001 0.0003 0.0076 Probabilit on Wikipedia Blue Car Barbie Auburn Software Toy 0.0043 0.0048 0.0002 0.0439 0.0756 0.0037 P( | Barbie car ) = P( Barbie car | ) = 0.0048 0.0002 = 0.00000096 Based on the results, it is more likely that Barbie car is generated from content written for younger audiences 32

  33. LMs for Retrieval The Document Likelihood Model Instead of using the probability of a document language model (Md) generating the query, use the probability of a query language model Mq generating the document Creating a document likelihood model, even though it is less appealing, since there is much less text available to estimate a LM based on the query text Solution: incorporate relevance feedback into the model & update the language model Mq e.g., the relevance model of Lavrenko & Croft (2001) is an instance of a document likelihood model that incorporates pseudo relevance feedback into an LM approach 33

  34. Language Model (LM) - Example Consider the query Q= Frozen Movie & the following documents, where R is relevant & NR is non-relevant Frozen Character Anna Disney Movie Elsa Frozen Disney Sven Frozen Food from Fridge R R R NR Words (R) Frozen P(w) 2/9 Character 1/9 Anna 1/9 Disney 2/9 Movie 1/9 Elsa 1/9 Sven 1/9 34

  35. Language Model (LM) - Example Using the following query language model (Mq) we estimate the probability of docs D1 and D2 on their relevance for query Q= Frozen Movie Words (R) P(w) D1: Sven, Anna, Disney, Frozen P(D1 | Mq) = t D1P(t | Mq) = 1/9 1/9 2/9 2/9 = 4/6561 = 0.00061 Frozen 2/9 Character 1/9 Anna 1/9 Disney 2/9 Movie 1/9 Elsa 1/9 D2: Frozen ice cream P(D2 | Mq) = t D2P(t | Mq) = 2/9 0 0 = 0 Sven 1/9 35

  36. LMs for Retrieval Comparing the Query Likelihood Model & Document Likelihood Model Rather than generating either a document language model (Md) or a query language model (Mq), create an LM from both the document & query & find the difference To model the risk of returning a document d as relevant to a query q, use the KL-divergence P(t | Mq) P(t | Md) R(d; q) = KL(Md || Mq) = t VP(t | Mq) log KL-divergence measures how bad the probability distribution Mq is at modeling Md It has been shown that the comparison model outperforms both query-likelihood & document-likelihood models and useful for ad hoc retrieval 36

  37. KL-Divergence Example. Consider the probability distributions generated using texts from Simple Wikipedia (a simplified version of Wikipedia) and Wikipedia Probabilities generated using text targeting younger versus more mature audiences Words Probability based on Simple Wiki 0.0042 0.0039 0.0056 0.00000001 0.0003 0.0076 Probability based on Wikipedia 0.0043 0.0048 0.0002 0.0439 0.0756 0.0037 Blue Car Barbie Auburn Software Toy 37

  38. KL-Divergence Example (Cont.). Using K-L Divergence on vocabulary K-L Divergence ( || ) = (0.0043 log (0.0043/0.0042) + 0.0048 log (0.0048/0.0039) + 0.0002 log (0.002/0.0056) + 0.0439 log (0.0439/0.00000001) + 0.0756 log (0.0756/0.0003) + 0.0037 log (0.0037/0.0076)) = 0.47218 The vocabulary distribution for children versus more mature audiences is indeed different 38

  39. Smoothing Document texts are a sample from the language model Missing words (in a document) should not have zero probability of occurring Smoothing is a technique for estimating probabilities for missing (or unseen) words for the words that are not seen in the text Lower (or discount) the probability estimates for words that are seen in the document text Assign that left-over probability to the estimates 39

  40. Smoothing As stated in [Zhai 04] Smoothing is the problem of adjusting the maximum likelihood estimator to compensate for data sparseness and it is required to avoid assigning zero probability to unseen words Smoothing accuracy is directly related to the retrieval performance: The retrieval performance is generally sensitive to smoothing parameters Smoothing plays two different roles: Making the estimated document LM more accurate Explaining the non-informative words in the query 40

  41. Estimating Probabilities Estimate for unseen words is D P(qi | C) P(qi | C) is the probability for query word i in the collection language model for collection C (background probability) D, which can be a constant, is a coefficient of the probability assigned to unseen words, depending on the document D Estimate for words that occur is (1 D) P(qi | D) + D P(qi | C) Different forms of estimation come from different D 41

  42. Estimating Probabilities Different forms of estimation come from different D (or ) where ps(w | d) is the smoothed probability of a word seen in d, p(w | C) is the collection language model, dis a coefficient controlling the probability mass assigned to unseen words, all probabilities sum to one 42

  43. Jelinek-Mercer Smoothing D is a constant, (= 0.1 for short queries or = 0.7 for long queries in TREC evaluations Gives estimate of Ranking score Use logs for convenience: solve the accuracy problem of multiplying small numbers 43

  44. Where is tf-idf Weight? i: fqi,D> 0 i: fqi,D= 0 i: 1..n Same for all documents TF IDF Proportional to the term frequency (TF), inversely proportional to the collection frequency (IDF) 44

  45. Dirichlet Smoothing D depends on document length where is a parameter value set empirically Gives probability estimation of and document score 45

  46. Query Likelihood Example. Given , Let Q be President Lincoln , D (a doc) in C(ollection) For the term President , let fqi,D= 15, cqi = 160,000 For the term Lincoln , let fqi,D= 25, cqi= 2,400 Let no. of word occurrences in D (i.e., |D|) be 1,800 Let no. of word occurrences in C be 109 = 500,000 (docs) 2,000 (average no. of words in a doc) = 2,000 Negative due to the summating logs of small numbers 46

  47. Query Likelihood 47

  48. Set Theoretic Models The Boolean model imposes a binary criterion for deciding relevance The question of how to extend the Boolean model to accomodate partial matching and a ranking has attracted considerable attention in the past Two set theoretic models for this extension are Fuzzy Set Model Extended Boolean Model 48

  49. Fuzzy Queries Binary queries are based on binary logic, in which a document either satisfies or does not satisfy a query. Fuzzy queries are based on fuzzy logic, in which a query term is considered as a fuzzy set & each document is given a degree of relevance with respect to a query, usually [0 1]. Example. In binary logic: The set of "tall people" are defined as those > 6 feet. Thus, a person 5' 9.995" is not considered tall. In fuzzy logic: no clear distinction on tall people. Every person is given a degree of membership to the set of tall people, e.g., a person 7'0" will have a grade 1.0 a person 4'5" will have a grade 0.0 a person 6 2" will have a grade 0.85 a person 5 8" will have a grade 0.7 49

  50. Fuzzy Queries Binary Logic vs. Fuzzy Logic TALL TALL 1 1 0 0 6.5 6.5 4.5 Binary (crisp) Logic Fuzzy Logic 50

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#