Introduction to Information Retrieval by Dr. Huda Abdulaali 2017-2018

undefined
1
L
e
c
t
u
r
e
 
1
:
 
I
n
t
r
o
d
u
c
t
i
o
n
 
a
n
d
 
t
h
e
 
B
o
o
l
e
a
n
 
M
o
d
e
l
I
n
f
o
r
m
a
t
i
o
n
 
R
e
t
r
i
e
v
a
l
B
y
D
r
.
 
H
u
d
a
 
A
b
d
u
l
a
a
l
i
2
0
1
7
-
2
0
1
8
I
n
t
r
o
d
u
c
t
i
o
n
 
t
o
 
I
n
f
o
r
m
a
t
i
o
n
 
R
e
t
r
i
e
v
a
l
.
 
M
a
n
n
i
n
g
 
e
t
 
a
l
.
,
2
I
n
f
o
r
m
a
t
i
o
n
 
r
e
t
r
i
e
v
a
l
 
(
I
R
)
 
i
s
 
f
i
n
d
i
n
g
 
m
a
t
e
r
i
a
l
 
.
 
.
 
.
 
o
f
 
a
n
 
u
n
s
t
r
u
c
t
u
r
e
d
n
a
t
u
r
e
 
.
 
.
 
.
 
t
h
a
t
 
s
a
t
i
s
f
i
e
s
 
a
n
 
i
n
f
o
r
m
a
t
i
o
n
 
n
e
e
d
 
f
r
o
m
 
w
i
t
h
i
n
 
l
a
r
g
e
c
o
l
l
e
c
t
i
o
n
s
 
.
 
.
 
.
 
.
I
N
T
R
O
D
U
C
T
I
O
N
»
Information retrieval (IR) is finding material
(usually documents)
»
of an unstructured nature . . . that satisfies an
information need
»
from within large collections (usually stored on
computers).
3
D
o
c
u
m
e
n
t
 
C
o
l
l
e
c
t
i
o
n
:
 
t
e
x
t
 
u
n
i
t
s
 
w
e
 
h
a
v
e
 
b
u
i
l
t
 
a
n
 
I
R
s
y
s
t
e
m
.
Usually documents
But could be
book chapters
paragraphs
scenes of a movie
turns in a conversation...
I
N
T
R
O
D
U
C
T
I
O
N
4
S
t
r
u
c
t
u
r
e
d
 
v
s
 
U
n
s
t
r
u
c
t
u
r
e
d
 
D
a
t
a
 
structured data
 refers to information with a high
degree of organization, such that inclusion in a relational
database is seamless and readily searchable by simple,
straightforward search engine algorithms or other search
operations;
 
unstructured data
 is essentially the opposite.
I
N
T
R
O
D
U
C
T
I
O
N
5
I
n
f
o
r
m
a
t
i
o
n
 
N
e
e
d
s
 
a
n
d
 
R
e
l
e
v
a
n
c
e
An 
information need 
is the topic about which the user
desires 
to know more about.
A 
query
 is what the user conveys to the computer in
an
attempt to communicate the information need.
A 
document is relevant 
if the user perceives that it
contains information of value with respect to their
personal information need.
I
N
T
R
O
D
U
C
T
I
O
N
6
The field of IR also covers supporting users in browsing
or filtering document collections or further processing a
set of retrieved documents.
Given a set of documents, clustering is the task of coming
up with a good grouping of the documents based on their
contents.
I
N
T
R
O
D
U
C
T
I
O
N
7
In 
web search
, the system has to provide search over billions of
documents stored on millions of computers. Distinctive issues are
needing to gather documents for indexing, being able to build
systems that work efficiently
In 
personal information retrieval
, 
operating systems have
integrated information retrieval (such as Apple’s Mac OS X
Spotlight or Windows Vista’s Instant Search). Email programs
usually not only provide search but also text classification: they
at least provide a spam (junk mail) filter, and commonly also
provide either manual or automatic means for classifying mail
so that it can be placed directly into particular 
folders
I
N
T
R
O
D
U
C
T
I
O
N
In 
Enterprise 
and institutional search e.g company’s
documentation, patents, research articles often domain-specific,
centralized storage; dedicated machines for search        
 
.
8
A
 
s
h
o
r
t
 
h
i
s
t
o
r
y
 
o
f
 
I
R
9
B
o
o
l
e
a
n
 
R
e
t
r
i
e
v
a
l
In the Boolean retrieval model we can pose any query in
the form of a Boolean expression of term i.e., one in which
terms are combined with the operators and, or, and not.
Shakespeare example
The Boolean model of information retrieval (BIR) is a
classical information retrieval (IR) model and, at the same
time, the first and most adopted one. It is used by many IR
systems to this day
»
An index term is either present(1) or absent(0) in the
document
»
All index terms provide equal evidence with respect to
information needs.
»
Queries are Boolean combinations of index terms.
»
X AND Y: represents doc that contains both X and Y
»
X OR Y: represents doc that contains either X or Y
»
NOT X: represents the doc that do not contain X
10
10
Basic Assumption of Boolean Model
11
11
A
n
 
e
x
a
m
p
l
e
 
i
n
f
o
r
m
a
t
i
o
n
 
r
e
t
r
i
e
v
a
l
 
p
r
o
b
l
e
m
B
r
u
t
u
s
 
A
N
D
 
C
a
e
s
a
r
 
A
N
D
 
N
O
T
 
C
a
l
p
u
r
n
i
a
fat book that many people own is 
Shakespeare’s Collected
Works
. Suppose you wanted to determine which plays of
Shakespeare contain the words Brutus and Caesar and not
Calpurnia. One way to do that is to start at the beginning and
to read through all the text
,
The simplest form of document retrieval is for a computer
to do this sort of linear scan through documents. This
process is 
commonly referred to as 
GREPPING
 through
text
12
12
GREP
 is a command-line utility for searching plain-text data sets
for lines that match a regular expression. Its name comes from
the command g/re/p (globally search a regular expression and
print), which has the same effect: doing a global search with the
regular expression and printing all matching lines. 
GREP
 was
originally developed for the Unix operating system, but later
available for all Unix-like systems
.
A
n
 
e
x
a
m
p
l
e
 
i
n
f
o
r
m
a
t
i
o
n
 
r
e
t
r
i
e
v
a
l
 
p
r
o
b
l
e
m
13
13
1. To process large document collections quickly. The amount of
online data has grown at least as quickly as the speed of computers,
and we would now like to be able to search collections that total in
the order of billions to trillions of words.
2. To allow more flexible matching operations. For example, it is
impractical to perform the query Romans near countrymen with
GREP, where near might be defined as “within 5 words” or “within
the same sentence.”
3. To allow ranked retrieval. In many cases, you want the best
answer to an information need among many documents that
contain certain words.
A
n
 
e
x
a
m
p
l
e
 
i
n
f
o
r
m
a
t
i
o
n
 
r
e
t
r
i
e
v
a
l
 
p
r
o
b
l
e
m
14
14
The way to avoid linearly scanning the texts for each query is to
index the documents in advance
The result is a binary term-document incidence matrix
the information retrieval literature normally speaks of terms
(NOT WORDS)
The result is a vector for each term
Retrieval model can be categorize as
Boolean retrieval model
Vector space model
Probabilistic model
Model based on belief net
A
n
 
e
x
a
m
p
l
e
 
i
n
f
o
r
m
a
t
i
o
n
 
r
e
t
r
i
e
v
a
l
 
p
r
o
b
l
e
m
15
15
r
e
s
u
l
t
 
o
f
 
b
i
n
a
r
y
 
t
e
r
m
-
d
o
c
u
m
e
n
t
i
n
c
i
d
e
n
c
e
 
m
a
t
r
i
x
A
n
 
e
x
a
m
p
l
e
 
i
n
f
o
r
m
a
t
i
o
n
 
r
e
t
r
i
e
v
a
l
 
p
r
o
b
l
e
m
Matrix element (
t, d
) is 1 if the play in column 
d
 contains
the word in row 
t
, and is 0 otherwise.
Main idea
: record for each document whether it contains each
word out of all the different words Shakespeare used.
16
16
We compute the results for our query as the bitwise AND
between vectors for Brutus, Caesar and complement (Calpurnia):
A
n
 
e
x
a
m
p
l
e
 
i
n
f
o
r
m
a
t
i
o
n
 
r
e
t
r
i
e
v
a
l
 
p
r
o
b
l
e
m
To answer the query Brutus and Caesar and not Calpurnia,
we take the vectors for Brutus, Caesar and Calpurnia,
complement the last, and then do a bitwise and:
110100 and 110111 and 101111 = 100100
The answers for this query are thus Antony and Cleopatra
and Hamlet
17
17
A
 
f
i
r
s
t
 
t
a
k
e
 
a
t
 
b
u
i
l
d
i
n
g
 
a
n
 
i
n
v
e
r
t
e
d
 
i
n
d
e
x
The inverted index consists of a dictionary of terms (also:
lexicon, vocabulary) and a postings list for each term, i.e., a list
that records which documents the term occurs in.
The inverted index data structure is a central component of a
typical search engine indexing algorithm. A goal of a search
engine implementation is to optimize the speed of the query: find
the documents where word X occurs.
To gain the speed benefits of indexing at retrieval time, we have
to build the index in advance.
18
18
A
 
f
i
r
s
t
 
t
a
k
e
 
a
t
 
b
u
i
l
d
i
n
g
 
a
n
 
i
n
v
e
r
t
e
d
 
i
n
d
e
x
19
19
A
 
f
i
r
s
t
 
t
a
k
e
 
a
t
 
b
u
i
l
d
i
n
g
 
a
n
 
i
n
v
e
r
t
e
d
 
i
n
d
e
x
20
20
A
 
f
i
r
s
t
 
t
a
k
e
 
a
t
 
b
u
i
l
d
i
n
g
 
a
n
 
i
n
v
e
r
t
e
d
 
i
n
d
e
x
Within a document collection,
we assume that each document
has a unique docID serial
number, known as the document
identifier (docID).
Then, collect the documents to
be indexed. The core indexing
step is sorting this list so that
the terms are alphabetical.
21
21
Multiple occurrences of the
same term from the same
document are then merged.
Instances of the same term are
then grouped, and the result is
split into a dictionary and
postings
The dictionary records some
statistics, such as the number of
document documents which
contain each term
The postings are secondarily
sorted by docID. This provides
the basis for efficient query
processing.
A
 
f
i
r
s
t
 
t
a
k
e
 
a
t
 
b
u
i
l
d
i
n
g
 
a
n
 
i
n
v
e
r
t
e
d
 
i
n
d
e
x
22
22
Slide Note
Embed
Share

Overview of information retrieval, including finding material in unstructured data collections, structured vs. unstructured data, information needs and relevance, and tasks such as clustering in the field of IR. It also touches on web search challenges, institutional search, and personal information retrieval system integration.

  • Information Retrieval
  • Dr. Huda Abdulaali
  • Structured Data
  • Unstructured Data
  • Web Search

Uploaded on Feb 25, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. 1 Lecture 1: Introduction and the Boolean Model Information Retrieval By Dr. Huda Abdulaali 2017-2018 Introduction to Information Retrieval. Manning et al.,

  2. INTRODUCTION Information retrieval (IR) is finding material . . . of an unstructured nature . . . that satisfies an information need from within large collections . . . . 2

  3. INTRODUCTION Information retrieval (IR) is finding material (usually documents) of an unstructured nature . . . that satisfies an information need from within large collections (usually stored on computers). Document Collection: text units we have built an IR system. Usually documents But could be book chapters paragraphs scenes of a movie turns in a conversation... 3

  4. INTRODUCTION Structured vs Unstructured Data structured data refers to information with a high degree of organization, such that inclusion in a relational database is seamless and readily searchable by simple, straightforward search engine algorithms or other search operations; unstructured data is essentially the opposite. 4

  5. INTRODUCTION Information Needs and Relevance An information need is the topic about which the user desires to know more about. A query is what the user conveys to the computer in an attempt to communicate the information need. A document is relevant if the user perceives that it contains information of value with respect to their personal information need. 5

  6. INTRODUCTION The field of IR also covers supporting users in browsing or filtering document collections or further processing a set of retrieved documents. Given a set of documents, clustering is the task of coming up with a good grouping of the documents based on their contents. 6

  7. INTRODUCTION In web search, the system has to provide search over billions of documents stored on millions of computers. Distinctive issues are needing to gather documents for indexing, being able to build systems that work efficiently In and institutional search e.g company s Enterprise documentation, patents, research articles often domain-specific, centralized storage; dedicated machines for search . In personal information retrieval, operating systems have integrated information retrieval (such as Apple s Mac OS X Spotlight or Windows Vista s Instant Search). Email programs usually not only provide search but also text classification: they at least provide a spam (junk mail) filter, and commonly also provide either manual or automatic means for classifying mail so that it can be placed directly into particular folders 7

  8. A short history of IR 8

  9. Boolean Retrieval The Boolean model of information retrieval (BIR) is a classical information retrieval (IR) model and, at the same time, the first and most adopted one. It is used by many IR systems to this day In the Boolean retrieval model we can pose any query in the form of a Boolean expression of term i.e., one in which terms are combined with the operators and, or, and not. Shakespeare example 9

  10. Basic Assumption of Boolean Model An index term is either present(1) or absent(0) in the document All index terms provide equal evidence with respect to information needs. Queries are Boolean combinations of index terms. XAND Y: represents doc that contains both X and Y X OR Y: represents doc that contains either X or Y NOT X: represents the doc that do not contain X 10

  11. An example information retrieval problem Brutus AND Caesar AND NOT Calpurnia fat book that many people own is Shakespeare s Collected Works. Suppose you wanted to determine which plays of Shakespeare contain the words Brutus and Caesar and not Calpurnia. One way to do that is to start at the beginning and to read through all the text, The simplest form of document retrieval is for a computer to do this sort of linear scan through documents. This process is commonly referred to as GREPPING through text 11

  12. An example information retrieval problem GREP is a command-line utility for searching plain-text data sets for lines that match a regular expression. Its name comes from the command g/re/p (globally search a regular expression and print), which has the same effect: doing a global search with the regular expression and printing all matching lines. GREP was originally developed for the Unix operating system, but later 12 available for all Unix-like systems.

  13. An example information retrieval problem 1. To process large document collections quickly. The amount of online data has grown at least as quickly as the speed of computers, and we would now like to be able to search collections that total in the order of billions to trillions of words. 2. To allow more flexible matching operations. For example, it is impractical to perform the query Romans near countrymen with GREP, where near might be defined as within 5 words or within the same sentence. 3. To allow ranked retrieval. In many cases, you want the best answer to an information need among many documents that contain certain words. 13

  14. An example information retrieval problem The way to avoid linearly scanning the texts for each query is to index the documents in advance The result is a binary term-document incidence matrix the information retrieval literature normally speaks of terms (NOT WORDS) The result is a vector for each term Retrieval model can be categorize as Boolean retrieval model Vector space model Probabilistic model 14 Model based on belief net

  15. An example information retrieval problem result of binary term-document incidence matrix Main idea: record for each document whether it contains each word out of all the different words Shakespeare used. Matrix element (t, d) is 1 if the play in column d contains the word in row t, and is 0 otherwise. 15

  16. An example information retrieval problem We compute the results for our query as the bitwise AND between vectors for Brutus, Caesar and complement (Calpurnia): To answer the query Brutus and Caesar and not Calpurnia, we take the vectors for Brutus, Caesar and Calpurnia, complement the last, and then do a bitwise and: 110100 and 110111 and 101111 = 100100 The answers for this query are thus Antony and Cleopatra and Hamlet 16

  17. A first take at building an inverted index The inverted index consists of a dictionary of terms (also: lexicon, vocabulary) and a postings list for each term, i.e., a list that records which documents the term occurs in. The inverted index data structure is a central component of a typical search engine indexing algorithm. A goal of a search engine implementation is to optimize the speed of the query: find the documents where word X occurs. To gain the speed benefits of indexing at retrieval time, we have to build the index in advance. 17

  18. A first take at building an inverted index 18

  19. A first take at building an inverted index 19

  20. A first take at building an inverted index Within a document collection, we assume that each document has a unique docID serial number, known as the document identifier (docID). Then, collect the documents to be indexed. The core indexing step is sorting this list so that the terms are alphabetical. 20

  21. A first take at building an inverted index Multiple same document Instances of the same term are then grouped, and the result is split into a postings The dictionary records some statistics, such as the number of document documents contain each term The postings are secondarily sorted by docID. This provides the basis for efficient query processing. occurrences term from are of the the same then merged. dictionary and which 21

  22. 22

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#