Introduction to Information Retrieval and Web Search

Introducing Information Retrieval
and Web Search
Information Retrieval
 
Information Retrieval (IR) is 
finding material
 (usually
documents) of an 
unstructured
 nature (usually text)
that satisfies an 
information need
 from within 
large
collections
 (usually stored on computers).
 
These days we frequently think first of 
web search
, but
there are many other cases:
E-mail search
Searching your laptop
Corporate knowledge bases
Legal information retrieval
2
Unstructured (text) vs. structured
(database) data in the mid-nineties
3
Unstructured (text) vs. structured
(database) data today
4
Basic assumptions of Information
Retrieval
 
Collection
: A set of documents
Assume it is a static collection for the moment
 
Goal
: Retrieve documents with
information that is 
relevant
 to the user’s
information need
 
and helps the user
complete a 
task
5
Sec. 1.1
how trap mice alive
The classic search model
 
C
o
l
l
e
c
t
i
o
n
U
s
e
r
 
t
a
s
k
I
n
f
o
 
n
e
e
d
Q
u
e
r
y
R
e
s
u
l
t
s
S
e
a
r
c
h
e
n
g
i
n
e
Q
u
e
r
y
r
e
f
i
n
e
m
e
n
t
Get rid of mice in a polite
correct way
Info about removing mice
without killing them 
 
Search
How good are the retrieved docs?
Precision 
: Fraction of retrieved docs that are relevant
to the user’s 
information need
Recall
 
: Fraction of relevant docs in collection that are
retrieved
7
Sec. 1.1
Term-document incidence matrices
Unstructured data in 1620
 
Which plays of Shakespeare contain the words 
Brutus
AND
 
Caesar
  but 
NOT
 
Calpurnia
?
One could 
grep
 all of Shakespeare’s plays for 
Brutus
 and
Caesar,
 then strip out lines containing 
Calpurnia
?
Why is that not the answer?
Slow (for large corpora, linear scan not possible most of
times)
NOT
 
Calpurnia
 is non-trivial
Other operations (e.g., find the word 
Romans 
near
countrymen
) not feasible
Ranked retrieval (best documents to return)
9
Sec. 1.1
Term-document incidence matrices
1 if 
play
 contains
word
, 0 otherwise
Brutus
 
AND
 
Caesar
 
BUT
 
NOT
Calpurnia
Sec. 1.1
Incidence vectors
 
So we have a 0/1 vector for each term.
To answer query: take the vectors for 
Brutus, Caesar
 and 
Calpurnia
(complemented) 
  b
itwise 
AND
.
110100 
AND
110111 
AND
101111 =
100100
11
Sec. 1.1
Answers to query
Antony and Cleopatra,
 
Act III, Scene ii
Agrippa
 [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
W
h
e
n
 
A
n
t
o
n
y
 
f
o
u
n
d
 
J
u
l
i
u
s
 
C
a
e
s
a
r
 
d
e
a
d
,
                           He cried almost to roaring; and he wept
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
W
h
e
n
 
a
t
 
P
h
i
l
i
p
p
i
 
h
e
 
f
o
u
n
d
 
B
r
u
t
u
s
 
s
l
a
i
n
.
Hamlet, Act III, Scene ii
L
o
r
d
 
P
o
l
o
n
i
u
s
:
 
I
 
d
i
d
 
e
n
a
c
t
 
J
u
l
i
u
s
 
C
a
e
s
a
r
 
I
 
w
a
s
 
k
i
l
l
e
d
 
i
 
t
h
e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
C
a
p
i
t
o
l
;
 
B
r
u
t
u
s
 
k
i
l
l
e
d
 
m
e
.
12
Sec. 1.1
Bigger collections
 
Consider 
N 
= 1 million documents, each with about
1000 words.
Avg 6 bytes/word including spaces/punctuation
6GB of data in the documents.
Say there are 
M 
= 500K 
distinct
 terms among these.
13
Sec. 1.1
Can’t build the matrix
 
500K x 1M matrix has half-a-trillion 0’s and
1’s.
 
But it has no more than one billion 1’s.
matrix is extremely sparse.
 
What’s a better representation?
We only record the 1 positions.
14
Why?
Sec. 1.1
Term-document incidence matrices
Slide Note

With the Dominant form of doc search

Embed
Share

Information Retrieval (IR) involves finding unstructured material, usually text documents, to satisfy information needs from large collections. This process is essential for various applications like web search, email search, and corporate knowledge bases. The evolution of unstructured vs. structured data is highlighted along with basic assumptions of information retrieval and the classic search model. Key concepts such as precision, recall, term-document incidence matrices, and handling unstructured data are discussed to illustrate the challenges and solutions in information retrieval.

  • Information Retrieval
  • Web Search
  • Unstructured Data
  • Search Model
  • Precision

Uploaded on Oct 02, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Introduction to Information Retrieval Introducing Information Retrieval and Web Search

  2. Information Retrieval Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). These days we frequently think first of web search, but there are many other cases: E-mail search Searching your laptop Corporate knowledge bases Legal information retrieval 2

  3. Unstructured (text) vs. structured (database) data in the mid-nineties Unstructured Structured Data volume Market Cap 3

  4. Unstructured (text) vs. structured (database) data today Unstructured Structured Data volume Market Cap 4

  5. Sec. 1.1 Basic assumptions of Information Retrieval Collection: A set of documents Assume it is a static collection for the moment Goal: Retrieve documents with information that is relevant to the user s information need and helps the user complete a task 5

  6. The classic search model Get rid of mice in a polite correct way User task Misconception? Info about removing mice without killing them Info need Misformulation? Query how trap mice alive Search Search engine Query refinement Results Collection

  7. Sec. 1.1 How good are the retrieved docs? Precision : Fraction of retrieved docs that are relevant to the user s information need Recall : Fraction of relevant docs in collection that are retrieved 7

  8. Introduction to Information Retrieval Term-document incidence matrices

  9. Sec. 1.1 Unstructured data in 1620 Which plays of Shakespeare contain the words Brutus ANDCaesar but NOTCalpurnia? One could grep all of Shakespeare s plays for Brutus and Caesar, then strip out lines containing Calpurnia? Why is that not the answer? Slow (for large corpora, linear scan not possible most of times) NOTCalpurnia is non-trivial Other operations (e.g., find the word Romans near countrymen) not feasible Ranked retrieval (best documents to return) 9

  10. Sec. 1.1 Term-document incidence matrices Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth 1 1 0 0 0 1 Antony 1 1 0 1 0 0 Brutus 1 1 0 1 1 1 Caesar 0 1 0 0 0 0 Calpurnia 1 0 0 0 0 0 Cleopatra 1 0 1 1 1 1 mercy 1 0 1 1 1 0 worser 1 if play contains word, 0 otherwise BrutusANDCaesarBUTNOT Calpurnia

  11. Sec. 1.1 Incidence vectors So we have a 0/1 vector for each term. To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented) bitwise AND. 110100 AND 110111 AND 101111 = 100100 Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth 1 1 0 0 0 1 Antony 1 1 0 1 0 0 Brutus 1 1 0 1 1 1 Caesar 0 1 0 0 0 0 Calpurnia 1 0 0 0 0 0 Cleopatra 1 0 1 1 1 1 mercy 1 0 1 1 1 0 worser 11

  12. Sec. 1.1 Answers to query Antony and Cleopatra,Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain. Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar I was killed i the Capitol; Brutus killed me. 12

  13. Sec. 1.1 Bigger collections Consider N = 1 million documents, each with about 1000 words. Avg 6 bytes/word including spaces/punctuation 6GB of data in the documents. Say there are M = 500K distinct terms among these. 13

  14. Sec. 1.1 Can t build the matrix 500K x 1M matrix has half-a-trillion 0 s and 1 s. But it has no more than one billion 1 s. matrix is extremely sparse. Why? What s a better representation? We only record the 1 positions. 14

  15. Introduction to Information Retrieval Term-document incidence matrices

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#