Introduction to Information Retrieval and Web Search

Introducing Information Retrieval

and Web Search

Information Retrieval

•

Information Retrieval (IR) is

finding material

 (usually

documents) of an

unstructured

 nature (usually text)

that satisfies an

information need

 from within

large

collections

 (usually stored on computers).

•

These days we frequently think first of

web search

, but

there are many other cases:

•

E-mail search

•

Searching your laptop

•

Corporate knowledge bases

•

Legal information retrieval

Unstructured (text) vs. structured

(database) data in the mid-nineties

Unstructured (text) vs. structured

(database) data today

Basic assumptions of Information

Retrieval

•

Collection

: A set of documents

•

Assume it is a static collection for the moment

•

Goal

: Retrieve documents with

information that is

relevant

 to the user’s

information need

and helps the user

complete a

task

Sec. 1.1

how trap mice alive

The classic search model

Get rid of mice in a polite

correct way

Info about removing mice

without killing them

Search

How good are the retrieved docs?



Precision

: Fraction of retrieved docs that are relevant

to the user’s

information need



Recall

: Fraction of relevant docs in collection that are

retrieved

Sec. 1.1

Term-document incidence matrices

Unstructured data in 1620

•

Which plays of Shakespeare contain the words

Brutus

AND

Caesar

but

NOT

Calpurnia

•

One could

grep

 all of Shakespeare’s plays for

Brutus

and

Caesar,

 then strip out lines containing

Calpurnia

•

Why is that not the answer?

•

Slow (for large corpora, linear scan not possible most of

times)

•

NOT

Calpurnia

 is non-trivial

•

Other operations (e.g., find the word

Romans

near

countrymen

) not feasible

•

Ranked retrieval (best documents to return)

Sec. 1.1

Term-document incidence matrices

1 if

play

 contains

word

, 0 otherwise

Brutus

AND

Caesar

BUT

NOT

Calpurnia

Sec. 1.1

Incidence vectors

•

So we have a 0/1 vector for each term.

•

To answer query: take the vectors for

Brutus, Caesar

and

Calpurnia

(complemented)



itwise

AND

•

AND

•

AND

•

101111 =

•

Sec. 1.1

Answers to query

•

Antony and Cleopatra,

Act III, Scene ii

Agrippa

 [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,

                           He cried almost to roaring; and he wept

•

Hamlet, Act III, Scene ii

’

Sec. 1.1

Bigger collections

•

Consider

= 1 million documents, each with about

1000 words.

•

Avg 6 bytes/word including spaces/punctuation

•

6GB of data in the documents.

•

Say there are

= 500K

distinct

 terms among these.

Sec. 1.1

Can’t build the matrix

•

500K x 1M matrix has half-a-trillion 0’s and

1’s.

•

But it has no more than one billion 1’s.

•

matrix is extremely sparse.

•

What’s a better representation?

•

We only record the 1 positions.

Why?

Sec. 1.1

Term-document incidence matrices

Slide Note

With the Dominant form of doc search

Embed Share

Download

Information Retrieval (IR) involves finding unstructured material, usually text documents, to satisfy information needs from large collections. This process is essential for various applications like web search, email search, and corporate knowledge bases. The evolution of unstructured vs. structured data is highlighted along with basic assumptions of information retrieval and the classic search model. Key concepts such as precision, recall, term-document incidence matrices, and handling unstructured data are discussed to illustrate the challenges and solutions in information retrieval.

wonderly_k Follow

Uploaded on Oct 02, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Introduction to Information Retrieval Introducing Information Retrieval and Web Search

Information Retrieval Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). These days we frequently think first of web search, but there are many other cases: E-mail search Searching your laptop Corporate knowledge bases Legal information retrieval 2

Unstructured (text) vs. structured (database) data in the mid-nineties Unstructured Structured Data volume Market Cap 3

Unstructured (text) vs. structured (database) data today Unstructured Structured Data volume Market Cap 4

Sec. 1.1 Basic assumptions of Information Retrieval Collection: A set of documents Assume it is a static collection for the moment Goal: Retrieve documents with information that is relevant to the user s information need and helps the user complete a task 5

The classic search model Get rid of mice in a polite correct way User task Misconception? Info about removing mice without killing them Info need Misformulation? Query how trap mice alive Search Search engine Query refinement Results Collection

Sec. 1.1 How good are the retrieved docs? Precision : Fraction of retrieved docs that are relevant to the user s information need Recall : Fraction of relevant docs in collection that are retrieved 7

Introduction to Information Retrieval Term-document incidence matrices

Sec. 1.1 Unstructured data in 1620 Which plays of Shakespeare contain the words Brutus ANDCaesar but NOTCalpurnia? One could grep all of Shakespeare s plays for Brutus and Caesar, then strip out lines containing Calpurnia? Why is that not the answer? Slow (for large corpora, linear scan not possible most of times) NOTCalpurnia is non-trivial Other operations (e.g., find the word Romans near countrymen) not feasible Ranked retrieval (best documents to return) 9

Sec. 1.1 Term-document incidence matrices Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth 1 1 0 0 0 1 Antony 1 1 0 1 0 0 Brutus 1 1 0 1 1 1 Caesar 0 1 0 0 0 0 Calpurnia 1 0 0 0 0 0 Cleopatra 1 0 1 1 1 1 mercy 1 0 1 1 1 0 worser 1 if play contains word, 0 otherwise BrutusANDCaesarBUTNOT Calpurnia

Sec. 1.1 Incidence vectors So we have a 0/1 vector for each term. To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented) bitwise AND. 110100 AND 110111 AND 101111 = 100100 Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth 1 1 0 0 0 1 Antony 1 1 0 1 0 0 Brutus 1 1 0 1 1 1 Caesar 0 1 0 0 0 0 Calpurnia 1 0 0 0 0 0 Cleopatra 1 0 1 1 1 1 mercy 1 0 1 1 1 0 worser 11

Sec. 1.1 Answers to query Antony and Cleopatra,Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain. Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar I was killed i the Capitol; Brutus killed me. 12

Sec. 1.1 Bigger collections Consider N = 1 million documents, each with about 1000 words. Avg 6 bytes/word including spaces/punctuation 6GB of data in the documents. Say there are M = 500K distinct terms among these. 13

Sec. 1.1 Can t build the matrix 500K x 1M matrix has half-a-trillion 0 s and 1 s. But it has no more than one billion 1 s. matrix is extremely sparse. Why? What s a better representation? We only record the 1 positions. 14

Introduction to Information Retrieval Term-document incidence matrices

Introduction to Information Retrieval and Web Search

Download Presentation

Presentation Transcript

Related

More Related Content