Introduction to Information Retrieval by Dr. Huda Abdulaali 2017-2018
Overview of information retrieval, including finding material in unstructured data collections, structured vs. unstructured data, information needs and relevance, and tasks such as clustering in the field of IR. It also touches on web search challenges, institutional search, and personal information retrieval system integration.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
1 Lecture 1: Introduction and the Boolean Model Information Retrieval By Dr. Huda Abdulaali 2017-2018 Introduction to Information Retrieval. Manning et al.,
INTRODUCTION Information retrieval (IR) is finding material . . . of an unstructured nature . . . that satisfies an information need from within large collections . . . . 2
INTRODUCTION Information retrieval (IR) is finding material (usually documents) of an unstructured nature . . . that satisfies an information need from within large collections (usually stored on computers). Document Collection: text units we have built an IR system. Usually documents But could be book chapters paragraphs scenes of a movie turns in a conversation... 3
INTRODUCTION Structured vs Unstructured Data structured data refers to information with a high degree of organization, such that inclusion in a relational database is seamless and readily searchable by simple, straightforward search engine algorithms or other search operations; unstructured data is essentially the opposite. 4
INTRODUCTION Information Needs and Relevance An information need is the topic about which the user desires to know more about. A query is what the user conveys to the computer in an attempt to communicate the information need. A document is relevant if the user perceives that it contains information of value with respect to their personal information need. 5
INTRODUCTION The field of IR also covers supporting users in browsing or filtering document collections or further processing a set of retrieved documents. Given a set of documents, clustering is the task of coming up with a good grouping of the documents based on their contents. 6
INTRODUCTION In web search, the system has to provide search over billions of documents stored on millions of computers. Distinctive issues are needing to gather documents for indexing, being able to build systems that work efficiently In and institutional search e.g company s Enterprise documentation, patents, research articles often domain-specific, centralized storage; dedicated machines for search . In personal information retrieval, operating systems have integrated information retrieval (such as Apple s Mac OS X Spotlight or Windows Vista s Instant Search). Email programs usually not only provide search but also text classification: they at least provide a spam (junk mail) filter, and commonly also provide either manual or automatic means for classifying mail so that it can be placed directly into particular folders 7
Boolean Retrieval The Boolean model of information retrieval (BIR) is a classical information retrieval (IR) model and, at the same time, the first and most adopted one. It is used by many IR systems to this day In the Boolean retrieval model we can pose any query in the form of a Boolean expression of term i.e., one in which terms are combined with the operators and, or, and not. Shakespeare example 9
Basic Assumption of Boolean Model An index term is either present(1) or absent(0) in the document All index terms provide equal evidence with respect to information needs. Queries are Boolean combinations of index terms. XAND Y: represents doc that contains both X and Y X OR Y: represents doc that contains either X or Y NOT X: represents the doc that do not contain X 10
An example information retrieval problem Brutus AND Caesar AND NOT Calpurnia fat book that many people own is Shakespeare s Collected Works. Suppose you wanted to determine which plays of Shakespeare contain the words Brutus and Caesar and not Calpurnia. One way to do that is to start at the beginning and to read through all the text, The simplest form of document retrieval is for a computer to do this sort of linear scan through documents. This process is commonly referred to as GREPPING through text 11
An example information retrieval problem GREP is a command-line utility for searching plain-text data sets for lines that match a regular expression. Its name comes from the command g/re/p (globally search a regular expression and print), which has the same effect: doing a global search with the regular expression and printing all matching lines. GREP was originally developed for the Unix operating system, but later 12 available for all Unix-like systems.
An example information retrieval problem 1. To process large document collections quickly. The amount of online data has grown at least as quickly as the speed of computers, and we would now like to be able to search collections that total in the order of billions to trillions of words. 2. To allow more flexible matching operations. For example, it is impractical to perform the query Romans near countrymen with GREP, where near might be defined as within 5 words or within the same sentence. 3. To allow ranked retrieval. In many cases, you want the best answer to an information need among many documents that contain certain words. 13
An example information retrieval problem The way to avoid linearly scanning the texts for each query is to index the documents in advance The result is a binary term-document incidence matrix the information retrieval literature normally speaks of terms (NOT WORDS) The result is a vector for each term Retrieval model can be categorize as Boolean retrieval model Vector space model Probabilistic model 14 Model based on belief net
An example information retrieval problem result of binary term-document incidence matrix Main idea: record for each document whether it contains each word out of all the different words Shakespeare used. Matrix element (t, d) is 1 if the play in column d contains the word in row t, and is 0 otherwise. 15
An example information retrieval problem We compute the results for our query as the bitwise AND between vectors for Brutus, Caesar and complement (Calpurnia): To answer the query Brutus and Caesar and not Calpurnia, we take the vectors for Brutus, Caesar and Calpurnia, complement the last, and then do a bitwise and: 110100 and 110111 and 101111 = 100100 The answers for this query are thus Antony and Cleopatra and Hamlet 16
A first take at building an inverted index The inverted index consists of a dictionary of terms (also: lexicon, vocabulary) and a postings list for each term, i.e., a list that records which documents the term occurs in. The inverted index data structure is a central component of a typical search engine indexing algorithm. A goal of a search engine implementation is to optimize the speed of the query: find the documents where word X occurs. To gain the speed benefits of indexing at retrieval time, we have to build the index in advance. 17
A first take at building an inverted index Within a document collection, we assume that each document has a unique docID serial number, known as the document identifier (docID). Then, collect the documents to be indexed. The core indexing step is sorting this list so that the terms are alphabetical. 20
A first take at building an inverted index Multiple same document Instances of the same term are then grouped, and the result is split into a postings The dictionary records some statistics, such as the number of document documents contain each term The postings are secondarily sorted by docID. This provides the basis for efficient query processing. occurrences term from are of the the same then merged. dictionary and which 21