Understanding Core Concepts in Information Retrieval: Lexical and Semantic Gaps, Retrieval Models, and Algorithms

Slide Note
Embed
Share

Explore the core concepts in Information Retrieval (IR) including lexical gaps like 'say' vs. 'said', semantic gaps, ranking models vs. retrieval methods, special data structures for efficient access, and algorithms for finding relevant documents. Understand the differences between IR and databases, search engine architectures, visiting strategies, statistical properties of language, and Zipf's law. Dive into complexity analysis in IR for time complexity evaluations.


Uploaded on Sep 25, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Core concepts in IR Query representation Lexical gap: say v.s. said Semantic gap: ranking model v.s. retrieval method Document representation Special data structure for efficient access Lexical gap and semantic gap Retrieval model Algorithms that find the most relevant documents for the given information need CS@UVa CS4501: Information Retrieval 1

  2. IR v.s. DBs Information Retrieval: Unstructured data Semantics of objects are subjective Simple keyword queries Relevance-drive retrieval Effectiveness is primary issue, though efficiency is also important Database Systems: Structured data Semantics of each object are well defined Structured query languages (e.g., SQL) Exact retrieval Emphasis on efficiency CS@UVa CS4501: Information Retrieval 2

  3. Abstraction of search engine architecture Indexed corpus Crawler Ranking procedure Evaluation Feedback Doc Analyzer (Query) User Query Rep Doc Representation results Ranker Indexer Index CS@UVa CS4501: Information Retrieval 3

  4. Visiting strategy Breadth first Uniformly explore from the entry page Memorize all nodes on the previous level As shown in pseudo code Depth first Explore the web by branch Biased crawling given the web is not a tree structure Focused crawling Prioritize the new links by predefined strategies CS@UVa CS4501: Information Retrieval 4

  5. Statistical property of language discrete version of power law Word frequency Word rank by frequency A plot of word frequency in Wikipedia (Nov 27, 2006) CS@UVa CS4501: Information Retrieval 5

  6. Zipfs law tells us Head words may take large portion of occurrence, but they are semantically meaningless E.g., the, a, an, we, do, to Tail words take major portion of vocabulary, but they rarely occur in documents E.g., dextrosinistral The rest is most representative To be included in the controlled vocabulary CS@UVa CS4501: Information Retrieval 6

  7. Complexity analysis Time complexity analysis ?( ? ? |?|) ? is the length of query, |?| is the length of a document doclist = [] for (wi in q) { for (d in D) { for (wj in d) { break; } } } return doclist; Bottleneck, since most of them won t match! if (wi == wj) { doclist += [d]; } CS@UVa CS4501: Information Retrieval 7

  8. Solution: inverted index Build a look-up table for each word in vocabulary From word to documents! Postings Dictionary Time complexity: ?( ? |?|), |?| is the average length of posting list By Zipf s law, ? ? Doc1 Doc2 information Doc1 retrieval Doc2 retrieved Query: information retrieval is Doc1 Doc2 helpful Doc1 Doc2 for Doc1 Doc2 Doc2 you everyone Doc1 CS@UVa CS4501: Information Retrieval 8

  9. Structures for inverted index Dictionary: modest size Needs fast random access Stay in memory Hash table, B-tree, trie, Postings: huge Sequential access is expected Stay on disk Contain docID, term freq, term position, Compression is needed Key data structure underlying modern IR - Christopher D. Manning CS@UVa CS4501: Information Retrieval 9

  10. Sorting-based inverted index construction <Tuple>: <termID, docID, count> Sort by docId Sort by termId All info about term 1 doc1 <1,1,3> <2,1,2> <3,1,1> ... <1,2,2> <3,2,3> <4,2,5> <1,1,3> <1,2,2> <2,1,2> <2,4,3> ... <1,5,3> <1,6,2> <1,1,3> <1,2,2> <1,5,2> <1,6,3> ... <1,300,3> <2,1,2> Term Lexicon: 1 the 2 cold 3 days 4 a ... doc2 ... DocID Lexicon: 1 doc1 2 doc2 3 doc3 ... doc300 <1,300,3> <3,300,1> ... <1,299,3> <1,300,1> ... <5000,299,1> <5000,300,1> ... Local sort CS4501: Information Retrieval Merge sort Parse & Count CS@UVa 10

  11. Dynamic index update Periodically rebuild the index Acceptable if change is small over time and penalty of missing new documents is negligible Auxiliary index Keep index for new documents in memory Merge to index when size exceeds threshold Increase I/O operation Solution: multiple auxiliary indices on disk, logarithmic merging CS@UVa CS4501: Information Retrieval 11

  12. Index compression Observation of posting files Instead of storing docID in posting, we store gap between docIDs, since they are ordered Zipf s law again: The more frequent a word is, the smaller the gaps are The less frequent a word is, the shorter the posting list is Heavily biased distribution gives us great opportunity of compression! Information theory: entropy measures compression difficulty. CS@UVa CS4501: Information Retrieval 12

  13. Phrase query Generalized postings matching Equality condition check with requirement of position pattern between two query terms e.g., T2.pos-T1.pos = 1 (T1 must be immediately before T2 in any matched document) Proximity query: |T2.pos-T1.pos| k scan the postings 2 1 4 2 8 3 16 5 32 8 64 128 Term1 13 21 34 Term2 CS@UVa CS4501: Information Retrieval 13

  14. Classical IR evaluation Three key elements for IR evaluation 1. A document collection 2. A test suite of information needs, expressible as queries 3. A set of relevance judgments, e.g., binary assessment of either relevant or nonrelevant for each query-document pair CS@UVa CS 4501: Information Retrieval 14

  15. Evaluation of unranked retrieval sets In a Boolean retrieval system Precision: fraction of retrieved documents that are relevant, i.e., p(relevant|retrieved) Recall: fraction of relevant documents that are retrieved, i.e., p(retrieved|relevant) Precision: relevant nonrelevant ?? ? = retrieved true positive (TP) false positive (FP) ?? + ?? not retrieved false negative (FN) true negative (TN) ?? Recall: ? = ?? + ?? CS@UVa CS 4501: Information Retrieval 15

  16. Evaluation of ranked retrieval results Ranked results are the core feature of an IR system Precision, recall and F-measure are set-based measures, that cannot assess the ranking quality Solution: evaluate precision at every recall point System1 System2 precision x Which system is better? x x x x x x x x recall x CS@UVa CS 4501: Information Retrieval 16

  17. Statistical significance tests How confident you are that an observed difference doesn t simply result from the particular queries you chose? Experiment 2 Experiment 1 Query System A System B Query System A System B 11 12 13 14 15 16 17 0.02 0.39 0.26 0.38 0.14 0.09 0.12 0.76 0.07 0.17 0.31 0.02 0.91 0.56 1 2 3 4 5 6 7 0.20 0.21 0.22 0.19 0.17 0.20 0.21 0.40 0.41 0.42 0.39 0.37 0.40 0.41 Average 0.20 0.40 Average CS@UVa 0.20 0.40 CS 4501: Information Retrieval 17

  18. Challenge the assumptions in classical IR evaluations Assumption 1 Satisfaction = Result Relevance Assumption 2 Relevance = independent topical relevance Documents are independently judged, and then ranked (that is how we get the ideal ranking) Assumption 3 Sequential browsing from top to bottom CS@UVa CS 4501: Information Retrieval 18

  19. A/B test Two-sample hypothesis testing Two versions (A and B) are compared, which are identical except for one variation that might affect a user's behavior E.g., indexing with or without stemming Randomized experiment Separate the population into equal size groups 10% random users for system A and 10% random users for system B Null hypothesis: no difference between system A and B Z-test, t-test CS@UVa CS 4501: Information Retrieval 19

  20. Interleave test Design principle from sensory analysis Instead of giving absolute ratings, ask for relative comparison between alternatives E.g., is A better than B? Randomized experiment Interleave results from both A and B Giving interleaved results to the same population and ask for their preference Hypothesis test over preference votes CS@UVa CS 4501: Information Retrieval 20

  21. Relevance = Similarity Assumptions Query and documents are represented in the same form A query can be regarded as a document Relevance(d,q) similarity(d,q) R(q) = {d C|rel(d,q)>?}, rel(q,d)= (Rep(q), Rep(d)) Key issues How to represent query/document? How to define the similarity measure (?,?)? CS@UVa CS 4501: Information Retrieval 21

  22. Vector space model Represent both document and query by concept vectors Each concept defines one dimension K concepts define a high-dimensional space Element of vector corresponds to concept weight E.g., d=(x1, ,xk), xiis importance of concept i Measure relevance Distance between the query vector and document vector in this concept space CS@UVa CS 4501: Information Retrieval 22

  23. What is a good basic concept? Orthogonal Linearly independent basis vectors Non-overlapping in meaning No ambiguity Weights can be assigned automatically and accurately Existing solutions Terms or N-grams, i.e., bag-of-words Topics, i.e., topic model CS@UVa CS 4501: Information Retrieval 23

  24. How to assign weights? Important! Why? Query side: not all terms are equally important Document side: some terms carry more information about the content How? Two basic heuristics TF (Term Frequency) = Within-doc-frequency IDF (Inverse Document Frequency) CS@UVa CS 4501: Information Retrieval 24

  25. Cosine similarity TF-IDF vector Angle between two vectors ?????? ??,?? = ?? ?? ??2 ?? 2= ?? ??2 ?? ?? 2 Document length normalized Unit vector Finance TF-IDF space D2 D1 Query Sports CS@UVa CS 4501: Information Retrieval 25

  26. Justification From decision theory Two types of loss Loss(retrieved|non-relevant)=?1 Loss(not retrieved|relevant)=?2 ?(??,?): probability of ?? being relevant to ? Expected loss regarding to the decision of including ?? in the final results Retrieve: 1 ? ??,? ?1 Not retrieve: ? ??,? ?2 CS@UVa CS 4501: Information Retrieval 26

  27. Justification From decision theory We make decision by If 1 ? ??,? ?1<? ??,? ?2, retrieve ?? Otherwise, not retrieve ?? Check if ? ??,? > Rank documents by descending order of ? ??,? would minimize the loss ?1 ?1+?2 CS@UVa CS 4501: Information Retrieval 27

  28. Conditional models for P(R=1|Q,D) Basic idea: relevance depends on how well a query matches a document P(R=1|Q,D)=g(Rep(Q,D), ) Rep(Q,D): feature representation of query-doc pair E.g., #matched terms, highest IDF of a matched term, docLen Using training data (with known relevance judgments) to estimate parameter Apply the model to rank new documents Special case: logistic regression a functional form CS@UVa CS 4501: Information Retrieval 28

  29. Generative models for P(R=1|Q,D) Basic idea Compute Odd(R=1|Q,D) using Bayes rule | 1 ( ) , | 1 ( = R P = ) 1 = ) 1 = , ) ( , | ( P R Q D P Q D R P R = = = Ignored for ranking Odd R Q D = = ( | 0 , ) ( , | ) 0 ( ) 0 Q D P Q D R P R Assumption Relevance is a binary variable Variants Document generation P(Q,D|R)=P(D|Q,R)P(Q|R) Query generation P(Q,D|R)=P(Q|D,R)P(D|R) CS@UVa CS 4501: Information Retrieval 29

  30. Maximum likelihood vs. Bayesian Maximum likelihood estimation Best means data likelihood reaches maximum ? = ????????(?|?) Issue: small sample size Bayesian estimation Best means being consistent with our prior knowledge and explaining data well ? = ???????? ? ? = ???????? ? ? ?(?) ML: Frequentist s point of view A.k.a, Maximum a Posterior estimation Issue: how to define prior? MAP: Bayesian s point of view CS@UVa CS 4501: Information Retrieval 30

  31. Robertson-Sparck Jones Model (Robertson & Sparck Jones 76) u 1 ( ) 1 k k p u p u Rank d i d i (RSJ model) = = + log ( | 1 , ) log log log i 1 ( i i i O R Q D ) 1 u p p = = = = = = , 1 1 , 1 1 i q i q i i i i i i Two parameters for each term Ai: pi = P(Ai=1|Q,R=1): prob. that term Ai occurs in a relevant doc ui = P(Ai=1|Q,R=0): prob. that term Ai occurs in a non-relevant doc How to estimate these parameters? Suppose we have relevance judgments, 5 . 0 + 5 . 0 + # ( . ) # ( . ) rel doc with A nonrel doc with A = = p u i i i i + + # ( . ) 1 # ( . ) 1 rel doc nonrel doc +0.5 and +1 can be justified by Bayesian estimation as priors Per-query estimation! CS@UVa CS 4501: Information Retrieval 31

  32. BM25/Okapi approximation (Robertson et al. 94) Idea: model p(D|Q,R) with a simpler function that approximates 2-Possion mixture model Observations: log O(R=1|Q,D) is a sum of term weights occurring in both query and document Term weight Wi= 0, if TFi=0 Wi increases monotonically with TFi Wi has an asymptotic limit The simple function is W = = = = = k ( | , ) 1 ( | 0 , ) 0 ( | 1 , ) P A d Q R P A ( Q R , P R Q D = , 1 i i i = = = = = ( | 0 , ) ( | , ) 0 | 0 ) 1 P R Q D P A d Q R P A Q R 1 i d i i i i ) 1 + (1 k + 1 ( i p ) TF u = log i i i 1 ( i ) K TF u p i i CS@UVa CS 4501: Information Retrieval 32

  33. Adding doc. length Incorporating doc length Motivation: the 2-Poisson model assumes equal document length Implementation: penalize long doc 1 ( i i u TF K + ) 1 + (1 k 1 ( i p ) TF u = log i i W i ) p i | | d = + (( 1 ) ) K k b b where 1 | | avg d Pivoted document length normalization CS@UVa CS 4501: Information Retrieval 33

  34. Adding query TF Incorporating query TF Motivation Natural symmetry between document and query Implementation: a similar TF transformation as in document TF ) 1 ( s i i QTF k + + ) 1 + ( + 1 ( i p ) QTF k TF k u = 1 TF log i i W 1 ( i ) K u p s i i i The final formula is called BM25, achieving top TREC performance BM: best match CS@UVa CS 4501: Information Retrieval 34

  35. Query generation models = = P Q D R ( , | ) 1 = = O R Q D ( | 1 , ) = = P Q D R ( , | ) 0 = = = = P Q D R P D R ( | , ) 1 ( | ) 1 = = = = = = P Q D R P D R ( | , ) 0 ( | ) 0 = = P D R ( | ) 1 = = = = = = P Q D R Assume P Q D R P Q R ( | , ) 1 ( ( | , ) 0 ( | 0 )) = = P D R ( | ) 0 Query likelihoodp(q| d) Document prior Assuming uniform document prior, we have ( O = = = = D R Q D P Q D ( R | | 1 , ) ( | , ) 1 , = = P Q R ) 1 Now, the question is how to compute ? Generally involves two steps: (1) estimate a language model based on D (2) compute the query likelihood according to the estimated model Language models! CS@UVa CS 4501: Information Retrieval 35

  36. Language model for text We need independence assumptions! Probability distribution over word sequences ? ?1 ?2 ??= ? ?1? ?2?1? ?3?1,?2 ? ???1,?2, ,?? 1 Complexity - ?(?? ) ? - maximum document length sentence Chain rule: from conditional probability to joint probability 475,000 main headwords in Webster's Third New International Dictionary Average English sentence length is 14.3 words A rough estimate: ?(47500014) 47500014 8????? 10244 3.38?66?? How large is this? CS@UVa CS 4501: Information Retrieval 36

  37. Illustration of language model smoothing P(w|d) Max. Likelihood Estimate MLw p = = ) ( count of w count of all words Smoothed LM w Assigning nonzero probabilities to the unseen words Discount from the seen words CS@UVa CS 4501: Information Retrieval 37

  38. Refine the idea of smoothing Should all unseen words get equal probabilities? We can use a reference model to discriminate unseen words Discounted ML estimate ( | ) ( | p w REF seen p w d if w is seen in d otherwise = ( | ) p w d ) d Reference language model 1 ( | ) w d seen p w is seen = d ( | p w REF ) w is unseen CS@UVa CS 4501: Information Retrieval 38

  39. Understanding smoothing Plug in the general smoothing scheme to the query likelihood retrieval formula, we obtain Doc length normalization (longer doc is expected to have a smaller d) Doc length normalization (longer doc is expected to have a smaller d) TF weighting ( | ) p w d w i = + + log ( | ) [log ] | | log log ( | ) seen i p q d q p w C d i ( | ) p w C w d q q d i i IDF weighting Ignore for ranking Smoothing with p(w|C) TF-IDF + doc- length normalization 1 ( | ) w d seen p w is seen = d ( | p w REF ) w is unseen CS@UVa CS 4501: Information Retrieval 39

  40. Learning to Rank General solution in optimization framework Input: (??,?1),?1, (??,?2),?2, , (??,??),?? dn ??,?? {0, ,?} Objective: O = {P@k, MAP, NDCG} Output: ? ?,? ?, s.t., ? = argmax? ??(? (?,?),?) , where DocID BM25 LM PageRank Label 0001 1.6 1.1 0.9 0 0002 2.7 1.9 0.2 1 CS@UVa CS 4501: Information Retrieval 40

  41. Approximate the objective function! Pointwise Fit the relevance labels individually Pairwise Fit the relative orders Listwise Fit the whole order CS@UVa CS 4501: Information Retrieval 41

  42. Pointwise Learning to Rank Ideally perfect relevance prediction leads to perfect ranking ? score order metric Reducing ranking problem to Regression ? ? ?,? ,? = ?|? ??,?? ??| Subset Ranking using Regression, D.Cossock and T.Zhang, COLT 2006 (multi-)Classification ? ? ?,? ,? = ??(? ??,?? = ??) Ranking with Large Margin Principles, A. Shashua and A. Levin, NIPS 2002 CS@UVa CS 4501: Information Retrieval 42

  43. Deficiencies of Pointwise Learning to Rank Cannot directly optimize IR metrics (0 1, 2 0) worse than (0 -2, 2 4) Position of documents are ignored Penalty on documents at higher positions should be larger Favor the queries with more documents CS@UVa CS 4501: Information Retrieval 43

  44. Pairwise Learning to Rank Ideally perfect partial order leads to perfect ranking ? partial order order metric Ordinal regression ? ? ?,? ,? = ? ??(??> ??)?(? ??,?? < ?(??,??)) Relative ordering between different documents is significant E.g., (0 -2, 2 4) is better than (0 1, 2 0) Large body of research 2 1 ?1 0 ?2 -1 ?1 -2 ?1 CS@UVa CS 4501: Information Retrieval 44

  45. Deficiency of Pairwise Learning to Rank Minimizing mis-ordered pair => maximizing IR metrics? Mis-ordered pairs: 6 Mis-ordered pairs: 4 AP: 5 AP: 5 8 12 DCG: 1.333 DCG: 0.931 Position is crucial! CS@UVa CS 4501: Information Retrieval 45

  46. Listwise Learning to Rank Can we directly optimize the ranking? ? order metric Tackle the challenge Optimization without gradient CS@UVa CS 4501: Information Retrieval 46

Related


More Related Content