Evaluation of Information Retrieval Systems and User Satisfaction
Information Retrieval Systems are evaluated based on aspects like query assistance, speed, resources, and relevancy. Measuring user satisfaction often relies on the relevance of search results, which requires benchmark collections, query suites, and binary relevance assessments. Human-labeled corpora and evaluation metrics like precision and recall play vital roles in assessing system performance.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Evaluation of Information Retrieval Systems CSC 575 Intelligent Information Retrieval
Information Retrieval as a Process Document Objects Information Need Representation Representation Query Indexed Objects Relevant? Comparison Evaluation/Feedback Retrieved Objects Intelligent Information Retrieval 2
Information Retrieval Evaluation IR systems are often components of larger systems Might evaluate several aspects: assistance in formulating queries speed of retrieval resources required presentation of documents ability to find relevant documents Evaluation is generally comparative system A vs. system B, etc. Most common evaluation: retrieval effectiveness. Intelligent Information Retrieval 3
Sec. 8.1 Measuring User Satisfaction? Most common proxy: relevance of search results But how do you measure relevance? Relevance measurement requires 3 elements: 1. A benchmark document collection 2. A benchmark suite of queries 3. A usually binary assessment of either Relevant or Non-relevant for each query and each document Some work on more-than-binary, but not the standard 4
Human Labeled Corpora (Gold Standard) Start with a corpus of documents. Collect a set of queries for this corpus. Have one or more human experts exhaustively label the relevant documents for each query. Typically assumes binary relevance judgments. Requires considerable human effort for large document/query corpora. 5
Sec. 8.3 Unranked retrieval evaluation: Precision and Recall Precision: fraction of retrieved docs that are relevant = P(relevant | retrieved) Recall: fraction of relevant docs that are retrieved = P(retrieved | relevant) Relevant tp Nonrelevant fp Retrieved Not Retrieved fn tn Precision P = tp/(tp + fp) Recall R = tp/(tp + fn) 6
Retrieved vs. Relevant Documents | Ret Rel | | Ret Rel | = = Precision Recall | Ret | | Rel | High Recall Retrieved High Precision Relevant Intelligent Information Retrieval 7
Computing Recall/Precision For a given query, produce the ranked list of retrievals. Adjusting a threshold on this ranked list produces different sets of retrieved documents, and therefore different recall/precision measures. Mark each document in the ranked list that is relevant according to the gold standard test collection (pre-labeled corpora). Compute a recall/precision pair for each position in the ranked list that contains a relevant document. Repeat for many test queries and find mean recall/precision values 8
Computing Recall/Precision Points: Example 1 n 1 2 3 4 5 6 7 8 9 10 11 12 13 14 doc # relevant 588 589 576 590 986 592 984 988 578 985 103 591 772 990 Let total # of relevant docs = 6 Check each new recall point: No. of retrieved docs x x R=1/6=0.167; P=1/1=1 x R=2/6=0.333; P=2/2=1 x R=3/6=0.5; P=3/4=0.75 R=4/6=0.667; P=4/6=0.667 Missing one relevant doc. Never reach 100% recall x R=5/6=0.833; p=5/13=0.38 9 Example from Raymond Mooney, University of Texas
Computing Recall/Precision Points: Example 1 n 1 2 3 4 5 6 7 8 9 10 11 12 13 14 doc # relevant 588 589 576 590 986 592 984 988 578 985 103 591 772 990 No. of retrieved docs x x No. of Retrieved Docs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Recall 0.17 0.33 0.33 0.50 0.50 0.67 0.67 0.67 0.67 0.67 0.67 0.67 0.83 0.83 Precision 1.00 1.00 0.67 0.75 0.60 0.67 0.57 0.50 0.44 0.40 0.36 0.33 0.38 0.36 x x x 10
Computing Recall/Precision Points: Example No. of 1.20 Retrieved Docs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Recall 0.17 0.33 0.33 0.50 0.50 0.67 0.67 0.67 0.67 0.67 0.67 0.67 0.83 0.83 Precision 1.00 1.00 0.67 0.75 0.60 0.67 0.57 0.50 0.44 0.40 0.36 0.33 0.38 0.36 Recall 1.00 0.80 0.60 0.40 0.20 0.00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 11
Mean Average Precision (MAP) Average Precision: Average of the precision values at the points at which each relevant document is retrieved. Example: (1 + 1 + 0.75 + 0.667 + 0.38 + 0)/6 = 0.633 Mean Average Precision: Average of the average precision value for a set of queries. 12
Precision/Recall Curves There is a tradeoff between Precision and Recall So measure Precision at different levels of Recall x precision x x x recall Intelligent Information Retrieval 13
Precision/Recall Curves Difficult to determine which of these two hypothetical results is better: x precision x x x recall Intelligent Information Retrieval 14
Cumulative Gain relevance (gain) 1.0 0.6 0.0 0.8 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2 0.0 Withgraded relevance judgments, we can compute the gain at each rank. Cumulative Gain at rank n: n 1 2 3 4 5 6 7 8 9 10 11 12 13 14 CGn 1.0 1.6 1.6 2.4 2.4 3.4 3.4 3.4 3.4 3.4 3.4 3.4 3.6 3.6 doc # 588 589 576 590 986 592 984 988 578 985 103 591 772 990 (Where reli is the graded relevance of the document at position i) 15 Drawn from Lecture by Raymond Mooney, University of Texas
Discounting Based on Position rel Users care more about high-ranked documents, so we discount results by 1/log2(rank) (gain) 1.0 0.6 0.0 0.8 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2 0.0 n 1 2 3 4 5 6 7 8 9 10 11 12 13 14 CGn 1.0 1.6 1.6 2.4 2.4 3.4 3.4 3.4 3.4 3.4 3.4 3.4 3.6 3.6 logn - 1.00 1.58 2.00 2.32 2.58 2.81 3.00 3.17 3.32 3.46 3.58 3.70 3.81 DCGn 1.00 1.60 1.60 2.00 2.00 2.39 2.39 2.39 2.39 2.39 2.39 2.39 2.44 2.44 doc # 588 589 576 590 986 592 984 988 578 985 103 591 772 990 Discounted Cumulative Gain: 16
Normalized Discounted Cumulative Gain (NDCG) To compare DCGs, normalize values so that an ideal ranking would have a Normalized DCGof 1.0 Ideal ranking rel rel (gain) 1.0 1.0 0.8 0.6 0.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 n 1 2 3 4 5 6 7 8 9 10 11 12 13 14 (gain) 1.0 0.6 0.0 0.8 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2 0.0 CGn 1.0 2.0 2.8 3.4 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 logn 0.00 1.00 1.58 2.00 2.32 2.58 2.81 3.00 3.17 3.32 3.46 3.58 3.70 3.81 IDCGn 1.00 2.00 2.50 2.80 2.89 2.89 2.89 2.89 2.89 2.89 2.89 2.89 2.89 2.89 n 1 2 3 4 5 6 7 8 9 10 11 12 13 14 CGn 1.0 1.6 1.6 2.4 2.4 3.4 3.4 3.4 3.4 3.4 3.4 3.4 3.6 3.6 logn 0.00 1.00 1.58 2.00 2.32 2.58 2.81 3.00 3.17 3.32 3.46 3.58 3.70 3.81 DCGn 1.00 1.60 1.60 2.00 2.00 2.39 2.39 2.39 2.39 2.39 2.39 2.39 2.44 2.44 doc # 588 592 590 589 772 576 986 984 988 578 985 103 591 990 doc # 588 589 576 590 986 592 984 988 578 985 103 591 772 990 17
Normalized Discounted Cumulative Gain (NDCG) rel Normalize by DCG of the ideal ranking: (gain) 1.0 0.6 0.0 0.8 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2 0.0 n 1 2 3 4 5 6 7 8 9 10 11 12 13 14 DCGn IDCGn NDCGn 1.00 1.00 1.60 2.00 1.60 2.50 2.00 2.80 2.00 2.89 2.39 2.89 2.39 2.89 2.39 2.89 2.39 2.89 2.39 2.89 2.39 2.89 2.39 2.89 2.44 2.89 2.44 2.89 doc # 588 589 576 590 986 592 984 988 578 985 103 591 772 990 1.00 0.80 0.64 0.71 0.69 0.83 0.83 0.83 0.83 0.83 0.83 0.83 0.84 0.84 NDCG 1 at all ranks NDCG is comparable across different queries 18
Standard Benchmarks A benchmark collection contains: A set of standard documents and queries/topics. A list of relevant documents for each query. Standard collections for traditional IR: Smart collection: ftp://ftp.cs.cornell.edu/pub/smart TREC: http://trec.nist.gov/ Precision and recall Retrieved result Standard document collection Algorithm under test Evaluation Standard queries Standard result 19
Early Test Collections Previous experiments were based on the SMART collection which is fairly small. (ftp://ftp.cs.cornell.edu/pub/smart) Collection Name CACM CISI CRAN MED TIME Number Of Documents 3,204 1,460 1,400 1,033 425 Number Of Queries 64 112 225 30 83 Raw Size (Mbytes) 1.5 1.3 1.6 1.1 1.5 Different researchers used different test collections and evaluation techniques. 20
The TREC Benchmark TREC: Text REtrieval Conference (http://trec.nist.gov/) Originated from the TIPSTER program sponsored by Defense Advanced Research Projects Agency (DARPA). Became an annual conference in 1992, co-sponsored by the National Institute of Standards and Technology (NIST) and DARPA. Participants submit the P/R values for the final document and query corpus and present their results at the conference. 21
Characteristics of the TREC Collection Both long and short documents (from a few hundred to over one thousand unique terms in a document). Test documents consist of: WSJ Wall Street Journal articles (1986-1992) AP Associate Press Newswire (1989) ZIFF Computer Select Disks (Ziff-Davis Publishing) FR Federal Register DOE Abstracts from Department of Energy reports 550 M 514 M 493 M 469 M 190 M 22
Issues with Relevance Marginal Relevance: Do later documents in the ranking add new information beyond what is already given in higher documents. Choice of retrieved set should encourage diversity and novelty. Coverage Ratio: The proportion of relevant items retrieved out of the total relevant documents known to a user prior to the search. Relevant when the user wants to locate documents which they have seen before (e.g., the budget report for Year 2000). 23
Evaluation of Information Retrieval Systems CSC 575 Intelligent Information Retrieval