Basic Approach to Understanding User Tasks from Web Anchor Text Graph
In this study, a basic approach to task understanding using publicly available resources is presented, focusing on identifying key phrases and queries related to user tasks from anchor text data. Different subtasks like low budget wedding planning are explored, along with methods to retrieve relevant documents. The use of anchor text for keyphrase identification and comparisons with search engine logs are discussed. Data statistics from ClueWeb 12 Anchor Text Graph are analyzed to establish the approach's effectiveness in task understanding.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
MINING TASKS FROM THE WEB ANCHOR TEXT GRAPH TREC 2015 Nov 19, 2015 PAUL N. BENNETT MICROSOFT RESEARCH RYEN WHITE
Tasks Track Task Understanding Identify all key phrases or alternate queries that might be involved in any task a user might have which would give rise to the observed query for a topic. Task Completion Document retrieval for usefulness to any task a user with the query might be performing Completion measure (Relevance & Usefulness) Ad hoc Relevance
Task Understanding Query: [low wedding budget] (Task ID: 3) Task Description (formulated during assessment): I am planning a low budget wedding Subtasks Low budget wedding venues Tips for wedding planning on a low budget How to organize your own wedding How to have a low budget wedding/low budget wedding ideas Low budget bridal shower ideas Delegate tasks to family Buy a used wedding gown Cheap wedding cake Make your own invitations Find low cost catering or cater it yourself Buy budget wedding bands Hold the ceremony at an inexpensive location
Use Anchor Text to Identify Keyphrases Similar to query reformulations approach taken by Dang & Croft (WSDM, 2010). Treat anchor text pointing to the same document as co- occurrences in a session : clueweb12-0204wb-43-19185 The ID of the document pointed to is used as a session ID .
Goals Establish a very basic approach to task understanding using publically available resources. Data source: ClueWeb 12 Anchor Text Graph Compare to same exact technique applied to search engine logs: Longitudinal logs: comScore logs of ~2 years from multiple search engines (available publically with a subscription) Short-burst logs: Bing sub-sample of less than one week (subsampled to be similar in size to comScore logs). Privacy concerns prevented us from submitting both search engine log runs but we have post-hoc analysis which lower bounds performance (subject to missing relevance judgments).
Data Source Statistics Queries Rel to Anchor Text 100 % Avg. Queries per session Source Queries Sessions Anchor Text Longitudinal SE (comScore, ~2y) Short Burst SE (Bing, < 1 week subsample) 678999678 18962877 35.81 576457197 162532366 3.55 84.9% 611258530 170722865 3.58 90.0% Sources of roughly comparable size by number of queries. Many more links (10x) point at same document on average than queries within a standard search session.
Approach Overview Remove globally most frequent top 1K queries in data source. 1) Input query normalization to filter phrase. 2) Match filter phrase to candidate seeds. 3) Expand candidate seeds to related candidate using session co-occurrence counts. 4) Filter related candidates by length and similarity to filter phrase. 5) Order by a score combining co-occurrence and similarity to filter phrase. 6)
Approach Overview Remove globally most frequent top 1K queries in data source. 1) Input query normalization to filter phrase. 2) Match filter phrase to candidate seeds. Use common IR techniques 3) Simple approach Expand candidate seeds to related candidate using session co-occurrence counts. Establish a Baseline as a Starting Point for Others 4) Easy to reproduce Filter related candidates by length and similarity to filter phrase. 5) Order by a score combining co-occurrence and similarity to filter phrase. 6)
Data Source Session 1 ?1,?2,?3 Raw Query Session 2 ?4,?1,?5 Session 3 ?6,?7,?1,?8 Session M ?1,?? Normalize Filter Phrase Remove Globally Frequent Queries Reduced Data Source Session 1 ?2,?3 Session 2 ?4,?5 Session 3 ?6,?7,?8 Session M ?? Seed Matching Matching Seed List Session 1: ?? Session 3: ?? Expand to Co-occurring Candidate Key Phrases Session 1: ?? Session 1: ?? Session 3: ?? Session 3: ?? Session 3: ?? Filtered Key Phrases Session 1: ?? Session 1: ?? Session 3: ?? Session 3: ?? Ranked Key Phrases Session 3: ?? Session 1: ?? Session 1: ?? Session 3: ?? Filter Rank
Remove globally frequent queries Intuition: The step out to frequent co-occurrences often catches these because they are globally frequent not because they are related to the session. Sample of frequent queries : Anchor: [next], [permalink], [prev], [read more], etc. comScore/Bing: [facebook], [youtube], etc.
Input Query Normalization Lower case, trimming, etc. Stop-wording using English function words. E.g. remove a , about , the , to , who , Used 221 words at http://homepage.ntlworld.com/vivian.c/Words/StructureWordsList.htm . Broader than the 25 semantically non-selective words common in Reuters- RCV1 from Manning et al. (2008) but not as aggressive as frequent word lists. [living in india] [living india] Intuition: the main failure is in not finding a matching seed. As the data source (anchor graph or logs) grows, the need for stop- wording should be less.
Match filter phrase to candidate seeds Intuition 1: Be restrictive when matching seeds. The step out to co-occurrences will give other candidates. Require all terms in filter phrase to occur in candidate seeds. Intuition 2: Tasks often involve specifying additional aspects. Allow additional terms to occur in candidate seeds. [cost of living india] [living in india] [living expenditures in india] [all india man email address living and working in uae]
Expand Candidate Seeds to Related Candidates Intuition: Similar to query reformulation. Related queries occur in session or point to same document. Task 12: I am thinking about moving to India but need to gather more information [seeking friendship india] [cost of living india] [living in india] [living expenditures in india] [all india man email address living and working in uae]
Filter related candidates by length and similarity to filter phrase Intuition 1: Overly long candidates will match randomly. Practical: Many long candidates are anchor text parse failures from malformed html. Require candidate be no longer than ? times longer than normalized filter phrase (used ?=4 to be generous). Intuition 2: Step out to co-occurrences can be risky, require at least 1 word overlap with normalized filter phrase. [seeking friendship india] [cost of living india] [living in india] [living expenditures in india] [all india man email address living and working in uae] Length(candidate) = 10 > 4 x Length(filter phrase) = 4 x 2 = 8
Rank by Frequency and Similarity Intuition 1: each remaining unique session is ideally a relevant session. Count across all relevant sessions is a measure of relevance. Intuition 2: each query should also be related to original query. Score keyphrase = cos filter phrase,keyphrase SessionCount(keyphrase,relevant sessions) Note the filter phrase itself can be surpassed if it doesn t occur as a query in all sessions.
Coverage Track: 50 topics, 34 topics judged (limited assessor time) Covered: A topic where a matching seed is found. Can be determined pre-submission Unjudged Topics 4 (25%) Judged Topics Higher coverage for search engine log sources (despite anchor text having the most queries ). All Topics Source Type 23 (46%) 19 (56%) Covered Anchor Text 27 (54%) 15 (44%) Between half to two-thirds coverage. 12 (75%) Not Covered 32 (64%) 25 (74%) 7 (44%) Covered Longitudinal SE (comScore) 18 (36%) 9 (26%) Longitudinal and short burst similar given small sample of topics. 9 (56%) Not Covered 30 (60%) 23 (68%) 7 (44%) Covered Short Burst SE (Bing) 20 (40%) 11 (32%) 9 (56%) Not Covered
Coverage Judged topics were more likely to be covered regardless of source. Track: 50 topics, 34 topics judged (limited assessor time) Covered: An effect of prioritization of assessor time? (i.e. topics covered by A topic where a matching seed is found. more systems get first judging). Can be determined pre-submission Unjudged Topics 4 (25%) All Topics Judged Topics Source Type 23 (46%) 19 (56%) Covered Anchor Text 27 (54%) 15 (44%) 12 (75%) Not Covered 32 (64%) 25 (74%) 7 (44%) Covered Longitudinal SE (comScore) 18 (36%) 9 (26%) 9 (56%) Not Covered 30 (60%) 23 (68%) 7 (44%) Covered Short Burst SE (Bing) 20 (40%) 11 (32%) 9 (56%) Not Covered
Anchor Text Absolute Results Mean performance over all 34 judged topics ERR-IA @10 ERR-IA @20 ERR-IA @1000 -nDCG @10 -nDCG @20 -nDCG @1000 Method Anchor Text 0.2925 0.2969 0.2986 0.3564 0.3557 0.3594 Mean All Runs 0.3164 0.3246 0.3260 0.3754 0.4036 0.4305 Mean performance over 19 judged topics with coverage by anchor text ERR-IA @10 ERR-IA @20 ERR-IA @1000 -nDCG @10 -nDCG @20 -nDCG @1000 Method Anchor Text 0.5234 0.5313 0.5343 0.6377 0.6365 0.6431 Mean All Runs 0.3678 0.3771 0.3786 0.4438 0.4740 0.5040
Anchor Text Relative Results Average difference from mean performance over all 34 judged topics ERR-IA @10 ERR-IA @20 ERR-IA @1000 -nDCG @10 -nDCG @20 -nDCG @1000 Source Anchor Text -0.0191 -0.0255 -0.0268 -0.0213 -0.0481 -0.0715 Average difference from mean performance over 19 judged topics with coverage ERR-IA @10 ERR-IA @20 ERR-IA @1000 -nDCG @10 -nDCG @20 -nDCG @1000 Source Anchor Text 0.1642 0.1580 0.1568 0.1898 0.1622 0.1383 When coverage, best performer 37% of the time and above mean 79% of the time. ERR-IA @10 ERR-IA @20 ERR-IA @1000 -nDCG @10 -nDCG @20 -nDCG @1000 Comparison Max 7 7 7 7 6 6 > Mean 15 15 15 15 15 14
Lower Bound on Longitudinal SE ERR-IA @10 ERR-IA @20 ERR-IA @1000 -nDCG @10 -nDCG @20 -nDCG @1000 Type Longitudinal All -0.0529 -0.0574 -0.0570 -0.0703 -0.0988 -0.1211 Longitudinal Cov. 0.0097 0.0055 0.0065 0.0021 -0.0288 -0.0519 Max 3 3 3 4 2 2 > Mean 12 13 13 12 11 10 Longitudinal All 0.2635 0.2672 0.2690 0.3051 0.3048 0.3094 Longitudinal Cov. 0.3584 0.3633 0.3659 0.4149 0.4145 0.4208 Because this run is being analyzed post-hoc, there are missing relevance judgments (assumed non-relevant when missing). Thus these are lower bounds for using Longitudinal SE.
Lower Bound on Short Burst SE ERR-IA @10 ERR-IA @20 ERR-IA @1000 -nDCG @10 -nDCG @20 -nDCG @1000 Type Short Burst All -0.0988 -0.1053 -0.1046 -0.1278 -0.1574 -0.1779 Short Burst Cov. -0.0179 -0.0248 -0.0233 -0.0370 -0.0700 -0.0911 Max 4 4 4 4 2 2 > Mean 9 9 10 9 9 9 Short Burst All 0.2176 0.2193 0.2214 0.2476 0.2461 0.2526 Short Burst Cov. 0.3216 0.3242 0.3273 0.3660 0.3639 0.3734 Because this run is being analyzed post-hoc, there are missing relevance judgments (assumed non-relevant when missing). Thus these are lower bounds for using Short Burst SE.
Summary Simple method easy to implement even without full indexing of ClueWeb (use ClueWeb 12 anchor text as in Hiemstra & Hauff, 2010). http://wwwhome.ewi.utwente.nl/~hiemstra/2013/anchor-text-for-clueweb12.html Good baseline compare to mean of all runs. When coverage, best performer 37% of the time and above mean 79% of the time. Best performer in 20% of all judged topics. Taken together implies good potential for combining with nearly every run. Lower bounds for same method using search engine logs (weakly) implies anchor text is a reasonable surrogate for logs for this task. Many potential extensions to investigate: Full random walk. Seed matching methods. Filtering at each step of walk. Diversity/cohesion in walk.