Term-weighting Functions for Similarity Measures

Scott Wen-tau Yih
Microsoft Research
How similar are they?
mariners
 vs. 
seattle mariners
mariners
 vs. 
1st mariner bank
query
mariners
query
movie theater
 tickets
v
p
 = {
 
digital: 1.35,
         
 
camera: 0.89,
         
 
review: 0.32, … }
D
p
tf 
(“review”,
 D
p
) 
 
idf 
(“review”)
Sim(D
p
,D
q
)
 
 
f
sim
(
v
p
,
v
q
)
f
sim
 could be cosine, overlap, Jaccard,
 etc.
 
 
Advantages
Simple
 & Efficient
Concise
 representation
Effective in many applications
Issues
Not
 trivial to adapt to target domain
Lots of variations of TFIDF formulas
Not
 clear how to incorporate other information
e.g., term position, query log frequency, etc.
T
WEAK
T
erm-
w
eighting L
ea
rning Framewor
k
Instead of a fixed TFIDF formula, learn the term-
weighting functions
Preserve the engineering advantages of the vector-
based similarity measures
Able to incorporate other term information and
fine tune the similarity measure
Flexible in choosing various loss functions to match
the true objectives in the target applications
Introduction
Problem Statement & Model
Formal definition
Loss functions
Experiments
Query suggestion
Ad page relevance
Conclusions
Compute the similarity between 
D
p 
and 
D
q
Vocabulary:
Term-vector:
Term-weighting score:
Use the same 
f
sim
(
,
) (i.e., cosine)
Linear term-weighting function
Training examples: document pairs
Loss functions
Sum-of-squares error
Log-loss
Smoothing
 
Training examples: pairs of document pairs
LogExpLoss 
[Dekel et al. NIPS-03]
Upper bound the pairwise accuracy
Introduction
Problem Definition & Model
Term-weighting functions
Objective functions
Experiments
Query suggestion
Ad page relevance
Conclusions
Data: Query suggestion dataset
 
[Metzler et al. ’07; Yih&Meek ‘07]
|Q| = 122, |(Q,S)| = 4852; {Ex,Good} vs. {Fair,Bad}
Query expansion of 
x
 using a search engine
Issue the query 
x
 to a search engine
Concatenate top-
n
 search result snippets
Titles and summaries of top-
n
 returned documents
Features (of each term w.r.t. the document)
Term Frequency, Capitalization, Location
Document Frequency, Query Log Frequency
10 fold CV; smoothing parameter selected on dev set
Data: a random sample of queries and ad
landing pages collected during 2008
Collected 13,341 query/page pairs with reliable
labels (8,309 – relevant; 5,032 – irrelevant)
Apply the same query expansion on queries
Additional HTML Features
Hypertext, URL, Title
Meta-keywords, Meta-Description
Preference order learning on different feature sets
Preference order learning on different feature sets
“Siamese” neural network framework
Vectors of objects being compared are generated by two-layer neural
networks
Applications: fingerprint matching, face matching
T
WEAK
 can be viewed as a single-layer neural network with 
many
(vocabulary size) output nodes
Learning directly the term-weighting scores 
[Bilenko&Mooney ‘03]
May work for limited vocabulary size
Learning to combine multiple similarity measures 
[Yih&Meek ‘07]
Features of each pair: similarity scores from different measures
Complementary to 
T
WEAK
Near-duplicate detection
Existing methods (e.g
., shingles, I-Match)
Create hash code of 
n-grams in document as fingerprints
Detect duplicates when identical fingerprints are found
Learn which fingerprints are important
Paraphrase recognition
Vector-based similarity for surface matching
Deep NLP analysis may be needed and encoded as
features for sentence pairs
Learn additional weights on terms
Create an indicator feature for each term
Create a two-layer neural network, where each term is a
node; learn the weight of each term as well
A joint model for term-weighting learning and
similarity function (e.g., kernel) learning
The final similarity function combines multiple similarity
functions and incorporates pair-level features
The vector construction and term-weighting scores are
trained using T
WEAK
T
WEAK
: A term-weighting learning framework for
improving vector-based similarity measures
Given labels of text pairs, learns the term-weighting
function
A principled way to incorporate more information and
adapt to target applications
Can replace existing TFIDF methods directly
Flexible in using various loss functions
Potential for more applications and model
enhancement
Slide Note

© 2006 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.

The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation.

MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Embed
Share

Explore term-weighting functions for similarity measures in information retrieval, focusing on TFIDF vectors, vector-based similarity measures, and the TWEAK learning framework for fine-tuning similarity metrics.

  • Term-weighting
  • Similarity Measures
  • TFIDF
  • Vector-based
  • Learning Framework

Uploaded on Dec 15, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Learning Term-weighting Functions for Similarity Measures Scott Wen-tau Yih Microsoft Research

  2. Applicationsof Similarity Measures Query Suggestion query mariners How similar are they? mariners vs. seattle mariners mariners vs. 1st mariner bank

  3. Applicationsof Similarity Measures Ad Relevance query movie theater tickets

  4. Similarity Measures based on TFIDF Vectors Digital Camera Review vp = { camera: 0.89, review: 0.32, } digital: 1.35, The new flagship of Canon s S-series, PowerShot S80 digital camera, incorporates 8 megapixels for shooting still images and a movie mode that records an impressive 1024 x 768 pixels. tf ( review , Dp) idf ( review ) Dp Sim(Dp,Dq) fsim(vp,vq) fsim could be cosine, overlap, Jaccard, etc.

  5. Vector-based Similarity Measures Pros & Cons Advantages Simple & Efficient Concise representation Effective in many applications Issues Not trivial to adapt to target domain Lots of variations of TFIDF formulas Not clear how to incorporate other information e.g., term position, query log frequency, etc.

  6. Approach: Learn Term-weighting Functions TWEAK Term-weighting Learning Framework Instead of a fixed TFIDF formula, learn the term- weighting functions Preserve the engineering advantages of the vector- based similarity measures Able to incorporate other term information and fine tune the similarity measure Flexible in choosing various loss functions to match the true objectives in the target applications

  7. Outline Introduction Problem Statement & Model Formal definition Loss functions Experiments Query suggestion Ad page relevance Conclusions

  8. Vector-based Similarity Measures Formal Definition Compute the similarity between Dp and Dq Vocabulary: Term-vector: Term-weighting score: ??? tw(??,??) ? = {?1,?2, ,??} ??= {??1,??2, ,???} ?sim ??,?? vp 1 vq ? 1 n p ?? q S S ??

  9. TFIDF Cosine Similarity ?sim ??,?? ?? ?? ?? ?? ???? ??,??= vp 1 vq ? tw ??,?? ?? ??,?? log ??(??) ? 1 n p ?? q S S ?? Use the same fsim( , ) (i.e., cosine) Linear term-weighting function tw? ??,?? ?? ??(??,??) ?

  10. Learning Similarity Metric Training examples: document pairs Loss functions Sum-of-squares error ?sse ? =1 2 ??,????(???,???) k ?1, ??1,??1, , ??, ???,??? m 2 Log-loss m ?log ? = ??log???????,??? (1 ??) k 2 ? 2 1 log???????,??? Smoothing

  11. Learning Preference Ordering Training examples: pairs of document pairs ???= ????,????,???= ????,???? ?1, ??1,??1, , ??, ???,??? LogExpLoss [Dekel et al. NIPS-03] ?= ???? ???,???? ???? ???,???? ? ? ? =log(1 + exp( y? ? 1 y? ? ?=1 )) Upper bound the pairwise accuracy

  12. Outline Introduction Problem Definition & Model Term-weighting functions Objective functions Experiments Query suggestion Ad page relevance Conclusions

  13. Experiment Query Suggestion Data: Query suggestion dataset [Metzler et al. 07; Yih&Meek 07] Query Suggestion shell gas cards texaco credit card fresno city college dallas county schools Label Excellent Fair Bad Good shell oil credit card shell oil credit card tarrant county college tarrant county college |Q| = 122, |(Q,S)| = 4852; {Ex,Good} vs. {Fair,Bad}

  14. Term Vector Construction and Features Query expansion of x using a search engine Issue the query x to a search engine Concatenate top-n search result snippets Titles and summaries of top-n returned documents Features (of each term w.r.t. the document) Term Frequency, Capitalization, Location Document Frequency, Query Log Frequency

  15. Results Query Suggestion 0.782 0.8 0.597 0.7 0.6 Series1 Series2 Series3 Series4 0.5 0.4 0.3 0.2 0.1 0 1 2 10 fold CV; smoothing parameter selected on dev set

  16. Experiment Ad Page Relevance Data: a random sample of queries and ad landing pages collected during 2008 Collected 13,341 query/page pairs with reliable labels (8,309 relevant; 5,032 irrelevant) Apply the same query expansion on queries Additional HTML Features Hypertext, URL, Title Meta-keywords, Meta-Description

  17. Results Ad Page Relevance 0.9 1 0.8 Features TFIDF TF&DF Plaintext HTML Series4 AUC 0.794 0.806 0.832 0.855 0.7 0.6 Series1 Series2 Series3 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 Preference order learning on different feature sets

  18. Results Ad Page Relevance Features TFIDF TF&DF Plaintext HTML AUC 0.794 0.806 0.832 0.855 Preference order learning on different feature sets

  19. Related Work Siamese neural network framework Vectors of objects being compared are generated by two-layer neural networks Applications: fingerprint matching, face matching TWEAK can be viewed as a single-layer neural network with many (vocabulary size) output nodes Learning directly the term-weighting scores [Bilenko&Mooney 03] May work for limited vocabulary size Learning to combine multiple similarity measures [Yih&Meek 07] Features of each pair: similarity scores from different measures Complementary to TWEAK

  20. Future Work Other Applications Near-duplicate detection Existing methods (e.g., shingles, I-Match) Create hash code of n-grams in document as fingerprints Detect duplicates when identical fingerprints are found Learn which fingerprints are important Paraphrase recognition Vector-based similarity for surface matching Deep NLP analysis may be needed and encoded as features for sentence pairs

  21. Future Work Model Improvement Learn additional weights on terms Create an indicator feature for each term Create a two-layer neural network, where each term is a node; learn the weight of each term as well A joint model for term-weighting learning and similarity function (e.g., kernel) learning The final similarity function combines multiple similarity functions and incorporates pair-level features The vector construction and term-weighting scores are trained using TWEAK

  22. Conclusions TWEAK: A term-weighting learning framework for improving vector-based similarity measures Given labels of text pairs, learns the term-weighting function A principled way to incorporate more information and adapt to target applications Can replace existing TFIDF methods directly Flexible in using various loss functions Potential for more applications and model enhancement

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#