Term-weighting Functions for Similarity Measures

Scott Wen-tau Yih

Microsoft Research

How similar are they?

mariners

vs.

seattle mariners

mariners

vs.

1st mariner bank

query

mariners

query

movie theater

 tickets

= {

digital: 1.35,

camera: 0.89,

review: 0.32, … }

tf

(“review”,



idf

(“review”)

•

Sim(D

,D



sim

•

sim

 could be cosine, overlap, Jaccard,

 etc.

•

Advantages

•

Simple

 & Efficient

•

Concise

 representation

•

Effective in many applications

•

Issues

•

Not

 trivial to adapt to target domain

•

Lots of variations of TFIDF formulas

•

Not

 clear how to incorporate other information

•

e.g., term position, query log frequency, etc.

•

WEAK

–

erm-

eighting L

ea

rning Framewor

•

Instead of a fixed TFIDF formula, learn the term-

weighting functions

•

Preserve the engineering advantages of the vector-

based similarity measures

•

Able to incorporate other term information and

fine tune the similarity measure

•

Flexible in choosing various loss functions to match

the true objectives in the target applications

•

Introduction

•

Problem Statement & Model

•

Formal definition

•

Loss functions

•

Experiments

•

Query suggestion

•

Ad page relevance

•

Conclusions

•

Compute the similarity between

and

•

Vocabulary:

•

Term-vector:

•

Term-weighting score:

•

Use the same

sim

∙

∙

) (i.e., cosine)

•

Linear term-weighting function

•

Training examples: document pairs

•

•

Loss functions

•

Sum-of-squares error

•

Log-loss

•

Smoothing

•

Training examples: pairs of document pairs

•

•

•

LogExpLoss

[Dekel et al. NIPS-03]

•

Upper bound the pairwise accuracy

•

Introduction

•

Problem Definition & Model

•

Term-weighting functions

•

Objective functions

•

Experiments

•

Query suggestion

•

Ad page relevance

•

Conclusions

•

Data: Query suggestion dataset

[Metzler et al. ’07; Yih&Meek ‘07]

•

|Q| = 122, |(Q,S)| = 4852; {Ex,Good} vs. {Fair,Bad}

•

Query expansion of

 using a search engine

•

Issue the query

 to a search engine

•

Concatenate top-

 search result snippets

•

Titles and summaries of top-

 returned documents

•

Features (of each term w.r.t. the document)

•

Term Frequency, Capitalization, Location

•

Document Frequency, Query Log Frequency

•

10 fold CV; smoothing parameter selected on dev set

•

Data: a random sample of queries and ad

landing pages collected during 2008

•

Collected 13,341 query/page pairs with reliable

labels (8,309 – relevant; 5,032 – irrelevant)

•

Apply the same query expansion on queries

•

Additional HTML Features

•

Hypertext, URL, Title

•

Meta-keywords, Meta-Description

•

Preference order learning on different feature sets

•

Preference order learning on different feature sets

•

“Siamese” neural network framework

•

Vectors of objects being compared are generated by two-layer neural

networks

•

Applications: fingerprint matching, face matching

•

WEAK

 can be viewed as a single-layer neural network with

many

(vocabulary size) output nodes

•

Learning directly the term-weighting scores

[Bilenko&Mooney ‘03]

•

May work for limited vocabulary size

•

Learning to combine multiple similarity measures

[Yih&Meek ‘07]

•

Features of each pair: similarity scores from different measures

•

Complementary to

WEAK

•

Near-duplicate detection

•

Existing methods (e.g

., shingles, I-Match)

•

Create hash code of

n-grams in document as fingerprints

•

Detect duplicates when identical fingerprints are found

•

Learn which fingerprints are important

•

Paraphrase recognition

•

Vector-based similarity for surface matching

•

Deep NLP analysis may be needed and encoded as

features for sentence pairs

•

Learn additional weights on terms

•

Create an indicator feature for each term

•

Create a two-layer neural network, where each term is a

node; learn the weight of each term as well

•

A joint model for term-weighting learning and

similarity function (e.g., kernel) learning

•

The final similarity function combines multiple similarity

functions and incorporates pair-level features

•

The vector construction and term-weighting scores are

trained using T

WEAK

•

WEAK

: A term-weighting learning framework for

improving vector-based similarity measures

•

Given labels of text pairs, learns the term-weighting

function

•

A principled way to incorporate more information and

adapt to target applications

•

Can replace existing TFIDF methods directly

•

Flexible in using various loss functions

•

Potential for more applications and model

enhancement

Slide Note

The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation.

MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Embed Share

Download

Explore term-weighting functions for similarity measures in information retrieval, focusing on TFIDF vectors, vector-based similarity measures, and the TWEAK learning framework for fine-tuning similarity metrics.

turbyfill_k Follow

Uploaded on Dec 15, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Learning Term-weighting Functions for Similarity Measures Scott Wen-tau Yih Microsoft Research

Applicationsof Similarity Measures Query Suggestion query mariners How similar are they? mariners vs. seattle mariners mariners vs. 1st mariner bank

Applicationsof Similarity Measures Ad Relevance query movie theater tickets

Similarity Measures based on TFIDF Vectors Digital Camera Review vp = { camera: 0.89, review: 0.32, } digital: 1.35, The new flagship of Canon s S-series, PowerShot S80 digital camera, incorporates 8 megapixels for shooting still images and a movie mode that records an impressive 1024 x 768 pixels. tf ( review , Dp) idf ( review ) Dp Sim(Dp,Dq) fsim(vp,vq) fsim could be cosine, overlap, Jaccard, etc.

Vector-based Similarity Measures Pros & Cons Advantages Simple & Efficient Concise representation Effective in many applications Issues Not trivial to adapt to target domain Lots of variations of TFIDF formulas Not clear how to incorporate other information e.g., term position, query log frequency, etc.

Approach: Learn Term-weighting Functions TWEAK Term-weighting Learning Framework Instead of a fixed TFIDF formula, learn the term- weighting functions Preserve the engineering advantages of the vector- based similarity measures Able to incorporate other term information and fine tune the similarity measure Flexible in choosing various loss functions to match the true objectives in the target applications

Outline Introduction Problem Statement & Model Formal definition Loss functions Experiments Query suggestion Ad page relevance Conclusions

Vector-based Similarity Measures Formal Definition Compute the similarity between Dp and Dq Vocabulary: Term-vector: Term-weighting score: ??? tw(??,??) ? = {?1,?2, ,??} ??= {??1,??2, ,???} ?sim ??,?? vp 1 vq ? 1 n p ?? q S S ??

TFIDF Cosine Similarity ?sim ??,?? ?? ?? ?? ?? ???? ??,??= vp 1 vq ? tw ??,?? ?? ??,?? log ??(??) ? 1 n p ?? q S S ?? Use the same fsim( , ) (i.e., cosine) Linear term-weighting function tw? ??,?? ?? ??(??,??) ?

Learning Similarity Metric Training examples: document pairs Loss functions Sum-of-squares error ?sse ? =1 2 ??,????(???,???) k ?1, ??1,??1, , ??, ???,??? m 2 Log-loss m ?log ? = ??log???????,??? (1 ??) k 2 ? 2 1 log???????,??? Smoothing

Learning Preference Ordering Training examples: pairs of document pairs ???= ????,????,???= ????,???? ?1, ??1,??1, , ??, ???,??? LogExpLoss [Dekel et al. NIPS-03] ?= ???? ???,???? ???? ???,???? ? ? ? =log(1 + exp( y? ? 1 y? ? ?=1 )) Upper bound the pairwise accuracy

Outline Introduction Problem Definition & Model Term-weighting functions Objective functions Experiments Query suggestion Ad page relevance Conclusions

Experiment Query Suggestion Data: Query suggestion dataset [Metzler et al. 07; Yih&Meek 07] Query Suggestion shell gas cards texaco credit card fresno city college dallas county schools Label Excellent Fair Bad Good shell oil credit card shell oil credit card tarrant county college tarrant county college |Q| = 122, |(Q,S)| = 4852; {Ex,Good} vs. {Fair,Bad}

Term Vector Construction and Features Query expansion of x using a search engine Issue the query x to a search engine Concatenate top-n search result snippets Titles and summaries of top-n returned documents Features (of each term w.r.t. the document) Term Frequency, Capitalization, Location Document Frequency, Query Log Frequency

Results Query Suggestion 0.782 0.8 0.597 0.7 0.6 Series1 Series2 Series3 Series4 0.5 0.4 0.3 0.2 0.1 0 1 2 10 fold CV; smoothing parameter selected on dev set

Experiment Ad Page Relevance Data: a random sample of queries and ad landing pages collected during 2008 Collected 13,341 query/page pairs with reliable labels (8,309 relevant; 5,032 irrelevant) Apply the same query expansion on queries Additional HTML Features Hypertext, URL, Title Meta-keywords, Meta-Description

Results Ad Page Relevance 0.9 1 0.8 Features TFIDF TF&DF Plaintext HTML Series4 AUC 0.794 0.806 0.832 0.855 0.7 0.6 Series1 Series2 Series3 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 Preference order learning on different feature sets

Results Ad Page Relevance Features TFIDF TF&DF Plaintext HTML AUC 0.794 0.806 0.832 0.855 Preference order learning on different feature sets

Related Work Siamese neural network framework Vectors of objects being compared are generated by two-layer neural networks Applications: fingerprint matching, face matching TWEAK can be viewed as a single-layer neural network with many (vocabulary size) output nodes Learning directly the term-weighting scores [Bilenko&Mooney 03] May work for limited vocabulary size Learning to combine multiple similarity measures [Yih&Meek 07] Features of each pair: similarity scores from different measures Complementary to TWEAK

Future Work Other Applications Near-duplicate detection Existing methods (e.g., shingles, I-Match) Create hash code of n-grams in document as fingerprints Detect duplicates when identical fingerprints are found Learn which fingerprints are important Paraphrase recognition Vector-based similarity for surface matching Deep NLP analysis may be needed and encoded as features for sentence pairs

Future Work Model Improvement Learn additional weights on terms Create an indicator feature for each term Create a two-layer neural network, where each term is a node; learn the weight of each term as well A joint model for term-weighting learning and similarity function (e.g., kernel) learning The final similarity function combines multiple similarity functions and incorporates pair-level features The vector construction and term-weighting scores are trained using TWEAK

Conclusions TWEAK: A term-weighting learning framework for improving vector-based similarity measures Given labels of text pairs, learns the term-weighting function A principled way to incorporate more information and adapt to target applications Can replace existing TFIDF methods directly Flexible in using various loss functions Potential for more applications and model enhancement

Term-weighting Functions for Similarity Measures

Download Presentation

Presentation Transcript

Related

More Related Content