Similarity Recognition in the Web of Data

undefined

Similarity Recognition in the Web of Data

Alfio Ferrara

Lorenzo Genta

Stefano Montanelli

4th Int. Workshop on Linked Web

Data Management (LWDM 2014)

Athens, Greece - March, 28th  2014

Context and motivations



In the web of data,

similarity recognition

is the

basis for a variety of resource-consuming activities

and applications. Such as for example:

–

data recommendation

–

data aggregation

–

data analysis



A lot of research effort has been focused on data

matching with a specific attention to instance

matching in the framework of the Semantic Web

LWDM 2014 - Athens, Greece - March, 28th 2014

Why data matching again?

Focus on the

scalability issue



Traditional comparison-based approaches work on

reducing the

O(n

 number of matching operations



Many optimization strategies have been proposed

with interesting results



Matching is still an expensive step and offline

execution is required for very large datasets to

match

LWDM 2014 - Athens, Greece - March, 28th 2014

Why data matching again?

We are interested in data matching under specific

constraints



Resources are defined according to a common

schema (extraction from a single repository)



The set of features considered for matching is

known in advance (

dimensional matching



Features have a “dense” distribution over the

resources to match



Feature values are specified according to a

controlled vocabulary

LWDM 2014 - Athens, Greece - March, 28th 2014

Example of resources to match



Resources are

represented as sets

of feature-value pairs



Set of considered

features:

–

Profession

–

Nationality

–

Type

LWDM 2014 - Athens, Greece - March, 28th 2014

muhammad ali (r1)

profession - athlete

profession - professional boxer

type - olympic athlete

nationality - United States of America

george foreman (r3)

profession - professional boxer

type - olympic athlete

nationality - United States of America

michael jordan (r2)

profession - athlete

type - olympic athlete

type - celebrity

Examples taken from the Freebase repository (http://www.freebase.com)

HMatch4



What is HMatch4:

–

It is a data matching algorithm

–

It is natively conceived for working in the web of data

–

It has been developed on the ground of HMatch3



How HMatch4 works:

–

The similarity degree of two resources is proportional to

the number of common feature-value pairs

–

An

index-based

 approach is enforced to detect the

common feature-value pairs (resource comparison over

features is not required)

LWDM 2014 - Athens, Greece - March, 28th 2014

HM4 at a glance

Given a resource r



Index all the

relevant

 subsets of feature-value

pairs belonging to r



A resource r

 is similar to r

 if it shares the same

entry of r

 in the index, meaning that r

 and r

have a common subset of feature-value pairs



The similarity degree of r

 and r

 is assessed by

measuring the size (i.e., cardinality) of the shared

subset of feature-value pairs

LWDM 2014 - Athens, Greece - March, 28th 2014

HM4 with an example

LWDM 2014 - Athens, Greece - March, 28th 2014

muhammad ali (r1)

profession - athlete

profession - professional boxer

type - olympic athlete

nationality - United States of America

Index entries

HM4 with an example

LWDM 2014 - Athens, Greece - March, 28th 2014

Index entries

michael jordan (r2)

profession - athlete

type - olympic athlete

type - celebrity

HM4 with an example

LWDM 2014 - Athens, Greece - March, 28th 2014

Index entries

george foreman (r3)

profession - professional boxer

type - olympic athlete

nationality - United States of America

HM4 with an example



What is the similarity of r1, r2, r3?



Take the index entry with the higher number of

common feature-value pairs



Use the Dice’s coefficient



sim(r1, r2) = 2*2 / (3+2) = 0.8



sim(r2, r3) = 0



sim(r1, r3) = 2*3 / (3+3) = 1.0

LWDM 2014 - Athens, Greece - March, 28th 2014

•

rfs(r

,r

 is the relevant feature

set of

and

(the index entry of

and

•

fs(r

 is the feature set of

What is a relevant feature set?



Define a matching threshold

th

 (minimum level of

similarity for having that two resources are

matching resources)



Given two resources

and

, how many feature-

value pairs do they need to have in common to

satisfy

th

 (and thus to be a

relevant feature set

)?



Sets with less feature-values pairs than

 are not

considered (in the example

th=0.5

|F|=3

LWDM 2014 - Athens, Greece - March, 28th 2014

 is the number of needed

common feature-value pairs

Issues for the discussion session



Experimental results

–

Quality?

–

Scalability?



Application scenarios for HMatch4



Is it possible to relax some constraints?

–

Support to approximate string matching

–

Integration with hashing techniques preserving

“nearby” positions for similar values

LWDM 2014 - Athens, Greece - March, 28th 2014

Experimental results - quality



Comparison of Hmatch4 against HMatch3



Dataset of 58 Freebase individuals (1.653 mappings)

LWDM 2014 - Athens, Greece - March, 28th 2014

Experimental results - scalability



Comparison of Hmatch4 against LogMap and SLINT+



Growing number of individuals (from 58 to 85.144)

LWDM 2014 - Athens, Greece - March, 28th 2014

Conclusions



Extensive experimentation against very large

datasets (millions of instances)



Support to approximate string matching



Integration with hashing techniques preserving

“nearby” positions for similar values

LWDM 2014 - Athens, Greece - March, 28th 2014

Thanks



The ISLab team

http://islab.di.unimi.it/



Alfio Ferrara –

alfio.ferrara@unimi.it



Lorenzo Genta –

lorenzo.genta@unimi.it



Stefano Montanelli –

stefano.montanelli@unimi.it

LWDM 2014 - Athens, Greece - March, 28th 2014

Slide Note

Embed Share

Download

Exploring the importance of similarity recognition in various web data applications, the challenges of data matching in terms of scalability, and the specific constraints and features that play a role in the matching process. Examples from the Freebase repository demonstrate how resources are represented and matched based on defined features.

bbroo Follow

Uploaded on Dec 13, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Similarity Recognition in the Web of Data Alfio Ferrara Lorenzo Genta Stefano Montanelli 4th Int. Workshop on Linked Web Data Management (LWDM 2014) Athens, Greece - March, 28th 2014 Dipartimento di Informatica Universit degli Studi di Milano 1

Context and motivations In the web of data, similarity recognition is the basis for a variety of resource-consuming activities and applications. Such as for example: data recommendation data aggregation data analysis A lot of research effort has been focused on data matching with a specific attention to instance matching in the framework of the Semantic Web LWDM 2014 - Athens, Greece - March, 28th 2014 2

Why data matching again? Focus on the scalability issue Traditional comparison-based approaches work on reducing the O(n2) number of matching operations Many optimization strategies have been proposed with interesting results Matching is still an expensive step and offline execution is required for very large datasets to match LWDM 2014 - Athens, Greece - March, 28th 2014 3

Why data matching again? We are interested in data matching under specific constraints Resources are defined according to a common schema (extraction from a single repository) The set of features considered for matching is known in advance (dimensional matching) Features have a dense distribution over the resources to match Feature values are specified according to a controlled vocabulary LWDM 2014 - Athens, Greece - March, 28th 2014 4

Example of resources to match muhammad ali (r1) profession - athlete profession - professional boxer type - olympic athlete nationality - United States of America Resources are represented as sets of feature-value pairs Set of considered features: Profession Nationality Type michael jordan (r2) profession - athlete type - olympic athlete type - celebrity george foreman (r3) profession - professional boxer type - olympic athlete nationality - United States of America Examples taken from the Freebase repository (http://www.freebase.com) LWDM 2014 - Athens, Greece - March, 28th 2014 5

HMatch4 What is HMatch4: It is a data matching algorithm It is natively conceived for working in the web of data It has been developed on the ground of HMatch3 How HMatch4 works: The similarity degree of two resources is proportional to the number of common feature-value pairs An index-based approach is enforced to detect the common feature-value pairs (resource comparison over features is not required) LWDM 2014 - Athens, Greece - March, 28th 2014 6

HM4 at a glance Given a resource r1: Index all the relevant subsets of feature-value pairs belonging to r1 A resource r2is similar to r1if it shares the same entry of r1in the index, meaning that r1and r2 have a common subset of feature-value pairs The similarity degree of r1and r2is assessed by measuring the size (i.e., cardinality) of the shared subset of feature-value pairs LWDM 2014 - Athens, Greece - March, 28th 2014 7

HM4 with an example muhammad ali (r1) profession - athlete profession - professional boxer type - olympic athlete nationality - United States of America Index entries Subset of feature-values pairs Resourc e r1 r1 r1 r1 r1 r1 r1 (profession - athlete), (type ol. athlete) (profession prof. boxer), (type ol. athlete) (profession - athlete), (nat. - USA) (profession prof. boxer), (nat. USA) (type ol. athlete), (nat. USA) (profession - athlete), (type ol. athlete), (nat. USA) (profession - prof. boxer), (type ol. athlete), (nat. USA) LWDM 2014 - Athens, Greece - March, 28th 2014 8

HM4 with an example michael jordan (r2) profession - athlete type - olympic athlete type - celebrity Index entries Subset of feature-values pairs Resourc e r1, r2 r2 r1 r1 r1 r1 r1 r1 (profession - athlete), (type ol. athlete) (profession - athlete), (type celebrity) (profession prof. boxer), (type ol. athlete) (profession - athlete), (nat. - USA) (profession prof. boxer), (nat. USA) (type ol. athlete), (nat. USA) (profession - athlete), (type ol. athlete), (nat. USA) (profession - prof. boxer), (type ol. athlete), (nat. USA) LWDM 2014 - Athens, Greece - March, 28th 2014 9

HM4 with an example george foreman (r3) profession - professional boxer type - olympic athlete nationality - United States of America Index entries Subset of feature-values pairs Resourc e r1, r2 r2 r1, r3 r1 r1, r3 r1, r3 r1 r1, r3 (profession - athlete), (type ol. athlete) (profession - athlete), (type celebrity) (profession prof. boxer), (type ol. athlete) (profession - athlete), (nat. - USA) (profession prof. boxer), (nat. USA) (type ol. athlete), (nat. USA) (profession - athlete), (type ol. athlete), (nat. USA) (profession - prof. boxer), (type ol. athlete), (nat. USA) LWDM 2014 - Athens, Greece - March, 28th 2014 10

HM4 with an example What is the similarity of r1, r2, r3? Take the index entry with the higher number of common feature-value pairs Use the Dice s coefficient rfs(ri,rj) is the relevant feature set of riand rj (the index entry of riand rj) fs(ri) is the feature set of ri 2*|rfs(ri,rj)| | fs(ri)|+| fs(rj)| sim(ri,rj)= sim(r1, r2) = 2*2 / (3+2) = 0.8 sim(r2, r3) = 0 sim(r1, r3) = 2*3 / (3+3) = 1.0 LWDM 2014 - Athens, Greece - March, 28th 2014 11

What is a relevant feature set? Define a matching threshold th (minimum level of similarity for having that two resources are matching resources) Given two resources ri and rj, how many feature- value pairs do they need to have in common to satisfy th (and thus to be a relevant feature set)? X is the number of needed common feature-value pairs X th*|F| Sets with less feature-values pairs than X are not considered (in the example th=0.5, |F|=3) LWDM 2014 - Athens, Greece - March, 28th 2014 12

Issues for the discussion session Experimental results Quality? Scalability? Application scenarios for HMatch4 Is it possible to relax some constraints? Support to approximate string matching Integration with hashing techniques preserving nearby positions for similar values LWDM 2014 - Athens, Greece - March, 28th 2014 13

Thanks The ISLab team - http://islab.di.unimi.it/ Alfio Ferrara alfio.ferrara@unimi.it Lorenzo Genta lorenzo.genta@unimi.it Stefano Montanelli stefano.montanelli@unimi.it LWDM 2014 - Athens, Greece - March, 28th 2014 17

Similarity Recognition in the Web of Data

Download Presentation

Presentation Transcript

Related

More Related Content