Similarity Recognition in the Web of Data

undefined
Similarity Recognition in the Web of Data
Alfio Ferrara
Lorenzo Genta
Stefano Montanelli
4th Int. Workshop on Linked Web
Data Management (LWDM 2014)
Athens, Greece - March, 28th  2014
1
Context and motivations
In the web of data, 
similarity recognition 
is the
basis for a variety of resource-consuming activities
and applications. Such as for example:
data recommendation
data aggregation
data analysis
A lot of research effort has been focused on data
matching with a specific attention to instance
matching in the framework of the Semantic Web
LWDM 2014 - Athens, Greece - March, 28th 2014
2
Why data matching again?
Focus on the 
scalability issue
Traditional comparison-based approaches work on
reducing the 
O(n
2
)
 number of matching operations
Many optimization strategies have been proposed
with interesting results
Matching is still an expensive step and offline
execution is required for very large datasets to
match
LWDM 2014 - Athens, Greece - March, 28th 2014
3
Why data matching again?
We are interested in data matching under specific
constraints
Resources are defined according to a common
schema (extraction from a single repository)
The set of features considered for matching is
known in advance (
dimensional matching
)
Features have a “dense” distribution over the
resources to match
Feature values are specified according to a
controlled vocabulary
LWDM 2014 - Athens, Greece - March, 28th 2014
4
Example of resources to match
Resources are
represented as sets
of feature-value pairs
Set of considered
features:
Profession
Nationality
Type
LWDM 2014 - Athens, Greece - March, 28th 2014
muhammad ali (r1)
profession - athlete
profession - professional boxer
type - olympic athlete
nationality - United States of America
george foreman (r3)
profession - professional boxer
type - olympic athlete
nationality - United States of America
michael jordan (r2)
profession - athlete
type - olympic athlete
type - celebrity
Examples taken from the Freebase repository (http://www.freebase.com)
5
HMatch4
What is HMatch4:
It is a data matching algorithm
It is natively conceived for working in the web of data
It has been developed on the ground of HMatch3
How HMatch4 works:
The similarity degree of two resources is proportional to
the number of common feature-value pairs
An 
index-based
 approach is enforced to detect the
common feature-value pairs (resource comparison over
features is not required)
LWDM 2014 - Athens, Greece - March, 28th 2014
6
HM4 at a glance
Given a resource r
1
:
Index all the 
relevant
 subsets of feature-value
pairs belonging to r
1
A resource r
2
 is similar to r
1
 if it shares the same
entry of r
1
 in the index, meaning that r
1
 and r
2
have a common subset of feature-value pairs
The similarity degree of r
1
 and r
2
 is assessed by
measuring the size (i.e., cardinality) of the shared
subset of feature-value pairs
LWDM 2014 - Athens, Greece - March, 28th 2014
7
HM4 with an example
LWDM 2014 - Athens, Greece - March, 28th 2014
muhammad ali (r1)
profession - athlete
profession - professional boxer
type - olympic athlete
nationality - United States of America
Index entries
8
HM4 with an example
LWDM 2014 - Athens, Greece - March, 28th 2014
Index entries
michael jordan (r2)
profession - athlete
type - olympic athlete
type - celebrity
9
HM4 with an example
LWDM 2014 - Athens, Greece - March, 28th 2014
Index entries
george foreman (r3)
profession - professional boxer
type - olympic athlete
nationality - United States of America
10
HM4 with an example
What is the similarity of r1, r2, r3?
Take the index entry with the higher number of
common feature-value pairs
Use the Dice’s coefficient
sim(r1, r2) = 2*2 / (3+2) = 0.8
sim(r2, r3) = 0
sim(r1, r3) = 2*3 / (3+3) = 1.0
LWDM 2014 - Athens, Greece - March, 28th 2014
rfs(r
i
,r
j
)
 is the relevant feature
set of 
r
i
 and 
r
j
(the index entry of 
r
i
 and 
r
j
)
fs(r
i
)
 is the feature set of 
r
i
11
What is a relevant feature set?
Define a matching threshold 
th
 (minimum level of
similarity for having that two resources are
matching resources)
Given two resources 
r
i
 and 
r
j
, how many feature-
value pairs do they need to have in common to
satisfy 
th
 (and thus to be a 
relevant feature set
)?
Sets with less feature-values pairs than 
X
 are not
considered (in the example 
th=0.5
, 
|F|=3
)
LWDM 2014 - Athens, Greece - March, 28th 2014
X
 is the number of needed
common feature-value pairs
12
Issues for the discussion session
Experimental results
Quality?
Scalability?
Application scenarios for HMatch4
Is it possible to relax some constraints?
Support to approximate string matching
Integration with hashing techniques preserving
“nearby” positions for similar values
LWDM 2014 - Athens, Greece - March, 28th 2014
13
Experimental results - quality
Comparison of Hmatch4 against HMatch3
Dataset of 58 Freebase individuals (1.653 mappings)
LWDM 2014 - Athens, Greece - March, 28th 2014
14
Experimental results - scalability
Comparison of Hmatch4 against LogMap and SLINT+
Growing number of individuals (from 58 to 85.144)
LWDM 2014 - Athens, Greece - March, 28th 2014
15
Conclusions
Extensive experimentation against very large
datasets (millions of instances)
Support to approximate string matching
Integration with hashing techniques preserving
“nearby” positions for similar values
LWDM 2014 - Athens, Greece - March, 28th 2014
16
Thanks
The ISLab team 
- 
http://islab.di.unimi.it/
Alfio Ferrara – 
alfio.ferrara@unimi.it
Lorenzo Genta – 
lorenzo.genta@unimi.it
Stefano Montanelli – 
stefano.montanelli@unimi.it
LWDM 2014 - Athens, Greece - March, 28th 2014
17
Slide Note
Embed
Share

Exploring the importance of similarity recognition in various web data applications, the challenges of data matching in terms of scalability, and the specific constraints and features that play a role in the matching process. Examples from the Freebase repository demonstrate how resources are represented and matched based on defined features.

  • Web Data Management
  • Data Matching
  • Semantic Web
  • Resource Matching
  • Data Analysis

Uploaded on Dec 13, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Similarity Recognition in the Web of Data Alfio Ferrara Lorenzo Genta Stefano Montanelli 4th Int. Workshop on Linked Web Data Management (LWDM 2014) Athens, Greece - March, 28th 2014 Dipartimento di Informatica Universit degli Studi di Milano 1

  2. Context and motivations In the web of data, similarity recognition is the basis for a variety of resource-consuming activities and applications. Such as for example: data recommendation data aggregation data analysis A lot of research effort has been focused on data matching with a specific attention to instance matching in the framework of the Semantic Web LWDM 2014 - Athens, Greece - March, 28th 2014 2

  3. Why data matching again? Focus on the scalability issue Traditional comparison-based approaches work on reducing the O(n2) number of matching operations Many optimization strategies have been proposed with interesting results Matching is still an expensive step and offline execution is required for very large datasets to match LWDM 2014 - Athens, Greece - March, 28th 2014 3

  4. Why data matching again? We are interested in data matching under specific constraints Resources are defined according to a common schema (extraction from a single repository) The set of features considered for matching is known in advance (dimensional matching) Features have a dense distribution over the resources to match Feature values are specified according to a controlled vocabulary LWDM 2014 - Athens, Greece - March, 28th 2014 4

  5. Example of resources to match muhammad ali (r1) profession - athlete profession - professional boxer type - olympic athlete nationality - United States of America Resources are represented as sets of feature-value pairs Set of considered features: Profession Nationality Type michael jordan (r2) profession - athlete type - olympic athlete type - celebrity george foreman (r3) profession - professional boxer type - olympic athlete nationality - United States of America Examples taken from the Freebase repository (http://www.freebase.com) LWDM 2014 - Athens, Greece - March, 28th 2014 5

  6. HMatch4 What is HMatch4: It is a data matching algorithm It is natively conceived for working in the web of data It has been developed on the ground of HMatch3 How HMatch4 works: The similarity degree of two resources is proportional to the number of common feature-value pairs An index-based approach is enforced to detect the common feature-value pairs (resource comparison over features is not required) LWDM 2014 - Athens, Greece - March, 28th 2014 6

  7. HM4 at a glance Given a resource r1: Index all the relevant subsets of feature-value pairs belonging to r1 A resource r2is similar to r1if it shares the same entry of r1in the index, meaning that r1and r2 have a common subset of feature-value pairs The similarity degree of r1and r2is assessed by measuring the size (i.e., cardinality) of the shared subset of feature-value pairs LWDM 2014 - Athens, Greece - March, 28th 2014 7

  8. HM4 with an example muhammad ali (r1) profession - athlete profession - professional boxer type - olympic athlete nationality - United States of America Index entries Subset of feature-values pairs Resourc e r1 r1 r1 r1 r1 r1 r1 (profession - athlete), (type ol. athlete) (profession prof. boxer), (type ol. athlete) (profession - athlete), (nat. - USA) (profession prof. boxer), (nat. USA) (type ol. athlete), (nat. USA) (profession - athlete), (type ol. athlete), (nat. USA) (profession - prof. boxer), (type ol. athlete), (nat. USA) LWDM 2014 - Athens, Greece - March, 28th 2014 8

  9. HM4 with an example michael jordan (r2) profession - athlete type - olympic athlete type - celebrity Index entries Subset of feature-values pairs Resourc e r1, r2 r2 r1 r1 r1 r1 r1 r1 (profession - athlete), (type ol. athlete) (profession - athlete), (type celebrity) (profession prof. boxer), (type ol. athlete) (profession - athlete), (nat. - USA) (profession prof. boxer), (nat. USA) (type ol. athlete), (nat. USA) (profession - athlete), (type ol. athlete), (nat. USA) (profession - prof. boxer), (type ol. athlete), (nat. USA) LWDM 2014 - Athens, Greece - March, 28th 2014 9

  10. HM4 with an example george foreman (r3) profession - professional boxer type - olympic athlete nationality - United States of America Index entries Subset of feature-values pairs Resourc e r1, r2 r2 r1, r3 r1 r1, r3 r1, r3 r1 r1, r3 (profession - athlete), (type ol. athlete) (profession - athlete), (type celebrity) (profession prof. boxer), (type ol. athlete) (profession - athlete), (nat. - USA) (profession prof. boxer), (nat. USA) (type ol. athlete), (nat. USA) (profession - athlete), (type ol. athlete), (nat. USA) (profession - prof. boxer), (type ol. athlete), (nat. USA) LWDM 2014 - Athens, Greece - March, 28th 2014 10

  11. HM4 with an example What is the similarity of r1, r2, r3? Take the index entry with the higher number of common feature-value pairs Use the Dice s coefficient rfs(ri,rj) is the relevant feature set of riand rj (the index entry of riand rj) fs(ri) is the feature set of ri 2*|rfs(ri,rj)| | fs(ri)|+| fs(rj)| sim(ri,rj)= sim(r1, r2) = 2*2 / (3+2) = 0.8 sim(r2, r3) = 0 sim(r1, r3) = 2*3 / (3+3) = 1.0 LWDM 2014 - Athens, Greece - March, 28th 2014 11

  12. What is a relevant feature set? Define a matching threshold th (minimum level of similarity for having that two resources are matching resources) Given two resources ri and rj, how many feature- value pairs do they need to have in common to satisfy th (and thus to be a relevant feature set)? X is the number of needed common feature-value pairs X th*|F| Sets with less feature-values pairs than X are not considered (in the example th=0.5, |F|=3) LWDM 2014 - Athens, Greece - March, 28th 2014 12

  13. Issues for the discussion session Experimental results Quality? Scalability? Application scenarios for HMatch4 Is it possible to relax some constraints? Support to approximate string matching Integration with hashing techniques preserving nearby positions for similar values LWDM 2014 - Athens, Greece - March, 28th 2014 13

  14. Thanks The ISLab team - http://islab.di.unimi.it/ Alfio Ferrara alfio.ferrara@unimi.it Lorenzo Genta lorenzo.genta@unimi.it Stefano Montanelli stefano.montanelli@unimi.it LWDM 2014 - Athens, Greece - March, 28th 2014 17

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#