Challenges and Solutions in Data Integration

 
Laure Berti  (Universite de Rennes 1), Anish Das Sarma
(Stanford), 
Xin Luna Dong (AT&T), 
Amelie Marian
(Rutgers) , Divesh Srivastava (AT&T)
 
 
Challenges that Data Integration Faces
 
Challenges that Data Integration Faces
 
Schema matching
Model management
Query answering using views
Information extraction
Challenges that Data Integration Faces
 
Scissors
Paper Scissors
String matching (edit distance,
token-based, etc.)
Object matching (aka. record
linkage,  reference reconciliation, …)
Challenges that Data Integration Faces
 
Scissors
Glue
Data fusion
Truth
discovery
Existing Solutions Assume Independence
of Data Sources
 
However, advanced technologies,
such as the Web, eases copying of
data between data sources.
Such copying can significantly
affect effectiveness of existing
techniques.
Schema matching
Model management
Query answering using views
Information extraction
String matching (edit distance,
token-based, etc.)
Object matching (aka. record
linkage,  reference reconciliation, …)
Data fusion
Truth
discovery
False Information on the Web
UA’s bankruptcy
Chicago Tribune, 2002
Sun-Sentinel.com
Google News
Bloomberg.com
The UAL stock
plummeted to $3
from $12.5
 
How to Find the Truth?
Naïve voting: among conflicting values,
choose the one that is asserted by the most
number of data sources
However,
“A lie told often enough becomes the truth.”
 Vladimir Lenin
Identify 
dependence
 between data sources:
One source copies from other sources
Opinion by one source is influenced by others
 
I. Identifying Dependence bet. Sources
Intuition I: decide dependence (w/o direction)
Let D1, D2 be data from two sources. D1 and D2 are
dependent if
Pr(D1, D2) <> Pr(D1) * Pr(D2).
 
Dependence?
 
Source 1 on USA Presidents
:
1
st
 : George Washington
2
nd
 : John Adams
3
rd
 : Thomas Jefferson
4
th
  : 
James Madison
41
st
 : George H.W. Bush
42
nd
 : William J. Clinton
43
rd 
: George W. Bush
44
th
: Barack Obama
 
Source 2 on USA Presidents
:
1
st
 : George Washington
2
nd
 : John Adams
3
rd
 : Thomas Jefferson
4
th
  : 
James Madison
41
st
 : George H.W. Bush
42
nd
 : William J. Clinton
43
rd 
: George W. Bush
44
th
: Barack Obama
Are Source 1 and Source 2 dependent?
 
Not necessarily
 
 
 
 
 
 
 
 
Dependence?
Source 1 on USA Presidents
:
1
st
 : George Washington
2
nd
 : Benjamin Franklin
3
rd
 : Tom Jefferson
4
th
  : 
Abraham Lincoln
41
st
 : George W. Bush
42
nd
 : Hillary Clinton
43
rd 
: Mickey Mouse
44
th
: Barack Obama
Source 2 on USA Presidents
:
1
st
 : George Washington
2
nd
 : Benjamin Franklin
3
rd
 : Tom Jefferson
4
th
  : 
Abraham Lincoln
41
st
 : George W. Bush
42
nd
 : Hillary Clinton
43
rd 
: Mickey Mouse
44
th
: John McCain
Are Source 1 and Source 2 dependent?
 
-- Common Errors
-- Common Errors
 
Very likely
 
 
 
 
 
 
 
 
I. Identifying Dependence bet. Sources
Intuition I: decide
 
dependence (w/o direction)
Let D1, D2 be data from two sources. D1 and D2 are
dependent if
Pr(D1, D2) <> Pr(D1) * Pr(D2).
 
Intuition II: decide copying 
direction
Let F be a property function of the data; e.g.,
accuracy of data. D1 is likely to be dependent on
D2 if
|F(D1 
 D2)-F(D1-D2)| > |F(D1 
 D2)-F(D2-D1)| .
Dependence?
Source 2 on USA Presidents
:
1
st
 : George Washington
2
nd
 : Benjamin Franklin
3
rd
 : Tom Jefferson
4
th
  : 
Abraham Lincoln
41
st
 : George W. Bush
42
nd
 : Hillary Clinton
43
rd 
: Mickey Mouse
44
th
: John McCain
Are Source 1 and Source 2 dependent?
 
-- Different Accuracy
-- Different Accuracy
Source 1 on USA Presidents
:
1
st
 : George Washington
2
nd
 : John Adams
3
rd
 : Thomas Jefferson
4
th
  : 
Abraham Lincoln
41
st
 : George W. Bush
42
nd
 : Hillary Clinton
43
rd 
: George W. Bush
44
th
: John McCain
 
 
 
 
S1 more likely
to be a copier
 
 
 
 
 
 
 
 
 
 
II. Applying Dependence bet. Sources in DI
II. Applying Dependence bet. Sources in DI
 
Research
Research
 Agenda:
 Agenda:
 
Solomon
Solomon
 
Related  Work
Data provenance 
[Buneman et al., PODS’08]
Assume knowledge of provenance/lineage
Focus on effective presentation and retrieval
Opinion pooling 
[Clemen&Winkler, 1985]
Combine pr distributions from multiple experts
Again, assume knowledge of dependence
Detect plagiarism of programs 
[Schleimer,
 
Sigmod’03]
Unstructured data
 
Discovering Dependence Between Sources
 
Challenges
Accurate sources: independently
provide true values
Different coverage and expertise:
specialist srcs v.s. generalist srcs
Lazy copiers and slow providers
Partial dependence: copy only a
subset of data, reformat some of
the copied values, provide some
info independently, etc.
Correlated information: common
interest/belief system
Incomplete observations: hidden
data, undiscovered sources,
missing updates, etc.
 
Sub-problems
Discovery of copying for
snapshots of data
Sharing common false data
Different accuracy on common data
and distinct data
Discovery of copying for update
history
Same updates in close enough time
frame
Different accuracy on pre-provided
data and post-provided data
Discovery of opinion influence in
ratings
 
App I. Data Fusion w. Source Dependence
 
Truth discovery
Decide one true value for each
object.
Challenge: interdependence
between truth discovery and
dependence detection.
Integrating probabilistic data
Generate a probabilistic
distribution of possible values for
each object.
Challenge: the dependence
between sources may also be
probabilistic.
Finding consensus opinions in
recommendation systems.
 
 
App II. Record Linkage w. Source Dependence
 
Record linkage
Knowledge of dependence bet.
sources can improve record
linkage.
Challenges
Again, interdependence
between record linkage and
dependence detection.
Distinguish alternative
representations and wrong
values; e.g.,
Xin Dong (official name)
Luna
 Dong (alternative)
Xin D
e
ng (wrong value)
 
 
App III. Query Answering w. Source Dependence
 
Query Answering
Optimization: avoid visiting
sources dependent on, or having
been copied by, source already
visited.
Online query answering: first
return partially computed
answers and then update the
answers as querying more
sources; need to order sources
so as to provide complete and
accurate answers from the
beginning.
Schema matching
Knowledge of dependence bet.
sources can improve schema
matching.
 
Slide Note
Embed
Share

Facing challenges like data conflicts, instance and structure heterogeneity, the field of data integration encounters complexities in schema matching, model management, and query answering. Existing solutions assuming independence of data sources are now impacted by advanced technologies enabling easy data copying. False information on the web further complicates the truth-finding process, prompting strategies like naive voting and identifying dependence between data sources.

  • Data Integration
  • Challenges
  • Solutions
  • Data Conflicts
  • Truth Discovery

Uploaded on Sep 07, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Laure Berti (Universite de Rennes 1), Anish Das Sarma (Stanford), Xin Luna Dong (AT&T), Amelie Marian (Rutgers) , Divesh Srivastava (AT&T)

  2. Challenges that Data Integration Faces Data Conflicts Instance Heterogeneity Structure Heterogeneity

  3. Challenges that Data Integration Faces Data Conflicts Instance Heterogeneity Structure Heterogeneity Schema matching Model management Query answering using views Information extraction

  4. Challenges that Data Integration Faces Scissors Data Conflicts Paper Scissors Instance Heterogeneity Structure Heterogeneity String matching (edit distance, token-based, etc.) Object matching (aka. record linkage, reference reconciliation, )

  5. Challenges that Data Integration Faces Scissors Data Conflicts Glue Instance Heterogeneity Structure Heterogeneity Data fusion Truth discovery

  6. Existing Solutions Assume Independence of Data Sources However, advanced technologies, such as the Web, eases copying of data between data sources. Such copying can significantly affect effectiveness of existing techniques. Data Conflicts Instance Heterogeneity Assume INDEPENDENCE of data sources Structure Heterogeneity Data fusion Truth discovery String matching (edit distance, token-based, etc.) Object matching (aka. record linkage, reference reconciliation, ) Schema matching Model management Query answering using views Information extraction

  7. False Information on the Web UA s bankruptcy Chicago Tribune, 2002 Sun-Sentinel.com Google News Bloomberg.com The UAL stock plummeted to $3 from $12.5

  8. How to Find the Truth? Na ve voting: among conflicting values, choose the one that is asserted by the most number of data sources However, A lie told often enough becomes the truth. Vladimir Lenin Identify dependencebetween data sources: One source copies from other sources Opinion by one source is influenced by others

  9. I. Identifying Dependence bet. Sources Intuition I: decide dependence (w/o direction) Let D1, D2 be data from two sources. D1 and D2 are dependent if Pr(D1, D2) <> Pr(D1) * Pr(D2).

  10. Dependence? Are Source 1 and Source 2 dependent? Not necessarily Source 1 on USA Presidents: Source 2 on USA Presidents: 1st: George Washington 1st: George Washington 2nd: John Adams 2nd: John Adams 3rd: Thomas Jefferson 3rd: Thomas Jefferson 4th: James Madison 4th: James Madison 41st: George H.W. Bush 41st: George H.W. Bush 42nd: William J. Clinton 42nd: William J. Clinton 43rd : George W. Bush 43rd : George W. Bush 44th: Barack Obama 44th: Barack Obama

  11. Dependence? --Common Errors Are Source 1 and Source 2 dependent? Very likely Source 1 on USA Presidents: Source 2 on USA Presidents: 1st: George Washington 1st: George Washington 2nd: Benjamin Franklin 2nd: Benjamin Franklin 3rd: Tom Jefferson 3rd: Tom Jefferson 4th: Abraham Lincoln 4th: Abraham Lincoln 41st: George W. Bush 41st: George W. Bush 42nd: Hillary Clinton 42nd: Hillary Clinton 43rd : Mickey Mouse 43rd : Mickey Mouse 44th: Barack Obama 44th: John McCain

  12. I. Identifying Dependence bet. Sources Intuition I: decidedependence (w/o direction) Let D1, D2 be data from two sources. D1 and D2 are dependent if Pr(D1, D2) <> Pr(D1) * Pr(D2). Intuition II: decide copying direction Let F be a property function of the data; e.g., accuracy of data. D1 is likely to be dependent on D2 if |F(D1 D2)-F(D1-D2)| > |F(D1 D2)-F(D2-D1)| .

  13. Dependence? --Different Accuracy S1 more likely to be a copier Are Source 1 and Source 2 dependent? Source 1 on USA Presidents: Source 2 on USA Presidents: 1st: George Washington 1st: George Washington 2nd: John Adams 2nd: Benjamin Franklin 3rd: Thomas Jefferson 3rd: Tom Jefferson 4th: Abraham Lincoln 4th: Abraham Lincoln 41st: George W. Bush 41st: George W. Bush 42nd: Hillary Clinton 42nd: Hillary Clinton 43rd : George W. Bush 43rd : Mickey Mouse 44th: John McCain 44th: John McCain

  14. II. Applying Dependence bet. Sources in DI Truth discovery Integrating probabilistic data Data Fusion Data Conflicts Improve record linkage Distinguish bet wrong values and alter representations Record Instance Heterogeneity Linkage Structure Heterogeneity Query optimization Improve schema matching Query Answering Recommend trustworthy , up-to-date, and independent sources Source Recom- mendation

  15. ResearchAgenda: Solomon Solomon Discovery of copying for snapshots of data Discovery of copying for update history Discovery of opinion influence in reviews Data Conflicts Discovery Instance Heterogeneity Truth discovery Record linkage Query optimization Source recommendation Structure Heterogeneity Applications

  16. Related Work Data provenance [Buneman et al., PODS 08] Assume knowledge of provenance/lineage Focus on effective presentation and retrieval Opinion pooling [Clemen&Winkler, 1985] Combine pr distributions from multiple experts Again, assume knowledge of dependence Detect plagiarism of programs [Schleimer, Sigmod 03] Unstructured data

  17. Discovering Dependence Between Sources Challenges Accurate sources: independently provide true values Different coverage and expertise: specialist srcsv.s. generalist srcs Lazy copiers and slow providers Partial dependence: copy only a subset of data, reformat some of the copied values, provide some info independently, etc. Correlated information: common interest/belief system Incomplete observations: hidden data, undiscovered sources, missing updates, etc. Sub-problems Discovery of copying for snapshots of data Sharing common false data Different accuracy on common data and distinct data Discovery of copying for update history Same updates in close enough time frame Different accuracy on pre-provided data and post-provided data Discovery of opinion influence in ratings

  18. App I. Data Fusion w. Source Dependence Truth discovery Decide one true value for each object. Challenge: interdependence between truth discovery and dependence detection. Integrating probabilistic data Generate a probabilistic distribution of possible values for each object. Challenge: the dependence between sources may also be probabilistic. Finding consensus opinions in recommendation systems. Data Conflicts Instance Heterogeneity Structure Heterogeneity

  19. App II. Record Linkage w. Source Dependence Record linkage Knowledge of dependence bet. sources can improve record linkage. Challenges Again, interdependence between record linkage and dependence detection. Distinguish alternative representations and wrong values; e.g., Xin Dong (official name) Luna Dong (alternative) Xin Deng (wrong value) Data Conflicts Instance Heterogeneity Structure Heterogeneity

  20. App III. Query Answering w. Source Dependence Query Answering Optimization: avoid visiting sources dependent on, or having been copied by, source already visited. Online query answering: first return partially computed answers and then update the answers as querying more sources; need to order sources so as to provide complete and accurate answers from the beginning. Schema matching Knowledge of dependence bet. sources can improve schema matching. Data Conflicts Instance Heterogeneity Structure Heterogeneity

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#