Data Cleaning and Integration Computing Systems Overview

Data Cleaning and Integration
Computing Systems: 1/30/18 Lecture
Professor Eugene
 
Wu
By Kathy Lin , Xiaohui Guo, and Aria Kumar
  k
l2615        xg2225                sk4345
1
Motivation for Data Integration
2
 
Reduce unused inventory costs
Assess the supply and demand of different products
Market Basket Analysis
used by large retailers so see if there are associations between items. It’s an association
analysis technique that allows you to see if you buy a certain group of items whether you
are more or less likely to buy another group of items. In class we were told of the famous
diapers and beer example, where a retailer found that both those items were frequently
bought together.
Complications for Data Integration
3
 
Q: 
What happens when you have data coming from more than one source /
schema?
Consider a conglomerate of stores and each store has a different schema -- this becomes very
difficult to integrate data.
Different items codes,  different syntax, different items at different stores are a few examples
of difficulties when combining data from different stores into one database
In the example below we see one of the complications of data integration from different
stores
Different stores may have different
identifiers for same items
ETL: Extract, Load, and Transform
4
Three database functions used to pull data out of one database and place it in the
global database
Extract:
Process of collecting and reading data from a database
Data is collected often from multiple sources
Transform:
Process of converting the extracted data and getting it into the appropriate form to be placed
in the global database
Use rules, lookup tables, queries to accomplish
Load
Process of writing the data into the global database
ETL: Extract Transform Load
5
 
(also known as master data management)
Store1
Sales
Orders
Product
Information
Marketing
Campaign
GLOBAL SCHEMA
MASTER
1.
Define a Global Schema
2.
ETL data to fit global schema
Store2
Store3
Potential problems may arise during data importing.
Because each store has a different schema, it is challenging
to fit the global schema when importing,
Potential Data Integration Problems
6
 
Humans define the global schema: 
Since humans define and create the schema, employee overturn / poor
documentation can lead to ambiguous schema values and cause difficulties in
matching schemas
Programmers don’t scale
It is both time inefficient and expensive to send programmers onsite to conduct
interviews to decipher ambiguous schemas. Updating schemas a lot of software
engineering manpower and it may be hard to find software engineers interested in
doing this particular work.
Potential Data Integration Problems
7
 
Data changes constantly
Restaurants can change addresses or go out of business 
 requiring potential
updates to schema. Business needs can also grow across time, which requires
updates to the schema that could potentially take months. By the time you
update, there are potentially more changes that need to be integrated.
Data could be wrong (expect 10% wrong data on average)
As we will see in the second half of lecture, human error or scripting errors
can be prevalent in data without warnings.
Other Potential Data Integration Problems
8
 
Unstructured data
Missing Values in data
Data may be too large
Outdated schema due to changing business needs
Employee overturn / poor documentation can lead to
ambiguous schema values
Examples of limitations of global schema for business use
9
Beer company wants to know how weather affects beer sales. However, weather
is not in the data warehouse within the global schema. Therefore, the business
question can’t be answered since weather is hard to add.
“How many employees are there?” 
 due to company merger and acquisitions
 
Steps of Data Integration
10
1.
Ingest
: getting/saving/downloading data
2.
Transform
: unit conversions, simple data extraction/wranging, dict syntax
3.
Clean
: dealing with nulls and outliers
4.
Schema Integration
: merging multiple tables into 1table
5.
Deduplication
: merging multiple rows into 1 unique row
Ingest
Transform
Clean
Schema
Integration
Dedup
1.
Ingest
: getting/saving/downloading data
Parsing
Converting file formats
NOTE STEPS 1-5 ABOVE CAN OCCUR IN ANY ORDER
2. Transform
11
Data Wrangling: 
Simple transformations to data (see example below) to change its formatting
Regular Expressions
: 
(RegEx) is a search pattern used to help find the relevant data 
Extraction / Value Filling: 
Selecting/Filtering certain values from input / replacing NULLs with 0’s
Dictionaries:
 
Transforming starting input/keys using a dictionary/lookup table into dictionary mapped output
String
 
Tuple transformations: 
Splitting a data string “abc” 
 (“a”,”b”,”c”)
Ingest
Transform
Clean
Schema
Integration
Dedup
Data Wrangling Example.:
Blah; blah2, blah3
NO SEMANTICS
Blah, blah2, blah3
Old Version of Table
New Version of Table
1
st
 step with 
semantics
 
 are these values “correct”*?
Cleans up nulls
Cleans up outliers
Define Constraints
o
i.e. Age column must take an integer >0
Functional Dependencies: given a variable’s
information, you can infer another’s
3. Clean
12
Ingest
Transform
Clean
Schema
Integration
Dedup
Example of Functional Dependence:
 (see top right table)
We can infer that given 10027 maps to NY and NYC that NY and NYC most likely correspond to the same state.
We can potentially transform NYC -> NY to be more consistent with the rest of the rows of the table.
Given Zipcode mainly consists of 5 digit numbers, we can infer that “foo” is an invalid entry of
the zipcode column.
4. Schema Matching/Integration
13
Understanding which attributes in a table map to other attributes in alternate source’s tables
(merging multiple tables into a single table).
Ingest
Transform
Clean
Schema
Integration
Dedup
Table 1
Last Name
Table 3
Example: Information in Table 1’s first two columns and Table 2’s first column can be associated and combined into
the first column of Table 3.
First Name
Contact
Customer
Table 2
5. Deduplication
14
Also know as clustering and entity matching (merging multiple rows into 1 unique
row). It is an O(N
2
) algorithm.
Ingest
Transform
Clean
Schema
Integration
Dedup
Table 1
Table 2
i.e.: Amazon may have multiple entries of the same item under different names. It needs to
fix duplicates in order to accurately display the stock of an item.
Product
Seller
Qty
Product
Seller
Qty
Truth Finding on the
Deep Web
15
How trustworthy is web scraped data?
16
Deep Web Data
: Data stored in underlying databases and queried
using Web forms
Database
APP VIEW
(QUERY)
Webpage
API
Introduction to Truth Finding on the Deep Web
17
Using web data in two influential and presumably “clean” domains:
Stock and Flight, researchers
1
 wanted to answer the following questions:
1.
Are there a lot of redundant data on the Web?
2.
Are the data consistent?
3.
Does each source provide data of high quality in terms of
correctness and is the quality consistent over time?
4.
Is there any copying?
1. Li, Xian, et al. “Truth Finding on the Deep Web.” 
Proceedings of the VLDB Endowment
, vol. 6, no. 2, 2012, pp. 97–108.,
doi:10.14778/2535568.2448943.
Data Model
18
For each domain (Stocks and Flights):
1.
Objects
 
i.e.: a flight on a particular day
2.
Attributes
: An object is described by a set of attributes
i.e.: a particular flight can be described by scheduled departure time, actual departure
time
3.
Data Item
: A particular attribute of a particular object, which has a single
true value that reflects the real world
i.e.: actual departure time of a flight (the minute that the airplane leaves the gate on
the specific day
Given m=38 sources you obtain: [v
1
, 
v
m
]  where v
i
=[stocks, day, price].
A data item can be incorrect: are they consistent?
Data Redundancy Results
19
Stock Domain
:
16% of the sources provide all 1000 observed stocks (objects)
All sources provide over 90% of the stocks
50% of sources have all stocks
And 83% of the stocks have a full redundancy (provided by all sources)
Flight Domain:
36% of sources cover 90% of the flights
60% of sources cover more than 50% of the flights
87% of the flights have a redundancy of over 50%
29% attributes in 50% of sources
Data Consistency Metrics
20
Metrics Used to Measure Data Consistency:
Entropy
Variance
# of unique values
Dominant Value (D) (similar to majority vote)
Researchers proceeded with this metric due to ease of implementation
Deviation from the dominant value
Dominance Factor: % of data sources that return D
Dominance Factor Example: 
 Given following information about
Flight (UA101, 1/21/18/delay) = [0,0,0,10,15,15,20,5]
Dominant Value: 0
Dominant Factor: 3/8
Data Consistency Results
21
Note
: even though there’s less value inconsistency for the flights domain,
there’s an observed larger deviation for departure delay values.
Potential Reasons for Inconsistency
22
Precision vs Dominance Factor
23
Summary
For 73% of stock values, 98% of the
dominant values are consistent with the
gold standard
In Flight domain, more data items have
higher dominance factor (82% of data
items have dominance factor >0.5)
But these dominant values have lower
precision (88% of dominant values are
consistent with the gold standard)
Stock has precision of .908 and Flight
has 0.864.
Why does Flight have lower precision for dominant values? Data Copying
Data Copying Results
24
Both domains exhibit copying between deep web sources
o
Could be claimed explicitly
o
Query redirection
Copying between sources with slightly different schemas still provided
almost the same objects and the same values
Accuracy of original sources ranged:
Stock
: 0.75-0.92
Flight
: 0.53-0.93
Note:
 Because Flight domain contains more low accuracy sources with
copying, removing these copied sources IMPROVES precision of the
dominant values more significantly than in the Stock domain.
Data Fusion
25
Picking the right fusion
algorithm still  MATTERS!
Researchers tested 15 approaches and ranked the performance of each algorithm:
Majority vote < Best source < Best Fusion
Slide Note
Embed
Share

This lecture covers the motivation for data integration, complications that arise from merging data from various sources, the ETL (Extract, Load, Transform) process, challenges in fitting data to a global schema, and potential problems faced during data integration. It emphasizes the importance of defining a global schema, the role of humans in schema creation, and the inefficiencies of manual data integration methods.

  • Data Integration
  • ETL Process
  • Global Schema
  • Data Management
  • Database Systems

Uploaded on Oct 09, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Data Cleaning and Integration Computing Systems: 1/30/18 Lecture Professor Eugene Wu By Kathy Lin , Xiaohui Guo, and Aria Kumar kl2615 xg2225 sk4345 1

  2. Motivation for Data Integration Reduce unused inventory costs Assess the supply and demand of different products Market Basket Analysis used by large retailers so see if there are associations between items. It s an association analysis technique that allows you to see if you buy a certain group of items whether you are more or less likely to buy another group of items. In class we were told of the famous diapers and beer example, where a retailer found that both those items were frequently bought together. 2

  3. Complications for Data Integration Q: What happens when you have data coming from more than one source / schema? Consider a conglomerate of stores and each store has a different schema -- this becomes very difficult to integrate data. Different items codes, different syntax, different items at different stores are a few examples of difficulties when combining data from different stores into one database In the example below we see one of the complications of data integration from different stores Item Codes Store # Store 1 pd10123 Store 2 S310122 Diapers Different stores may have different identifiers for same items 3

  4. ETL: Extract, Load, and Transform Three database functions used to pull data out of one database and place it in the global database Extract: Process of collecting and reading data from a database Data is collected often from multiple sources Transform: Process of converting the extracted data and getting it into the appropriate form to be placed in the global database Use rules, lookup tables, queries to accomplish Load Process of writing the data into the global database 4

  5. ETL: Extract Transform Load (also known as master data management) 1. Define a Global Schema 2. ETL data to fit global schema Potential problems may arise during data importing. Because each store has a different schema, it is challenging to fit the global schema when importing, Sales Orders Store1 MASTER Product Information Store2 Marketing Campaign Store3 GLOBAL SCHEMA 5

  6. Potential Data Integration Problems Humans define the global schema: Since humans define and create the schema, employee overturn / poor documentation can lead to ambiguous schema values and cause difficulties in matching schemas Programmers don t scale It is both time inefficient and expensive to send programmers onsite to conduct interviews to decipher ambiguous schemas. Updating schemas a lot of software engineering manpower and it may be hard to find software engineers interested in doing this particular work. 6

  7. Potential Data Integration Problems Data changes constantly Restaurants can change addresses or go out of business requiring potential updates to schema. Business needs can also grow across time, which requires updates to the schema that could potentially take months. By the time you update, there are potentially more changes that need to be integrated. Data could be wrong (expect 10% wrong data on average) As we will see in the second half of lecture, human error or scripting errors can be prevalent in data without warnings. 7

  8. Other Potential Data Integration Problems Unstructured data Missing Values in data Data may be too large Outdated schema due to changing business needs Employee overturn / poor documentation can lead to ambiguous schema values 8

  9. Examples of limitations of global schema for business use Beer company wants to know how weather affects beer sales. However, weather is not in the data warehouse within the global schema. Therefore, the business question can t be answered since weather is hard to add. How many employees are there? due to company merger and acquisitions 9

  10. Steps of Data Integration 1. Ingest: getting/saving/downloading data 2. Transform: unit conversions, simple data extraction/wranging, dict syntax 3. Clean: dealing with nulls and outliers 4. Schema Integration: merging multiple tables into 1table 5. Deduplication: merging multiple rows into 1 unique row NOTE STEPS 1-5 ABOVE CAN OCCUR IN ANY ORDER 1. Ingest: getting/saving/downloading data Parsing Converting file formats Schema Integration Dedup Ingest Transform Clean 10

  11. 2. Transform Data Wrangling: Simple transformations to data (see example below) to change its formatting Regular Expressions: (RegEx) is a search pattern used to help find the relevant data Extraction / Value Filling: Selecting/Filtering certain values from input / replacing NULLs with 0 s Dictionaries:Transforming starting input/keys using a dictionary/lookup table into dictionary mapped output String Tuple transformations: Splitting a data string abc ( a , b , c ) Data Wrangling Example.: Old Version of Table New Version of Table NO SEMANTICS Blah; blah2, blah3 Blah, blah2, blah3 Schema Integration Dedup Ingest Transform Clean 11

  12. 3. Clean 1st step with semantics are these values correct *? Cleans up nulls Cleans up outliers Define Constraints o i.e. Age column must take an integer >0 Functional Dependencies: given a variable s information, you can infer another s Zipcode 10027 10027 10028 foo State NY NYC NY NY Example of Functional Dependence: (see top right table) We can infer that given 10027 maps to NY and NYC that NY and NYC most likely correspond to the same state. We can potentially transform NYC -> NY to be more consistent with the rest of the rows of the table. Given Zipcode mainly consists of 5 digit numbers, we can infer that foo is an invalid entry of the zipcode column. Schema Integration Dedup Ingest Transform Clean 12

  13. 4. Schema Matching/Integration Understanding which attributes in a table map to other attributes in alternate source s tables (merging multiple tables into a single table). Example: Information in Table 1 s first two columns and Table 2 s first column can be associated and combined into the first column of Table 3. Table 1 First Name Last Name Jane Doe Table 3 Customer Jane Doe Table 2 Contact Jane Doe Schema Integration Dedup Ingest Transform Clean 13

  14. 5. Deduplication Also know as clustering and entity matching (merging multiple rows into 1 unique row). It is an O(N2) algorithm. i.e.: Amazon may have multiple entries of the same item under different names. It needs to fix duplicates in order to accurately display the stock of an item. Table 1 Seller Product Qty Table 2 Seller iPhone X B A B 100 80 50 Qty Product iPhone 6 iPhone 6 A B 80 100 iPhone X GREAT! iPhone X Schema Integration Dedup Ingest Transform Clean 14

  15. Truth Finding on the Deep Web 15

  16. How trustworthy is web scraped data? Deep Web Data: Data stored in underlying databases and queried using Web forms Webpage APP VIEW (QUERY) Database API 16

  17. Introduction to Truth Finding on the Deep Web Using web data in two influential and presumably clean domains: Stock and Flight, researchers1 wanted to answer the following questions: 1. Are there a lot of redundant data on the Web? 2. Are the data consistent? 3. Does each source provide data of high quality in terms of correctness and is the quality consistent over time? 4. Is there any copying? 1. Li, Xian, et al. Truth Finding on the Deep Web. Proceedings of the VLDB Endowment, vol. 6, no. 2, 2012, pp. 97 108., doi:10.14778/2535568.2448943. 17

  18. Data Model For each domain (Stocks and Flights): 1. Objects i.e.: a flight on a particular day 2. Attributes: An object is described by a set of attributes i.e.: a particular flight can be described by scheduled departure time, actual departure time 3. Data Item: A particular attribute of a particular object, which has a single true value that reflects the real world i.e.: actual departure time of a flight (the minute that the airplane leaves the gate on the specific day Given m=38 sources you obtain: [v1, vm] where vi=[stocks, day, price]. A data item can be incorrect: are they consistent? 18

  19. Data Redundancy Results Stock Domain: 16% of the sources provide all 1000 observed stocks (objects) All sources provide over 90% of the stocks 50% of sources have all stocks And 83% of the stocks have a full redundancy (provided by all sources) Flight Domain: 36% of sources cover 90% of the flights 60% of sources cover more than 50% of the flights 87% of the flights have a redundancy of over 50% 29% attributes in 50% of sources 19

  20. Data Consistency Metrics Metrics Used to Measure Data Consistency: Entropy Variance # of unique values Dominant Value (D) (similar to majority vote) Researchers proceeded with this metric due to ease of implementation Deviation from the dominant value Dominance Factor: % of data sources that return D Dominance Factor Example: Given following information about Flight (UA101, 1/21/18/delay) = [0,0,0,10,15,15,20,5] Dominant Value: 0 Dominant Factor: 3/8 20

  21. Data Consistency Results Stocks Flights 3.7 values/Attribute 1.45 values/attribute 17% of data items have 1 value 61% of data items have 1 value 30% of data items have 2 values 93% of attributes have <=2 values Note: even though there s less value inconsistency for the flights domain, there s an observed larger deviation for departure delay values. 21

  22. Potential Reasons for Inconsistency Stocks Flights (46%) Semantic ambiguity words can have multiple meaning (ie: Different sources define Dividend across different periods -- year, quarter, half-year) (56%) Pure Errors completely incorrect data, could be due to human error etc. (34%) Out of Date Data data that was true at one point but may not be true anymore (33%) Semantic Ambiguity words can have multiple meaning (ie: flight delay can mean taxiing delay or wheels off delay) (11%) Purely erroneous completely incorrect data, could be due to human error etc. (11%) Out of Date Data data that was true at one point but may not be true anymore (6%) Instance ambiguity where a source interprets one stock symbol differently from the majority of sources; when stock symbols are terminated (3%) Unit Error incorrect units 22

  23. Precision vs Dominance Factor Summary For 73% of stock values, 98% of the dominant values are consistent with the gold standard In Flight domain, more data items have higher dominance factor (82% of data items have dominance factor >0.5) But these dominant values have lower precision (88% of dominant values are consistent with the gold standard) Stock has precision of .908 and Flight has 0.864. Why does Flight have lower precision for dominant values? Data Copying 23

  24. Data Copying Results Both domains exhibit copying between deep web sources o Could be claimed explicitly o Query redirection Copying between sources with slightly different schemas still provided almost the same objects and the same values Accuracy of original sources ranged: Stock: 0.75-0.92 Flight: 0.53-0.93 Note: Because Flight domain contains more low accuracy sources with copying, removing these copied sources IMPROVES precision of the dominant values more significantly than in the Stock domain. 24

  25. Data Fusion Domain Vote Accuracy AccuCopy (FUSION) 88% Picking the right fusion algorithm still MATTERS! Stock 92% 99% Flight 88% 95% 98% Researchers tested 15 approaches and ranked the performance of each algorithm: Majority vote < Best source < Best Fusion 25

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#