Data Warehousing and Data Mining Concepts

undefined
 
DATA WAREHOUSING
&
DATA MINING
 
 
1
 
 
Prepared by:
Anita Parmar
undefined
 
 
2
 
undefined
 
 
 
3
undefined
 
D
ATA
 M
INING
:
 
C
ONCEPTS
 
AND
 T
ECHNIQUES
— C
HAPTER
 2 —
 
Introduction to Data Mining
 
4
 
 
undefined
 
C
HAPTER
 2.  I
NTRODUCTION
 
Motivation: Why data mining?
What is data mining?
Data Mining: On what kind of data?
Data mining functionality
Classification of data mining systems
Data mining task primitives
Major issues in data mining
 
5
undefined
 
W
HY
 D
ATA
 M
INING
?
 
The Explosive Growth of Data: from terabytes to petabytes
Data collection and data availability
Automated data collection tools, database systems, Web, computerized
society
Major sources of rich data
Business: Web, e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific simulation, …
Society and everyone: news, digital cameras, YouTube
We are drowning in data, but starving for knowledge!
“Necessity is the mother of invention”
Data mining
Automated
analysis of massive data sets
 
6
undefined
 
E
VOLUTION
 
OF
 D
ATABASE
 T
ECHNOLOGY
 
1960s:
Data collection, database creation, file creation
1970s:
Relational data model, relational DBMS implementation
1980s:
RDBMS, advanced data models (extended-relational, OO, deductive,
etc.)
Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s:
Data mining, data warehousing, multimedia databases, and Web
databases
2000s
Data mining and its applications
Web technology (XML, data integration)
 
7
undefined
 
 
 
8
 
data rich but information poor
undefined
 
 
 
9
 
We want to know ...
 
Which types of transactions are likely to be
fraudulent given the transactional history of a
particular customer?
If I raise the price of my product by Rs. 2, what is
the effect on my business?
If I offer only 2,500 as an incentive to purchase
rather than 5,000, how many lost responses will
result?
If I emphasize ease-of-use of the product as
opposed to its technical capabilities, what will be
the net effect on my revenues?
Which of my customers are likely to be the most
loyal?
Data Mining helps extract such information
 
undefined
 
W
HAT
 I
S
 D
ATA
 M
INING
?
 
Data mining (knowledge discovery from data)
Extraction of interesting 
patterns or knowledge from huge
amount of data.
Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data dredging(searching),
information harvesting(gathering), business intelligence, etc.
 
10
undefined
 
 
 
11
undefined
 
K
NOWLEDGE
 D
ISCOVERY
 
FROM
 
DATA
 (KDD) P
ROCESS
 
Data mining—core of
knowledge discovery process
 
12
 
Data Cleaning
 
Data Integration
 
Databases
 
Data Warehouse
 
Task-relevant Data
 
Selection and
transformation
 
Data Mining
 
Pattern Evaluation
undefined
 
KDD P
ROCESS
: S
EVERAL
 K
EY
 S
TEPS
 
1.Data cleaning
 : to remove noise and inconsistent data (may take
60% of effort!)
2.
 
Data integration
 : Where multiple data sources may be combined.
3. Data selection :
Where data relevant to the analysis task are retrieved from the
database.
4. Data Transformation
Where data are transformed or consolidated into forms appropriate for
mining by performing summary or aggregation
 
13
undefined
 
C
ONTINUE
 
5. Data mining
: search for patterns of interest.
An essential process where intelligent methods are
applied in order to extract data patterns.
 
6. Pattern evaluation: 
to identify the truly interesting
patterns representing 
knowledge 
based on some
interestingness measures
 
7. 
Knowledge presentation : 
visualization and knowledge
representation techniques are used to present the mined
knowledge to the user.
 
 
14
undefined
 
D
ATA
 M
INING
 
AND
 B
USINESS
 I
NTELLIGENCE
 
15
 
Increasing potential
to support
business decisions
 
End User
 
Business
  Analyst
 
     Data
Analyst
 
DBA
 
Decision
Making
 
Data Presentation
 
Visualization Techniques
 
Data Mining
 
Information Discovery
 
Data Exploration
 
Statistical Summary, Querying, and Reporting
 
Data Preprocessing/Integration, Data Warehouses
 
Data Sources
 
Paper, Files, Web documents, Scientific experiments, Database Systems
undefined
 
A
RCHITECTURE
 
OF
 
A
 
DATA
 
MINING
 
SYSTEM
 
16
undefined
 
C
ONTINUE
 
Database, Data warehouse, WWW or other information
repository:
A set of Database, data warehouse, spreadsheets, or other kind of
information repositories.
Data cleaning and data integration techniques may be performed on
the data.
 
Database or data warehouse server :
Responsible for fetching the relevant data, based on the user’s data
mining request.
 
Knowledge base:
Domain knowledge that is used to guide the search or evaluate the
interestingness of resulting patterns. For ex.,
Concept hierarchies
, used to organize attributes or attribute values
into different levels of abstraction,
User beliefs, 
which can be used to assess a pattern’s interestingness
based on its unexpectedness, may also be included.
Additional interestingness constraints or thresholds and
metadata.
 
17
undefined
 
C
ONTINUE
 
Data mining engine:
Essential to the data mining system
Consists of a set of functional modules.
 
Pattern evaluation module:
Employs interestingness measures and interacts with the data mining modules
so as to focus the search toward interesting patterns.
It may use interestingness thresholds to filter out discovered patterns.
In many system pattern evaluation  module may be integrated with the mining
module, depending on the implementation of the data mining method used.
 
User interface:
Communicate between users and the data mining system.
Allowing the user to interact with the system by specifying a data mining query
or task, providing information to help focus the search,
performing exploratory data mining based on the intermediate data mining
results.
Allows the user to browse database and data warehouse schemas or data
structures,
evaluate mined patterns,
and visualize the patterns in different forms.
 
18
undefined
 
D
ATA
 M
INING
: O
N
 W
HAT
 K
INDS
 
OF
 D
ATA
?
 
Database-oriented data sets and applications
Relational database,
data warehouse,
transactional database
Advanced data sets and advanced applications
Object-relational databases
Temporal data, sequence data (incl. bio-sequences), Time-series data
Time related, customer shopping sequence, sequence of values repeated over
time(hourly, monthly,daily)
Spatial data and spatiotemporal data
Geographic database, VLSI data, satellite images etc.
 
19
undefined
 
D
ATA
 M
INING
: O
N
 W
HAT
 K
INDS
 
OF
 D
ATA
?
 
Heterogeneous databases and legacy databases
Ex. Information of students performance at different schools
Data streams
Multimedia database
Text databases
The World-Wide Web
 
20
undefined
 
D
ATA
 M
INING
 F
UNCTIONALITIES
 : 
WHAT
 
KINDS
 
OF
PATTERNS
 
CAN
 
BE
 
MINED
 
Concept description: Characterization and discrimination
Generalize, summarize, and contrast data characteristics,
Eg. Find characteristics of Customers  who spend more than
10,000 per month
Eg. Compare customers who shop regularly verses who shop
rarely
Frequent patterns, association, correlation
computer 
 printer [0.5%, 75%]
Classification and prediction
Construct models (functions) that describe and distinguish
classes or concepts for future prediction
E.g., classify countries based on (climate), or classify cars based on
(gas mileage)
Predict some unknown or missing numerical values
 
21
undefined
 
D
ATA
 M
INING
 F
UNCTIONALITIES
 (2)
 
Cluster analysis
Class label is unknown: Group data to form new classes, e.g.,
cluster houses to find distribution patterns
Maximizing intra-class similarity & minimizing interclass
similarity
Outlier analysis
Outlier: Data object that does not comply with the general
behavior of the data
Noise or exception? Useful in fraud detection, rare events analysis
Trend and evolution analysis
Regularities or trends  for object whose behavior changes over time.
Ex. Stock exchange
 
22
undefined
 
A
RE
 A
LL
 
THE
 “D
ISCOVERED
” P
ATTERNS
 I
NTERESTING
?
 
Data mining may generate thousands of patterns: Not all of them
are interesting
Interestingness measures
A pattern is 
interesting
 if it is 
easily understood
 by humans, 
valid
 
on
new
 
or test data with some degree of 
certainty
, 
potentially useful
,
novel,
 or 
validates some hypothesis
 that a user seeks to confirm
Objective vs. subjective interestingness measures
Objective
:
 based on 
statistics and structures of patterns
, e.g., support,
confidence, etc.
Subjective
:
 based on 
user’s belief
 in the data
 
23
undefined
 
F
IND
 A
LL
 
AND
 O
NLY
 I
NTERESTING
 P
ATTERNS
?
 
Find all the interesting patterns: 
Completeness
Can a data mining system find 
all
 
the interesting patterns? Do we
need to find 
all
 of the interesting patterns?
Association vs. classification vs. clustering
Search for only interesting patterns: An optimization problem
Can a data mining system find 
only
 the interesting patterns?
Approaches
First general all the patterns and then filter out the uninteresting
ones
Generate only the interesting patterns—mining query optimization
 
24
undefined
 
C
LASSIFICATION
 
OF
 D
ATA
 M
INING
 S
YSTEM
 
 
25
undefined
 
C
LASSIFICATION
 
OF
 D
ATA
 M
INING
 S
YSTEM
 
Kinds of Databases to be mined
Relational, data warehouse, transactional, stream, object-
oriented/relational, spatial, time-series, text, multi-media,
heterogeneous, legacy, WWW
Kinds of Knowledge to be mined
Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
Multiple/integrated functions and mining at multiple levels
Kinds of Techniques utilized
Database-oriented, data warehouse (OLAP), machine learning,
statistics, visualization, etc.
Applications adapted
Retail, telecommunication, banking, fraud analysis, bio-data mining,
stock market analysis, text mining, Web mining, etc.
 
26
undefined
 
P
RIMITIVES
 
THAT
 D
EFINE
 
A
 D
ATA
 M
INING
 T
ASK
 
Task-relevant data
Database or data warehouse name
Database tables or data warehouse cubes
Condition for data selection
Relevant attributes or dimensions
Data grouping criteria
Type of knowledge to be mined
Characterization, discrimination, association, classification,
prediction, clustering, outlier analysis, other data mining tasks
Background knowledge
Pattern interestingness measurements
Visualization/presentation of discovered patterns
 
27
undefined
 
P
RIMITIVE
 3: B
ACKGROUND
 K
NOWLEDGE
 
A typical kind of background knowledge: Concept hierarchies
Schema hierarchy
E.g., street < city < province_or_state < country
Set-grouping hierarchy
E.g., {20-39} = young, {40-59} = middle_aged
Operation-derived hierarchy
email address: 
hagonzal@cs.u
iuc.edu
login-name < department < university < country
Rule-based hierarchy
low_profit_margin (X) <= price(X, P
1
) and cost (X, P
2
) and (P
1
 -
P
2
) < $50
 
28
undefined
 
I
NTEGRATION
 
OF
 D
ATA
 M
INING
 
AND
 D
ATA
 W
AREHOUSING
 
Data mining systems, DBMS, Data warehouse systems
coupling
No coupling, loose-coupling, semi-tight-coupling, tight-coupling
 
29
undefined
 
C
OUPLING
 D
ATA
 M
INING
 
WITH
 DB/DW S
YSTEMS
 
No coupling—flat file processing, not recommended
Loose coupling
Fetching data from DB/DW
Semi-tight coupling—enhanced DM performance
Provide efficient implement a few data mining primitives in a
DB/DW system, e.g., sorting, indexing, aggregation, histogram
analysis, multiway join, precomputation of some stat functions
Tight coupling—A uniform information processing
environment
DM is smoothly integrated into a DB/DW system, mining
query is optimized based on mining query, indexing, query
processing methods, etc.
 
30
undefined
 
M
AJOR
 I
SSUES
 
IN
 D
ATA
 M
INING
 
Mining methodology
Mining different kinds of knowledge from diverse data types, e.g., bio, stream,
Web
Performance: efficiency, effectiveness, and scalability
Pattern evaluation: the interestingness problem
Incorporation of background knowledge
Handling noise and incomplete data
Parallel, distributed and incremental mining methods
Integration of the discovered knowledge with existing one: knowledge fusion
User interaction
Data mining query languages and ad-hoc mining
Expression and visualization of data mining results
Interactive mining of
 
knowledge at multiple levels of abstraction
 
31
undefined
 
S
UMMARY
 
Data mining: Discovering interesting patterns from large amounts
of data
A natural evolution of database technology, in great demand, with
wide applications
A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
knowledge presentation
Mining can be performed in a variety of information repositories
Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis,
etc.
Data mining systems and architectures
Major issues in data mining
 
32
Slide Note
Embed
Share

This presentation delves into the realms of data warehousing and data mining, discussing concepts, techniques, and the importance of mining data for valuable insights. It covers topics such as the evolution of database technology, the significance of data mining, and the necessity for extracting knowledge from vast amounts of data available today.

  • Data Warehousing
  • Data Mining
  • Database Technology
  • Insights
  • Information Extraction

Uploaded on Jul 26, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. DATA WAREHOUSING & DATA MINING Prepared by: Anita Parmar 1

  2. 2

  3. 3

  4. DATA MINING: CONCEPTSAND TECHNIQUES CHAPTER 2 Introduction to Data Mining 4

  5. CHAPTER 2. INTRODUCTION Motivation: Why data mining? What is data mining? Data Mining: On what kind of data? Data mining functionality Classification of data mining systems Data mining task primitives Major issues in data mining 5

  6. WHY DATA MINING? The Explosive Growth of Data: from terabytes to petabytes Data collection and data availability Automated data collection tools, database systems, Web, computerized society Major sources of rich data Business: Web, e-commerce, transactions, stocks, Science: Remote sensing, bioinformatics, scientific simulation, Society and everyone: news, digital cameras, YouTube We are drowning in data, but starving for knowledge! Necessity is the mother of invention Data mining Automated analysis of massive data sets 6

  7. EVOLUTIONOF DATABASE TECHNOLOGY 1960s: Data collection, database creation, file creation 1970s: Relational data model, relational DBMS implementation 1980s: RDBMS, advanced data models (extended-relational, OO, deductive, etc.) Application-oriented DBMS (spatial, scientific, engineering, etc.) 1990s: Data mining, data warehousing, multimedia databases, and Web databases 2000s Data mining and its applications Web technology (XML, data integration) 7

  8. data rich but information poor 8

  9. We want to know ... Which types of transactions are likely to be fraudulent given the transactional history of a particular customer? If I raise the price of my product by Rs. 2, what is the effect on my business? If I offer only 2,500 as an incentive to purchase rather than 5,000, how many lost responses will result? If I emphasize ease-of-use of the product as opposed to its technical capabilities, what will be the net effect on my revenues? Which of my customers are likely to be the most loyal? Data Mining helps extract such information 9

  10. WHAT IS DATA MINING? Data mining (knowledge discovery from data) Extraction of interesting patterns or knowledge from huge amount of data. Alternative names Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data dredging(searching), information harvesting(gathering), business intelligence, etc. 10

  11. 11

  12. KNOWLEDGE DISCOVERYFROMDATA (KDD) PROCESS Data mining core of knowledge discovery process Pattern Evaluation Data Mining Task-relevant Data Selection and transformation Data Warehouse Data Cleaning Data Integration 12 Databases

  13. KDD PROCESS: SEVERAL KEY STEPS 1.Data cleaning : to remove noise and inconsistent data (may take 60% of effort!) 2. Data integration : Where multiple data sources may be combined. 3. Data selection : Where data relevant to the analysis task are retrieved from the database. 4. Data Transformation Where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation 13

  14. CONTINUE 5. Data mining: search for patterns of interest. An essential process where intelligent methods are applied in order to extract data patterns. 6. Pattern evaluation: to identify the truly interesting patterns representing knowledge interestingness measures based on some 7. Knowledge presentation : visualization and knowledge representation techniques are used to present the mined knowledge to the user. 14

  15. DATA MININGAND BUSINESS INTELLIGENCE Increasing potential to support business decisions End User Decision Making Business Analyst Data Presentation Visualization Techniques Data Mining Information Discovery Data Analyst Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems DBA 15

  16. ARCHITECTUREOFADATAMININGSYSTEM 16

  17. CONTINUE Database, Data warehouse, WWW or other information repository: A set of Database, data warehouse, spreadsheets, or other kind of information repositories. Data cleaning and data integration techniques may be performed on the data. Database or data warehouse server : Responsible for fetching the relevant data, based on the user s data mining request. Knowledge base: Domain knowledge that is used to guide the search or evaluate the interestingness of resulting patterns. For ex., Concept hierarchies, used to organize attributes or attribute values into different levels of abstraction, User beliefs, which can be used to assess a pattern s interestingness based on its unexpectedness, may also be included. Additional interestingness constraints or thresholds and metadata. 17

  18. CONTINUE Data mining engine: Essential to the data mining system Consists of a set of functional modules. Pattern evaluation module: Employs interestingness measures and interacts with the data mining modules so as to focus the search toward interesting patterns. It may use interestingness thresholds to filter out discovered patterns. In many system pattern evaluation module may be integrated with the mining module, depending on the implementation of the data mining method used. User interface: Communicate between users and the data mining system. Allowing the user to interact with the system by specifying a data mining query or task, providing information to help focus the search, performing exploratory data mining based on the intermediate data mining results. Allows the user to browse database and data warehouse schemas or data structures, evaluate mined patterns, and visualize the patterns in different forms. 18

  19. DATA MINING: ON WHAT KINDSOF DATA? Database-oriented data sets and applications Relational database, data warehouse, transactional database Advanced data sets and advanced applications Object-relational databases Temporal data, sequence data (incl. bio-sequences), Time-series data Time related, customer shopping sequence, sequence of values repeated over time(hourly, monthly,daily) Spatial data and spatiotemporal data Geographic database, VLSI data, satellite images etc. 19

  20. DATA MINING: ON WHAT KINDSOF DATA? Heterogeneous databases and legacy databases Ex. Information of students performance at different schools Data streams Multimedia database Text databases The World-Wide Web 20

  21. DATA MINING FUNCTIONALITIES : WHATKINDSOF PATTERNSCANBEMINED Concept description: Characterization and discrimination Generalize, summarize, and contrast data characteristics, Eg. Find characteristics of Customers who spend more than 10,000 per month Eg. Compare customers who shop regularly verses who shop rarely Frequent patterns, association, correlation computer printer [0.5%, 75%] Classification and prediction Construct models (functions) that describe and distinguish classes or concepts for future prediction 21 E.g., classify countries based on (climate), or classify cars based on (gas mileage) Predict some unknown or missing numerical values

  22. DATA MINING FUNCTIONALITIES (2) Cluster analysis Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns Maximizing intra-class similarity & minimizing interclass similarity Outlier analysis Outlier: Data object that does not comply with the general behavior of the data Noise or exception? Useful in fraud detection, rare events analysis Trend and evolution analysis Regularities or trends for object whose behavior changes over time. Ex. Stock exchange 22

  23. ARE ALLTHEDISCOVERED PATTERNS INTERESTING? Data mining may generate thousands of patterns: Not all of them are interesting Interestingness measures A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm Objective vs. subjective interestingness measures Objective: based on statistics and structures of patterns, e.g., support, confidence, etc. Subjective: based on user s belief in the data 23

  24. FIND ALLAND ONLY INTERESTING PATTERNS? Find all the interesting patterns: Completeness Can a data mining system find all the interesting patterns? Do we need to find all of the interesting patterns? Association vs. classification vs. clustering Search for only interesting patterns: An optimization problem Can a data mining system find only the interesting patterns? Approaches First general all the patterns and then filter out the uninteresting ones Generate only the interesting patterns mining query optimization 24

  25. CLASSIFICATIONOF DATA MINING SYSTEM Database Technology Statistics Machine Learning Visualization Data Mining Pattern Recognition Other Disciplines Algorithm 25

  26. CLASSIFICATIONOF DATA MINING SYSTEM Kinds of Databases to be mined Relational, oriented/relational, heterogeneous, legacy, WWW data warehouse, spatial, transactional, time-series, stream, text, object- multi-media, Kinds of Knowledge to be mined Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc. Multiple/integrated functions and mining at multiple levels Kinds of Techniques utilized Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, etc. Applications adapted Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc. 26

  27. PRIMITIVESTHAT DEFINEA DATA MINING TASK Task-relevant data Database or data warehouse name Database tables or data warehouse cubes Condition for data selection Relevant attributes or dimensions Data grouping criteria Type of knowledge to be mined Characterization, discrimination, association, classification, prediction, clustering, outlier analysis, other data mining tasks Background knowledge Pattern interestingness measurements Visualization/presentation of discovered patterns 27

  28. PRIMITIVE 3: BACKGROUND KNOWLEDGE A typical kind of background knowledge: Concept hierarchies Schema hierarchy E.g., street < city < province_or_state < country Set-grouping hierarchy E.g., {20-39} = young, {40-59} = middle_aged Operation-derived hierarchy email address: hagonzal@cs.uiuc.edu login-name < department < university < country Rule-based hierarchy low_profit_margin (X) <= price(X, P1) and cost (X, P2) and (P1 - P2) < $50 28

  29. INTEGRATIONOF DATA MININGAND DATA WAREHOUSING Data mining systems, DBMS, Data warehouse systems coupling No coupling, loose-coupling, semi-tight-coupling, tight-coupling 29

  30. COUPLING DATA MININGWITH DB/DW SYSTEMS No coupling flat file processing, not recommended Loose coupling Fetching data from DB/DW Semi-tight coupling enhanced DM performance Provide efficient implement a few data mining primitives in a DB/DW system, e.g., sorting, indexing, aggregation, histogram analysis, multiway join, precomputation of some stat functions Tight coupling A uniform information processing environment DM is smoothly integrated into a DB/DW system, mining query is optimized based on mining query, indexing, query processing methods, etc. 30

  31. MAJOR ISSUESIN DATA MINING Mining methodology Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web Performance: efficiency, effectiveness, and scalability Pattern evaluation: the interestingness problem Incorporation of background knowledge Handling noise and incomplete data Parallel, distributed and incremental mining methods Integration of the discovered knowledge with existing one: knowledge fusion User interaction Data mining query languages and ad-hoc mining Expression and visualization of data mining results Interactive mining ofknowledge at multiple levels of abstraction 31

  32. SUMMARY Data mining: Discovering interesting patterns from large amounts of data A natural evolution of database technology, in great demand, with wide applications A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation Mining can be performed in a variety of information repositories Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc. Data mining systems and architectures 32 Major issues in data mining

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#