Modern Analysis Practices for Big Data in Enterprise Data Warehousing

undefined
 
MAD skills:
New Analysis Practices for Big Data
2009
 
CS598 
 
Yue Sun (yuesun3)
 
Trend:
collect,
leverage data
in multiple
organizational
units
 
Data:
 
Fox Audience Network
 
using 
Greenplum
 parallel
database system
 
Motivation ?
 
1.
Cheap storage
2.
Growing number of large-scale databases
3.
Value of data analysis
4.
Complement statisticians with software skills
 
background
 
EDW
enterprise data warehouse
 
Problems?
discourage integration of
uncleaned new data sources
“architected environment”
Based on long-range design and
planning
Limited statistical
functionalities
Require data to fit in RAM
 
OLAP & data cubes
Provides descriptive
statistics
 
 
Background
continue…
 
MapReduce and
Parallel programming
Data-parallel fashion via
summations
 
Data mining and
Analytics
correspond to only
statistical libraries that
ship with a stat package
 
 
 
Statistical
packages
 
 
 
Spreadsheets
 
MAD database
design
 
Objectives:
1. get the data into the warehouse as soon as possible
2. intelligent cleaning of data
3. intelligent integration of data
 
3 layer approach
1. staging schema
2. production data warehouse schema
3. reporting schema
 
Fourth class of
schema:
Sandbox
 
Why sandbox?
used for managing experimental processes
1. track and record work and work products
2. materialize query results and reuse the
results later
 
Data parallel
statistics
 
Layers of abstraction in traditional SQL
database
1. Simple arithmetic
2. Vector arithmetic
3. Functions
4. Functionals
 
Vectors and
Matrices
 
Matrix addition:
 
 
Matrix transpose:
 
 
Matrix-matrix multiplication
 
Tf-idf
document
similarity
 
Matrix based
analytical
methods
 
Ordinary least squares
 
 
 
 
 
 
 
 
Matrix based
analytical
methods
continue…
 
Conjugate gradient
 
update alpha(r_i, p_i, A)
update x(x_i, alpha_i, v_i)
update r(x_i, alpha_i, v_i, A)
update v(r_i, alpha_i, v_i, A)
 
Construction of these functions allow user to
insert one full row at a time
 
functional
:
data-parallel
implementations
of comparative
statistics
expressed in SQL
 
Mann-Whitney U Test
 
Log-likelihood
ratios
useful for
comparing a
subpopulation to
an overall
population on a
particular
attributed
 
Example: multinomial distribution
 
Resampling
techniques
 
2 standard resampling techniques:
1
. 
bootstrap
from a population of size N, pick k members
randomly, compute statistic; replace
subsample with another k random
member…
 
2
. 
jackknife
Repeatedly compute statistic by leaving out
one or more data items
undefined
 
MAD DBMS requirements?
 
1.
Easy and efficient to put new data source into
the warehouse
2.
Make physical storage evolution easy and
efficient
3.
Provide powerful, flexible programming
environment
 
Storage and
partitioning
 
Require multiple storage mechanisms
 
1. [early stage]
iterate over tasks frequently; load databases
quickly; allow users to run queries directly
against external tables
2. [for “detail tables”]
well served by traditional transactional
storage techniques
3. [for fact tables]
better served by compressed storage
because it handles appends and reads more
efficiently
 
MAD DBMS
 
loading & unloading
Parallel access for external tables via Scatter/Gather
Streaming
Require coordination with external processes to
“feed” in parallel
 
More
Support transformations written in SQL
Support MapReduce scripting in DBMS
 
MAD DBMS
 
Support external tables
traditional “heap” storage format for frequent
updates data
Highly-compressed “append-only” (AO) table
feature for data with no updates
With compression off: bulk loads run quickly
With most aggressive compression: use as little
space as possible
With medium compression: improved table scan
time with slower loads
 
MAD DBMS
 
Multiple ways to partition tables:
distribution policy
partitioning policy (for a table)
range partitioning policy
list partitioning policy
 
 
 
 
 
 
 
 
    Note that partitioning structure is completely mutable
 
Partition
Slide Note
Embed
Share

Explore new analysis practices for handling big data in enterprise data warehousing, focusing on integrating uncleaned data sources, statistical functionalities, and intelligent data cleaning and integration. Learn about MAD database design, sandbox usage, and layers of abstraction in SQL databases.

  • Big Data Analysis
  • Enterprise Data Warehousing
  • Statistical Functionalities
  • Data Integration
  • Database Design

Uploaded on Oct 05, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. MAD skills: New Analysis Practices for Big Data 2009 CS598 Yue Sun (yuesun3)

  2. Data: Fox Audience Network Trend: using Greenplum parallel database system collect, leverage data in multiple organizational units Motivation ? 1. Cheap storage 2. Growing number of large-scale databases 3. Value of data analysis 4. Complement statisticians with software skills

  3. EDW enterprise data warehouse Problems? discourage integration of uncleaned new data sources architected environment Based on long-range design and planning Limited statistical functionalities Require data to fit in RAM OLAP & data cubes Provides descriptive statistics background

  4. Statistical packages MapReduce and Parallel programming Data-parallel fashion via summations Background continue Data mining and Analytics correspond to only statistical libraries that ship with a stat package Spreadsheets

  5. Objectives: 1. get the data into the warehouse as soon as possible 2. intelligent cleaning of data 3. intelligent integration of data MAD database design 3 layer approach 1. staging schema 2. production data warehouse schema 3. reporting schema

  6. Why sandbox? used for managing experimental processes 1. track and record work and work products 2. materialize query results and reuse the results later Fourth class of schema: Sandbox

  7. Layers of abstraction in traditional SQL database 1. Simple arithmetic 2. Vector arithmetic 3. Functions 4. Functionals Data parallel statistics

  8. Matrix addition: Vectors and Matrices Matrix transpose: Matrix-matrix multiplication

  9. Tf-idf document similarity

  10. Ordinary least squares Matrix based analytical methods

  11. Conjugate gradient update alpha(r_i, p_i, A) update x(x_i, alpha_i, v_i) update r(x_i, alpha_i, v_i, A) update v(r_i, alpha_i, v_i, A) Matrix based analytical methods continue Construction of these functions allow user to insert one full row at a time

  12. Mann-Whitney U Test functional: data-parallel implementations of comparative statistics expressed in SQL

  13. Log-likelihood ratios Example: multinomial distribution useful for comparing a subpopulation to an overall population on a particular attributed

  14. 2 standard resampling techniques: 1. bootstrap from a population of size N, pick k members randomly, compute statistic; replace subsample with another k random member Resampling techniques 2. jackknife Repeatedly compute statistic by leaving out one or more data items

  15. MAD DBMS requirements? 1. Easy and efficient to put new data source into the warehouse 2. Make physical storage evolution easy and efficient 3. Provide powerful, flexible programming environment

  16. Require multiple storage mechanisms 1. [early stage] iterate over tasks frequently; load databases quickly; allow users to run queries directly against external tables 2. [for detail tables ] well served by traditional transactional storage techniques 3. [for fact tables] better served by compressed storage because it handles appends and reads more efficiently Storage and partitioning

  17. loading & unloading MAD DBMS Parallel access for external tables via Scatter/Gather Streaming Require coordination with external processes to feed in parallel More Support transformations written in SQL Support MapReduce scripting in DBMS

  18. Support external tables traditional heap storage format for frequent updates data Highly-compressed append-only (AO) table feature for data with no updates With compression off: bulk loads run quickly With most aggressive compression: use as little space as possible With medium compression: improved table scan time with slower loads MAD DBMS

  19. Multiple ways to partition tables: distribution policy partitioning policy (for a table) range partitioning policy list partitioning policy MAD DBMS Partition Note that partitioning structure is completely mutable

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#