Modern Analysis Practices for Big Data in Enterprise Data Warehousing

undefined

MAD skills:

New Analysis Practices for Big Data

CS598

Yue Sun (yuesun3)

Trend:

collect,

leverage data

in multiple

organizational

units

Data:

Fox Audience Network

using

Greenplum

 parallel

database system

Motivation ?

1.

Cheap storage

2.

Growing number of large-scale databases

3.

Value of data analysis

4.

Complement statisticians with software skills

background



EDW



enterprise data warehouse



Problems?



discourage integration of

uncleaned new data sources



“architected environment”



Based on long-range design and

planning



Limited statistical

functionalities



Require data to fit in RAM



OLAP & data cubes



Provides descriptive

statistics

Background

continue…



MapReduce and

Parallel programming



Data-parallel fashion via

summations



Data mining and

Analytics



correspond to only

statistical libraries that

ship with a stat package



Statistical

packages



Spreadsheets

MAD database

design



Objectives:



1. get the data into the warehouse as soon as possible

2. intelligent cleaning of data

3. intelligent integration of data



3 layer approach



1. staging schema



2. production data warehouse schema



3. reporting schema

Fourth class of

schema:

Sandbox



Why sandbox?



used for managing experimental processes

1. track and record work and work products

2. materialize query results and reuse the

results later

Data parallel

statistics



Layers of abstraction in traditional SQL

database



1. Simple arithmetic



2. Vector arithmetic



3. Functions



4. Functionals

Vectors and

Matrices



Matrix addition:



Matrix transpose:



Matrix-matrix multiplication

Tf-idf

document

similarity

Matrix based

analytical

methods



Ordinary least squares

Matrix based

analytical

methods

continue…



Conjugate gradient



update alpha(r_i, p_i, A)



update x(x_i, alpha_i, v_i)



update r(x_i, alpha_i, v_i, A)



update v(r_i, alpha_i, v_i, A)



Construction of these functions allow user to

insert one full row at a time

functional

data-parallel

implementations

of comparative

statistics

expressed in SQL



Mann-Whitney U Test

Log-likelihood

ratios

useful for

comparing a

subpopulation to

an overall

population on a

particular

attributed

Example: multinomial distribution

Resampling

techniques



2 standard resampling techniques:



bootstrap



from a population of size N, pick k members

randomly, compute statistic; replace

subsample with another k random

member…



jackknife



Repeatedly compute statistic by leaving out

one or more data items

undefined

MAD DBMS requirements?

1.

Easy and efficient to put new data source into

the warehouse

2.

Make physical storage evolution easy and

efficient

3.

Provide powerful, flexible programming

environment

Storage and

partitioning



Require multiple storage mechanisms



1. [early stage]

iterate over tasks frequently; load databases

quickly; allow users to run queries directly

against external tables



2. [for “detail tables”]

well served by traditional transactional

storage techniques



3. [for fact tables]

better served by compressed storage

because it handles appends and reads more

efficiently

MAD DBMS



loading & unloading



Parallel access for external tables via Scatter/Gather

Streaming



Require coordination with external processes to

“feed” in parallel



More



Support transformations written in SQL



Support MapReduce scripting in DBMS

MAD DBMS



Support external tables



traditional “heap” storage format for frequent

updates data



Highly-compressed “append-only” (AO) table

feature for data with no updates



With compression off: bulk loads run quickly



With most aggressive compression: use as little

space as possible



With medium compression: improved table scan

time with slower loads

MAD DBMS



Multiple ways to partition tables:



distribution policy



partitioning policy (for a table)



range partitioning policy



list partitioning policy

    Note that partitioning structure is completely mutable

Partition

Slide Note

Embed Share

Download

Explore new analysis practices for handling big data in enterprise data warehousing, focusing on integrating uncleaned data sources, statistical functionalities, and intelligent data cleaning and integration. Learn about MAD database design, sandbox usage, and layers of abstraction in SQL databases.

rrhy Follow

Uploaded on Oct 05, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

MAD skills: New Analysis Practices for Big Data 2009 CS598 Yue Sun (yuesun3)

Data: Fox Audience Network Trend: using Greenplum parallel database system collect, leverage data in multiple organizational units Motivation ? 1. Cheap storage 2. Growing number of large-scale databases 3. Value of data analysis 4. Complement statisticians with software skills

EDW enterprise data warehouse Problems? discourage integration of uncleaned new data sources architected environment Based on long-range design and planning Limited statistical functionalities Require data to fit in RAM OLAP & data cubes Provides descriptive statistics background

Statistical packages MapReduce and Parallel programming Data-parallel fashion via summations Background continue Data mining and Analytics correspond to only statistical libraries that ship with a stat package Spreadsheets

Objectives: 1. get the data into the warehouse as soon as possible 2. intelligent cleaning of data 3. intelligent integration of data MAD database design 3 layer approach 1. staging schema 2. production data warehouse schema 3. reporting schema

Why sandbox? used for managing experimental processes 1. track and record work and work products 2. materialize query results and reuse the results later Fourth class of schema: Sandbox

Layers of abstraction in traditional SQL database 1. Simple arithmetic 2. Vector arithmetic 3. Functions 4. Functionals Data parallel statistics

Matrix addition: Vectors and Matrices Matrix transpose: Matrix-matrix multiplication

Tf-idf document similarity

Ordinary least squares Matrix based analytical methods

Conjugate gradient update alpha(r_i, p_i, A) update x(x_i, alpha_i, v_i) update r(x_i, alpha_i, v_i, A) update v(r_i, alpha_i, v_i, A) Matrix based analytical methods continue Construction of these functions allow user to insert one full row at a time

Mann-Whitney U Test functional: data-parallel implementations of comparative statistics expressed in SQL

Log-likelihood ratios Example: multinomial distribution useful for comparing a subpopulation to an overall population on a particular attributed

2 standard resampling techniques: 1. bootstrap from a population of size N, pick k members randomly, compute statistic; replace subsample with another k random member Resampling techniques 2. jackknife Repeatedly compute statistic by leaving out one or more data items

MAD DBMS requirements? 1. Easy and efficient to put new data source into the warehouse 2. Make physical storage evolution easy and efficient 3. Provide powerful, flexible programming environment

Require multiple storage mechanisms 1. [early stage] iterate over tasks frequently; load databases quickly; allow users to run queries directly against external tables 2. [for detail tables ] well served by traditional transactional storage techniques 3. [for fact tables] better served by compressed storage because it handles appends and reads more efficiently Storage and partitioning

loading & unloading MAD DBMS Parallel access for external tables via Scatter/Gather Streaming Require coordination with external processes to feed in parallel More Support transformations written in SQL Support MapReduce scripting in DBMS

Support external tables traditional heap storage format for frequent updates data Highly-compressed append-only (AO) table feature for data with no updates With compression off: bulk loads run quickly With most aggressive compression: use as little space as possible With medium compression: improved table scan time with slower loads MAD DBMS

Multiple ways to partition tables: distribution policy partitioning policy (for a table) range partitioning policy list partitioning policy MAD DBMS Partition Note that partitioning structure is completely mutable

Modern Analysis Practices for Big Data in Enterprise Data Warehousing

Download Presentation

Presentation Transcript

Related

More Related Content