Advanced Dimensional Modelling: Strategies for Effective Data Handling

SQLBits 8, 9
th
 April 2011, Brighton
Vincent Rainardi
vrainardi@gmail.com
Blog: dwbi1.wordpress.com
Advanced Dimensional Modelling
 
1. Dimensions - Structure
SCD Type 6
1 or 2 Dimensions
When To Snowflake
A Dimension with Only 1 Attribute
Transaction Level Dimension
2. Fact Tables
Fact Table Primary Key
Snapshotting Transaction Fact Tables
Aggregate Fact Tables
Vertical Fact Tables
3. Dimensions - Behavior
Rapidly Changing Dimension
Very Large Dimensions
Banding Dimension Rows
Stamping Dimension Rows
Dimensions with Multi Valued
Attributes
4. Combinations
Real Time Fact Table
Dealing with Currency Rates
Dealing with Status
4 sections: 2 dims, 1 fact, 1 combi. Lots of material, may not able to finish.
44 slides, some slides we may have to touch 
lightly
.
Questions between sections, available after.
SCD Type 6 is a combination of Type 1, 2 & 3
e.g. type
 2 + type 1 : DimAccount (telco example)
6 = 1 + 2 + 3   (Ref: Ross 
& 
Kimball 2005, Wikipedia)
http://www.rkimball.com/html/articles_search/articles%202005/0503IE.html
http://en.wikipedia.org/wiki/Slowly_changing_dimension
Business/Natural Key
SCD Type 6
1/2
Used for “As Was” reporting
e.g. balances by tariff (price plan) at the end of last year,
 
if the
customers were on today’s tariff.
Fact
Dim
SCD Type 6
“Type 12”
Natural Key
2/2
1 or 2 
dimensions
Simplicity, 1 dim
Hierarchy from customer
attribute &account attribute
Use when we don’t have fact
tables requiring customer grain.
We can get the customer
attributes without knowing the
account key
Disadvantage: can’t go from
account to customer without
going through the fact table -
performance
1/4
customer
attributes 
Fact
Table
Dim
Account
a) One Dimension
b) Two Dimensions
Fact
Table
Dim
Account
Dim
Customer
1 or 2 
dimensions
Dim customer is needed by another fact table
Modular: 2 separate dim tables but we can combine
them easily to create a bigger dimension
To get the breakdown of a measure by a customer
attribute is a bit more complicated than a)
select c. attribute, sum(f.measure1) from fact1 f
inner join dim_account a on f.account_key = a.account_key
inner join dim_customer c on a.customer_key = c.customer_key
group by c. attribute
2/4
c) Snowflake
Fact
Table
Dim
Account
Dim
Customer
1 or 2 
dimensions
Try to fix weakness on b and c:
We can “go” direct from account
dim to customer dim
We can access dim customer
directly from the fact table.
Weakness: maintain customer key in 2
places: fact table and dim account.
a.k.a. “Star with a Back Door”
3/4
d) Two Dimensions with inter-dimension link
Fact
Table
Dim
Account
Dim
Customer
1 or 2 
dimensions
Try to fix weakness of a:
unable to build a fact table with
grain = customer.
Add a column in dim account:
customer key
Not as popular as c) and d) in solving Dim Customer issue. It is “indecisive” :
trying to create Dim Customer but doesn’t want to create Dim Customer.
4/4
e) One Dimension with Customer Key
Fact
Table
Dim
Account
Fact
Table
Disadvantage: Dim Customer is hidden inside Dim Account, making it:
a) more difficult to maintain (especially for a type 2), and
b) less modular/flexible
When to Snowflake
1. When the sub dim is used by several dims
City-Country-Region columns exist in
DimBroker, DimPolicy, DimOffice and
DimInsured
Replaced by Location/GeoKey pointing
to DimLocation / DimGeography
Advantage: 
consistent
 hierarchy, i.e. relationship between
City, Country & Region.
Weakness: we would lose 
flexibility
. City to Country are
more or less fixed, but the grouping of countries might be
different between dimensions.
1/3
When to Snowflake
2. When the sub dim is used by both the main dim and the
fact table(s)
DimCustomer is used in DimAccount, and
is also used in the fact table.
DimManufacturer is used in DimProduct,
and is also used in the fact table.
DimProductGroup is used in DimProduct,
and is also used in some fact table.
The alternative is maintaining two
full dimensions (star classic).
2/3
4. To enrich a date attribute
When to Snowflake
3. To make “base dim” and “detail dim”
Insurance classes, account types
(banking), product lines, diagnosis,
treatment (health care)
Policies for marine, aviation & property classes have different attributes.
Pull common attributes into 1 dim: DimBasePolicy
Put class-specific attributes into DimMarine, DimProperty, DimAviation
Month, Quarter, Year, etc.
Like #1, a sub dim used by several dims.
Ref: Kimball DW Toolkit 2
nd
 edition page 213
3/3
A dimension with only 1 attribute
Reasons for putting single attribute in its own dim:
Keep fact table slim  (4 bytes int not 100 bytes varchar)
When the value changes, we don’t have to update the 
BIG
fact table – ETL performance
Grain is much lower than fact table – small dim
Yes it’s only 1 attribute today, but in the future there could
be another attribute. Could become a junk dim.
Should we put the attribute in the fact table?
(like DD = Degenerate Dim)
Probably, if the grain = fact table, and
it’s short or it’s a number.
1/2
A dimension with only 1 attribute
Exception: snapshot month (or day/week/quarter)
Snapshot month is used in periodic snapshot fact table.
Snapshot month is in the form of an integer (201104 for
April 2011). Doesn’t violate the 3 points above.
It is an integer, not char(6).
The value never changes, April 2011 will be April 2011
forever
There will not be other attributes in the dim
2/2
Transaction Level Dimension
A dim with grain = the transaction fact table
Most granular event in any business process
Examples:
IT Helpdesk DW: Dim Ticket
Telco DW: Dim Call
Banking/Asset Mgt DW: Dim Trade
Insurance DW: Dim Premium
Transaction
Level Dim
Transaction, not accumulative or periodic snapshot
1/5
Transaction Level Dimension
1.
Query Performance
DD columns are moved to a dim
, away from the heavy traffic in fact
tables. DW queries don’t touch those DD columns unless they need to
– performance. DD attributes totalling 30 bytes, replaced by 4 bytes int
column. Slimmer fact table, better for queries.
2.
Periodic Snapshot Fact Table
For periodic snapshot fact table, saving is even greater. Monthly
snapshot fact, 10 years / 120 months. Rather than specifying the DDs
repeatedly 120x, they are specified once in the transaction dim. All that
is left on the fact table is a slim 1 int col: the transaction key.
Advantages:
2/5
3.
Some fact tables have grains greater than the transaction
A payment from a customer is posted into 4 accounts in the GL fact table.
That single financial transaction becomes 4 fact rows but only has 1 row
in the trans dim. Fact table with 10m rows, trans dim only 3 million rows.
4.
Related Transactions
Some transactions are related, e.g. in retail, a purchase of a kitchen might
need to be created as 2 related orders, because the worktop is made-to-
order. Rather than creating a ‘related order’ column on the fact tables, it
might be better (depends on how it’s used) to create it on the trans dim
because:
a) an order can consist of many fact rows (1 row per item) so the “related
order number” will be duplicated across these fact rows
b) slimmer fact table
c) the transaction could be on many fact tables, not only one.
Transaction Level Dimension
3/5
Transaction Level Dimension
Transaction fact table and the grain of the trans dim = grain of the fact
table, and only 1 DD column: perhaps better leave the DD in the fact
table. Not a lot of space/speed gain by putting it on trans dim.
Mart/DW only used for SSAS: there is little point of having trans dim
physically. In SSAS we can create the transaction dimension “on the fly”
from the fact table (“fact dimension”).
Disadvantages/not suitable:
Using trans dim to put attributes as opposed to
put them in the main dimensions, with the
argument of: that’s the value of the attribute
when the transaction happened – this is not
right, use type 2 SCD for this.
Main
Trans
Acct type
Location
4/5
Transaction Level Dimension
Any dim with grain = fact table (like trans dim) is questionable
Do we really need this dim at this grain?
Perhaps it should be divided into several dims instead?
A dim with grain = fact table - potential performance issue (unless the
fact table is small).
e.g. fact table = 10m rows, trans dim = 10m rows. Joining 10m to 10m
potentially slow, especially if the physical ordering of the trans dim is
not the joining column.
Disadvantages/not suitable:
5/5
1. Dimensions - Structure
SCD Type 6
1 or 2 Dimensions
When To Snowflake
A Dimension with Only 1 Attribute
Transaction Level Dimension
2. Fact Tables
Fact Table Primary Key
Snapshotting Transaction Fact Tables
Aggregate Fact Tables
Vertical Fact Tables
3. Dimensions - Behavior
Rapidly Changing Dimension
Very Large Dimensions
Banding Dimension Rows
Stamping Dimension Rows
Dimensions with Multi Valued
Attributes
4. Combinations
Real Time Fact Table
Dealing with Currency Rates
Dealing with Status
25%
Time, Questions
Fact Table Primary Key
Should we have a PK?
Yes, if we need to be able to identify each fact row
1.
Need to refer to a fact row from another fact row e.g. chain of events
2.
Many identical fact rows and we need to update/delete only one
3.
To link the fact table to another fact table
Some experts totally disagree
PK
FK (no RI)
PK
FK
(not enforced)
Related Trans 
Header - Detail
PK
Uniqueness
previous/next transaction
1/3
Fact Table Primary Key
Single or Multi Column?
    Single Column: Generated Identity
    Multi Column: Dimension Keys
Single-column PK is better than multi-column PK because :
1) A multi-column PK may not be unique. A single-column PK guarantees
that the PK is unique, because it is an identity column.
2) A single-column PK is slimmer than a multi-column PK, better query
performance. To do a self join in the fact table (e.g. to link the current fact
row to the previous fact row), we join on a single integer column.
2/3
Fact Table Primary Key
Advantage: Prevent duplicate rows, query performance
Disadvantage: loading performance
Indexing the PK: cluster or not?
Cluster the PK if: the PK is an identity column
Don’t cluster the PK if: the PK is a composite, or when you need the
cluster index for query performance (with partitioning)
Example of not having a PK
If duplicate fact rows are allowed.
e.g. retail DW: Store Key, Date Key, Product Key, Customer Key
Same customer buying the same milk in the same shop on the
same day twice
--- Order Line ID as DD to make it unique (not all EPOS has it)
3/3
Snapshotting Transaction Fact Tables
Potentially huge – billions rows
Only take what you need
Smart date key/month, e.g. 20110409
Monthly or daily
Trunc-reload of current month/day
Daily (4 wk), Weekly (1 yr), Monthly (10 yr)
Purging & Archiving
Load from staging (cached)
Index/partition on snapshot date
Trans
Snapshot
Staging
1/1
Aggregate Fact Tables
What are they?
High level aggregation of base fact tables
A “select group by” query on a 2 billion rows fact
table can take 30 mins if it joins with two big
fact tables, even with indexes in place
So we do this query in advance as part of the DW
load and store it as an Aggregate Fact Table
The report only takes 1 second to run.
Aggregate
Fact Table
Base Fact Tables
Report
30 mins
1 sec
1/2
Aggregate Fact Tables
What For?
For report performance (group by is costly)
BO: aggregate aware
Not SSAS: aggregate in cubes, not tables
Loading & indexing:
Best to load from staging (at the same time as loading the
main fact table) not from the main fact table (this would be
working 2x)
Partition for data distribution or narrow query
Indexing: by the main dim keys
2/2
Vertical Fact Tables
Normalised
1 measure column
The meaning of that measure column
depends on “measure type” column
Used for Finance/GL mart
Advantage: flexibility:
using accounts, balance, Dr Cr
Disadvantage: non additive
“Normal”
Fact Table
Vertical
Fact Table
many measures
1 measure
(actual & budget)
1/1
Measure Type
Dim Key
1. Dimensions - Structure
SCD Type 6
1 or 2 Dimensions
When To Snowflake
A Dimension with Only 1 Attribute
Transaction Level Dimension
2. Fact Tables
Fact Table Primary Key
Snapshotting Transaction Fact Tables
Aggregate Fact Tables
Vertical Fact Tables
3. Dimensions - Behavior
Rapidly Changing Dimension
Very Large Dimensions
Banding Dimension Rows
Stamping Dimension Rows
Dimensions with Multi Valued
Attributes
4. Combinations
Real Time Fact Table
Dealing with Currency Rates
Dealing with Status
50%
Time, Questions
Rapidly Changing Dimension
Why is it a problem
Large SCD2 dim – Attributes change every day
Slow query when join with large fact tables
What to do
Put into a separate dim, link direct to fact table.
Just store the latest, type 1 attributes (or dual)
Store in the fact table (for small attribute, e.g. indicator)
Type2
Type1
Type2
Type2
Type2
1/1
Very Large Dimension
Why is it a problem
SSAS: 4 GB string store limit for dimension
SSAS: dim is “select distinct” on each attribute – long
processing time
“Valid date” join on SCD2 for as was
Usually customer dim where the “quality stamp” changes
daily or because of high number of distinct values
Difficult to browse high cardinality attribute
Join with fact tables – performance
1/2
What to do
Split into 2 dims, same grain. Always cut vertically.
Remove SCD2, or at least only certain columns.
Most common: separate the attributes with high
cardinality/change frequency
Bucketing/banding, group values into ranges
Very Large Dimension
VLD
2/2
Banding Dimension Rows
It is grouping numerical values (numerical
attributes, not measure) into several bands,
e.g. engine size, distance from station,
amount purchased (last complete year).
Benefits: easier for analysis & reporting,
comparing between categories.
Issue/problem: limit e.g. bucketing criteria
1 hour to implement, 3 months to argue
1/1
Stamping Dimension Row
Calculate internally or buy data from
outside
Customer categories (loyalty programme) e.g. A,
B, C of customer class.
To reflect c0nsumer interest on the product
(product categorisation based on customer
interest level)
Any other dates or measures summarized as
stamped attribute, i.e. “new customer”, “big
spender”, or results from recommendation
analysis/algorithm e.g. customer behaviour based
on previous purchases.
Used for analysis / reporting
“Stamped”
Attributes
1/1
Dimensions with Multi Valued Attributes
What is a Multi Valued Attribute?
An attribute which has more than 1 value per dimension row.
MV Attribute or MV Dimension?
MV Dim = For each fact row there could be more than 1 dimension row
Why do I need to know this?
To be able to model it
If wrong, difficult at BI/report
1/4
Dimensions with Multi Valued Attributes
Approaches to deal with MV Attributes
  1. Lower the grain of the dim
2. Put the MV attributes in a separate dim, link direct to the fact table
Before
After
Before
After
Fact table requires that the product dimension is at Product Code grain,
e.g. no sales info per size, but only per product code.
Often we don’t have the allocation information e.g. 50-50 or 30-70, we only
know that product
1
 has 2 sizes
2/4
Dimensions with Multi Valued Attributes
3. Use a bridge table to link the 2 dims
Fact Table
Dim Product
Bridge Table
Dim Size
4. Have several columns in the dim for that attribute
If the number of attributes is small and fixed, this is a popular approach. But if the
number of attributes is large (e.g. >10) or if it’s variable (e.g. sometimes 2,
sometimes 20), approach 2 and 3 above are more popular, and more appropriate.
3/4
5. Put the attribute in a snowflake sub dim
Dimensions with Multi Valued Attributes
We can’t really do this, as it is 1 to many (1 row in the main dim corresponds to
many rows in the sub dim). So we need a bridge table, which brings us back to
approach 3.
6. Keep in one column using delimiters
e.g. “Small|Medium″. A crazy idea. More flexible than having several columns
(approach 4) and simpler than approach 3 or 2.
If the purpose of the attribute is “display only” on a report (rather than analyse or
slice & dice), there is an argument for using this approach, particularly if the
number of attributes is small (e.g. 1 to 4).
4/4
Dimensions - Structure
SCD Type 6
1 or 2 Dimensions
When To Snowflake
A Dimension with Only 1 Attribute
Transaction Level Dimension
Fact Tables
Fact Table Primary Key
Snapshotting Transaction Fact Tables
Aggregate Fact Tables
Vertical Fact Tables
Dimensions - Behavior
Rapidly Changing Dimension
Very Large Dimensions
Banding Dimension Rows
Stamping Dimension Rows
Dimensions with Multi Valued
Attributes
Combinations
Real Time Fact Table
Dealing with Currency Rates
Dealing with Status
75%
Time, Questions
Real Time Fact Table
Reporting the transaction system in real time
View to union with the normal fact table, or use partitions
Freezing the dims for key lookup, -3 unknown key
Key corrections next day
Real time partition
(intraday today)
Dims as of
yesterday
Main partition
(up to last night)
1/1
-1 null in source
-2 not in dim table
-3 not in dim table as dim was frozen
    to be resolved next batch
Unknown keys:
dim
key
Dealing with Currency Rates
What for/background/requirements
Report in 3 reporting currencies, using today rates or past
Analyse over time without the impact of currency rates (using fixed
currency rates, e.g. 2010 EOY rates)
Had the transactions happened today
Currency rates historical analysis
Transaction
Currency
DW
Currency
Reporting
Currency
Transaction
Rates
Reporting
Rates
(many transaction
dates)
( 1 reporting
date)
100 countries
40 currencies
1 currency
3-4 currencies
GBP, USD, EUR,
Original
1/3
e.g. GBP
Dealing with Currency Rates
Approaches
Store in original currencies, convert to DW currency at runtime.
Or convert at load, store in DW currency – inaccuracy.
Or store in both original and DW currency
Currency rate fact table (date, currency, rate)
Or store rates in the fact table
On report/cube: date input at run time (default = today)
Fact Tables
FX Fact Table
2/3
Rate
Dealing with Currency Rates
Concept of FX Rate Type/Profile
in original currency, DW currency or both
3/3
Dealing with Status
What/background
Workflow (policies, contracts, documents)
Bottleneck analysis (no of days between stages)
How many on each stage
Status 1
Status 3
Status 4
Status 5
Status 6
Status 2
date1
date4
date3
date2
1/2
Dealing with Status
Approaches
Accumulative Snapshot Fact, 1 row per application
SCD2 on DimApp
App Status fact table
2/2
Thanks
Email: vrainardi@gmail.com
Blog: http://dwbi1.wordpress.com
Covers many of the topics in this presentation
This PowerPoint: in my blog, scroll to bottom, click on “SQLBit8”
Special thanks to Guang Ming Xing and Simon Jensen 
who
helped reviewing this presentation and provided useful
comments (doesn’t mean that they agree with the content)
Slide Note
Embed
Share

Explore advanced techniques in dimensional modelling such as handling rapidly changing dimensions, snowflaking dimensions, and dealing with very large dimensions. Learn about different types of dimension structures, fact tables, and their combinations to optimize data storage and retrieval efficiently.

  • Dimensional Modelling
  • Snowflaking
  • Fact Tables
  • Rapidly Changing Dimensions
  • Data Handling

Uploaded on Sep 16, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. SQLBits 8, 9thApril 2011, Brighton Vincent Rainardi vrainardi@gmail.com Blog: dwbi1.wordpress.com

  2. Advanced Dimensional Modelling 1. Dimensions - Structure SCD Type 6 1 or 2 Dimensions When To Snowflake A Dimension with Only 1 Attribute Transaction Level Dimension 3. Dimensions - Behavior Rapidly Changing Dimension Very Large Dimensions Banding Dimension Rows Stamping Dimension Rows Dimensions with Multi Valued Attributes 2. Fact Tables Fact Table Primary Key Snapshotting Transaction Fact Tables Aggregate Fact Tables Vertical Fact Tables 4. Combinations Real Time Fact Table Dealing with Currency Rates Dealing with Status 4 sections: 2 dims, 1 fact, 1 combi. Lots of material, may not able to finish. 44 slides, some slides we may have to touch lightly. Questions between sections, available after.

  3. SCD Type 6 1/2 SCD Type 6 is a combination of Type 1, 2 & 3 e.g. type 2 + type 1 : DimAccount (telco example) Account Key Phone Number Tariff Current Tariff 1 07789111111 2 07789111111 Business/Natural Key E35 E40 E40 E40 6 = 1 + 2 + 3 (Ref: Ross & Kimball 2005, Wikipedia) http://www.rkimball.com/html/articles_search/articles%202005/0503IE.html http://en.wikipedia.org/wiki/Slowly_changing_dimension

  4. SCD Type 6 2/2 Used for As Was reporting e.g. balances by tariff (price plan) at the end of last year, if the customers were on today s tariff. SnapshotMonth AccountKey Other Balance 201012 201104 Type 12 Fact 1 2 40 30 AccountKey PhoneNumber Tariff CurrentTariff Other 1 07789111111 2 07789111111 Natural Key Dim E35 E40 E40 E40

  5. 1 or 2 dimensions 1/4 a) One Dimension b) Two Dimensions Dim Account Dim Account Fact Table Fact Table customer attributes Dim Customer We can get the customer attributes without knowing the account key Disadvantage: can t go from account to customer without going through the fact table - performance Simplicity, 1 dim Hierarchy from customer attribute &account attribute Use when we don t have fact tables requiring customer grain.

  6. 1 or 2 dimensions 2/4 c) Snowflake Dim Account Dim Customer Fact Table Dim customer is needed by another fact table Modular: 2 separate dim tables but we can combine them easily to create a bigger dimension To get the breakdown of a measure by a customer attribute is a bit more complicated than a) select c. attribute, sum(f.measure1) from fact1 f inner join dim_account a on f.account_key = a.account_key inner join dim_customer c on a.customer_key = c.customer_key group by c. attribute

  7. 1 or 2 dimensions 3/4 d) Two Dimensions with inter-dimension link Try to fix weakness on b and c: We can go direct from account dim to customer dim We can access dim customer directly from the fact table. Dim Account Fact Table Dim Customer Weakness: maintain customer key in 2 places: fact table and dim account. a.k.a. Star with a Back Door

  8. 1 or 2 dimensions 4/4 e) One Dimension with Customer Key Try to fix weakness of a: unable to build a fact table with grain = customer. Dim Account Fact Table Fact Table Add a column in dim account: customer key Not as popular as c) and d) in solving Dim Customer issue. It is indecisive : trying to create Dim Customer but doesn t want to create Dim Customer. Disadvantage: Dim Customer is hidden inside Dim Account, making it: a) more difficult to maintain (especially for a type 2), and b) less modular/flexible

  9. When to Snowflake 1/3 1. When the sub dim is used by several dims City-Country-Region columns exist in DimBroker, DimPolicy, DimOffice and DimInsured Replaced by Location/GeoKey pointing to DimLocation / DimGeography Advantage: consistent hierarchy, i.e. relationship between City, Country & Region. Weakness: we would lose flexibility. City to Country are more or less fixed, but the grouping of countries might be different between dimensions.

  10. When to Snowflake 2/3 2. When the sub dim is used by both the main dim and the fact table(s) DimCustomer is used in DimAccount, and is also used in the fact table. DimManufacturer is used in DimProduct, and is also used in the fact table. DimProductGroup is used in DimProduct, and is also used in some fact table. The alternative is maintaining two full dimensions (star classic).

  11. When to Snowflake 3/3 3. To make base dim and detail dim Insurance classes, account types (banking), product lines, diagnosis, treatment (health care) Policies for marine, aviation & property classes have different attributes. Pull common attributes into 1 dim: DimBasePolicy Put class-specific attributes into DimMarine, DimProperty, DimAviation Ref: Kimball DW Toolkit 2nd edition page 213 4. To enrich a date attribute Month, Quarter, Year, etc. Like #1, a sub dim used by several dims.

  12. A dimension with only 1 attribute 1/2 Should we put the attribute in the fact table? (like DD = Degenerate Dim) Probably, if the grain = fact table, and it s short or it s a number. Reasons for putting single attribute in its own dim: Keep fact table slim (4 bytes int not 100 bytes varchar) When the value changes, we don t have to update the BIG fact table ETL performance Grain is much lower than fact table small dim Yes it s only 1 attribute today, but in the future there could be another attribute. Could become a junk dim.

  13. A dimension with only 1 attribute 2/2 Exception: snapshot month (or day/week/quarter) Snapshot month is used in periodic snapshot fact table. Snapshot month is in the form of an integer (201104 for April 2011). Doesn t violate the 3 points above. It is an integer, not char(6). The value never changes, April 2011 will be April 2011 forever There will not be other attributes in the dim

  14. Transaction Level Dimension 1/5 A dim with grain = the transaction fact table Transaction, not accumulative or periodic snapshot Examples: IT Helpdesk DW: Dim Ticket Telco DW: Dim Call Banking/Asset Mgt DW: Dim Trade Insurance DW: Dim Premium Transaction Level Dim Most granular event in any business process

  15. Transaction Level Dimension 2/5 Advantages: 1. Query Performance DD columns are moved to a dim, away from the heavy traffic in fact tables. DW queries don t touch those DD columns unless they need to performance. DD attributes totalling 30 bytes, replaced by 4 bytes int column. Slimmer fact table, better for queries. 2. Periodic Snapshot Fact Table For periodic snapshot fact table, saving is even greater. Monthly snapshot fact, 10 years / 120 months. Rather than specifying the DDs repeatedly 120x, they are specified once in the transaction dim. All that is left on the fact table is a slim 1 int col: the transaction key.

  16. Transaction Level Dimension 3/5 3. Some fact tables have grains greater than the transaction A payment from a customer is posted into 4 accounts in the GL fact table. That single financial transaction becomes 4 fact rows but only has 1 row in the trans dim. Fact table with 10m rows, trans dim only 3 million rows. 4.Related Transactions Some transactions are related, e.g. in retail, a purchase of a kitchen might need to be created as 2 related orders, because the worktop is made-to- order. Rather than creating a related order column on the fact tables, it might be better (depends on how it s used) to create it on the trans dim because: a) an order can consist of many fact rows (1 row per item) so the related order number will be duplicated across these fact rows b) slimmer fact table c) the transaction could be on many fact tables, not only one.

  17. Transaction Level Dimension 4/5 Disadvantages/not suitable: Transaction fact table and the grain of the trans dim = grain of the fact table, and only 1 DD column: perhaps better leave the DD in the fact table. Not a lot of space/speed gain by putting it on trans dim. Mart/DW only used for SSAS: there is little point of having trans dim physically. In SSAS we can create the transaction dimension on the fly from the fact table ( fact dimension ). Using trans dim to put attributes as opposed to put them in the main dimensions, with the argument of: that s the value of the attribute when the transaction happened this is not right, use type 2 SCD for this. Main Acct type Location Trans

  18. Transaction Level Dimension 5/5 Disadvantages/not suitable: Any dim with grain = fact table (like trans dim) is questionable Do we really need this dim at this grain? Perhaps it should be divided into several dims instead? A dim with grain = fact table - potential performance issue (unless the fact table is small). e.g. fact table = 10m rows, trans dim = 10m rows. Joining 10m to 10m potentially slow, especially if the physical ordering of the trans dim is not the joining column.

  19. 1. Dimensions - Structure SCD Type 6 1 or 2 Dimensions When To Snowflake A Dimension with Only 1 Attribute Transaction Level Dimension 3. Dimensions - Behavior Rapidly Changing Dimension Very Large Dimensions Banding Dimension Rows Stamping Dimension Rows Dimensions with Multi Valued Attributes 2. Fact Tables Fact Table Primary Key Snapshotting Transaction Fact Tables Aggregate Fact Tables Vertical Fact Tables 4. Combinations Real Time Fact Table Dealing with Currency Rates Dealing with Status Time, Questions

  20. Fact Table Primary Key 1/3 Should we have a PK? Yes, if we need to be able to identify each fact row 1. Need to refer to a fact row from another fact row e.g. chain of events 2. Many identical fact rows and we need to update/delete only one 3. To link the fact table to another fact table Some experts totally disagree Uniqueness Related Trans Header - Detail PK PK FK PK FK (no RI) (not enforced) previous/next transaction

  21. Fact Table Primary Key 2/3 Single or Multi Column? Single Column: Generated Identity Multi Column: Dimension Keys Single-column PK is better than multi-column PK because : 1) A multi-column PK may not be unique. A single-column PK guarantees that the PK is unique, because it is an identity column. 2) A single-column PK is slimmer than a multi-column PK, better query performance. To do a self join in the fact table (e.g. to link the current fact row to the previous fact row), we join on a single integer column.

  22. Fact Table Primary Key 3/3 Advantage: Prevent duplicate rows, query performance Disadvantage: loading performance Indexing the PK: cluster or not? Cluster the PK if: the PK is an identity column Don t cluster the PK if: the PK is a composite, or when you need the cluster index for query performance (with partitioning) Example of not having a PK If duplicate fact rows are allowed. e.g. retail DW: Store Key, Date Key, Product Key, Customer Key Same customer buying the same milk in the same shop on the same day twice --- Order Line ID as DD to make it unique (not all EPOS has it)

  23. Snapshotting Transaction Fact Tables 1/1 Potentially huge billions rows Only take what you need Smart date key/month, e.g. 20110409 Monthly or daily Trunc-reload of current month/day Daily (4 wk), Weekly (1 yr), Monthly (10 yr) Purging & Archiving Load from staging (cached) Index/partition on snapshot date Trans Staging Snapshot

  24. Aggregate Fact Tables 1/2 What are they? High level aggregation of base fact tables A select group by query on a 2 billion rows fact table can take 30 mins if it joins with two big fact tables, even with indexes in place So we do this query in advance as part of the DW load and store it as an Aggregate Fact Table The report only takes 1 second to run. Base Fact Tables 30 mins Aggregate Fact Table 1 sec Report

  25. Aggregate Fact Tables 2/2 What For? For report performance (group by is costly) BO: aggregate aware Not SSAS: aggregate in cubes, not tables Loading & indexing: Best to load from staging (at the same time as loading the main fact table) not from the main fact table (this would be working 2x) Partition for data distribution or narrow query Indexing: by the main dim keys

  26. Vertical Fact Tables 1/1 Normalised 1 measure column The meaning of that measure column depends on measure type column Used for Finance/GL mart Advantage: flexibility: using accounts, balance, Dr Cr Disadvantage: non additive Normal Fact Table many measures (actual & budget) Measure Type Dim Key Vertical Fact Table 1 measure

  27. 1. Dimensions - Structure SCD Type 6 1 or 2 Dimensions When To Snowflake A Dimension with Only 1 Attribute Transaction Level Dimension 3. Dimensions - Behavior Rapidly Changing Dimension Very Large Dimensions Banding Dimension Rows Stamping Dimension Rows Dimensions with Multi Valued Attributes 2. Fact Tables Fact Table Primary Key Snapshotting Transaction Fact Tables Aggregate Fact Tables Vertical Fact Tables 4. Combinations Real Time Fact Table Dealing with Currency Rates Dealing with Status Time, Questions

  28. Rapidly Changing Dimension 1/1 Why is it a problem Large SCD2 dim Attributes change every day Slow query when join with large fact tables What to do Put into a separate dim, link direct to fact table. Just store the latest, type 1 attributes (or dual) Store in the fact table (for small attribute, e.g. indicator) Type2 Type2 Type2 Type2 Type1

  29. Very Large Dimension 1/2 Why is it a problem SSAS: 4 GB string store limit for dimension SSAS: dim is select distinct on each attribute long processing time Valid date join on SCD2 for as was Usually customer dim where the quality stamp changes daily or because of high number of distinct values Difficult to browse high cardinality attribute Join with fact tables performance

  30. Very Large Dimension 2/2 What to do Split into 2 dims, same grain. Always cut vertically. Remove SCD2, or at least only certain columns. Most common: separate the attributes with high cardinality/change frequency Bucketing/banding, group values into ranges VLD

  31. Banding Dimension Rows 1/1 It is grouping numerical values (numerical attributes, not measure) into several bands, e.g. engine size, distance from station, amount purchased (last complete year). Benefits: easier for analysis & reporting, comparing between categories. Issue/problem: limit e.g. bucketing criteria 1 hour to implement, 3 months to argue

  32. Stamping Dimension Row 1/1 Calculate internally or buy data from outside Customer categories (loyalty programme) e.g. A, B, C of customer class. To reflect c0nsumer interest on the product (product categorisation based on customer interest level) Any other dates or measures summarized as stamped attribute, i.e. new customer , big spender , or results from recommendation analysis/algorithm e.g. customer behaviour based on previous purchases. Used for analysis / reporting Stamped Attributes

  33. Dimensions with Multi Valued Attributes 1/4 What is a Multi Valued Attribute? An attribute which has more than 1 value per dimension row. Product Key Product Code Colour 1 2 3 Size Small Small, Medium A B C Red Blue Green, Yellow Small MV Attribute or MV Dimension? MV Dim = For each fact row there could be more than 1 dimension row Why do I need to know this? To be able to model it If wrong, difficult at BI/report

  34. Dimensions with Multi Valued Attributes 2/4 Approaches to deal with MV Attributes 1. Lower the grain of the dim Before After Product Key Product Code Size 1 2 2 A B B Small Small Medium Product Key Product Code Size 1 2 Fact table requires that the product dimension is at Product Code grain, e.g. no sales info per size, but only per product code. A B Small Small, Medium 2. Put the MV attributes in a separate dim, link direct to the fact table Before After Product Key Measure1 1 9 2 10 2 1 2 2 Product Key Size Key Measure1 1 Size Key Size 1 2 1 9 5 5 Small Medium Often we don t have the allocation information e.g. 50-50 or 30-70, we only know that product1 has 2 sizes

  35. Dimensions with Multi Valued Attributes 3/4 3. Use a bridge table to link the 2 dims Fact Table Dim Size Bridge Table Dim Product Product Key 1 2 2 Size key 1 1 2 Size Key Size 1 2 Product Key Measure1 1 2 Product Key Product Code 1 2 Small Medium 9 10 A B 4. Have several columns in the dim for that attribute Product Key Product Code Size1 1 2 Size2 A B Small Small Medium If the number of attributes is small and fixed, this is a popular approach. But if the number of attributes is large (e.g. >10) or if it s variable (e.g. sometimes 2, sometimes 20), approach 2 and 3 above are more popular, and more appropriate.

  36. Dimensions with Multi Valued Attributes 4/4 5. Put the attribute in a snowflake sub dim We can t really do this, as it is 1 to many (1 row in the main dim corresponds to many rows in the sub dim). So we need a bridge table, which brings us back to approach 3. 6. Keep in one column using delimiters e.g. Small|Medium . A crazy idea. More flexible than having several columns (approach 4) and simpler than approach 3 or 2. If the purpose of the attribute is display only on a report (rather than analyse or slice & dice), there is an argument for using this approach, particularly if the number of attributes is small (e.g. 1 to 4).

  37. Dimensions - Structure SCD Type 6 1 or 2 Dimensions When To Snowflake A Dimension with Only 1 Attribute Transaction Level Dimension Dimensions - Behavior Rapidly Changing Dimension Very Large Dimensions Banding Dimension Rows Stamping Dimension Rows Dimensions with Multi Valued Attributes Fact Tables Fact Table Primary Key Snapshotting Transaction Fact Tables Aggregate Fact Tables Vertical Fact Tables Combinations Real Time Fact Table Dealing with Currency Rates Dealing with Status Time, Questions

  38. Real Time Fact Table 1/1 Reporting the transaction system in real time View to union with the normal fact table, or use partitions Freezing the dims for key lookup, -3 unknown key Key corrections next day Dims as of yesterday Main partition (up to last night) Unknown keys: -1 null in source -2 not in dim table -3 not in dim table as dim was frozen to be resolved next batch Real time partition (intraday today) dim key

  39. Dealing with Currency Rates 1/3 What for/background/requirements Report in 3 reporting currencies, using today rates or past Analyse over time without the impact of currency rates (using fixed currency rates, e.g. 2010 EOY rates) Had the transactions happened today Currency rates historical analysis Reporting Currency DW Transaction Currency Currency Transaction Rates (many transaction dates) Reporting Rates ( 1 reporting date) 100 countries 40 currencies 3-4 currencies GBP, USD, EUR, Original 1 currency e.g. GBP

  40. Dealing with Currency Rates 2/3 Approaches Store in original currencies, convert to DW currency at runtime. Or convert at load, store in DW currency inaccuracy. Or store in both original and DW currency Currency rate fact table (date, currency, rate) Or store rates in the fact table On report/cube: date input at run time (default = today) Fact Tables FX Fact Table Rate

  41. Dealing with Currency Rates 3/3 Fact Table Date 01/04/2011 1 02/04/2011 2 03/04/2011 3 Fact FX Rates Date Key From Currency To Currency FX Rate 1 1 1 2 1 3 Currency Key Measure1 10,000 20,000 30,000 1 1 1 1.000000000 0.620385881 0.883057262 in original currency, DW currency or both Concept of FX Rate Type/Profile

  42. Dealing with Status 1/2 What/background Workflow (policies, contracts, documents) Bottleneck analysis (no of days between stages) How many on each stage Status 1 Status 2 Status 4 Status 6 date2 date3 date4 date1 Status 3 Status 5

  43. Dealing with Status 2/2 Approaches Accumulative Snapshot Fact, 1 row per application SCD2 on DimApp AppKey AppID StsKey StsDate Current 1 1 1 2 1 2 3 1 3 4 2 1 5 2 2 1/3/11 3/3/11 7/3/11 6/3/11 7/3/11 N N Y N Y App Status fact table AppKey StsKey StsDateKey 1 1 1 2 1 3 2 1 2 2 61 63 67 66 67 AppKey Sts1Date Sts1Ind Sts2Date Sts2Ind Sts3Date Sts3Ind 1 1/3/11 1 3/3/11 2 6/3/11 1 7/3/11 1 1 7/3/11 1 0

  44. Thanks Email: vrainardi@gmail.com Blog: http://dwbi1.wordpress.com Covers many of the topics in this presentation This PowerPoint: in my blog, scroll to bottom, click on SQLBit8 Special thanks to Guang Ming Xing and Simon Jensen who helped reviewing this presentation and provided useful comments (doesn t mean that they agree with the content)

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#