Slowly Changing Dimensions in SQL Server 2008

 
Choices, Choices - Implementing Slowly
Changing Dimensions in SQL Server
2008 – Level 200
Colin Hardie
Agenda…
…or how long do I have to sit here for?
 
 Who am I?
 Introduction – What is a Slowly Changing Dimension (SCD)?
 Types of SCD
 The Kimball Approach to SCDs – modelling an SCD in the Data Warehouse
 Alternative designs and why you might want to use them
 SCDs in the OLAP layer
 Loading an SCD 1 – The SSIS SCD Wizard
 Loading an SCD 2 – BYO (Build Your Own!)
Loading an SCD 3 – BYO Reloaded - The one that works! (I hope)
 Loading an SCD 4 – SQL Server 2008 TSQL Approach – the MERGE statement!
 Questions (and possibly answers)
Who am I?
The name is familiar from somewhere…
 
 
Senior Business Intelligence Consultant with IMGROUP
 Specialising in SSIS and SSAS with a smattering of RDBMS
 MCTS & MCITP qualified for Business Intelligence
 Co-Author of the most popular post on Chris Webb’s Blog – OLAP Jokes
 
http://cwebbbi.spaces.live.com/Blog/cns!1pi7ETChsJ1un_2s41jm9Iyg!290.entry
 
 
A girl goes speed dating and after a run of duff prospects a dimension sits down
opposite her and they engage in sprightly conversation.
 
After the event is over, her friend asks "Did you get that dimension's number? You
two really seemed to hit it off"
 
"No" replies the girl "He was really deep at times - had a lot of levels - but you know
parent-child dimensions, just too much of a mummy's boy"
 
 
Subject Matter Specific Humour…
…otherwise known as the tenuous link
Q: Why did the dimension take all day to take
off its suit and put on a pair of jeans?
 
A: It was a slowly-changing dimension!
 
(Yeah it was Chris’s)
What is a Slowly Changing Dimension?
Time to Ask Kimball…or not
 
 Somewhat cryptically, Kimball has started defining SCDs as “nearly constant dimensions”*
 A more helpful definition might be “a dimension that records change over time”
 For most people in the BI community there are nominally 3 different types of SCD, each
with a different method of handling change
Type 1 where change is handled by overwriting old data, hence you always get the latest view
Type 2 where change is stored using a combination of effective dates and new rows of data
(more to follow!)
Type 3 where change is handled by adding a new column!
Wikipedia also defines Types 0, 4 and 6
 
Type 0 – where nothing is done at all (so this is a dimension then???)
 Type 4 – where “history tables” are used to store old records
 Type 6 – which is a hybrid (invented by Kimball) of Types 1 + 2 + 3 = 6
 
 
 
* The Data Warehouse Toolkit, Second Edition (Ralph Kimball et al, 2002)
 
 
TYPES OF SCD
Types of SCD
Type 1 SCD Handling
       TECHNIQUE: Replace an old dimension attribute value with a new one
 Example is the Customer Dimension in the Adventure Works DW database
 
 
After a freak accident one day, Jon Yang becomes Jane Yang and the row then
becomes
Types of SCD
Type 2 Handling
TECHNIQUE: Add a new row with all of the same values apart from the changed
one
Kimball states that “A type 2 response is the predominant technique for supporting [the
representation of prior data correctly]”
 Example is the Employee dimension in the Adventure Works Data Warehouse database
 
 One day after an unfortunate incident at the office party, Guy get re-assigned to a new
department with a new boss and a new title.
 
 Note that the new row receives a new surrogate key, the old row now has CurrentFlag
status of 0 and a non-null EndDate
Types of SCD
Type 3 Handling
TECHNIQUE: Add a new column to capture the previous state of the attribute
 
 This technique is used in those cases when we want the ability to compare results using
both the old and current values of an attribute
 Type 2 cannot support this as it partitions the Fact table by attribute value. It is
equivalent to a UNION of results for all attribute values.
 Type 3 is the equivalent of a JOIN of results for current and prior attribute values.
Types of SCD
Type 6 Handling
TECHNIQUE: Add new rows and columns with updates (1 + 2 + 3)
 To be used solely in the case where the solution is required to preserve all historical
changes whilst also allowing you to compare historical data with the current view
Summary
What do we know so far?
 
 Slowly Changing Dimensions are a method of handling change in
dimensional data over time
 There are various “Types” of SCD that each involve one or more
techniques including updates, new rows and new columns
 The most commonly used are Type 1 and Type 2
 For the remainder of the session we will be focusing on Type 2 and
how they can be implemented
 
 
 
SCDs IN THE RELATIONAL LAYER
Modelling Type 2 SCDs
The Relational Layer – The Kimball Approach
 
 SCDs like all other Dimensions in the Kimball approach to Data Warehouse design are
predominantly modelled using a de-normalized star schema.
 This can lead to some data redundancy (e.g. repeated attribute values) but generally on
the basis of ease of use and query performance (TSQL that is) this is acceptable.
 Kimball specifically recommends against the use of normalized or snowflake schemas on
the grounds that “since dimension tables are geometrically smaller than fact tables,
improving storage efficiency…has virtually no impact“*
* Data Warehouse Toolkit, Second Edition (Ralph Kimball et al, 2002)
Modelling Type 2 SCDs
The Relational Layer – The Kimball Approach
 
 An important point to note about the Dimension table is the use of a surrogate key
(EmployeeKey) rather than the natural business key
 Without a surrogate key in place, Type 2 changes would not be possible as the business key
could not be repeated in a new row without change, at which point it would represent
something else
 The surrogate key becomes the basis for the join to the fact table and for the partitioning
of the fact table data based on dimensional change
 Note that in a single dimension table, both Type 1 and Type 2 attributes can be present
Modelling Type 2 SCDs
When one table is never enough…
 
 In general, using a Star-Schema dimension is the best approach
 They are easier to understand and in a relational sense perform better
 ETL is more straightforward
 Hierarchy relationships can be easily seen by reading across a row
 However, it should be noted that Kimball tends to exclusively talk about relational BI
systems
 The use of an OLAP layer will significantly mitigate against the problems of a multi-table
design
 In particular there are two main scenarios where a multi-table approach offers a better
solution
 
i.
A long chain of Type 2 hierarchy attributes
ii.
A large, rapidly changing Type 2 dimension
Modelling Type 2 SCDs
A Long Type 2 Hierarchy Chain
 Consider the following scenario…
 A large multinational has implemented a central BI tool to manage its global sales force
 The sales force has the following structure
Global Head of Sales
Regional Head
Country Manager
Area Manager
Team Manager
Sales Rep
Modelling Type 2 SCDs
A Long Type 2 Hierarchy Chain
 In a Star Schema format this type of chain is modelled as one table
 
 If all of the managers and efficiency targets are
modelled as type 2, then data explosion can occur very
quickly, especially for large numbers of granularity
members
 Large amounts of data redundancy is introduced
Modelling Type 2 SCDs
A Long Type 2 Hierarchy Chain
 In a Snowflake Scheme this is modelled as a series of tables
 
 This approach has the advantage
that if a change occurs in a mid-
hierarchy attribute, then the
subsequent changes are restricted to
that attribute and below
 This not only reduces storage space
but also the burden placed on the
ETL system
 Analysis Services also tends to
process slightly  more performantly
against Snowflake schemas than Star
schemas
Modelling Type 2 SCDs
A Long Type 2 Hierarchy Chain
 Suppose we have the following row of data from the Star schema implementation of the
dimension
Modelling Type 2 SCDs
A Long Type 2 Hierarchy Chain
Modelling Type 2 SCDs
A Rapidly Changing Type 2 Dimension
 
 Kimball describes a scenario where you find yourself with what he terms a “rapidly
changing dimension” - essentially very large dimensions with many Type 2 attributes that
change frequently  e.g. a Customer dimension
 For multi-million row dimension tables, the standard Type 2 handling techniques fall
down due to the data explosion they generate
 Kimball’s suggested solution is to split the dimensions in two, with 2 keys attached to the
fact table instead of one
 In one table the original surrogate key and all of the slowly changing Type 2 attributes
are left.
In the other, all of the remaining rapidly changing Type 2 attributes are stored with
another surrogate key. The relationship between the two is then maintained through the
fact table
 This technique should only be considered for those scenarios described as “rapidly
changing monster dimension”
Modelling Type 2 SCDs
A Rapidly Changing Type 2 Dimension
 The split dimension technique involves taking one dimension with a combination of
slowly and rapidly changing Type 2 attributes and creating two
 
 
SCDs IN THE OLAP LAYER
Modelling Type 2 SCDs
The OLAP Layer
 
 Modelling Type 2 SCDs in SSAS is extremely straight forward
 Because we use a surrogate key for the granularity of the dimension in the relational
layer, all of the hard work has already been done
 The changing surrogate key means that over time the “parent” attributes to which this
granularity attribute belongs will change over time seamlessly and with no extra effort on
our part
 If we want to see a view across time for a particular member then we can also create an
attribute hierarchy based on the business key of the dimension
Modelling Type 2 SCDs
The OLAP Layer – An Example
 
 Suppose we are a pharmaceutical company who want to track sales of a particular drug
over time
 We have a Type 2 SCD Product dimension that records details of the drug over time
 And a Fact table detailing drug sales over time
 
 
Modelling Type 2 SCDs
 
The OLAP Layer – An Example
 
DEMO
 
 
LOADING SCDs
Loading a Hybrid Type1/Type 2 SCD
Our example dimension
 
 Over the next few slides and demos we’ll be looking at a fairly realistic scenario. What
are the methods we can use to load an SCD and in this case a hybrid Type 1/ Type 2 SCD
 The dimension table I’ll using as my test subject is one I created by using the new Red
Gate SQL Generator.
 It looks like this
 
 
 
 
 
 
 
 It has 1 million rows
 Each method of loading the table will employ the same data “feed”, a SQL table
containing
 100, 000 new rows
 100, 000 existing rows
 100, 000 update rows
Loading a Hybrid Type1/Type 2 SCD
The SCD Wizard
 Out of the box, SSIS comes with the Slowly Changing Dimension task
 It allows you to rapidly develop Slowly Changing Dimension handling in a very short
amount of time
 It has one major drawback though…
 
 
I
T
S
R
U
B
B
I
S
H
!
!
!
Loading a Hybrid Type1/Type 2 SCD
The SCD Wizard
 
 In all but the smallest of tables with very few updates per cycle the task does not
perform at all
 The reason for this poor performance is twofold:
1.
The Algorithm - column by column comparison between the destination and
incoming rows can quickly become incredibly inefficient and a performance
bottleneck
2.
Row by Row Processing - When a row has been identified as an update row it is sent
to an OLE DB Command which performs a single update statement for that row. All
the traditional advantages of SQL set based processing are lost
 
Loading a Hybrid Type1/Type 2 SCD
 
The SCD Wizard
 
DEMO
Loading a Hybrid Type1/Type 2 SCD
Replacing the SCD Task
 
 
Any solution to the performance problem of the SCD wizard must be able to
solve the indicated problems, whilst still retaining the accuracy of the logic
 Essentially this means that any solution must be able to do the following:
1.
Be able to differentiate between new or existing rows
2.
Be able to remove the need for column by column comparison
3.
Be able to leverage set based rather than row by row based processing
Loading a Hybrid Type1/Type 2 SCD
Use MERGE JOINS & HASHING
 
 
We can create our own solution using MERGE JOIN tasks that will mimic the
behaviour of the SCD Task
  We can combine this with a workflow that allows us to use set based rather
than row based processing
 We can also use CHECKSUM or HASHBYTES to ensure we only have to compare
one column
Loading a Hybrid Type1/Type 2 SCD
CHECKSUM or HASHBYTES
 
 
CHECKSUM and HASHBYTES are both hashing algorithms within TSQL
 Used for combining many columns together in one “hash” value and building
hash indexes over
 For our purposes this is important – it allows us to combine all of the incoming
attributes into one column, thus instead of checking multiple columns we only
need to check one
 There is a problem though, the number of “collisions” with CHECKSUM – the
number of times the same hash value will be created by different combinations
of values is too high
 Microsoft recommends the use of the HASHBYTES function instead, along with
the SHA1 algorithm
 
Loading a Hybrid Type1/Type 2 SCD
 
Use Lookups
 
DEMO
Loading a Hybrid Type1/Type 2 SCD
Use Lookups
 
 
There are obviously problems with this technique!
 
1.
HASHBYTES, the VARBINARY(8000) it creates and the function used to
transform it is a major bottleneck..
2.
MERGE JOIN is not a particularly efficient JOIN algorithm
Loading a Hybrid Type1/Type 2 SCD
Tweaking the Solution
 
 
To get around the second point we can use Lookup tasks
 The Lookup task essentially caches a whole table or query in memory at
runtime and then compares incoming rows with the result of that cache
 This means queries can be very quickly resolved
 However, in very large dimensions the MERGE JOIN may perform better due to
the memory limitation of the Lookup
 As for the first point, in practice due to the size of hashing output we are better
off comparing the separate columns (assuming they total less than 8000 bytes!)
 
Loading a Hybrid Type1/Type 2 SCD
 
Tweaking the Solution
 
DEMO
Loading a Hybrid Type1/Type 2 SCD
SQL Server 2008 New TSQL Features
 
 
In SQL Server 2008 (as you have doubtless seen elsewhere today) there are
additions to Transact SQL (TSQL) that are relevant for Slowly Changing
Dimension handling
 Specifically these are
1.
The MERGE statement
2.
The INSERT from DML output
Loading a Hybrid Type1/Type 2 SCD
SQL Server 2008 New TSQL Features - MERGE
 
 The MERGE statement performs insert, update, or delete operations on a target table
based on the results of a join with a source table.
 The MERGE statement provides three types of WHEN clauses:
1.
WHEN MATCHED enables you to UPDATE or DELETE the given row in the target
table when the source and target rows match some criteria or criterion.
2.
WHEN NOT MATCHED [BY TARGET] enables you to INSERT a row into the target
when it exists in the source but not in the target.
3.
WHEN NOT MATCHED BY SOURCE enables you to UPDATE or DELETE the given
row in the target table when it exists in the target but not in the source.
 You can specify a search condition with each of the WHEN clauses to choose which type
of DML operation should be performed on the row.
 The OUTPUT clause for the MERGE statement includes a new virtual column called
$action
 that you can use to identify the DML action that was performed on each row.
Loading a Hybrid Type1/Type 2 SCD
SQL Server 2008 New TSQL Features - INSERT
 
 
In SQL Server 2008, INSERT has the new ability to consume the results of DML
statements
If this new capability is combined with the output of a MERGE, we have the
ability to perform Type 2 inserts based on Type 2 updates performed in the
MERGE
 If these pieces of syntax are nested in the right way, then we can perform
hybrid updates as we have been doing previously
 
 
Loading a Hybrid Type1/Type 2 SCD
 
SQL Server 2008 New TSQL Features - INSERT
 
DEMO
Summary
What have we learned?
 
 Slowly Changing Dimensions and in particular Type 2 SCDs are a powerful tool in tracking
history in a data warehouse
 There are various ways of modelling Type 2 SCDs in the relational layer
 The easiest to understand is the Star Schema approach advocated by Kimball and
generally this is the best choice. However in certain scenarios, a Snowflake or Split
Dimension technique is more suitable
 Whichever method you choose to model, in the OLAP layer the hard work has already
been done and the Type 2 model should just drop out
 SQL Server 2008 offers several options in loading SCDs
 The in built SCD Wizard should only be used for the smallest of dimensions
 It is possible to construct a performant alternative using Lookups (or MERGE JOINS for
very large dimensions) and set based UPDATE statements
 MERGE and INSERT functionality offers a viable alternative for loading in environments
where a code based ETL approach has been taken
 
THE END
 
 
QUESTIONS?
Slide Note
Embed
Share

Learn about the concept of Slowly Changing Dimensions (SCD) in SQL Server 2008, different types of SCD, the Kimball approach to modeling SCD in a Data Warehouse, loading strategies using SSIS and T-SQL, and subject-specific humor related to SCDs. This presentation by Colin Hardie provides valuable insights for Business Intelligence professionals.

  • Slowly Changing Dimensions
  • SQL Server 2008
  • Data Warehouse
  • Kimball Approach
  • Business Intelligence

Uploaded on Oct 08, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 Level 200 Colin Hardie

  2. Agenda or how long do I have to sit here for? Who am I? Introduction What is a Slowly Changing Dimension (SCD)? Types of SCD The Kimball Approach to SCDs modelling an SCD in the Data Warehouse Alternative designs and why you might want to use them SCDs in the OLAP layer Loading an SCD 1 The SSIS SCD Wizard Loading an SCD 2 BYO (Build Your Own!) Loading an SCD 3 BYO Reloaded - The one that works! (I hope) Loading an SCD 4 SQL Server 2008 TSQL Approach the MERGE statement! Questions (and possibly answers)

  3. Who am I? The name is familiar from somewhere Senior Business Intelligence Consultant with IMGROUP Specialising in SSIS and SSAS with a smattering of RDBMS MCTS & MCITP qualified for Business Intelligence Co-Author of the most popular post on Chris Webb s Blog OLAP Jokes http://cwebbbi.spaces.live.com/Blog/cns!1pi7ETChsJ1un_2s41jm9Iyg!290.entry A girl goes speed dating and after a run of duff prospects a dimension sits down opposite her and they engage in sprightly conversation. After the event is over, her friend asks "Did you get that dimension's number? You two really seemed to hit it off" "No" replies the girl "He was really deep at times - had a lot of levels - but you know parent-child dimensions, just too much of a mummy's boy"

  4. Subject Matter Specific Humour otherwise known as the tenuous link Q: Why did the dimension take all day to take off its suit and put on a pair of jeans? A: It was a slowly-changing dimension! (Yeah it was Chris s)

  5. What is a Slowly Changing Dimension? Time to Ask Kimball or not Somewhat cryptically, Kimball has started defining SCDs as nearly constant dimensions * A more helpful definition might be a dimension that records change over time For most people in the BI community there are nominally 3 different types of SCD, each with a different method of handling change Type 1 where change is handled by overwriting old data, hence you always get the latest view Type 2 where change is stored using a combination of effective dates and new rows of data (more to follow!) Type 3 where change is handled by adding a new column! Wikipedia also defines Types 0, 4 and 6 Type 0 where nothing is done at all (so this is a dimension then???) Type 4 where history tables are used to store old records Type 6 which is a hybrid (invented by Kimball) of Types 1 + 2 + 3 = 6 * The Data Warehouse Toolkit, Second Edition (Ralph Kimball et al, 2002)

  6. TYPES OF SCD

  7. Types of SCD Type 1 SCD Handling TECHNIQUE: Replace an old dimension attribute value with a new one Example is the Customer Dimension in the Adventure Works DW database After a freak accident one day, Jon Yang becomes Jane Yang and the row then becomes

  8. Types of SCD Type 2 Handling TECHNIQUE: Add a new row with all of the same values apart from the changed one Kimball states that A type 2 response is the predominant technique for supporting [the representation of prior data correctly] Example is the Employee dimension in the Adventure Works Data Warehouse database One day after an unfortunate incident at the office party, Guy get re-assigned to a new department with a new boss and a new title. Note that the new row receives a new surrogate key, the old row now has CurrentFlag status of 0 and a non-null EndDate

  9. Types of SCD Type 3 Handling TECHNIQUE: Add a new column to capture the previous state of the attribute This technique is used in those cases when we want the ability to compare results using both the old and current values of an attribute Type 2 cannot support this as it partitions the Fact table by attribute value. It is equivalent to a UNION of results for all attribute values. Type 3 is the equivalent of a JOIN of results for current and prior attribute values.

  10. Types of SCD Type 6 Handling TECHNIQUE: Add new rows and columns with updates (1 + 2 + 3) To be used solely in the case where the solution is required to preserve all historical changes whilst also allowing you to compare historical data with the current view

  11. Summary What do we know so far? Slowly Changing Dimensions are a method of handling change in dimensional data over time There are various Types of SCD that each involve one or more techniques including updates, new rows and new columns The most commonly used are Type 1 and Type 2 For the remainder of the session we will be focusing on Type 2 and how they can be implemented

  12. SCDs IN THE RELATIONAL LAYER

  13. Modelling Type 2 SCDs The Relational Layer The Kimball Approach SCDs like all other Dimensions in the Kimball approach to Data Warehouse design are predominantly modelled using a de-normalized star schema. This can lead to some data redundancy (e.g. repeated attribute values) but generally on the basis of ease of use and query performance (TSQL that is) this is acceptable. Kimball specifically recommends against the use of normalized or snowflake schemas on the grounds that since dimension tables are geometrically smaller than fact tables, improving storage efficiency has virtually no impact * * Data Warehouse Toolkit, Second Edition (Ralph Kimball et al, 2002)

  14. Modelling Type 2 SCDs The Relational Layer The Kimball Approach An important point to note about the Dimension table is the use of a surrogate key (EmployeeKey) rather than the natural business key Without a surrogate key in place, Type 2 changes would not be possible as the business key could not be repeated in a new row without change, at which point it would represent something else The surrogate key becomes the basis for the join to the fact table and for the partitioning of the fact table data based on dimensional change Note that in a single dimension table, both Type 1 and Type 2 attributes can be present

  15. Modelling Type 2 SCDs When one table is never enough In general, using a Star-Schema dimension is the best approach They are easier to understand and in a relational sense perform better ETL is more straightforward Hierarchy relationships can be easily seen by reading across a row However, it should be noted that Kimball tends to exclusively talk about relational BI systems The use of an OLAP layer will significantly mitigate against the problems of a multi-table design In particular there are two main scenarios where a multi-table approach offers a better solution i. ii. A long chain of Type 2 hierarchy attributes A large, rapidly changing Type 2 dimension

  16. Modelling Type 2 SCDs A Long Type 2 Hierarchy Chain Consider the following scenario A large multinational has implemented a central BI tool to manage its global sales force The sales force has the following structure Global Head of Sales Regional Head Country Manager Area Manager Team Manager Sales Rep

  17. Modelling Type 2 SCDs A Long Type 2 Hierarchy Chain In a Star Schema format this type of chain is modelled as one table If all of the managers and efficiency targets are modelled as type 2, then data explosion can occur very quickly, especially for large numbers of granularity members Large amounts of data redundancy is introduced

  18. Modelling Type 2 SCDs A Long Type 2 Hierarchy Chain In a Snowflake Scheme this is modelled as a series of tables This approach has the advantage that if a change occurs in a mid- hierarchy attribute, then the subsequent changes are restricted to that attribute and below This not only reduces storage space but also the burden placed on the ETL system Analysis Services also tends to process slightly more performantly against Snowflake schemas than Star schemas

  19. Modelling Type 2 SCDs A Long Type 2 Hierarchy Chain Suppose we have the following row of data from the Star schema implementation of the dimension 2 Joe Bloggs WC1 ACME Vacuum Cleaner ACME Dust Buster 1 Bill Bloggs London 30 1 Bruce Bloggs London 25 1 Jack Bloggs UK 25 1 Jean-Claude Bloggs Europe 25 1 Wayne Bloggs 25 SalesRepKey SalesRepName PostalCodeAssignment AssignedProduct1 AssignedProduct2 TeamManagerKey TeamManagerName DistrictName TeamEfficiencyTarget AreaManagerKey AreaManagerName AreaName AreaEfficiencyTarget CountryManagerKey CountryManagerName CountryName CountryEfficiencyTarget 25 RegionManagerKey RegionManagerName RegionName RegionEfficiencyTarget GlobalHeadKey GlobalHeadName GlobalEfficiencyTarget 1 Joe Bloggs WC1 ACME Vacuum Cleaner ACME Dust Buster 1 Bill Bloggs London 25 1 Bruce Bloggs London 25 1 Jack Bloggs UK 1 Joe Bloggs WC1 ACME Vacuum Cleaner ACME Dust Buster 1 Bill Bloggs London 30 1 Bruce Bloggs London 25 1 Jill Bloggs UK 25 1 Jean-Claude Bloggs Europe 25 1 Wayne Bloggs 25 1 Jean-Claude Bloggs Europe 25 1 Wayne Bloggs 25

  20. Modelling Type 2 SCDs A Long Type 2 Hierarchy Chain SalesRepKey SalesRepName PostalCodeAssignment AssignedProduct1 AssignedProduct2 TeamManagerKey TeamManagerKey TeamManagerKey SalesRepKey SalesRepName PostalCodeAssignment AssignedProduct1 AssignedProduct2 AssignedProduct2 SalesRepKey SalesRepName PostalCodeAssignment AssignedProduct1 1 Joe Bloggs WC1 ACME Vacuum Cleaner ACME Dust Buster 1 1 1 1 Joe Bloggs WC1 ACME Vacuum Cleaner ACME Dust Buster ACME Dust Buster 1 Joe Bloggs WC1 ACME Vacuum Cleaner 2 Joe Bloggs WC1 ACME Vacuum Cleaner ACME Dust Buster 2 Joe Bloggs WC1 ACME Vacuum Cleaner ACME Dust Buster 2 2 3 Joe Bloggs WC1 ACME Vacuum Cleaner ACME Dust Buster 3 TeamManagerKey TeamManagerName DistrictName TeamEfficiencyTarget AreaManagerKey AreaManagerKey AreaManagerKey 1 Bill Bloggs London 25 1 1 1 2 Bill Bloggs London 30 1 1 TeamManagerKey TeamManagerName DistrictName TeamEfficiencyTarget TeamEfficiencyTarget 1 Bill Bloggs London 25 25 2 Bill Bloggs London 30 3 Bill Bloggs London 30 2 TeamManagerKey TeamManagerName DistrictName 1 Bill Bloggs London AreaManagerKey AreaManagerName AreaName AreaEfficiencyTarget CountryManagerKey CountryManagerKey AreaManagerKey AreaManagerName AreaName AreaEfficiencyTarget 1 Bruce Bloggs London 25 1 1 1 Bruce Bloggs London 25 2 Bruce Bloggs London 25 2 CountryManagerKey CountryManagerName CountryName CountryEfficiencyTarget RegionManagerKey RegionManagerKey CountryManagerKey CountryManagerName CountryName CountryEfficiencyTarget 1 Jack Bloggs UK 25 1 Jack Bloggs UK 25 1 1 2 Jill Bloggs UK 25 1 RegionManagerKey RegionManagerName RegionName RegionEfficiencyTarget GlobalHeadKey 1 Jean-Claude Bloggs Europe 25 1 GlobalHeadKey GlobalHeadName GlobalEfficiencyTarget 1 Wayne Bloggs 25

  21. Modelling Type 2 SCDs A Rapidly Changing Type 2 Dimension Kimball describes a scenario where you find yourself with what he terms a rapidly changing dimension - essentially very large dimensions with many Type 2 attributes that change frequently e.g. a Customer dimension For multi-million row dimension tables, the standard Type 2 handling techniques fall down due to the data explosion they generate Kimball s suggested solution is to split the dimensions in two, with 2 keys attached to the fact table instead of one In one table the original surrogate key and all of the slowly changing Type 2 attributes are left. In the other, all of the remaining rapidly changing Type 2 attributes are stored with another surrogate key. The relationship between the two is then maintained through the fact table This technique should only be considered for those scenarios described as rapidly changing monster dimension

  22. Modelling Type 2 SCDs A Rapidly Changing Type 2 Dimension The split dimension technique involves taking one dimension with a combination of slowly and rapidly changing Type 2 attributes and creating two

  23. SCDs IN THE OLAP LAYER

  24. Modelling Type 2 SCDs The OLAP Layer Modelling Type 2 SCDs in SSAS is extremely straight forward Because we use a surrogate key for the granularity of the dimension in the relational layer, all of the hard work has already been done The changing surrogate key means that over time the parent attributes to which this granularity attribute belongs will change over time seamlessly and with no extra effort on our part If we want to see a view across time for a particular member then we can also create an attribute hierarchy based on the business key of the dimension

  25. Modelling Type 2 SCDs The OLAP Layer An Example Suppose we are a pharmaceutical company who want to track sales of a particular drug over time We have a Type 2 SCD Product dimension that records details of the drug over time And a Fact table detailing drug sales over time

  26. Modelling Type 2 SCDs The OLAP Layer An Example DEMO

  27. LOADING SCDs

  28. Loading a Hybrid Type1/Type 2 SCD Our example dimension Over the next few slides and demos we ll be looking at a fairly realistic scenario. What are the methods we can use to load an SCD and in this case a hybrid Type 1/ Type 2 SCD The dimension table I ll using as my test subject is one I created by using the new Red Gate SQL Generator. It looks like this It has 1 million rows Each method of loading the table will employ the same data feed , a SQL table containing 100, 000 new rows 100, 000 existing rows 100, 000 update rows

  29. Loading a Hybrid Type1/Type 2 SCD The SCD Wizard Out of the box, SSIS comes with the Slowly Changing Dimension task It allows you to rapidly develop Slowly Changing Dimension handling in a very short amount of time It has one major drawback though IT S RUBBISH!!!

  30. Loading a Hybrid Type1/Type 2 SCD The SCD Wizard In all but the smallest of tables with very few updates per cycle the task does not perform at all The reason for this poor performance is twofold: 1. The Algorithm - column by column comparison between the destination and incoming rows can quickly become incredibly inefficient and a performance bottleneck 2. Row by Row Processing - When a row has been identified as an update row it is sent to an OLE DB Command which performs a single update statement for that row. All the traditional advantages of SQL set based processing are lost

  31. Loading a Hybrid Type1/Type 2 SCD The SCD Wizard DEMO

  32. Loading a Hybrid Type1/Type 2 SCD Replacing the SCD Task Any solution to the performance problem of the SCD wizard must be able to solve the indicated problems, whilst still retaining the accuracy of the logic Essentially this means that any solution must be able to do the following: 1. Be able to differentiate between new or existing rows 2. Be able to remove the need for column by column comparison 3. Be able to leverage set based rather than row by row based processing

  33. Loading a Hybrid Type1/Type 2 SCD Use MERGE JOINS & HASHING We can create our own solution using MERGE JOIN tasks that will mimic the behaviour of the SCD Task We can combine this with a workflow that allows us to use set based rather than row based processing We can also use CHECKSUM or HASHBYTES to ensure we only have to compare one column

  34. Loading a Hybrid Type1/Type 2 SCD CHECKSUM or HASHBYTES CHECKSUM and HASHBYTES are both hashing algorithms within TSQL Used for combining many columns together in one hash value and building hash indexes over For our purposes this is important it allows us to combine all of the incoming attributes into one column, thus instead of checking multiple columns we only need to check one There is a problem though, the number of collisions with CHECKSUM the number of times the same hash value will be created by different combinations of values is too high Microsoft recommends the use of the HASHBYTES function instead, along with the SHA1 algorithm

  35. Loading a Hybrid Type1/Type 2 SCD Use Lookups DEMO

  36. Loading a Hybrid Type1/Type 2 SCD Use Lookups There are obviously problems with this technique! 1. HASHBYTES, the VARBINARY(8000) it creates and the function used to transform it is a major bottleneck.. MERGE JOIN is not a particularly efficient JOIN algorithm 2.

  37. Loading a Hybrid Type1/Type 2 SCD Tweaking the Solution To get around the second point we can use Lookup tasks The Lookup task essentially caches a whole table or query in memory at runtime and then compares incoming rows with the result of that cache This means queries can be very quickly resolved However, in very large dimensions the MERGE JOIN may perform better due to the memory limitation of the Lookup As for the first point, in practice due to the size of hashing output we are better off comparing the separate columns (assuming they total less than 8000 bytes!)

  38. Loading a Hybrid Type1/Type 2 SCD Tweaking the Solution DEMO

  39. Loading a Hybrid Type1/Type 2 SCD SQL Server 2008 New TSQL Features In SQL Server 2008 (as you have doubtless seen elsewhere today) there are additions to Transact SQL (TSQL) that are relevant for Slowly Changing Dimension handling Specifically these are 1. The MERGE statement 2. The INSERT from DML output

  40. Loading a Hybrid Type1/Type 2 SCD SQL Server 2008 New TSQL Features - MERGE The MERGE statement performs insert, update, or delete operations on a target table based on the results of a join with a source table. The MERGE statement provides three types of WHEN clauses: 1. WHEN MATCHED enables you to UPDATE or DELETE the given row in the target table when the source and target rows match some criteria or criterion. 2. WHEN NOT MATCHED [BY TARGET] enables you to INSERT a row into the target when it exists in the source but not in the target. WHEN NOT MATCHED BY SOURCE enables you to UPDATE or DELETE the given row in the target table when it exists in the target but not in the source. You can specify a search condition with each of the WHEN clauses to choose which type of DML operation should be performed on the row. The OUTPUT clause for the MERGE statement includes a new virtual column called $action that you can use to identify the DML action that was performed on each row. 3.

  41. Loading a Hybrid Type1/Type 2 SCD SQL Server 2008 New TSQL Features - INSERT In SQL Server 2008, INSERT has the new ability to consume the results of DML statements If this new capability is combined with the output of a MERGE, we have the ability to perform Type 2 inserts based on Type 2 updates performed in the MERGE If these pieces of syntax are nested in the right way, then we can perform hybrid updates as we have been doing previously

  42. Loading a Hybrid Type1/Type 2 SCD SQL Server 2008 New TSQL Features - INSERT DEMO

  43. Summary What have we learned? Slowly Changing Dimensions and in particular Type 2 SCDs are a powerful tool in tracking history in a data warehouse There are various ways of modelling Type 2 SCDs in the relational layer The easiest to understand is the Star Schema approach advocated by Kimball and generally this is the best choice. However in certain scenarios, a Snowflake or Split Dimension technique is more suitable Whichever method you choose to model, in the OLAP layer the hard work has already been done and the Type 2 model should just drop out SQL Server 2008 offers several options in loading SCDs The in built SCD Wizard should only be used for the smallest of dimensions It is possible to construct a performant alternative using Lookups (or MERGE JOINS for very large dimensions) and set based UPDATE statements MERGE and INSERT functionality offers a viable alternative for loading in environments where a code based ETL approach has been taken

  44. THE END QUESTIONS?

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#