Understanding Slowly Changing Dimensions in SQL Server 2008
Learn about the concept of Slowly Changing Dimensions (SCD) in SQL Server 2008, different types of SCD, the Kimball approach to modeling SCD in a Data Warehouse, loading strategies using SSIS and T-SQL, and subject-specific humor related to SCDs. This presentation by Colin Hardie provides valuable insights for Business Intelligence professionals.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 Level 200 Colin Hardie
Agenda or how long do I have to sit here for? Who am I? Introduction What is a Slowly Changing Dimension (SCD)? Types of SCD The Kimball Approach to SCDs modelling an SCD in the Data Warehouse Alternative designs and why you might want to use them SCDs in the OLAP layer Loading an SCD 1 The SSIS SCD Wizard Loading an SCD 2 BYO (Build Your Own!) Loading an SCD 3 BYO Reloaded - The one that works! (I hope) Loading an SCD 4 SQL Server 2008 TSQL Approach the MERGE statement! Questions (and possibly answers)
Who am I? The name is familiar from somewhere Senior Business Intelligence Consultant with IMGROUP Specialising in SSIS and SSAS with a smattering of RDBMS MCTS & MCITP qualified for Business Intelligence Co-Author of the most popular post on Chris Webb s Blog OLAP Jokes http://cwebbbi.spaces.live.com/Blog/cns!1pi7ETChsJ1un_2s41jm9Iyg!290.entry A girl goes speed dating and after a run of duff prospects a dimension sits down opposite her and they engage in sprightly conversation. After the event is over, her friend asks "Did you get that dimension's number? You two really seemed to hit it off" "No" replies the girl "He was really deep at times - had a lot of levels - but you know parent-child dimensions, just too much of a mummy's boy"
Subject Matter Specific Humour otherwise known as the tenuous link Q: Why did the dimension take all day to take off its suit and put on a pair of jeans? A: It was a slowly-changing dimension! (Yeah it was Chris s)
What is a Slowly Changing Dimension? Time to Ask Kimball or not Somewhat cryptically, Kimball has started defining SCDs as nearly constant dimensions * A more helpful definition might be a dimension that records change over time For most people in the BI community there are nominally 3 different types of SCD, each with a different method of handling change Type 1 where change is handled by overwriting old data, hence you always get the latest view Type 2 where change is stored using a combination of effective dates and new rows of data (more to follow!) Type 3 where change is handled by adding a new column! Wikipedia also defines Types 0, 4 and 6 Type 0 where nothing is done at all (so this is a dimension then???) Type 4 where history tables are used to store old records Type 6 which is a hybrid (invented by Kimball) of Types 1 + 2 + 3 = 6 * The Data Warehouse Toolkit, Second Edition (Ralph Kimball et al, 2002)
Types of SCD Type 1 SCD Handling TECHNIQUE: Replace an old dimension attribute value with a new one Example is the Customer Dimension in the Adventure Works DW database After a freak accident one day, Jon Yang becomes Jane Yang and the row then becomes
Types of SCD Type 2 Handling TECHNIQUE: Add a new row with all of the same values apart from the changed one Kimball states that A type 2 response is the predominant technique for supporting [the representation of prior data correctly] Example is the Employee dimension in the Adventure Works Data Warehouse database One day after an unfortunate incident at the office party, Guy get re-assigned to a new department with a new boss and a new title. Note that the new row receives a new surrogate key, the old row now has CurrentFlag status of 0 and a non-null EndDate
Types of SCD Type 3 Handling TECHNIQUE: Add a new column to capture the previous state of the attribute This technique is used in those cases when we want the ability to compare results using both the old and current values of an attribute Type 2 cannot support this as it partitions the Fact table by attribute value. It is equivalent to a UNION of results for all attribute values. Type 3 is the equivalent of a JOIN of results for current and prior attribute values.
Types of SCD Type 6 Handling TECHNIQUE: Add new rows and columns with updates (1 + 2 + 3) To be used solely in the case where the solution is required to preserve all historical changes whilst also allowing you to compare historical data with the current view
Summary What do we know so far? Slowly Changing Dimensions are a method of handling change in dimensional data over time There are various Types of SCD that each involve one or more techniques including updates, new rows and new columns The most commonly used are Type 1 and Type 2 For the remainder of the session we will be focusing on Type 2 and how they can be implemented
Modelling Type 2 SCDs The Relational Layer The Kimball Approach SCDs like all other Dimensions in the Kimball approach to Data Warehouse design are predominantly modelled using a de-normalized star schema. This can lead to some data redundancy (e.g. repeated attribute values) but generally on the basis of ease of use and query performance (TSQL that is) this is acceptable. Kimball specifically recommends against the use of normalized or snowflake schemas on the grounds that since dimension tables are geometrically smaller than fact tables, improving storage efficiency has virtually no impact * * Data Warehouse Toolkit, Second Edition (Ralph Kimball et al, 2002)
Modelling Type 2 SCDs The Relational Layer The Kimball Approach An important point to note about the Dimension table is the use of a surrogate key (EmployeeKey) rather than the natural business key Without a surrogate key in place, Type 2 changes would not be possible as the business key could not be repeated in a new row without change, at which point it would represent something else The surrogate key becomes the basis for the join to the fact table and for the partitioning of the fact table data based on dimensional change Note that in a single dimension table, both Type 1 and Type 2 attributes can be present
Modelling Type 2 SCDs When one table is never enough In general, using a Star-Schema dimension is the best approach They are easier to understand and in a relational sense perform better ETL is more straightforward Hierarchy relationships can be easily seen by reading across a row However, it should be noted that Kimball tends to exclusively talk about relational BI systems The use of an OLAP layer will significantly mitigate against the problems of a multi-table design In particular there are two main scenarios where a multi-table approach offers a better solution i. ii. A long chain of Type 2 hierarchy attributes A large, rapidly changing Type 2 dimension
Modelling Type 2 SCDs A Long Type 2 Hierarchy Chain Consider the following scenario A large multinational has implemented a central BI tool to manage its global sales force The sales force has the following structure Global Head of Sales Regional Head Country Manager Area Manager Team Manager Sales Rep
Modelling Type 2 SCDs A Long Type 2 Hierarchy Chain In a Star Schema format this type of chain is modelled as one table If all of the managers and efficiency targets are modelled as type 2, then data explosion can occur very quickly, especially for large numbers of granularity members Large amounts of data redundancy is introduced
Modelling Type 2 SCDs A Long Type 2 Hierarchy Chain In a Snowflake Scheme this is modelled as a series of tables This approach has the advantage that if a change occurs in a mid- hierarchy attribute, then the subsequent changes are restricted to that attribute and below This not only reduces storage space but also the burden placed on the ETL system Analysis Services also tends to process slightly more performantly against Snowflake schemas than Star schemas
Modelling Type 2 SCDs A Long Type 2 Hierarchy Chain Suppose we have the following row of data from the Star schema implementation of the dimension 2 Joe Bloggs WC1 ACME Vacuum Cleaner ACME Dust Buster 1 Bill Bloggs London 30 1 Bruce Bloggs London 25 1 Jack Bloggs UK 25 1 Jean-Claude Bloggs Europe 25 1 Wayne Bloggs 25 SalesRepKey SalesRepName PostalCodeAssignment AssignedProduct1 AssignedProduct2 TeamManagerKey TeamManagerName DistrictName TeamEfficiencyTarget AreaManagerKey AreaManagerName AreaName AreaEfficiencyTarget CountryManagerKey CountryManagerName CountryName CountryEfficiencyTarget 25 RegionManagerKey RegionManagerName RegionName RegionEfficiencyTarget GlobalHeadKey GlobalHeadName GlobalEfficiencyTarget 1 Joe Bloggs WC1 ACME Vacuum Cleaner ACME Dust Buster 1 Bill Bloggs London 25 1 Bruce Bloggs London 25 1 Jack Bloggs UK 1 Joe Bloggs WC1 ACME Vacuum Cleaner ACME Dust Buster 1 Bill Bloggs London 30 1 Bruce Bloggs London 25 1 Jill Bloggs UK 25 1 Jean-Claude Bloggs Europe 25 1 Wayne Bloggs 25 1 Jean-Claude Bloggs Europe 25 1 Wayne Bloggs 25
Modelling Type 2 SCDs A Long Type 2 Hierarchy Chain SalesRepKey SalesRepName PostalCodeAssignment AssignedProduct1 AssignedProduct2 TeamManagerKey TeamManagerKey TeamManagerKey SalesRepKey SalesRepName PostalCodeAssignment AssignedProduct1 AssignedProduct2 AssignedProduct2 SalesRepKey SalesRepName PostalCodeAssignment AssignedProduct1 1 Joe Bloggs WC1 ACME Vacuum Cleaner ACME Dust Buster 1 1 1 1 Joe Bloggs WC1 ACME Vacuum Cleaner ACME Dust Buster ACME Dust Buster 1 Joe Bloggs WC1 ACME Vacuum Cleaner 2 Joe Bloggs WC1 ACME Vacuum Cleaner ACME Dust Buster 2 Joe Bloggs WC1 ACME Vacuum Cleaner ACME Dust Buster 2 2 3 Joe Bloggs WC1 ACME Vacuum Cleaner ACME Dust Buster 3 TeamManagerKey TeamManagerName DistrictName TeamEfficiencyTarget AreaManagerKey AreaManagerKey AreaManagerKey 1 Bill Bloggs London 25 1 1 1 2 Bill Bloggs London 30 1 1 TeamManagerKey TeamManagerName DistrictName TeamEfficiencyTarget TeamEfficiencyTarget 1 Bill Bloggs London 25 25 2 Bill Bloggs London 30 3 Bill Bloggs London 30 2 TeamManagerKey TeamManagerName DistrictName 1 Bill Bloggs London AreaManagerKey AreaManagerName AreaName AreaEfficiencyTarget CountryManagerKey CountryManagerKey AreaManagerKey AreaManagerName AreaName AreaEfficiencyTarget 1 Bruce Bloggs London 25 1 1 1 Bruce Bloggs London 25 2 Bruce Bloggs London 25 2 CountryManagerKey CountryManagerName CountryName CountryEfficiencyTarget RegionManagerKey RegionManagerKey CountryManagerKey CountryManagerName CountryName CountryEfficiencyTarget 1 Jack Bloggs UK 25 1 Jack Bloggs UK 25 1 1 2 Jill Bloggs UK 25 1 RegionManagerKey RegionManagerName RegionName RegionEfficiencyTarget GlobalHeadKey 1 Jean-Claude Bloggs Europe 25 1 GlobalHeadKey GlobalHeadName GlobalEfficiencyTarget 1 Wayne Bloggs 25
Modelling Type 2 SCDs A Rapidly Changing Type 2 Dimension Kimball describes a scenario where you find yourself with what he terms a rapidly changing dimension - essentially very large dimensions with many Type 2 attributes that change frequently e.g. a Customer dimension For multi-million row dimension tables, the standard Type 2 handling techniques fall down due to the data explosion they generate Kimball s suggested solution is to split the dimensions in two, with 2 keys attached to the fact table instead of one In one table the original surrogate key and all of the slowly changing Type 2 attributes are left. In the other, all of the remaining rapidly changing Type 2 attributes are stored with another surrogate key. The relationship between the two is then maintained through the fact table This technique should only be considered for those scenarios described as rapidly changing monster dimension
Modelling Type 2 SCDs A Rapidly Changing Type 2 Dimension The split dimension technique involves taking one dimension with a combination of slowly and rapidly changing Type 2 attributes and creating two
Modelling Type 2 SCDs The OLAP Layer Modelling Type 2 SCDs in SSAS is extremely straight forward Because we use a surrogate key for the granularity of the dimension in the relational layer, all of the hard work has already been done The changing surrogate key means that over time the parent attributes to which this granularity attribute belongs will change over time seamlessly and with no extra effort on our part If we want to see a view across time for a particular member then we can also create an attribute hierarchy based on the business key of the dimension
Modelling Type 2 SCDs The OLAP Layer An Example Suppose we are a pharmaceutical company who want to track sales of a particular drug over time We have a Type 2 SCD Product dimension that records details of the drug over time And a Fact table detailing drug sales over time
Modelling Type 2 SCDs The OLAP Layer An Example DEMO
Loading a Hybrid Type1/Type 2 SCD Our example dimension Over the next few slides and demos we ll be looking at a fairly realistic scenario. What are the methods we can use to load an SCD and in this case a hybrid Type 1/ Type 2 SCD The dimension table I ll using as my test subject is one I created by using the new Red Gate SQL Generator. It looks like this It has 1 million rows Each method of loading the table will employ the same data feed , a SQL table containing 100, 000 new rows 100, 000 existing rows 100, 000 update rows
Loading a Hybrid Type1/Type 2 SCD The SCD Wizard Out of the box, SSIS comes with the Slowly Changing Dimension task It allows you to rapidly develop Slowly Changing Dimension handling in a very short amount of time It has one major drawback though IT S RUBBISH!!!
Loading a Hybrid Type1/Type 2 SCD The SCD Wizard In all but the smallest of tables with very few updates per cycle the task does not perform at all The reason for this poor performance is twofold: 1. The Algorithm - column by column comparison between the destination and incoming rows can quickly become incredibly inefficient and a performance bottleneck 2. Row by Row Processing - When a row has been identified as an update row it is sent to an OLE DB Command which performs a single update statement for that row. All the traditional advantages of SQL set based processing are lost
Loading a Hybrid Type1/Type 2 SCD The SCD Wizard DEMO
Loading a Hybrid Type1/Type 2 SCD Replacing the SCD Task Any solution to the performance problem of the SCD wizard must be able to solve the indicated problems, whilst still retaining the accuracy of the logic Essentially this means that any solution must be able to do the following: 1. Be able to differentiate between new or existing rows 2. Be able to remove the need for column by column comparison 3. Be able to leverage set based rather than row by row based processing
Loading a Hybrid Type1/Type 2 SCD Use MERGE JOINS & HASHING We can create our own solution using MERGE JOIN tasks that will mimic the behaviour of the SCD Task We can combine this with a workflow that allows us to use set based rather than row based processing We can also use CHECKSUM or HASHBYTES to ensure we only have to compare one column
Loading a Hybrid Type1/Type 2 SCD CHECKSUM or HASHBYTES CHECKSUM and HASHBYTES are both hashing algorithms within TSQL Used for combining many columns together in one hash value and building hash indexes over For our purposes this is important it allows us to combine all of the incoming attributes into one column, thus instead of checking multiple columns we only need to check one There is a problem though, the number of collisions with CHECKSUM the number of times the same hash value will be created by different combinations of values is too high Microsoft recommends the use of the HASHBYTES function instead, along with the SHA1 algorithm
Loading a Hybrid Type1/Type 2 SCD Use Lookups DEMO
Loading a Hybrid Type1/Type 2 SCD Use Lookups There are obviously problems with this technique! 1. HASHBYTES, the VARBINARY(8000) it creates and the function used to transform it is a major bottleneck.. MERGE JOIN is not a particularly efficient JOIN algorithm 2.
Loading a Hybrid Type1/Type 2 SCD Tweaking the Solution To get around the second point we can use Lookup tasks The Lookup task essentially caches a whole table or query in memory at runtime and then compares incoming rows with the result of that cache This means queries can be very quickly resolved However, in very large dimensions the MERGE JOIN may perform better due to the memory limitation of the Lookup As for the first point, in practice due to the size of hashing output we are better off comparing the separate columns (assuming they total less than 8000 bytes!)
Loading a Hybrid Type1/Type 2 SCD Tweaking the Solution DEMO
Loading a Hybrid Type1/Type 2 SCD SQL Server 2008 New TSQL Features In SQL Server 2008 (as you have doubtless seen elsewhere today) there are additions to Transact SQL (TSQL) that are relevant for Slowly Changing Dimension handling Specifically these are 1. The MERGE statement 2. The INSERT from DML output
Loading a Hybrid Type1/Type 2 SCD SQL Server 2008 New TSQL Features - MERGE The MERGE statement performs insert, update, or delete operations on a target table based on the results of a join with a source table. The MERGE statement provides three types of WHEN clauses: 1. WHEN MATCHED enables you to UPDATE or DELETE the given row in the target table when the source and target rows match some criteria or criterion. 2. WHEN NOT MATCHED [BY TARGET] enables you to INSERT a row into the target when it exists in the source but not in the target. WHEN NOT MATCHED BY SOURCE enables you to UPDATE or DELETE the given row in the target table when it exists in the target but not in the source. You can specify a search condition with each of the WHEN clauses to choose which type of DML operation should be performed on the row. The OUTPUT clause for the MERGE statement includes a new virtual column called $action that you can use to identify the DML action that was performed on each row. 3.
Loading a Hybrid Type1/Type 2 SCD SQL Server 2008 New TSQL Features - INSERT In SQL Server 2008, INSERT has the new ability to consume the results of DML statements If this new capability is combined with the output of a MERGE, we have the ability to perform Type 2 inserts based on Type 2 updates performed in the MERGE If these pieces of syntax are nested in the right way, then we can perform hybrid updates as we have been doing previously
Loading a Hybrid Type1/Type 2 SCD SQL Server 2008 New TSQL Features - INSERT DEMO
Summary What have we learned? Slowly Changing Dimensions and in particular Type 2 SCDs are a powerful tool in tracking history in a data warehouse There are various ways of modelling Type 2 SCDs in the relational layer The easiest to understand is the Star Schema approach advocated by Kimball and generally this is the best choice. However in certain scenarios, a Snowflake or Split Dimension technique is more suitable Whichever method you choose to model, in the OLAP layer the hard work has already been done and the Type 2 model should just drop out SQL Server 2008 offers several options in loading SCDs The in built SCD Wizard should only be used for the smallest of dimensions It is possible to construct a performant alternative using Lookups (or MERGE JOINS for very large dimensions) and set based UPDATE statements MERGE and INSERT functionality offers a viable alternative for loading in environments where a code based ETL approach has been taken
THE END QUESTIONS?