Slowly changing dimension

Dimension is a term in data management and data warehousing that refers to logical groupings of data such as geographical location, customer information, or product information. Slowly Changing Dimensions (SCD) are dimensions that have data that slowly changes.

For example, you may have a Dimension in your database that tracks the sales records of your company's salesmen. Creating sales reports seems simple enough, until a sales person is transferred from one regional office to another. How do you record such a change in your sales Dimension?

You could sum or average the sales by salesperson, but if you use that to compare the performance of sales people, that might give misleading information. If the sales person that was tranferred used to work in a hot market where sales were easy, and now works in a market where sales are infrequent, his totals will look much stronger than the other salespeople in his new region, even if they are just as good. Or you could create a second sales-person record and treat the transferred person as a new sales person, but that creates problems also.

Dealing with these issues involves SCD management methodologies referred to as Type 0, 1, 2, 3, 4, and 6. Type 6 SCDs are also sometimes called Hybrid SCDs.

Type 0
Type 0 is used somewhat infrequently, to refer to an SCD where no effort has been made to deal with the issues.

Type 1
The Type 1 methodology overwrites old data with new data, and therefore does not track historical data at all. This is most appropriate when correcting certain types of data errors, such as the spelling of a name. (Assuming you won't ever need to know how it used to be misspelled in the past.)

Another example would be of a database table that keeps supplier information.

Now imagine that this supplier moves their headquarters to Illinois. The updated table would simply overwrite this record:

The obvious disadvantage to this method of managing SCDs is that there is no historical record kept in the data warehouse. You can't tell if your suppliers are tending to move to the Midwest, for example. But an advantage to this is that these are very easy to maintain.

Type 2
The Type 2 method tracks historical data by creating multiple records in the dimensional tables with separate keys.

In the same example, if the supplier moves to Illinois, the table would look like this:

Another popular method for tuple versioning is to add effective date columns.

Null End_Date signifies current tuple version.

Type 3
The Type 3 method track changes using separate columns. Where the original table structure in Type 1 and Type 2 was very similar, Type 3 will add additional columns to the tables:

Note that this record can not track all historical changes, such as when a supplier moves twice.

Type 4
The Type 4 method is usually just referred to as using "history tables", where one table keeps the current data, and an additional table is used to keep a record of some or all changes.

Following the example above, the original table might be called Supplier and the history table might be called Supplier_History.

Type 6 / Hybrid
The Type 6 method is one that combines the approaches of types 1,2 and 3 (1 + 2 + 3 = 6). It is not frequently used because it has the potential to complicate end user access, but has some advantages over the other approaches especially when techniques are employed to mitigate the downstream complexity.

The approach is to use a Type 1 slowly changing dimension, but adding an additional pair of date columns to indicate the date range at which a particular row in the dimension applies.

This approach has a number of advantages:
 * the user can choose to query using the current values of the dimensional table by restricting the rows in the Dimension table using a filter to only select current values
 * alternatively the user can use the "as at the time of the transaction" values by using one of the date fields on the transaction as a constraint on the dimension table.
 * if there are a number of date columns on the transaction (e.g. Order Date, Shipping Date, Confirmation Date) then the user can choose which date to analyze the fact data by - something not possible using other approaches.

This is how the Supplier table would look using Type 6 Slowly Changing Dimensions:

Alternative implementations of Type 6 can include a blank end date, or add a flag to indicate the current row.

Example SQL:

To query the Star Schema using the historic reference data, the query looks like this:

To query the Star Schema using the current reference data, the query looks like this:

Notes
 * NOW is SQL Server syntax, for Oracle use SYSDATE


 * Caution - If the WHERE clause restricting the rows in the dimension table is not present, then the query will potentially return duplicate rows and give the wrong answers, so this technique should be used with care.


 * Some Business Intelligence tools do not handle generating complex joins such as this well.


 * The Etl processes needed to create this table also need to be carefully designed to ensure that there are no overlaps in the time periods for each distinct item of reference data.


 * The >= and < clause is required to ensure no time period is omitted.


 * Often a view is created over the table which can filter out the rows or columns. This simplifies joins to the table if only the current rows are needed for certain queries. This view could be materialized to a physical table if storage space is not a problem. This can be done automatically by most modern DBMS's and automatically kept up to date.

The software product Kalido DIW uses this approach for managing slowly changing dimensions.

Combining Types
Different types can be applied to different columns of a table. For example, we can apply Type 1 to Supplier Name column and Type 2 to Supplier State column of the same table, the Supplier table.

Alternative Typing
Todd Schraml has a well-written article available at Data Management Review.