Before jumping into creating a cube or tabular model in Analysis Service, the database used as source data should be well structured using best practices for data modeling. Some might say use Dimensional Modeling or Inmon’s data warehouse concepts while others say go with the future, Data Vault. No matter what conceptual path is taken, the tables can be well structured with the proper data types, sizes and constraints.
This article is going to use a scaled down example of the Adventure Works Data Warehouse. The full version of the Adventure Works DW can be downloaded from CodePlex.
Taking on a complete data warehouse project can be overwhelming. The best path from experience is to create the minimum structure that yields a good result. Not the best, but something of value where management will approve and extend other data warehouse projects.
The beginning can be a data mart that might answer the following questions:
- Compare Sales of Mountain Bikes for years 2015 and 2016
- What happens to profits when you remove the Bike Category?
- Show average sales for subcategories within categories in the USA?
- Calculate Gross Profit Margin for Internet Sales region by Year, Quarter and Month
From these specific inquiries, we can get an idea of the data required. There will be a Data dimension because of the need for Year, Quarter and Month. The main numerical value is sales, but there is a calculation for Gross Profit Margin. Sales Line Item Cost is required to compute Gross profit, then Gross Profit Margin.
There is reference to a Category value of Bike and slicing into Subcategories. Products for Sales Line Items will be relate to Subcategory and Category. Average Sales will be another calculation and is sliced by Country.
The Country is not specific, so the question where is the Country located, from Customers or from Sales Territories. The Reseller Sales does not have a different customer for this company but Internet Sales does. So, Sales Territory and Geography will get the country along with Group and Region.
Figure 2 shows a business database diagram of the full Adventure Works DW with just the fact table for Internet Sales.
There are a lot of columns that can be very confusing. Figure 3 scales down the model to something simpler.
Since all the dimension tables have IDs as a primary key, the use of the integer data type needs to be looked at. There are different uses like bigint, int, smallint and tinyint. You could even use Boolean if there was only 2 possible integer values – 0 or 1. This design uses an int data type for the Date dimension key because it uses values like 20170124, 20170125, etc.
The Category and Subcategory dimensions can use tinyint because there are less than 255 possible rows. Smallint would work for Customer because there will not be more than 32,000 rows. The int data type will allow 2,000,000+ rows if you use all positive values. You can also use negative values which work well when you have early arriving facts.
The fact table will have these ID columns as foreign keys, and because the fact table grows daily, reducing the size used for these integer type IDs will help keep the table in check as size can be a problem.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
CREATE TABLE [dbo].[FactInternetSales]( [ProductKey] [SMALLINT] NOT NULL, [OrderDateKey] [INT] NOT NULL, [CustomerKey] [SMALLINT] NOT NULL, [SalesOrderNumber] [VARCHAR](20) NOT NULL, --[SalesOrderNumber] [NVARCHAR](20) NOT NULL, [SalesOrderLineNumber] [TINYINT] NOT NULL, [OrderQuantity] [SMALLINT] NOT NULL, [UnitPrice] [MONEY] NOT NULL, [ExtendedAmount] [MONEY] NOT NULL, [DiscountAmount] [MONEY] NOT NULL, [SalesAmount] [MONEY] NOT NULL, [TaxAmt] [MONEY] NOT NULL, [Freight] [MONEY] NOT NULL, [TotalProductCost] [MONEY] NOT NULL, CONSTRAINT [PK_FactInternetSales_SalesOrderNumber_SalesOrderLineNumber] PRIMARY KEY CLUSTERED ( [SalesOrderNumber] ASC, [SalesOrderLineNumber] ASC) |
Figure 4 show the creation statement for the Internet Sales fact table. The SalesOrderNumber was originally a NVARCHAR, but it is known to only contain numeric text data with maybe some leading zeroes. The case for using NVARCHAR over VARCHAR usually depends on international sales. The system might need to use multiple accounting systems to be combined into one data warehouse. If this is the case, use NVARCHAR.
The numerical fields are using MONEY data, but SMALLMONEY could be used for some of them. Since they are dollar values, there is no need to use decimal or numeric were the number of decimal places need to be specified.
Once the fact table is created, foreign keys are created to relate to the dimension tables. Figure 5 shows the foreign create statements for the Internet Sales fact table.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
ALTER TABLE [dbo].[FactInternetSales] WITH CHECK ADD CONSTRAINT [FK_FactInternetSales_DimCustomer] FOREIGN KEY([CustomerKey]) REFERENCES [dbo].[DimCustomer] ([CustomerKey]) GO ALTER TABLE [dbo].[FactInternetSales] CHECK CONSTRAINT [FK_FactInternetSales_DimCustomer] GO ALTER TABLE [dbo].[FactInternetSales] WITH CHECK ADD CONSTRAINT [FK_FactInternetSales_DimDate] FOREIGN KEY([OrderDateKey]) REFERENCES [dbo].[DimDate] ([DateSKey]) GO ALTER TABLE [dbo].[FactInternetSales] CHECK CONSTRAINT [FK_FactInternetSales_DimDate] GO ALTER TABLE [dbo].[FactInternetSales] WITH CHECK ADD CONSTRAINT [FK_FactInternetSales_DimProduct] FOREIGN KEY([ProductKey]) REFERENCES [dbo].[DimProduct] ([ProductSKey]) GO ALTER TABLE [dbo].[FactInternetSales] CHECK CONSTRAINT [FK_FactInternetSales_DimProduct] GO |
Some might say that the ETL should enforce foreign key constraints. If that is the case, then there will be no problem adding the foreign keys to the database.
There are cases of early arriving facts. This indicates that the fact arrives before the dimension value when merging data from separate systems. The ETL for the fact table population will need to have intelligence to add the missing dimension first, then use the ID for the fact row, and later update the new temporary dimension row with the late arriving dimension data.
The scaling down of the first data mart will make creating a new model must easier to get a start on a new data warehouse project. The thought to include more floods the mind. Try to put those ideas in a reminder for the second interaction of the project. Remember to check the data types and not be afraid with a more challenging path.
Note
The Relationship between Product, Subcategory and Category can be folded into the Product table in this example. That would mean the Subcategory Name and Category Name would become columns in the Product table. In his case, that is ok and some would say it removes the snowflake in the schema so the model becomes a star schema.
But, in other cases, like Sales Territory, there is a different table between the territories and the fact table. The Reseller Sales needs to go through Reseller dimension while the Internet Sales needs to go through the Customer dimension. Both use the Geography dimension to get the country with links to the Sales Territory dimension.
Next articles in this series:
- Connecting to Data Source(s) with SQL Server Analysis Services
- Initial Attributes and Measures in SSAS Multidimensional Cubes
- Analysis Services (SSAS) Cubes – Dimension Attributes and Hierarchies
Useful links
- Performance tuning – Nested and Merge SQL Loop with Execution Plans - April 2, 2018
- Time Intelligence in Analysis Services (SSAS) Tabular Models - March 20, 2018
- How to create Intermediate Measures in Analysis Services (SSAS) - February 19, 2018