In this article, I am going to demonstrate about implementing the Modular ETL in SSIS practically. In my previous article on Designing a Modular ETL Architecture, I have explained in theory what a modular ETL solution is and how to design one. We have also understood the concepts behind a modular ETL solution and the benefits of it in the world of data warehousing. We have also related the concept of microservices architecture in software development to that of the modular ETL solution.
In this tutorial, I am going to design the entire ETL package and will provide step-by-step guidelines on how to implement the same. Please note that the primary focus of this article is to implement the Modular ETL in SSIS and not to tutor creating the SSIS packages. It is assumed that you have a sound knowledge of developing SSIS packages and the control flow and data flow components within SSIS. For the sake of clear understanding, I have created the following three databases.
Database Name |
Description |
ApplicationDB |
This database is considered as the OLTP database. |
DataWarehouse |
This can be considered as the warehouse database. |
ControlDB |
This database is used by the master package to log the executions. |
You can use the following script to create the databases and the relevant data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 |
USE master GO CREATE DATABASE ApplicationDB GO CREATE DATABASE DataWarehouse GO CREATE DATABASE ControlDB GO USE ApplicationDB GO CREATE TABLE Orders( OrderID INT IDENTITY(1,1), CustomerFirstName VARCHAR(100), CustomerLastName VARCHAR(100), ProductLine VARCHAR(100), ProductName VARCHAR(100), Quantity INT, UnitPrice DECIMAL(9,2), TotalPrice DECIMAL(9,2) ) GO CREATE TABLE Sales( SalesID INT IDENTITY(1,1), OrderID INT, SalesRegion VARCHAR(100), ProductName VARCHAR(100), TotalSalesAmount DECIMAL(9,2) ) GO INSERT INTO Orders (CustomerFirstName,CustomerLastName,ProductName,ProductLine,Quantity,UnitPrice,TotalPrice) VALUES ('Harry','Potter','Magic Wand','Accessories',10,20,200) INSERT INTO Orders (CustomerFirstName,CustomerLastName,ProductName,ProductLine,Quantity,UnitPrice,TotalPrice) VALUES ('Ron','Weasley','Flying Car','Others',1,400,400) INSERT INTO Orders (CustomerFirstName,CustomerLastName,ProductName,ProductLine,Quantity,UnitPrice,TotalPrice) VALUES ('Hermoine','Granger','Potions Book','Study',2,50,100) INSERT INTO Orders (CustomerFirstName,CustomerLastName,ProductName,ProductLine,Quantity,UnitPrice,TotalPrice) VALUES ('Draco','Malfoy','Magic Wand','Accessories',3,20,60) GO INSERT INTO Sales (OrderID,SalesRegion,ProductName,TotalSalesAmount) VALUES (1,'London','Magic Wand',200) INSERT INTO Sales (OrderID,SalesRegion,ProductName,TotalSalesAmount) VALUES (2,'London','Flying Car',400) INSERT INTO Sales (OrderID,SalesRegion,ProductName,TotalSalesAmount) VALUES (3,'London','Potions Book',100) INSERT INTO Sales (OrderID,SalesRegion,ProductName,TotalSalesAmount) VALUES (4,'London','Magic Wand',60) GO USE DataWarehouse GO CREATE SCHEMA stage GO CREATE TABLE stage.Orders( OrderID INT, CustomerFirstName VARCHAR(100), CustomerLastName VARCHAR(100), ProductLine VARCHAR(100), ProductName VARCHAR(100), Quantity INT, UnitPrice DECIMAL(9,2), TotalPrice DECIMAL(9,2) ) GO CREATE TABLE stage.Sales( SalesID INT, OrderID INT, SalesRegion VARCHAR(100), ProductName VARCHAR(100), TotalSalesAmount DECIMAL(9,2) ) GO CREATE TABLE DimCustomer( CustomerID INT IDENTITY(1,1), CustomerFirstName VARCHAR(100), CustomerLastName VARCHAR(100), CustomerFullName VARCHAR(255) ) GO CREATE TABLE DimProduct( ProductID INT IDENTITY(1,1), ProductLine VARCHAR(100), ProductName VARCHAR(100), UnitPrice DECIMAL(9,0) ) GO CREATE TABLE DimRegion( RegionID INT IDENTITY(1,1), RegionName VARCHAR(100) ) GO CREATE TABLE FactOrders( FactOrderID INT IDENTITY(1,1), CustomerID INT, ProductID INT, UnitPrice DECIMAL(9,0), Quantity INT, TotalPrice DECIMAL(9,0) ) GO CREATE TABLE FactSales( FactSalesID INT IDENTITY(1,1), OrderID INT, RegionID INT, ProductID INT, SalesAmount DECIMAL(9,0) ) GO CREATE PROCEDURE usp_Populate_Orders AS INSERT INTO DimCustomer (CustomerFirstName,CustomerLastName,CustomerFullName) SELECT DISTINCT CustomerFirstName, CustomerLastName, CustomerFirstName + ' ' + CustomerLastName AS CustomerFullName FROM stage.Orders INSERT INTO DimProduct (ProductLine,ProductName,UnitPrice) SELECT DISTINCT ProductLine, ProductName, UnitPrice FROM stage.Orders INSERT INTO FactOrders(CustomerID,ProductID,UnitPrice,Quantity,TotalPrice) SELECT dc.CustomerID, dp.ProductID, dp.UnitPrice, ord.Quantity, ord.TotalPrice FROM stage.Orders ord LEFT OUTER JOIN DimCustomer dc ON dc.CustomerFirstName = ord.CustomerFirstName LEFT OUTER JOIN DimProduct dp ON dp.ProductName = ord.ProductName GO CREATE PROCEDURE usp_Populate_Sales AS INSERT INTO DimRegion (RegionName) SELECT DISTINCT SalesRegion FROM stage.Sales INSERT INTO FactSales(OrderID,RegionID,ProductID,SalesAmount) SELECT sales.OrderID, dr.RegionID, dp.ProductID, sales.TotalSalesAmount FROM stage.Sales sales LEFT OUTER JOIN DimRegion dr ON dr.RegionName = sales.SalesRegion LEFT OUTER JOIN DimProduct dp ON dp.ProductName = sales.ProductName GO USE ControlDB GO CREATE TABLE ExecutionLog( ExecutionID INT IDENTITY(1,1), PackageName VARCHAR(100), ProcessName VARCHAR(100), CreatedDateTime DATETIME, ) GO |
Now that the databases are ready, let’s go ahead and start creating the packages in SSIS. We will create three packages in SSIS as follows.
Package Name |
Description |
Orders.dtsx |
Extracts and loads the data for the orders module. |
Sales.dtsx |
Extracts and loads the data for the sales module. |
Master.dtsx |
Controls the execution flow between the various components of the modules. |
Creating the connection managers in SSIS
As we have created the above three packages in SSIS, let us now create the connection managers that we will be using in order to connect and communicate with the database.
Figure 1 – Connection Managers
Building the Orders and Sales packages
Now that we have our connection managers ready, we can start creating the SSIS packages for Orders and Sales. For the sake of simplicity, I am not going to use any transformations. It will be just extracted and load for both the packages.
Figure 2 – Orders Package
As you can see in the above package, we have two Sequence Control – “Extract” and “Load” which defines the extract and the load components within the package. Let us now try to dive deep and understand the contents of each of the components.
Extract
Data Flow Task – This is used to connect to the source database (ApplicationDB) and load data into the stage tables of the DataWarehouse.
Update ControlDB – This is an audit component that logs the execution to the ExecutionLogs table in the ControlDB.
1 2 |
INSERT INTO ExecutionLog(PackageName,ProcessName,CreatedDateTime) VALUES('Orders','Extract',GETDATE()) |
Load
Execute SQL Task – Runs the stored procedure “usp_Populate_Orders” in the DataWarehouse.
Update ControlDB – This is an audit component that logs the execution to the ExecutionLogs table in the ControlDB.
1 2 |
INSERT INTO ExecutionLog(PackageName,ProcessName,CreatedDateTime) VALUES('Orders','Load',GETDATE()) |
Once all the components are added, add the control flow as shown in the above figure.
Adding parameters in the packages
One of the important features that we will be using to implement the modular ETL in SSIS is to be leveraging the parameters in SSIS. These parameters can be used to control the flow and execution of the individual modules from the master package also.
Figure 3 – Parameters – ETL in SSIS
As you can see in the above figure, we have added two parameters, namely p_Enable_Orders_Extract and p_Enable_Orders_Load. These are Boolean parameters, the default value of which is set to True.
Adding Expressions to the Sequence Control
So far, we have created the parameter in SSIS, however did not use them in the package. In this step, we will use the value of these parameters to enable or disable the Sequence Control components in the package. The reason behind doing this is that we can externally control which component to execute and which one to exclude from the execution pipeline.
Click the Extract component and select Expressions from Properties.
Figure 4 – Selecting Expressions
In the Expressions window, select Disable as the property from the dropdown and in the Expression, add the following.
@[$Package::p_Enable_Orders_Extract] ?False:True
Figure 5 – Adding Extract property to Disable
Basically, what we are trying to achieve in the above is to set the Disabled property on or off based on the parameter value. If the parameter value is set to True, which means the component is enabled, then the Disabled property will be set to False. Repeat the same for the load component, just by altering the parameter name.
Figure 6 – Adding Load property to Disable
Repeat the above steps for the Sales module and create the parameters. Also, add the expression controls to the individual extract and load components. An important point to note will be the names of the parameters should represent the name of the package and the name of the component. Otherwise, when you will work with multiple packages and components, it will be difficult to keep a track of all the parameters from various packages.
Figure 7 – Parameters for the Sales Package
Creating the Master Package
At this moment, you should have created both the Orders and Sales SSIS packages and now we are good to go ahead with the development of the Master package. You can consider this as a parent package that will execute the Orders and Sales package from within it. In this case, both the Orders and the Sales packages can be considered as child packages.
Figure 8 – Master Package
If you look at the package above, there are four Execute Package Tasks contained within two Sequence Control components, one each for extract and load. Connect the individual components as shown in the figure and add the package names to be executed.
Within the master package, we are going to create four variables, which will be used to bind to the parameters from the child packages. By default, we are going to set the value for all the variables to False.
Figure 9 – Variables in the master package
Now, when binding the parameters, keep in mind which component should be executed and which one to be disabled. For example, if we want to execute the Extract component of the Orders module, this is how we should bind the parameters.
- p_Enable_Orders_Extract = True
- p_Enable_Orders_Load = False
If you recollect, in the previous steps, we have already set the default value for p_Enable_Orders_Extract as True, so now, we will only set the value of p_Enable_Orders_Load to be False.
Figure 10 – Binding the variables
Repeat the same for the other Execute Package Tasks as well. Please remember to exactly specify which component should be disabled or enabled while executing the packages.
Executing the master package
Now that our development of all three packages has been completed, we can now go ahead and start the execution of the master package. Once you start the execution, you will see that the individual child packages are loaded and executed by the master package.
Figure 11 – Executing the master package
Also, for your understanding, you can disable and play with the values of the variables in the master package to learn more about the execution sequence. Alternatively, you can also monitor the execution log of the packages by querying the ExecutionLogs table in the ControlDB.
Figure 12 – Viewing the ExecutionLogs
Conclusion
In this article, we have implemented the concept of Modular ETL in SSIS. Designing a modular ETL in SSIS is a vast topic and cannot be covered in one article alone. However, I have tried my best to explain the same with simple examples. I have provided the sample solution for a better understanding. You can now get started with this concept and start modifying the above-explained solution as per your requirements. The main idea is to separate the various components of the individual modules such that all those modules can also be executed independently of one another after deployment. This solution explained here is also scalable and you can execute multiple packages in parallel.
- Getting started with PostgreSQL on Docker - August 12, 2022
- Getting started with Spatial Data in PostgreSQL - January 13, 2022
- An overview of Power BI Incremental Refresh - December 6, 2021