Introduction
The Python programming language is object oriented, easy to use and, widely used programming language. We can write simple Python SQL Scripts and it performs the task that is not easy to do with t-SQL.
In my previous article, Using Python SQL scripts for Importing Data from Compressed files, we explored that the combination of Python SQL Scripts and SQL Server can be an exciting combination like Peanut Butter and Chocolate ?
We might need to represent data on a different frequency and need to write t-SQL code to get data at various samples. Suppose we have data on yearly frequency. We need to represent data in monthly distribution. It is not an easy task to do with a t-SQL programming language. We can use Python SQL Scripts and use different modules to do frequency conversion. In this article, we will understand this in a combination of disk capacity data in SQL Server.
Disk space monitoring and proactively planning for space requirements are the core tasks of DBA. It is one of the essential elements of capacity planning. We must monitor the database growth over time to get requirements of disk space. Usually, we configure alerts to notify DBA in case of free disk space is less than a specified threshold. It is a good approach to maintain free space threshold (around 20% free space), but we should try to be more proactive and give the requirements to storage administrators in advance. We do not want to come in a situation where you want to increase space on a particular disk, but data store does not have sufficient free space to give it to you. We might need to do other stuff such as moving drive around data store that might require downtime.
We might need to consider many parameters for planning disk space requirements such as free space inside the data and log file, auto growth, disk space growth. In this article, we will focus on disk space growth.
We can gather disk space data using the DMV sys.dm_os_volume_stats with Join on sys.master_files. It captures the disk space for all disks consisting of database files. Usually, we do not place database files in the system drive. System administrators should maintain these drives.
1 2 3 4 5 6 7 8 9 |
SELECT DISTINCT volume_mount_point [Disk Mount Point], file_system_type [File System Type], logical_volume_name AS [Logical Drive Name], CONVERT(DECIMAL(18, 2), total_bytes / 1073741824.0) AS [Total Size in GB], CONVERT(DECIMAL(18, 2), available_bytes / 1073741824.0) AS [Available Size in GB], CAST(CAST(available_bytes AS FLOAT) / CAST(total_bytes AS FLOAT) AS DECIMAL(18, 2)) * 100 AS [Space Free %] FROM sys.master_files CROSS APPLY sys.dm_os_volume_stats(database_id, file_id); |
We get the output in the following format.
We should capture this data in a SQL table regularly. Suppose we do not have disk space data available with us. We only have few values on which basis we want to resample this data on a daily or monthly basis. We can write SQL queries to do this task for us. It might be complicated tasks for you if you are not much familiar with t-SQL. Let’s understand the concept of data resampling in the upcoming section.
Data resampling overview
We can increase or decrease the frequency of data based on the time series and generate values for it with different methods. Suppose we have a monthly distribution of data for stock prices in share market. We want to generate samples at a weekly or daily basis. Similarly, if we have weekly data, we might wish to data resampling on a monthly or quarterly basis.
We can resample data in two ways
- Upsampling: We increase the date-time frequency in Upsampling. Example of Upsampling is as follows.
- Convert Yearly to Quarterly data
- Convert Quarterly to Monthly data
- Convert Monthly data into Weekly data
- Convert Weekly to Daily Convert
- Convert Daily to Hourly data
- Downsampling: It is a reverse process to Upsampling. In this, we decrease the data-time frequency of data.
- Convert Hourly to Daily data
- Convert Daily to Weekly data
- Convert Weekly data to Monthly data
- Convert Monthly to Quarterly Convert
- Convert Quarterly to Yearly data
In the following sections, let see how to use Python Scripts to do data resampling.
Using Python SQL scripts for data Resampling
We can use SQL Server 2017 or above to use Python with SQL Server. We do not install all features in SQL Server while doing the installation. To use Python Scripts, we can use following Prerequisites. You can refer articles in TOC at the bottom to get detailed information on this.
Prerequisites
- SQL Server 2017 or later version
-
Install the Machine Learning Services (Python) to run the Python SQL scripts
-
Enable external scripts enabled using sp_configure command.
-
Restart SQL Server Service and SQL Server Launchpad service. Both of the services should be in running state to execute Python Scripts.
- Test Python SQL Script – We can use simple Python SQL Script to test if it is working fine or not. It would avoid later environmental issues.
1 2 3 4 5 6 7 8 9 |
Execute sp_execute_external_script @language = N'Python', @script = N' a = 9 b = 3 c = a/b d = a*b print(c, d) ' |
Once we have prepared the environment, let’s create a sample data and insert data into it.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
CREATE TABLE dbo.DiskSpace ( CaptureDate DATETIME NOT NULL, SpaceinGB int NOT NULL ) GO INSERT INTO dbo.DiskSpace VALUES ('20180401', 355 ), ('20190428', 255 ); |
We have the following data in the table. In this data, the first row has value for 1st April 2018 while the second row shows 28th April 2019 value.
Python SQL Script for Upsampling data
Suppose we want to resample this data daily. Execute the following Python Script to get Upsampling data weekly.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
EXEC sp_execute_external_script @language = N'Python' ,@script = N' import pandas as pd from datetime import datetime df["CaptureDate"] = pd.to_datetime(df["CaptureDate"]) df = df.set_index("CaptureDate", drop = True) df = df.resample("w").interpolate("linear") df = df.reset_index() df["CaptureDate"] = df["CaptureDate"].astype(str) OutputDataSet = df ' ,@input_data_1 = N'SELECT [CaptureDate] ,[SpaceinGB] FROM [DemoSQL].[dbo].[DiskSpace]' ,@input_data_1_name = N'df' WITH RESULT SETS(("CaptureDate" DATETIME, "DiskSpace" DECIMAL(18,2))) GO |
We get the following output of Python SQL Script. In this output, data is equally distributed into a weekly basis based on the actual values we have in the table.
Similar we can run the following Python SQL Script to get data daily.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
EXEC sp_execute_external_script @language = N'Python' ,@script = N' import pandas as pd from datetime import datetime df["CaptureDate"] = pd.to_datetime(df["CaptureDate"]) df = df.set_index("CaptureDate", drop = True) df = df.resample("d").interpolate("linear") df = df.reset_index() df["CaptureDate"] = df["CaptureDate"].astype(str) OutputDataSet = df ' ,@input_data_1 = N'SELECT [CaptureDate] ,[SpaceinGB] FROM [DemoSQL].[dbo].[DiskSpace]' ,@input_data_1_name = N'df' WITH RESULT SETS(("CaptureDate" DATETIME, "DiskSpace" DECIMAL(18,2))) GO |
Let’s truncate the table and insert records on a yearly basis.
1 2 3 4 5 6 7 8 9 10 11 |
INSERT INTO [DemoSQL].[dbo].[DiskSpace] VALUES ('20170131', 689 ), ('20180131', 412 ), ('20190131', 356 ); |
We want to resample data into monthly basis, execute the following Python SQL Script.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
EXEC sp_execute_external_script @language = N'Python' ,@script = N' import pandas as pd from datetime import datetime df["CaptureDate"] = pd.to_datetime(df["CaptureDate"]) df = df.set_index("CaptureDate", drop = True) df = df.resample("m").interpolate("linear") df = df.reset_index() df["CaptureDate"] = df["CaptureDate"].astype(str) OutputDataSet = df ' ,@input_data_1 = N'SELECT [CaptureDate] ,[SpaceinGB] FROM [DemoSQL].[dbo].[DiskSpace]' ,@input_data_1_name = N'df' WITH RESULT SETS(("CaptureDate" DATETIME, "DiskSpace" DECIMAL(18,2))) GO |
We can see monthly distribution of data with Python Script.
Explore a Python SQL Script
Let’s understand a Python script in detail.
Part 1: Import Python Module: in SQL Server, we can execute Python SQL Script with stored procedure sp_execute_external_script and specify Python language as a parameter.
We can use Pandas module in Python Script to resample data. We can use resample() function in Pandas module. We also need to convert data type date to datetime by the Python module datatime. We can use the resampling module on DateTime column only.
1 2 3 4 5 6 7 |
EXEC sp_execute_external_script @language = N'Python' ,@script = N' import pandas as pd from datetime import datetime |
Part 2: Resample the frequency of data :
In this part, we specify the index column of the pandas data frame. Index column is the column to use in data sampling.
In the first line, we convert data type date to datetime using function to_datetime. Once data type conversion finishes, we can drop existing column and set index on new column with datetime data type.
1 2 3 |
df["CaptureDate"] = pd.to_datetime(df["CaptureDate"]) df = df.set_index("CaptureDate", drop = True) |
In the next line of code, we want to resample data using monthly frequency and linear interpolation. Linear interpolation equally distributes the values among time range.
1 2 3 |
df = df.resample("m").interpolate("linear") df = df.reset_index() |
We can have the following values in the resample function.
- W: Weekly frequency
- M: month end frequency
- Y: Yearly frequency
- D: Daily frequency
- Q: quarter end frequency
Part 3: Specify input and output in a Python SQL Query
In this part, we will specify the input (SQL query to get records from the existing table) and output to display data with an appropriate data type.
Previously we defined disk space column as an integer. In the output, we can choose an appropriate data type. In this example, we defined the data type as decimal up to 2 points.
1 2 3 4 5 6 7 8 |
OutputDataSet = df ' ,@input_data_1 = N'SELECT [CaptureDate] ,[SpaceinGB] FROM [DemoSQL].[dbo].[DiskSpace]' ,@input_data_1_name = N'df' WITH RESULT SETS(("CaptureDate" DATETIME, "DiskSpace" DECIMAL(18,2))) GO |
Python SQL Script to Downscaling data
Suppose we have data in sample table daily. We want to resample data from daily to monthly using SQL Python Scripts.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
EXEC sp_execute_external_script @language = N'Python' ,@script = N' import pandas as pd from datetime import datetime df["CaptureDate"] = pd.to_datetime(df["CaptureDate"]) df = df.set_index("CaptureDate", drop = True) df = df.resample("M").mean() df = df.reset_index() df["CaptureDate"] = df["CaptureDate"].astype(str) OutputDataSet = df ' ,@input_data_1 = N'SELECT [CaptureDate] ,[DiskSpace] FROM [DemoSQL].[dbo].[diskspace_2]' ,@input_data_1_name = N'df' WITH RESULT SETS(("CaptureDate" DATETIME, "DiskSpace" DECIMAL(18,2))) GO |
It performs the aggregation on the daily data and gives us month data.
Similarly, we can run the code to get aggregated data quarterly.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
EXEC sp_execute_external_script @language = N'Python' ,@script = N' import pandas as pd from datetime import datetime df["CaptureDate"] = pd.to_datetime(df["CaptureDate"]) df = df.set_index("CaptureDate", drop = True) df = df.resample("Q").mean() df = df.reset_index() #df["CaptureDate"] = df["CaptureDate"].astype(str) OutputDataSet = df ' ,@input_data_1 = N'SELECT [CaptureDate] ,[DiskSpace] FROM [DemoSQL].[dbo].[diskspace_2]' ,@input_data_1_name = N'df' WITH RESULT SETS(("CaptureDate" DATETIME, "DiskSpace" DECIMAL(18,2))) GO |
If we do not have supported data types, we get the following error message.
Supported types: bit, tinyint, smallint, int, bigint, uniqueidentifier, real, float, char, varchar, nchar, nvarchar, varbinary, date, DateTime, smalldatetime.
Conclusion
Python is simple and easy to understand programming language, and we can perform many important tasks using it. In this article, we explored the use case of Python SQL Scripts to resample data on a different frequency. I will continue exploring mode Python SQL Scripts and share my knowledge in further articles. I hope you like this article.
- Understanding PostgreSQL SUBSTRING function - September 21, 2024
- How to install PostgreSQL on Ubuntu - July 13, 2023
- How to use the CROSSTAB function in PostgreSQL - February 17, 2023