This article will cover the process of analyzing and predicting/forecasting the size of a SQL backup as a means to better handle/manage retention of backups.
One of the main database management tenets is “Do not lose your data”. According to this; a database administrator incurs huge responsibilities to protect data. Under these circumstances, taking database backups and archiving SQL backup files is a key task for database administrators. In data protection strategy, taking database backups and archiving backup file processes play the leading role. Especially, backup planning is very significant for disaster recovery scenarios because backup file will be used for restore operation after any failure or data corruption. For this reason, every dba must generate recovery strategies for possible disaster scenarios and ensure that these scenarios can be solvable. At the same time, these backup files must be tested for data integrity; thus process provides to evaluate the recovery time and integrity of backup files. In this Backup and Restore (or Recovery) strategies for SQL Server database article you can find all details about backup and restore strategies.
After this brief introduction about backup and restore operation significance, we need to highlight one more term about backup files life cycle which is backup retention.
What is SQL backup retention?
According to the organization requirement, backup files must be retained and ready for use at the end of the expired date. This data protection task is named backup/data retention. The amount of time which elapsed to protect to data is backup retention time. This time period and number of backup files can be changed by organization retention policies. There are several factors which affect the retention periods but the main factor which affects this period is organization legal requirements. If your SQL Server databases are host on-premise you have to make disk and media capacity planning for backup retention times. But if your SQL Server runs in Azure, you can take advantage of long-term backup retention feature. This feature allows us to retain Azure SQL database backups for more than 10 years. Of course, this feature will be added as a cost to your Azure bills.
How to estimate the next backup size?
When we want to estimate the next backup size we can use the following stored procedure which gives some information about the backup size. The reserved column value approximately gives information about size of uncompressed backup size.
1 |
EXEC sp_spaceused @updateusage = 'true' |
Reserved column defines the database size which consists of data and log files. Unused column defines a specific part of reserved data but it is not yet used by database. According to these details, we can calculate the backup size with the following formula;
Backup Size (MB) = ((Reserved (KB) – Unused (KB))/1024)/1024
For our scenario, the estimation of backup size is approximately equal to ((543328 (KB)) – (21512 (KB))/1024)/1024 = 509 (MB).This calculation method is very useful for calculating next backup size but the disadvantage of this calculation method is that it does not give any idea about the backup file growing trend or acceleration for a particular term.
Forecast database backup size
If your concern is about storage or disk requirement for long terms (week, month, quarter, etc.) we need more sophisticated calculation methods. We can handle this issue with some statistics calculation methodologies. Simple linear regression method can be first potential candidate because when we look at the definition of simple linear regression; Simple linear regression is a statistical method that specifies the relationship between two quantitative data’s. The first variable that is shown on the x-axis; is the argument of the descriptive, and the second variable shown on the y-axis is the predicted output of the dependent variable. y = a*x +b; this equation that specifies a simple linear regression in which y is dependent variable, x is an independent variable and b is a constant. This statistical method will be very useful to find out the correlation between time period and backup size also with the linear regression formula we can forecast the next period’s backup sizes. The backup information is stored in a special table which is backup_history_table and this table is placed in msdb database. When we execute the following query, it gives average backup size of the month.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
SELECT DATEPART(MONTH,backup_finish_date) AS [BackupMonth] , (AVG(msdb.dbo.backupset.backup_size)/1048576) as [BackupSize (MB)] , DATEPART(YEAR,backup_finish_date) AS [BackupYear], msdb.dbo.backupset.database_name FROM msdb.dbo.backupmediafamily INNER JOIN msdb.dbo.backupset ON msdb.dbo.backupmediafamily.media_set_id = msdb.dbo.backupset.media_set_id WHERE msdb..backupset.type='D' AND database_name='Adventureworks' and DATEPART(YEAR,backup_finish_date)=DATEPART(YEAR,GETDATE()) GROUP BY msdb.dbo.backupset.database_name , DATEPART(MONTH,backup_finish_date) , DATEPART(YEAR,backup_finish_date) order by DATEPART(MONTH,backup_finish_date) Asc |
With the help of this data set, we will apply simple linear regression formula and then we can forecast the following (11 and 12 ) month’s backup sizes. First of all, we need to find out if there is any association between backup month and backup size. The easiest solution to this problem is to calculate the correlation coefficient. A correlation coefficient is a value used to represent the relation between two variables. It gets a value range from -1 through 1. Negative values indicate a negative relationship and positive values are positive relationships. When the correlation coefficient value is close to 1 or-1, there is a strong relationship between two variable. The purpose of following calculation is to find correlation coefficient value between month period and backup size. In this calculation, we will take advantage of R (wiki: R is a programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing) support of SQL Server.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
execute sp_execute_external_script @language = N'R', @script = N' mybackupdata <- SQLIn; SQLOut <- data.frame(cor(mybackupdata)) ', @input_data_1 = N'SELECT DATEPART(MONTH,backup_finish_date) AS [X] , ROUND((AVG(msdb.dbo.backupset.backup_size)/1048576),0) as [Y] FROM msdb.dbo.backupmediafamily INNER JOIN msdb.dbo.backupset ON msdb.dbo.backupmediafamily.media_set_id = msdb.dbo.backupset.media_set_id WHERE msdb..backupset.type=''D'' AND database_name=''Adventureworks'' AND DATEPART(YEAR,backup_finish_date)=DATEPART(YEAR,GETDATE()) GROUP BY msdb.dbo.backupset.database_name , DATEPART(MONTH,backup_finish_date) , DATEPART(YEAR,backup_finish_date) ', @input_data_1_name = N'SQLIn', @output_data_1_name = N'SQLOut' with result sets ((XCof Int, Ycof Int)); |
Correlation coefficient value specifies that there is a strong and positive correlation between month period and backup size because 0.98 is very close to 1. After this exploration, we can go through to calculate simple linear regression. The below chart illustrates the main idea of simple linear regression namely the red dotted line shows the calculated values of linear regression and blue line shows the history values of backup sizes. Red dotted line values are calculated based on this formula “y=96,594*x + 5355,5 y = Backup Size , x= month period “
When we apply the formula for 11 and 12 months the result will be like the bellow table.
Month | Estimated Backup Size Formula | Estimated Backup Size |
11 | =96,594*11 + 5355,5 | 6418,034 |
12 | =96,594*12 + 5355,5 | 6514,628 |
After all these descriptions about linear regression, we will create a scalar-valued function which calculates the simple linear regression formula in SQL Server and returns the result of linear regression equation value.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
Create FUNCTION CalculateEstimatedBackupSize (@Month INT,@DbName VARCHAR(100)) RETURNS FLOAT AS begin declare @RowCount as float declare @xvariable as float declare @bconstant as float declare @sumx as float declare @sumy as float declare @sumxx as float declare @sumyy as float declare @sumxy as float declare @result as float ; WITH BackupLinearReg AS ( SELECT DATEPART(MONTH,backup_finish_date) AS [X] , ROUND((AVG(msdb.dbo.backupset.backup_size)/1048576),0) as [Y] FROM msdb.dbo.backupmediafamily INNER JOIN msdb.dbo.backupset ON msdb.dbo.backupmediafamily.media_set_id = msdb.dbo.backupset.media_set_id WHERE msdb..backupset.type='D' AND database_name=@DbName AND DATEPART(YEAR,backup_finish_date)=DATEPART(YEAR,GETDATE()) GROUP BY msdb.dbo.backupset.database_name , DATEPART(MONTH,backup_finish_date) , DATEPART(YEAR,backup_finish_date) ) SELECT @RowCount=COUNT(*), @sumx = SUM(x), @sumy = SUM(y), @sumxx = SUM(x*x), @sumyy = SUM(y*y), @sumxy = SUM(x*y) FROM BackupLinearReg SET @xvariable = IIF(@RowCount=1,0,@RowCount * @sumXY - @sumX * @sumY) / (@RowCount * @sumXX - POWER(@sumX,2)) SET @bconstant = (@sumY - (@xvariable*@sumX)) / @RowCount set @result= (@Xvariable*@Month + @bconstant) RETURN @result END |
After this backup function creation, we will execute the function for 11 and 12 months. Additionally, if you are using compressed backup you can change backup_size column to compressed_backup_size column in this function.
1 2 3 |
select dbo.CalculateEstimatedBackupSize(11,'AdventureWorks') as [EstimatedBackupSize_11] select dbo.CalculateEstimatedBackupSize(12,'AdventureWorks') as [EstimatedBackupSize_12] |
As you can see the above image, the result is the same with formula calculation and it proves that the calculation methodology is true for scalar-valued function.
Conclusions
In this article, we reviewed some tips and advice about SQL backup and restore strategies and then we discussed how to estimate the backup size of database. In particular, we discussed linear regression methodologies and how this approach is used to forecasts the SQL backup sizes. The advantage of this method is common usage and simplicity. With linear regression method, we created a scalar-valued function which helps to forecast following month’s backup sizes and also we ensured that there was a strong correlation between months and backup size. In addition, you can develop and transform the sampled scalar valued function according to your needs.
- SQL Performance Tuning tips for newbies - April 15, 2024
- SQL Unit Testing reference guide for beginners - August 11, 2023
- SQL Cheat Sheet for Newbies - February 21, 2023