Background
One of the most popular complaints from developers to DBAs involves alerting, whether from third party tools or alerting built by other developers or DBAs in the environment. Building or using alerts for important applications, data layers, or processes within a SQL Server environment offer everyone benefits, but can become noisy if they’re architected poorly, or the purpose isn’t considered. In this article, we look at considerations for building effective alerts that tell us when something is wrong without creating situations where we learn to disregard them. We want to make sure that we respond when we need to, and not always be on high alert when there is no issue.
Discussion
Imagine SQL Server writing to the error log every five seconds, “The Database, DatabaseName, is online and good to go!” In one minute, we would have twelve messages telling us that our database was fine and the very nature of such an alert would add overhead, such as space to store it, along with questions as to why we’re seeing something written to an error log, when everything is running smoothly.
In the popular children’s fable, The Boy Who Cried Wolf, we see a story that cautions us of a problem with too many alerts, especially when some generate false alerts – they may slowly train us to ignore them and when something problematic occurs, we don’t respond. For an example, in a OLTP environment, we may run a test transaction that processes identically to our application, but choose to use a very timeout to alert us if it doesn’t finish in the timeframe. If our timeout is much shorter than our application, we may receive too many alerts indicating that the transaction is taking too long, possibly training us to ignore this alert.
Ideas:
In the case of alerts, fewer is better, and especially in situations where we have limited time or resources to respond.
One-next step alerts should be completely eliminated with the next step automated. An example of this is a SQL Server agent job that fails with the only solution to this failure being “Restart the job” (if the job is built exclusively for this solution). There is absolutely no reason to send an alert that this job failed, if the job must be restarted if it failed; simply, automate the next step, which is restarting the job. If a DBA or developer wants to keep a record of failures, that’s one thing, but alerting when there is only one next step wastes time and resources (noise).
There may be times where we want to log failures or errors, even if there is a next step to create a baseline. However, tracking baselines differs from alerting because we’re statistically collecting data before we assert that an error exists, or that an error is abnormal. Also, logging an error or issue doesn’t have to be constructed in a way that generates noise until we diagnose the problem exists, or that a problem is about to begin. To put this in a question form, suppose that I asked DBAs who used replication what’s the average amount of replication errors they get each day and they replied, “I don’t know.” If they don’t know the answer to that question, then it’s possible that two errors generated on one day is little to no issue, or that it’s a major issue; for instance, a query timeout error may be of little consequence, whereas the row cannot be found at the subscriber indicates a problem. Likewise, if DBCC CHECKDB fails, I want to know immediately.
This leads to the point that some alerts may be better suited around a threshold being crossed, such as a consecutive number of heartbeats failing as opposed to one heartbeat that fails (if the baseline shows that this happens). We can also expand this to cover a period of time – such as too many failed logins within a five period instead of over a day, relative to what our baseline shows.
Some Useful Tips and Questions
Consider alerts with steps to solve them initially, and automated solutions later. Often with software, we release in versions, such as version 1.0, 1.1, 1.2, etc and the later versions of our solution can automate our solutions in the email. Putting a summarized solution in the email also allows us to pass off the task to junior or mid-level DBAs\developers since it will have steps on how to solve the problem.
Sometimes use noisy alerting to get a feel for an environment. In the beginning of starting with a client, I will often use noisy alerting the first month so that I know the subtleties of the environment. This is my own preference and it allows me to learn faster – after a month, however, I will often build the tools based on what I’ve learned. The exception is here is the client; I don’t want to make too much noise for them, if they also want to be included on the alert. An example of this is a replication row count report where I see the count difference, regardless of whether it’s 0 or not, and later, only report when it catches a value greater than 0 and shouldn’t.
Consider using behavioral reporting, if familiar with behavioral statistics. Environments, servers, and SQL Server instances all have behavioral attributes that can be learned and reported on when falling outside of normal behavior for each. By far, this is one of the most accurate and less noisy ways of reporting, though it does require understanding what threshold falls outside of normal. If unfamiliar with this approach, use some of the built-in SQL Server tools, like restart a job after it fails a number of times and then alert and build your tools in a manner that makes retries possible – if applicable – and alerting easy. For an example, one of the easiest ways to report failure is to use a logging table where all failures go to it – in this manner, you can use the table to report on backup failures, job failures, checkpoint failures, DBCC failures, etc and create one solution to report on the log, while using a similar approach for inserting the different failures into the table.
Consider a logging table with useful columns that allow for filtering, such as a piece of the application for alerting the right team and a date and time for receiving the latest alerts. If there’s only one solution to a problem – like a log file grows too much and must be shrunk because the environment has no alternatives to the problem, then either build the solution immediately, or create a pro-active approach to avoid the problem, such as requiring that developers batch transactions in the example. It makes absolutely no sense to report a failure on something when there’s only one solution to the problem, or when the problem could be prevented easily.
For summaries or assessments, like an ETL audit or a health check, consider a time-based email or summary that you won’t ignore and at a time that you will stop and review. While I find summaries helpful – and they can sometimes be a warning – it’s easy to turn these off if they are received randomly, or at a bad time. If built right, these provide a good overview of what’s happening, but are very important to keep an eye on as they are often useful in detecting problems.
- Data Masking or Altering Behavioral Information - June 26, 2020
- Security Testing with extreme data volume ranges - June 19, 2020
- SQL Server performance tuning – RESOURCE_SEMAPHORE waits - June 16, 2020