Lease Timeouts and Health Checks in SQL Server Always On Availability Groups

In the 42^nd article for SQL Server Always On Availability Groups, we are going to explore lease timeouts and health checks.

Introduction

SQL Server Always On Availability Groups provides resilience for high availability and disaster recovery solution in a multi-node architecture. In the article, Session timeouts in SQL Server Always On Availability Groups, we explained the session timeout configurations to safeguard AG from soft errors.

Suppose you have an availability group configured for automatic failover. In this synchronized availability group, we can have the following types of failover.

Automatic
Manual
You can do forced failover with possible data loss

In the case of automatic failover, the availability group fails over to a synchronized secondary replica without any data loss. However, an automatic failover requires the satisfied conditions defined in the flexible failover policy.

The flexible failover policy depends on the Health-check timeout threshold, Failure-Condition Level and Cluster timeouts.

It is an important aspect to determine the auto-failover causes in a production database environment. The logs are useful factors for monitoring and investigating the failures.

SQL Server error logs
Windows cluster logs
Cluster event logs
SQL Server failover diagnostics (sp_server_diagnostics) logs
AlwaysOn_Health extended events output
System_health extended events
Application and system logs

The availability group feature requires an underlying Windows Failover Clusters. The cluster requires cluster services on all participating nodes.

The resource host acts as an interface between the cluster resource and service. SQL Server uses resource DLLs for the resource monitor. These DLLs are responsible for health monitoring and offline-online resource management.

In the below image, we see two levels of monitoring in an AG configuration.

Looks-Alive
Is-Alive

Looks-Alive mechanism

In a Windows failover cluster, the cluster service continuously transmits heartbeat to other cluster nodes and waits for the acknowledgment. If it does not receive a heartbeat acknowledgment after a series of heartbeats, it assumes that the corresponding node is down. Subsequently, it broadcasts a message to other cluster nodes, and a node takes over the cluster ownership based on the quorum votes and majority configuration. If due to any reason, the cluster could not determine the quorum majority, all nodes in a cluster go into the resolving state and all cluster resources are taken offline. This process is known as Looks-Alive and runs every 5 seconds. The Windows failover cluster uses the Looks-Alive mechanism for an integration between the cluster resource host and SQL process. On the primary replica, SQL Server uses a dedicated lease worker thread.

Go to Windows Failover Cluster and right-click on the cluster resource. In the properties, you get configured values for LeaseTimeout. By default, it is set to 20000 milliseconds (20 seconds).

The lease worker thread and resource host work circularly, as shown in the below image.

Image reference: Microsoft docs

In the SQL Server logs on the primary replica, you get a message from the lease worker. The lease worker thread and resource host maintains a time-to-leave (TTL), and it gets updated each time threads wait up after a signal from another resource.

If the lease timeout period elapsed while waiting for the signal, the AG resource goes into the resolving state. In the case of AG failover, SQL Server logs another entry to stop the lease renewal.

Is-Alive mechanism

The SQL Server Always on resource DLL uses the sp_server_diagnostics stored procedure for the health of SQL Service. The stored procedure reports the status for the system, resource, IO subsystem, query processing and events. It uses a FAILURE_CONDITION_LEVEL for implementing defining a condition in which automatic failover can happen.

Failover Condition Level	Description
1	Automatic failover occurred due to server down or lease expiry.
2	No data returned from sp_server_diagnostics. It can be related to the hung server issue. The automatic failover can happen due to SQL Server is not connected to the cluster or due to user-defined health check timeout.
3 (Default level)	System error ( stack dumps, orphaned spinlocks, serious write-access)
4	Resource error (For example- persistent out-of-memory condition)
5	Query processing (for example – Scheduler deadlock or unresolvable deadlock)

A failover condition encompasses conditions lower than configured conditions as well. For example, the default condition 3 also uses condition levels 1 and 2.

You can use the ALTER AVAILABILITY GROUP to set a specific failover condition level.

1	ALTER AVAILABILITY GROUP AG1 SET (FAILURE_CONDITION_LEVEL = 1);

But how long resource DLL with wait for the output from the stored procedure sp_server_diagnostics? The resource DLL uses the timeout properties for this purpose. By default, the timeout is 30,000 milliseconds, i.e. 30 seconds. You can set the value for a timeout from 15 seconds to 4294967 seconds.

You can modify the health check timeout from the following ALTER AVAILABILITY GROUP command.

1	ALTER AVAILABILITY GROUP AG1 SET (HEALTH_CHECK_TIMEOUT = 60000);

We can query sys.dm_xe_session to check the extended events for the sp_server_diagnostics session.

The output of these events is stored in the default log directory. For example, in my environment, these files are available in the C:\Program Files\Microsoft SQL Server\MSSQL15.INST1\MSSQL\Log directory.

Lease Time out in SQL Server Always On Availability Groups

As shown below, the lease timeout is 20 seconds. By default, the Windows failover cluster renewals the lease every 5 seconds, i.e. ¼ time of the lease timeout.

If lease timeout occurs ( >20 seconds), the rhs.exe reports an error in the Windows cluster, and it starts taking preventive action.

You can get the following error messages for a least timeout issues:

select message_id,text from sys.messages

where message_id in(19407,19419,19421,19422) and

language_id=1033

SQL Server Always On Availability Groups Health Check timeout

SQL Server Always On Availability Group performs a health check of the primary replica using the sp_server_diagnostics stored procedure. The sp_server_diagnostics executes every 10 seconds. The health check timeout is 30 seconds.

Therefore, the stored procedure returns the result on 1/3 * health checks current threshold. If the SP does not return any results, AG refers to the previous state for determining the instance health until the health-check timeout threshold. The next Is-Alive determines that the primary replica is unresponsive, and it initiates the automatic failover.

Note:
The sp_server_diagnostics does not perform health checks at the database level

It does not consider DB level health detection until we explicitly enabled it. Refer to the article, Database-level health detection in SQL Server Always On Availability Groups for more details

WSFC configuration and cluster timeouts for SQL Server Always On Availability Groups

In a Windows failover cluster, the following parameters control the cluster timeout values:

SameSubnetDelay
SameSubnetThreshold
CrossSubnetDelay
CrossSubnetThreshold

You can check these parameter values using the Windows PowerShell cmdlet get-cluster().

In my lab environment, the servers are located in the same subnets. Therefore, the following table shows the default values for Windows Server 2012 and 2016.

Parameter	Windows Server 2012 R2	Windows Server 2016
SameSubnetDelay	1 second	1 Second
SameSubnetThreshold	5 heartbeats	10 heartbeats

The SameSubnetDelay defines the wait time between the heartbeats. It is 1 second for both Windows Server 2012 and 2016.

The SameSubnetThreshold defines the number of missing heartbeats a cluster can tolerate, and if it does not receive the acknowledgement, the target node is declared dead.

Recap of LeaseTimeout, Session-Timeout and Health check timeouts

Lease timeout:
- Default value: 20000 milliseconds, i.e. 20 seconds
- It is required to prevent a split-brain scenario in the Windows failover cluster
- It can trigger an AG failover or offline-online
- It is used in both Is-Alive and the Looks-alive mechanism
Session Timeout:
- It is used to safeguard against soft errors between AG replicas
- default value: 10000 milliseconds or 10 seconds
- It is not part of the Is-Alive or the Looks-alive mechanism
- Secondary replica becomes DISCONNECTED status due to session timeout
- You can configure the session timeouts in the availability group properties from the primary replica instance
Health Check timeout
- Default value: 30000 milliseconds, i.e. 30 seconds
- It is the timeout if the sp_server_diagnostics does not return any data or reports errors in the health check
- It is used in both Is-Alive and the Looks-alive mechanism
- It also depends on the failover condition levels defined from 1-5, as explained earlier. The default configuration is level 3

Conclusion

In this article, we figured SQL Server Always On Availability Groups flexible failover policy using the lease, health check timeout and Failover Condition Levels. You should review the logs to determine the scenarios where the automatic failover occurred or failed.

A comprehensive guide to SQL Server Always On Availability Groups on Windows Server 2016

Configure Domain Controller and Active Directory for SQL Server Always On Availability Groups

Configure failover clusters, storage controllers and quorum configurations for SQL Server Always On Availability Groups

Install SQL Server 2019 on Windows Server 2016 with SQL Server Always On Availability Groups

Add a new node into existing SQL Server Always On Availability Groups

Configure Managed Service Accounts for SQL Server Always On Availability Groups

Add or remove a node from SQL Server Always On Availability Groups using T-SQL scripts

Database-level health detection in SQL Server Always On Availability Groups

Automatic Page Repair in SQL Server Always On Availability Groups

Deploy a domain-independent Windows Failover Cluster for SQL Server Always On Availability Groups

Configure a SQL Server Always On Availability Group on the domain-independent Failover Cluster

An overview of distributed SQL Server Always On Availability Groups

Deploy a distributed SQL Server Always On Availability Group

Monitor and failover a Distributed SQL Server Always On Availability Group

Transparent Data Encryption for SQL Server Always On Availability Groups

Configure SQL Server replication for a database in SQL Server Always On Availability Groups

Configuring SQL Server replication for distribution databases in SQL Server Always On Availability Groups

Explore Cross-database MSDTC for distributed transactions in SQL Server Always On Availability Groups

Deploy MSDTC for distributed transactions in SQL Server Always On Availability Groups

Restore an existing availability group database participating in SQL Server Always On Availability Groups

Exploring AG dashboards for monitoring SQL Server Always On Availability Groups

Backup compression in TDE enabled databases in SQL Server Always On Availability Groups

Impact of dropping a login in the active directory tied to SQL Server Always On Availability Groups

Refresh SQL Server Always On Availability Group databases using DBATools PowerShell

Configure SQL Server Reporting Services databases in SQL Server Always On Availability Groups

Suspend and Resume Data Movement in SQL Server Always On Availability Groups

Explore failover types in SQL Server Always On Availability Groups

Explore SQL Server PolyBase External Tables in SQL Server Always On Availability Groups

SQL Server Always On Availability Groups for SQL Server Linux instances

Column-level SQL Server encryption with SQL Server Always On Availability Groups

Make the most of secondary replicas in SQL Server Always On Availability Groups

Apply SQL Server patches or cumulative updates in SQL Server Always On Availability Groups

Monitor SQL Server Always On Availability groups using extended events

The Hub and Spoke model of policy-based management for SQL Server Always On Availability Groups

Custom policies for AG dashboards of SQL Server Always On Availability Groups

Explore dynamic management views for monitoring SQL Server Always On Availability Groups

Dynamic management views for monitoring availability replicas and databases for SQL Server Always On Availability

Configure SQL Server Always On Availability Groups using Windows PowerShell scripts

Configure Integration Services Catalog Database SSISDB in SQL Server Always On Availability Groups

Synchronize logins between Availability replicas in SQL Server Always On Availability Groups

Session timeouts in SQL Server Always On Availability Groups

Lease Timeouts and Health Checks in SQL Server Always On Availability Groups

Author
Recent Posts

Rajendra Gupta

Hi! I am Rajendra Gupta, Database Specialist and Architect, helping organizations implement Microsoft SQL Server, Azure, Couchbase, AWS solutions fast and efficiently, fix related issues, and Performance Tuning with over 14 years of experience.

I am the author of the book "DP-300 Administering Relational Database on Microsoft Azure". I published more than 650 technical articles on MSSQLTips, SQLShack, Quest, CodingSight, and SeveralNines.

I am the creator of one of the biggest free online collections of articles on a single topic, with his 50-part series on SQL Server Always On Availability Groups.

Based on my contribution to the SQL Server community, I have been recognized as the prestigious Best Author of the Year continuously in 2019, 2020, and 2021 (2nd Rank) at SQLShack and the MSSQLTIPS champions award in 2020.

Personal Blog: https://www.dbblogger.com
I am always interested in new challenges so if you need consulting help, reach me at rajendra.gupta16@gmail.com

View all posts by Rajendra Gupta

Latest posts by Rajendra Gupta (see all)

Understanding PostgreSQL SUBSTRING function - September 21, 2024
How to install PostgreSQL on Ubuntu - July 13, 2023
How to use the CROSSTAB function in PostgreSQL - February 17, 2023