Monitor SQL Server Always On Availability groups using extended events

In this 33^rd article of SQL Server Always On Availability Group series, we will use extended events to monitor the availability group.

Introduction

Database professionals’ primary role is to do proactive monitoring for ensuring system availability. DBA must investigate the problem root cause analysis in case any critical event or downtime occurs. You can use various things for investigation purposes such as SQL Server logs, Windows & Cluster logs, and dynamic management views output, profiler and extended events.

SQL Server Always On Availability Group provide robust high availability and disaster recovery solution. It is equally essential and beneficial to monitor the AG group ownership, their synchronization states, failover, critical parameters. Let’s go ahead and see how extended events are used in monitoring the availability groups.

Environment details

You can use earlier articles in this series (see TOC at the bottom) and configure two-node SQL Server Always On Availability group replicas as shown below. The availability group databases are in the synchronized commit mode.

Primary Replica: SQLNode1\INST1
Secondary Replica: SQLNode2\INST1
Database: [MyNewDB]

SQL Server Always On Monitoring using the extended events

SQL Server Extended events are a lightweight monitoring tool that captures useful SQL instance and database parameters. We can collect the required database and use it for monitoring and troubleshooting purpose. For example, in an article, Monitoring SQL Server deadlocks using the system_health extended event, we use it to get deadlock XML and graph using the default system_health session. We can use SSMS GUI or t-SQL to create, configure and retrieve data from extended event files.

Once we configure the availability group, the create Availability Group Wizard automatically creates the AlwaysOn_health extended event session on all replicas participating in the AG configuration. You can view it in SSMS navigating to Management -> Extended Events -> AlwaysOn_health.

If you configure the availability group using t-SQL or new availability group, it does not start the alwayson_health event session. Therefore, you should verify that the extended event session is running and collecting the required data.

To understand the AlwaysOn_health event session, it is a good idea to script it out. Right-click on the event session-> Script Session as -> Create To-> New Query editor window.

As shown above, it has several extended events along with the error numbers to capture. We use the sys.dm_xe_objects to retrieve a brief description of these extended events.

SELECT name, description

FROM sys.dm_xe_objects

WHERE NAME IN (

'alwayson_ddl_executed',

'availability_group_lease_expired',

'availability_replica_automatic_failover_validation',

'availability_replica_manager_state_change',

'availability_replica_state_change',

'error_reported',

'lock_redo_blocked')

alwayson_ddl_executed:

This event occurs when we execute a data definition language (DDL) statement such as CREATE, ALTER, DROP on an availability group database. It does not capture detailed information about DDL. For example, you cannot retrieve an execution plan. Its main purpose is to indicate you with the DDL followed by planned or unexpected failover.

availability_group_lease_expired:

If the Windows failover cluster and availability group have a connectivity issue, it might cause the lease expired messages in the event logs. You might face automatic failover for the synchronous replica’s in case it occurs on the primary replica.

availability_replica_automatic_failover_validation:

For an automatic failover, SQL Server validates that the primary target replica is synchronized or not. It provides an AG failover point for failovers. DBA’s can use the information to investigate the cause of an automatic failover using this extended event.

availability_replica_manager_state_change:

This event occurs if the availability replica manager states changes. It is also a useful event session to investigate the reason for a different state.

availability_replica_state_change:

We can use this event to monitor the state of the availability group replica. It is also helpful to understand the internal of an AG failover.

error_reported:

It tracks various error numbers to find out the connectivity errors for availability group endpoints. You can query the sys.messages table to check the description for these different error id’s.

SELECT message_id, [text] as [Description]

FROM sys.messages AS m

WHERE m.language_id = SERVERPROPERTY('LCID')

AND (m.message_id=(9691)

OR m.message_id=(35204)

OR m.message_id=(9693)

OR m.message_id=(26024)

OR m.message_id=(28047)

OR m.message_id=(26023)

OR m.message_id=(9692)

OR m.message_id=(28034)

OR m.message_id=(28036)

OR m.message_id=(28048)

OR m.message_id=(28080)

OR m.message_id=(28091)

OR m.message_id=(26022)

OR m.message_id=(9642)

OR m.message_id=(35201)

OR m.message_id=(35202)

OR m.message_id=(35206)

OR m.message_id=(35207)

OR m.message_id=(26069)

OR m.message_id=(26070)

OR m.message_id>(41047)

AND m.message_id<(41056)

OR m.message_id=(41142)

OR m.message_id=(41144)

OR m.message_id=(1480)

OR m.message_id=(823)

OR m.message_id=(824)

OR m.message_id=(829)

OR m.message_id=(35264)

OR m.message_id=(35265)

)

Order by message_id

In the query output, you get the error id and its corresponding description.

To summaries these error ids you can refer to the following table for errors:

Category	Error id	description
Consistency issues\ Corruption	823 824 829	Logical consistency error, disk corruption, database consistency issues reported by DBCC CHECKDB.
Changing roles	1480	It gives you a message once the AG replica changes its role, such as secondary to primary.
Endpoint	9691 9692 9693	If SQL Server could not listen for an IP address or port, it logs the message in the extended event. For example, in the case of the issue of SQL listener, we get the corresponding event.
Connection handshake	28034 28036 28047 28048 28080	You get these errors in case of connection handshake failure. For example, if your AG owner or service account does not have to connect permissions for the endpoint, it raises a connection handshake message.
Connection timeouts	35201 35202 35204 35206 35207	In case your availability group replicas get any connection timeouts, it logs the information in the extended event session.
Windows failover informational messages	41048 41049 41050 41051 41052 41053 41054 41055	These are the informational messages in different states of replicas during failover. It is useful to understand the failover process in SQL Server Always On Availability Groups.
AG failover issues	41142 41144	Suppose your SQL Server Always On Availability group replica tries for automatic failover, but the new primary replica is not ready for it, it logs the corresponding event in the extended event.

hadr_db_partner_set_sync_state

This event is to track the HADR partner sync state changes.

lock_redo_blocked

This event is useful to track the redo thread blocks that might cause latency in the data synchronizations.

Data Storage for Alwayson_health extended event

Open the Alwayson_health extended event properties and navigate to Data Storage. By default, it stores the collected data in an event file. This event file gets created in the instance log directory.

Maximum file size: 5 MB
Maximum number of files: 4
Enable file rollover: It rollover the files once it extended event collected data fills the 4 files with 5 MB each. It again starts overwriting the data from the first event file

Browse to your Log directory for SQL instance, and here you see the event files.

Monitor SQL Server Always On Availability Groups using the Extended event

We can extract data from the AlwaysOn_health extended event using GUI and T-SQL script.

You can expand the AlwaysOn_health in SSMS, right-click on the package0.event_data file and select View targeted data. If you want to view the live data, click on the Watch live data on the root Alwayson_health folder in SSMS.

It opens the event viewer to show the collected information. You can filter, aggregate, and choose columns in this window for your required data.

Sometimes, it is not convenient to get information using the graphical method. You can use t-SQL as well to get data from the extended event files and choose the required data.

To demonstrate the data collection for SQL Server Always On Availability Group, I suspend the data movement from the secondary replica. You can refer to series articles for understanding the suspend and resume data movement.

Expand Availability Group Database -> Suspend Data Movement.

Once the data movement suspends, wait for some time and again resume the data movement using the Resume Data movement wizard.

Use the following query to filter the extended event collected data for error 35264(suspend),35265 (resume).

In the below query, we use sys.fn_xe_file_target_read_file to read the event file and filter the data.

DECLARE @FileName NVARCHAR(4000)

SELECT @FileName = target_data.value('(EventFileTarget/File/@name)[1]',

'nvarchar(4000)')

FROM (

SELECT

CAST(target_data AS XML) target_data

FROM sys.dm_xe_sessions s

JOIN sys.dm_xe_session_targets t

ON s.address = t.event_session_address

WHERE s.name = N'AlwaysOn_health'

) ft

SELECT

XEData.value('(event/@timestamp)[1]','datetime2(3)') AS event_timestamp,

XEData.value('(event/data[@name="error_number"]/value)[1]', 'int')

AS error_number,

XEData.value('(event/data[@name="message"]/value)[1]', 'varchar(max)')

AS message

FROM (

SELECT CAST(event_data AS XML) XEData, *

FROM sys.fn_xe_file_target_read_file(@FileName, NULL, NULL, NULL)

WHERE object_name = 'error_reported'

) event_data

WHERE XEData.value('(event/data[@name="error_number"]/value)[1]', 'int') IN ( 35264, 35265)

ORDER BY event_timestamp DESC;

In the output, you can see the suspend and resume data movement for the availability group database. In the suspended message, it specifies that the database [MyNewDB] is suspended due to user activity.

Similarly, you can filter the extended event session for error for troubleshooting SQL Server Always On Availability Group.

Apart from the default availability group, you can configure additional event sessions as per your requirement. To check the extended events for HADR, you can filter events from the sys.dm_xe_objects as shown below.

1 2	SELECT Name, object_type, description FROM sys.dm_xe_objects WHERE name LIKE '%hadr%'

It gives you a list of 178 events. Here, you can see a snippet of all corresponding extended events.

Conclusion

In this article, we learned the extended events for monitor SQL Server Always On Availability group related events and errors. You can add additional monitoring events to collect the relevant data as per your requirement.

A comprehensive guide to SQL Server Always On Availability Groups on Windows Server 2016

Configure Domain Controller and Active Directory for SQL Server Always On Availability Groups

Configure failover clusters, storage controllers and quorum configurations for SQL Server Always On Availability Groups

Install SQL Server 2019 on Windows Server 2016 with SQL Server Always On Availability Groups

Add a new node into existing SQL Server Always On Availability Groups

Configure Managed Service Accounts for SQL Server Always On Availability Groups

Add or remove a node from SQL Server Always On Availability Groups using T-SQL scripts

Database-level health detection in SQL Server Always On Availability Groups

Automatic Page Repair in SQL Server Always On Availability Groups

Deploy a domain-independent Windows Failover Cluster for SQL Server Always On Availability Groups

Configure a SQL Server Always On Availability Group on the domain-independent Failover Cluster

An overview of distributed SQL Server Always On Availability Groups

Deploy a distributed SQL Server Always On Availability Group

Monitor and failover a Distributed SQL Server Always On Availability Group

Transparent Data Encryption for SQL Server Always On Availability Groups

Configure SQL Server replication for a database in SQL Server Always On Availability Groups

Configuring SQL Server replication for distribution databases in SQL Server Always On Availability Groups

Explore Cross-database MSDTC for distributed transactions in SQL Server Always On Availability Groups

Deploy MSDTC for distributed transactions in SQL Server Always On Availability Groups

Restore an existing availability group database participating in SQL Server Always On Availability Groups

Exploring AG dashboards for monitoring SQL Server Always On Availability Groups

Backup compression in TDE enabled databases in SQL Server Always On Availability Groups

Impact of dropping a login in the active directory tied to SQL Server Always On Availability Groups

Refresh SQL Server Always On Availability Group databases using DBATools PowerShell

Configure SQL Server Reporting Services databases in SQL Server Always On Availability Groups

Suspend and Resume Data Movement in SQL Server Always On Availability Groups

Explore failover types in SQL Server Always On Availability Groups

Explore SQL Server PolyBase External Tables in SQL Server Always On Availability Groups

SQL Server Always On Availability Groups for SQL Server Linux instances

Column-level SQL Server encryption with SQL Server Always On Availability Groups

Make the most of secondary replicas in SQL Server Always On Availability Groups

Apply SQL Server patches or cumulative updates in SQL Server Always On Availability Groups

Monitor SQL Server Always On Availability groups using extended events

The Hub and Spoke model of policy-based management for SQL Server Always On Availability Groups

Custom policies for AG dashboards of SQL Server Always On Availability Groups

Explore dynamic management views for monitoring SQL Server Always On Availability Groups

Dynamic management views for monitoring availability replicas and databases for SQL Server Always On Availability

Configure SQL Server Always On Availability Groups using Windows PowerShell scripts

Configure Integration Services Catalog Database SSISDB in SQL Server Always On Availability Groups

Synchronize logins between Availability replicas in SQL Server Always On Availability Group

Session timeouts in SQL Server Always On Availability Groups

Lease Timeouts and Health Checks in SQL Server Always On Availability Groups

Author
Recent Posts

Rajendra Gupta

Hi! I am Rajendra Gupta, Database Specialist and Architect, helping organizations implement Microsoft SQL Server, Azure, Couchbase, AWS solutions fast and efficiently, fix related issues, and Performance Tuning with over 14 years of experience.

I am the author of the book "DP-300 Administering Relational Database on Microsoft Azure". I published more than 650 technical articles on MSSQLTips, SQLShack, Quest, CodingSight, and SeveralNines.

I am the creator of one of the biggest free online collections of articles on a single topic, with his 50-part series on SQL Server Always On Availability Groups.

Based on my contribution to the SQL Server community, I have been recognized as the prestigious Best Author of the Year continuously in 2019, 2020, and 2021 (2nd Rank) at SQLShack and the MSSQLTIPS champions award in 2020.

Personal Blog: https://www.dbblogger.com
I am always interested in new challenges so if you need consulting help, reach me at rajendra.gupta16@gmail.com

View all posts by Rajendra Gupta

Latest posts by Rajendra Gupta (see all)

Understanding PostgreSQL SUBSTRING function - September 21, 2024
How to install PostgreSQL on Ubuntu - July 13, 2023
How to use the CROSSTAB function in PostgreSQL - February 17, 2023