In this 33rd article of SQL Server Always On Availability Group series, we will use extended events to monitor the availability group.
Introduction
Database professionals’ primary role is to do proactive monitoring for ensuring system availability. DBA must investigate the problem root cause analysis in case any critical event or downtime occurs. You can use various things for investigation purposes such as SQL Server logs, Windows & Cluster logs, and dynamic management views output, profiler and extended events.
SQL Server Always On Availability Group provide robust high availability and disaster recovery solution. It is equally essential and beneficial to monitor the AG group ownership, their synchronization states, failover, critical parameters. Let’s go ahead and see how extended events are used in monitoring the availability groups.
Environment details
You can use earlier articles in this series (see TOC at the bottom) and configure two-node SQL Server Always On Availability group replicas as shown below. The availability group databases are in the synchronized commit mode.
- Primary Replica: SQLNode1\INST1
- Secondary Replica: SQLNode2\INST1
- Database: [MyNewDB]
SQL Server Always On Monitoring using the extended events
SQL Server Extended events are a lightweight monitoring tool that captures useful SQL instance and database parameters. We can collect the required database and use it for monitoring and troubleshooting purpose. For example, in an article, Monitoring SQL Server deadlocks using the system_health extended event, we use it to get deadlock XML and graph using the default system_health session. We can use SSMS GUI or t-SQL to create, configure and retrieve data from extended event files.
Once we configure the availability group, the create Availability Group Wizard automatically creates the AlwaysOn_health extended event session on all replicas participating in the AG configuration. You can view it in SSMS navigating to Management -> Extended Events -> AlwaysOn_health.
If you configure the availability group using t-SQL or new availability group, it does not start the alwayson_health event session. Therefore, you should verify that the extended event session is running and collecting the required data.
To understand the AlwaysOn_health event session, it is a good idea to script it out. Right-click on the event session-> Script Session as -> Create To-> New Query editor window.
As shown above, it has several extended events along with the error numbers to capture. We use the sys.dm_xe_objects to retrieve a brief description of these extended events.
1 2 3 4 5 6 7 8 9 10 |
SELECT name, description FROM sys.dm_xe_objects WHERE NAME IN ( 'alwayson_ddl_executed', 'availability_group_lease_expired', 'availability_replica_automatic_failover_validation', 'availability_replica_manager_state_change', 'availability_replica_state_change', 'error_reported', 'lock_redo_blocked') |
alwayson_ddl_executed:
This event occurs when we execute a data definition language (DDL) statement such as CREATE, ALTER, DROP on an availability group database. It does not capture detailed information about DDL. For example, you cannot retrieve an execution plan. Its main purpose is to indicate you with the DDL followed by planned or unexpected failover.
availability_group_lease_expired:
If the Windows failover cluster and availability group have a connectivity issue, it might cause the lease expired messages in the event logs. You might face automatic failover for the synchronous replica’s in case it occurs on the primary replica.
availability_replica_automatic_failover_validation:
For an automatic failover, SQL Server validates that the primary target replica is synchronized or not. It provides an AG failover point for failovers. DBA’s can use the information to investigate the cause of an automatic failover using this extended event.
availability_replica_manager_state_change:
This event occurs if the availability replica manager states changes. It is also a useful event session to investigate the reason for a different state.
availability_replica_state_change:
We can use this event to monitor the state of the availability group replica. It is also helpful to understand the internal of an AG failover.
error_reported:
It tracks various error numbers to find out the connectivity errors for availability group endpoints. You can query the sys.messages table to check the description for these different error id’s.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
SELECT message_id, [text] as [Description] FROM sys.messages AS m WHERE m.language_id = SERVERPROPERTY('LCID') AND (m.message_id=(9691) OR m.message_id=(35204) OR m.message_id=(9693) OR m.message_id=(26024) OR m.message_id=(28047) OR m.message_id=(26023) OR m.message_id=(9692) OR m.message_id=(28034) OR m.message_id=(28036) OR m.message_id=(28048) OR m.message_id=(28080) OR m.message_id=(28091) OR m.message_id=(26022) OR m.message_id=(9642) OR m.message_id=(35201) OR m.message_id=(35202) OR m.message_id=(35206) OR m.message_id=(35207) OR m.message_id=(26069) OR m.message_id=(26070) OR m.message_id>(41047) AND m.message_id<(41056) OR m.message_id=(41142) OR m.message_id=(41144) OR m.message_id=(1480) OR m.message_id=(823) OR m.message_id=(824) OR m.message_id=(829) OR m.message_id=(35264) OR m.message_id=(35265) ) Order by message_id |
In the query output, you get the error id and its corresponding description.
To summaries these error ids you can refer to the following table for errors:
Category |
Error id |
description |
Consistency issues\ Corruption |
823 824 829 |
Logical consistency error, disk corruption, database consistency issues reported by DBCC CHECKDB. |
Changing roles |
1480 |
It gives you a message once the AG replica changes its role, such as secondary to primary. |
Endpoint |
9691 9692 9693 |
If SQL Server could not listen for an IP address or port, it logs the message in the extended event. For example, in the case of the issue of SQL listener, we get the corresponding event. |
Connection handshake |
28034 28036 28047 28048 28080 |
You get these errors in case of connection handshake failure. For example, if your AG owner or service account does not have to connect permissions for the endpoint, it raises a connection handshake message. |
Connection timeouts |
35201 35202 35204 35206 35207 |
In case your availability group replicas get any connection timeouts, it logs the information in the extended event session. |
Windows failover informational messages |
41048 41049 41050 41051 41052 41053 41054 41055 |
These are the informational messages in different states of replicas during failover. It is useful to understand the failover process in SQL Server Always On Availability Groups. |
AG failover issues |
41142 41144 |
Suppose your SQL Server Always On Availability group replica tries for automatic failover, but the new primary replica is not ready for it, it logs the corresponding event in the extended event. |
hadr_db_partner_set_sync_state
This event is to track the HADR partner sync state changes.
lock_redo_blocked
This event is useful to track the redo thread blocks that might cause latency in the data synchronizations.
Data Storage for Alwayson_health extended event
Open the Alwayson_health extended event properties and navigate to Data Storage. By default, it stores the collected data in an event file. This event file gets created in the instance log directory.
- Maximum file size: 5 MB
- Maximum number of files: 4
- Enable file rollover: It rollover the files once it extended event collected data fills the 4 files with 5 MB each. It again starts overwriting the data from the first event file
Browse to your Log directory for SQL instance, and here you see the event files.
Monitor SQL Server Always On Availability Groups using the Extended event
We can extract data from the AlwaysOn_health extended event using GUI and T-SQL script.
You can expand the AlwaysOn_health in SSMS, right-click on the package0.event_data file and select View targeted data. If you want to view the live data, click on the Watch live data on the root Alwayson_health folder in SSMS.
It opens the event viewer to show the collected information. You can filter, aggregate, and choose columns in this window for your required data.
Sometimes, it is not convenient to get information using the graphical method. You can use t-SQL as well to get data from the extended event files and choose the required data.
To demonstrate the data collection for SQL Server Always On Availability Group, I suspend the data movement from the secondary replica. You can refer to series articles for understanding the suspend and resume data movement.
Expand Availability Group Database -> Suspend Data Movement.
Once the data movement suspends, wait for some time and again resume the data movement using the Resume Data movement wizard.
Use the following query to filter the extended event collected data for error 35264(suspend),35265 (resume).
In the below query, we use sys.fn_xe_file_target_read_file to read the event file and filter the data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
DECLARE @FileName NVARCHAR(4000) SELECT @FileName = target_data.value('(EventFileTarget/File/@name)[1]', 'nvarchar(4000)') FROM ( SELECT CAST(target_data AS XML) target_data FROM sys.dm_xe_sessions s JOIN sys.dm_xe_session_targets t ON s.address = t.event_session_address WHERE s.name = N'AlwaysOn_health' ) ft SELECT XEData.value('(event/@timestamp)[1]','datetime2(3)') AS event_timestamp, XEData.value('(event/data[@name="error_number"]/value)[1]', 'int') AS error_number, XEData.value('(event/data[@name="message"]/value)[1]', 'varchar(max)') AS message FROM ( SELECT CAST(event_data AS XML) XEData, * FROM sys.fn_xe_file_target_read_file(@FileName, NULL, NULL, NULL) WHERE object_name = 'error_reported' ) event_data WHERE XEData.value('(event/data[@name="error_number"]/value)[1]', 'int') IN ( 35264, 35265) ORDER BY event_timestamp DESC; |
In the output, you can see the suspend and resume data movement for the availability group database. In the suspended message, it specifies that the database [MyNewDB] is suspended due to user activity.
Similarly, you can filter the extended event session for error for troubleshooting SQL Server Always On Availability Group.
Apart from the default availability group, you can configure additional event sessions as per your requirement. To check the extended events for HADR, you can filter events from the sys.dm_xe_objects as shown below.
1 2 |
SELECT Name, object_type, description FROM sys.dm_xe_objects WHERE name LIKE '%hadr%' |
It gives you a list of 178 events. Here, you can see a snippet of all corresponding extended events.
Conclusion
In this article, we learned the extended events for monitor SQL Server Always On Availability group related events and errors. You can add additional monitoring events to collect the relevant data as per your requirement.
Table of contents
- Understanding PostgreSQL SUBSTRING function - September 21, 2024
- How to install PostgreSQL on Ubuntu - July 13, 2023
- How to use the CROSSTAB function in PostgreSQL - February 17, 2023