This article will show how to catalog data assets and then tag business glossaries at the attribute level of data assets in Azure Purview.
Introduction
Data cataloging is pivotal to a growing data ecosystem and maintaining business a glossary with a data catalog is an efficient way of standardizing business-critical terminology. We learned about the importance as well as setup of this functionality in my last article. While organizing the business terminology in a data catalog, associating synonyms and related terms, as well as linking it to terminology stakeholders is not sufficient. The end purpose of this exercise is to hold the essential domain knowledge and terms in a systematic manner in the data catalog and utilize it to tag data assets that are held in the catalog. This allows discovery as well as the organization of data assets based on business terminology.
In the previous article, Defining Business Glossary with Azure Purview, we learned to create an Azure Purview account, create new terms using the Glossary feature of Azure Purview, hierarchically interlink different terms, as well as evolve the status of these terms as they pass through various data governance workflows. In this article, we will focus on the next step of cataloging data assets and then tagging business glossaries at the attribute level of data assets in Azure Purview.
Pre-requisites
We would need an Azure account with an Azure Purview account already set up with administrative privileges. We need some business terms already defined in the glossary section of the Azure Purview account, as we would be using the same to tag data assets once we register data assets.
We would also need at least one data repository like Azure SQL database or Azure Blob Storage or Azure Data Lake Storage or any other supported data repository with at least one data object or one data file hosted in the relevant data repository. This data object would be registered as the data asset which would then be tagged with the business glossary. It is assumed that this entire setup is in place before proceeding with the following exercise.
Registering Data Assets in Azure Purview
Assuming that the Azure Purview account is already in place, click on the Open Purview Studio link. This would open Purview Studio in a new window. By default, it would have no data assets registered in it. Click on the Sources section to view the list of data sources and it would look as shown below.
Click on the Register button to list and choose one of the supported data sources. It would open a list of supported data sources as shown below. Let’s say that we have an Azure SQL Server data source that would have an Azure SQL Database attached to it. This database instance would have one or multiple data objects hosted in it. Our intention is to register this data source and scan the database objects in it so that it gets registered as a data asset, which we will then tag with the business terms. So, select Azure SQL Database as the data source and move to the next step.
In this step, we need to provide the name of the source that we are going to register. Then once we select the Azure subscription, Azure SQL Databases in this subscription will be listed and we can select the appropriate one from the list as shown below. Once we click on the Register button it would get registered.
Once the data source gets registered, it would look as shown below. This is the data map view in which we can view all the data sources as well as organize them categorically as required.
From the icons listed under the name of the data source, we need to click on the item titled New Scan. This would invoke the new scan wizard as shown below. Here we need to provide the name of the scan being scheduled and the integration runtime which it would use for conducting the scan which by default is “AutoResolveIntegrationRuntime” as shown below. The server endpoint would be populated by default based on our earlier selection of Azure SQL Database instance. We need to select the database from which we intend to scan data objects and configure the credential using which Azure Purview would access the data objects from the data repository. We can use the managed identity i.e., Purview MSI or create a specific service credential for Purview to access a specific database. More configuration details can be found by clicking the See More link in this wizard as shown below.
Once the configuration is complete, complete the setup and initiate the scan. Once the scan is complete and successful, the data objects would get registered as data assets in the Azure Purview account. Navigate to the home page of Azure Purview and click on the Browse Assets tile. Navigate through the data source and schema list to reach the inventory of tables that are scanned from the data source as shown below.
Let’s say that we intend to associate business glossary terms to the table name Customer under the SalesLT schema. Click on this table to explore the glossary terms that may be already associated with this table. When you open the table, on the right-bottom corner of the screen, you would find whether any glossary terms are already associated. In our case, no terms are already associated with this table.
Let’s take a view at the schema of this table to explore the classification as well as tagging of glossary terms at an attribute level. In this case, it looks as shown below. Attributes like Email Addresses are auto classified using the built-in classification templates available in Azure Purview. But business terms association is a business-specific activity that needs to be performed by data stewards, at least initially.
The business terms glossary that we created in the last article is as shown below. Here we have defined the term Client and Engagement. We have stated in the term description that the term client stands for similar terms like Customer, User etc. In this case, we have the name of the table as Customer which in business terminology means “Client”. So, the task at hand is to associate this business term from the glossary with this table and/or attribute.
Navigate back to the table in question and click on the edit button. It would open an editable table screen as shown below. If we expand the Glossary terms drop-down, we will find the terms listed along with the hierarchy to which they belong. We can select one or more terms from the list as per the business relevance of the data object. Once done, click on the Save button.
Once the glossary terms are associated with the intended attributes, it would look as shown below. Clicking on the business term would navigate us to the definition of the term in the glossary.
As we have now associated the term from the glossary with this data assets, it should be discoverable as well when we are browsing the assets and looking for assets that are related to a specific standard term that may be defined in the glossary. To test this, navigate to the home page, on the top search bar key in the word “Client” and hit the enter button to search for this term. The result would list the Customer table though we searched for the word Client. The reason for this is that we associated this term from the glossary to this data object, and the search intelligently identified it as classified as a relevant result for the search, which is exactly the objective that we wanted to achieve.
In this way, we can tag data assets with glossary terms in Azure Preview and intelligently discover data assets cataloged in the Azure Purview account.
Conclusion
In this article, we continued from the last article and registered a data source and data objects hosted in it as data assets in the Azure Purview account. We then associated glossary terms to the data assets and then verified the discoverability of the asset using the business terms by searching the data asset with the business term.
- Oracle Substring function overview with examples - June 19, 2024
- Introduction to the SQL Standard Deviation function - April 21, 2023
- A quick overview of MySQL foreign key with examples - February 7, 2023