In this article, we will discuss a number of questions about the Azure Data Factory service that you may be asked when applying to an Azure Data Engineer role.
Q1: Briefly describe the purpose of the ADF Service
ADF is used mainly to orchestrate the data copying between different relational and non-relational data sources, hosted in the cloud or locally in your datacenters. Also, ADF can be used for transforming the ingested data to meet your business requirements. It is ETL, or ELT tool for data ingestion in most Big Data solutions.
- For more information, check Starting your journey with Microsoft Azure Data Factory
Q2: Data Factory consists of a number of components. Mention these components briefly
- Pipeline: The activities logical container
- Activity: An execution step in the Data Factory pipeline that can be used for data ingestion and transformation
- Mapping Data Flow: A data transformation UI logic
- Dataset: A pointer to the data used in the pipeline activities
- Linked Service: A descriptive connection string for the data sources used in the pipeline activities
- Trigger: Specify when the pipeline will be executed
- Control flow: Controls the execution flow of the pipeline activities
- For more information, check Starting your journey with Microsoft Azure Data Factory
Q3: What is the difference between the Dataset and Linked Service in Data Factory?
Linked Service is a description of the connection string that is used to connect to the data stores. For example, when ingesting data from a SQL Server instance, the linked service contains the name for the SQL Server instance and the credentials used to connect to that instance.
Dataset is a reference to the data store that is described by the linked service. When ingesting data from a SQL Server instance, the dataset points to the name of the table that contains the target data or the query that returns data from different tables.
- For more information, check Starting your journey with Microsoft Azure Data Factory
Q4: What is Data Factory Integration Runtime?
Integration Runtime is a secure compute infrastructure that is used by Data Factory to provide the data integration capabilities across the different network environments and make sure that these activities will be executed in the closest possible region to the data store.
- For more information, check Copy data between Azure data stores using Azure Data Factory
Q5: Data Factory supports three types of Integration Runtimes. Mention these supported types with a brief description for each
- Azure Integration Runtime: used for copying data from or to data stores accessed publicly via the internet
- Self-Hosted Integration Runtime: used for copying data from or to an on-premises data store or networks with access control
- Azure SSIS Integration Runtime: used to run SSIS packages in the Data Factory
- For more information, check Copy data between Azure data stores using Azure Data Factory
Q6: When copying data from or to an Azure SQL Database using Data Factory, what is the firewall option that we should enable to allow the Data Factory to access that database?
Allow Azure services and resources to access this server firewall option.
Q7: If we need to copy data from an on-premises SQL Server instance using Data Factory, which Integration Runtime should be used and where should it be installed?
Self-Hosted Integration Runtime should be installed on the on-premises machine where the SQL Server instance is hosted.
- For more information, check Copy data from On-premises data store to an Azure data store using Azure Data Factory.
Q8: After installing the Self-Hosted Integration Runtime to the machine where the SQL Server instance is hosted, how could we associate the SH-IR created from the Data Factory portal?
We need to register it using the authentication key provided by the ADF portal.
- For more information, check Copy data from On-premises data store to an Azure data store using Azure Data Factory
Q9: What is the difference between the Mapping data flow and Wrangling data flow transformation activities in Data Factory?
Mapping data flow activity is a visually designed data transformation activity that allows us to design a graphical data transformation logic without the need to be an expert developer, and executed as an activity within the ADF pipeline on an ADF fully managed scaled-out Spark cluster.
Wrangling data flow activity is a code-free data preparation activity that integrates with Power Query Online in order to make the Power Query M functions available for data wrangling using spark execution.
- For more information, check Transform data using a Mapping Data Flow in Azure Data Factory
Q10: Data Factory supports two types of compute environments to execute the transform activities. Mention these two types briefly
On-demand compute environment, using a computing environment fully managed by the ADF. In this compute type, the cluster will be created to execute the transform activity and removed automatically when the activity is completed.
Bring Your Own environment, in which the used compute environment is managed by you and ADF.
- For more information, check Transform data using a Mapping Data Flow in Azure Data Factory
Q11: What is Azure SSIS Integration Runtime?
A fully managed cluster of virtual machines hosted in Azure and dedicated to run SSIS packages in the Data Factory. The SSIS IR nodes can be scaled up, by configuring the node size, or scaled out by configuring the number of nodes in the VMs cluster.
- For more information, check Run SSIS packages in Azure Data Factory
Q12: What is required to execute an SSIS package in Data Factory?
We need to create an SSIS IR and an SSISDB catalog hosted in Azure SQL Database or Azure SQL Managed Instance.
- For more information, check Run SSIS packages in Azure Data Factory
Q13: Which Data Factory activity is used to run an SSIS package in Azure?
Execute SSIS Package activity.
- For more information, check Run SSIS packages in Azure Data Factory
Q14: Which Data Factory activity can be used to get the list of all source files in a specific storage account and the properties of each file located in that storage?
Get Metadata activity.
- For more information, check How to use iterations and conditions activities in Azure Data Factory
Q15: Which Data Factory activities can be used to iterate through all files stored in a specific storage account, making sure that the files smaller than 1KB will be deleted from the source storage account?
- ForEach activity for iteration
- Get Metadata to get the size of all files in the source storage
- If Condition to check the size of the files
- Delete activity to delete all files smaller than 1KB
- For more information, check How to use iterations and conditions activities in Azure Data Factory
Q16: Data Factory supports three types of triggers. Mention these types briefly
- The Schedule trigger that is used to execute the ADF pipeline on a wall-clock schedule
- The Tumbling window trigger that is used to execute the ADF pipeline on a periodic interval, and retains the pipeline state
- The Event-based trigger that responds to a blob related event, such as adding or deleting a blob from an Azure storage account
- For more information, check How to schedule Azure Data Factory pipeline executions using Triggers
Q17: Any Data Factory pipeline can be executed using three methods. Mention these methods
- Under Debug mode
- Manual execution using Trigger now
- Using an added scheduled, tumbling window or event trigger
Q18: Data Factory supports four types of execution dependencies between the ADF activities. Which dependency guarantees that the next activity will be executed regardless of the status of the previous activity?
Completion dependency.
Q19: Data Factory supports four types of execution dependencies between the ADF activities. Which dependency guarantees that the next activity will be executed only if the previous activity is not executed?
Skipped dependency.
- For more information, check Dependencies in ADF
Q20: From where we can monitor the execution of a pipeline that is executed under the Debug mode?
The Output tab of the pipeline, without the ability to use the Pipeline runs or Trigger runs under ADF Monitor window to monitor it.
- Azure Data Factory Interview Questions and Answers - February 11, 2021
- How to monitor Azure Data Factory - January 15, 2021
- Using Source Control in Azure Data Factory - January 12, 2021