Microsoft Fabric Data Factory
Interview Questions
Microsoft Fabric Data Factory Interview Questions – Master Data Pipelines, Dataflow Gen2, On-Prem connectivity, and migration strategies for Senior ETL Developers.
What are the top Fabric Data Factory interview questions?
The most common Microsoft Fabric Data Factory interview questions focus on the architectural differences between Data Pipelines (orchestration) and Dataflow Gen2 (transformation). Candidates are tested on optimizing ingestion using “Fast Copy,” handling On-Premises Data Gateways, and designing parameterized pipelines for dynamic execution.
If you are an ETL Developer or Data Engineer, preparing for Microsoft Fabric Data Factory interview questions is critical for your success. Fabric re-imagines data integration by combining the orchestration power of Azure Data Factory (ADF) with the user-friendly transformation logic of Power Query (Dataflow Gen2). Therefore, to succeed in senior interviews, you must demonstrate when to use each tool, how to optimize data movement, and how to migrate legacy SSIS packages efficiently.
This comprehensive guide provides 40 deep-dive questions organized into 6 modules. We have integrated insights from our Data Pipelines Guide to help you master production scenarios.
Module A: Fabric Data Factory Architecture
Understanding the dual nature of Fabric’s ETL tools is fundamental. These Microsoft Fabric Data Factory interview questions compare the two main engines.
Data Factory Engines
Beginner Q1: What is Data Factory in Fabric?
Fabric Data Factory is a unified data integration experience that combines Data Pipelines (based on Azure Data Factory) and Dataflow Gen2 (based on Power Query). It provides both code-free and code-first capabilities to ingest, prepare, and transform data at scale.
Intermediate Q2: Pipelines vs. Dataflow Gen2?
This is the most common question. Pipelines are for orchestration and high-speed data movement (Copy Activity). They handle control flow (If/Else, Loops) but do not transform data row-by-row internally. In contrast, Dataflow Gen2 is for data transformation. It allows you to clean, reshape, and merge data using a visual Power Query interface. Use Pipelines to move data; use Dataflows to change data.
Advanced Q3: When to use Spark vs. Dataflow?
Use Spark (Notebooks) for complex transformations on massive datasets (TB scale), unstructured data processing, or when Python/Scala libraries are needed. Use Dataflow Gen2 for low-code/no-code transformations, especially when the logic is already defined in Power Query or for smaller to medium-sized datasets where ease of maintenance is priority.
Integration Patterns
Intermediate Q4: Can a Pipeline trigger a Dataflow?
Yes. A Data Pipeline can include a “Dataflow” activity. This allows you to orchestrate the execution. For example, a Pipeline can first copy raw data from an on-prem SQL server to OneLake (Copy Activity) and then trigger a Dataflow Gen2 to clean and load that data into a Warehouse.
Advanced Q5: What is “Fast Copy”?
Fast Copy is a capability within Dataflow Gen2 that allows it to ingest large volumes of data rapidly. Instead of processing row-by-row during ingestion, it orchestrates a backend copy operation (similar to ADF Copy) to land data into the SQL endpoint or Lakehouse efficiently before applying transformations.
Intermediate Q6: Does Data Factory use Spark?
Under the hood, Dataflow Gen2 runs on a managed Spark compute engine. This allows it to scale better than Gen1 (which ran on shared Power BI capacity). However, this Spark complexity is abstracted from the user, who interacts only with the Power Query UI.
Module B: Dataflow Gen2 ETL
Dataflow Gen2 is the evolution of Power Query in the cloud. These questions cover destinations, staging, and write-back.
Dataflow Capabilities
Intermediate Q7: What are output destinations?
Unlike Gen1, Dataflow Gen2 supports Output Destinations. You can write the transformed data directly to a Lakehouse, Warehouse, Azure SQL Database, or KQL Database. This makes it a true ETL tool, whereas Gen1 was often limited to internal storage for Power BI.
Intermediate Q8: What is “Staging” in Dataflow?
Staging is enabled by default. It loads data into a temporary Lakehouse location (Compute Staging) before transformation logic is applied. This improves performance for folding queries but can increase latency for simple operations. You can disable staging for specific queries if direct query folding to the source is preferred.
Advanced Q9: How to optimize Dataflow Gen2?
To optimize, ensure Query Folding happens as much as possible (pushing logic to the source DB). Use “Fast Copy” for ingestion. Avoid complex row-by-row custom functions. Furthermore, separate ingestion logic (Bronze) from transformation logic (Silver) into different dataflows. See our Dataflow Gen2 Guide.
Troubleshooting Dataflows
Advanced Q10: Fixing “Error 20302” (Internal Error)?
This generic error often indicates a schema mismatch or a timeout during the staging write. Fixes include checking for column name special characters, reducing batch sizes, or disabling staging for that specific query. See our specific fix for Dataflow Gen2 Error 20302.
Intermediate Q11: Handling incremental refresh?
Dataflow Gen2 supports incremental refresh if the output destination (like a Warehouse) supports it. However, typically you implement incremental logic by using parameters (RangeStart/RangeEnd) in the Power Query logic to filter source data based on the last execution time.
Intermediate Q12: Write-back limitations?
When writing to a Warehouse, Dataflow Gen2 performs a “Replace” or “Append” operation. It does not natively support “Upsert” (Merge) logic out of the box. For Upserts, you often need to load data into a staging table via Dataflow and then run a T-SQL MERGE statement via a Pipeline Script Activity.
Module C: Data Factory Pipelines & Activities
Pipelines act as the traffic controller. These questions focus on orchestration activities.
Orchestration Activities
Beginner Q13: What is the Copy Activity?
The Copy Activity is the primary engine for moving data. It supports 100+ connectors (AWS, GCP, Salesforce, SAP, etc.). It is highly optimized for throughput and does not transform data; it simply moves it from Source to Sink (Destination).
Intermediate Q14: Lookup vs. Get Metadata?
Lookup retrieves the actual data (rows) from a dataset (e.g., config table values). Get Metadata retrieves structural information about files (e.g., file name, size, last modified date). Use Get Metadata to iterate over a list of files in a folder.
Advanced Q15: The “Script” Activity usage?
The Script Activity allows you to execute SQL (T-SQL) or DML commands against a Warehouse or Lakehouse SQL Endpoint. It replaces the older “Stored Procedure” activity. Use it to run `MERGE` statements, truncate tables, or update control logs.
Control Flow
Intermediate Q16: ForEach vs. Switch Activity?
ForEach iterates over a collection (e.g., list of tables) and executes activities in parallel or sequentially. Switch evaluates an expression and executes one specific branch of activities based on the result (similar to a Case statement).
Intermediate Q17: Triggering Notebooks?
Use the “Notebook” activity to call a Fabric Spark Notebook. You can pass dynamic parameters from the pipeline to the notebook using the `parameters` cell toggle in the notebook settings. This enables reusing a single notebook for multiple datasets.
Advanced Q18: What is “Invoke Pipeline”?
The “Invoke Pipeline” activity allows a parent pipeline to trigger a child pipeline. This promotes modular design. For example, you can have a “Master Orchestrator” pipeline that calls separate “Ingest” and “Process” child pipelines.
Intermediate Q19: Web Activity usage?
The Web Activity is used to call REST APIs. In Fabric, it is often used to trigger external processes (e.g., Azure Functions, Logic Apps) or to call the Fabric REST API itself to manage workspace items dynamically.
Intermediate Q20: Validation Activity?
The Validation Activity ensures that a dataset (file or table) exists and meets criteria before the pipeline proceeds. It “waits” until the file appears or a timeout occurs. Use this to prevent jobs from failing when source files arrive late.
Module D: Connectivity & Gateways
Connecting to on-premises and private network data is a key interview topic. These questions cover gateways and security.
Gateway Architecture
Beginner Q21: What is an On-Premises Data Gateway?
An On-Premises Data Gateway acts as a bridge between your local network (behind a firewall) and the cloud (Fabric). It allows Data Factory to securely connect to on-prem SQL Servers, File Shares, or Oracle databases without opening inbound ports.
Intermediate Q22: On-Prem vs. VNet Data Gateway?
On-Prem Gateway requires installing software on a local VM/Server. VNet Data Gateway is a managed service that injects the gateway into your Azure Virtual Network (VNet). Use the VNet gateway to connect to Azure PaaS services (like SQL MI) that are secured behind Private Endpoints. See our Gateway Comparison Guide.
Advanced Q23: High Availability for Gateways?
To ensure High Availability (HA), you should install the On-Premises Gateway on multiple servers and cluster them. This removes the single point of failure. If one gateway node goes down, Fabric automatically routes traffic to the other active nodes in the cluster.
Security & Network
Intermediate Q24: How are credentials managed?
Credentials for connections are stored securely in the Fabric “Manage Connections and Gateways” setting. However, the best practice is to store secrets in Azure Key Vault and reference them in the connection definition, ensuring passwords are rotated automatically.
Advanced Q25: Troubleshooting Gateway connectivity?
Common issues include firewall blocking outbound traffic or outdated gateway versions. Ensure port 443 (HTTPS) and specific Azure Service Bus ports are open outbound. Use the “Diagnostics” tool in the Gateway app to test network connectivity.
Intermediate Q26: Connecting to AWS/GCP?
Fabric Data Factory includes native connectors for AWS S3 and Google Cloud Storage. You do not need a gateway for these public cloud services. However, you must manage authentication using Access Keys or Identity Federation.
Module E: Advanced Factory Scenarios
Senior roles require mastering dynamic pipelines. These Microsoft Fabric Data Factory interview questions cover parameters and error handling.
Dynamic Pipelines
Advanced Q27: Parameterizing a Pipeline?
You can define parameters at the pipeline level (e.g., `TableName`, `DateRange`). These parameters can be passed into activities dynamically using the expression language `@pipeline().parameters.TableName`. This allows one pipeline to serve multiple tables.
Advanced Q28: Dynamic Content Expressions?
Fabric supports a rich expression language. For example, to generate a file path dynamically based on the current date, you would use: @concat('landing/', formatDateTime(utcnow(), 'yyyy/MM/dd'), '/file.csv').
Intermediate Q29: Metadata-Driven Ingestion?
This pattern involves storing a list of tables to copy in a control table (SQL or Excel). The pipeline first does a Lookup to get the list, then uses a ForEach loop to iterate through the list, invoking a parameterized Copy Activity for each table. This scales to hundreds of tables.
Error Handling
Advanced Q30: Implementing Retry Logic?
Every activity has a “Retry” policy setting. You can configure it to retry 3 times with a 30-second interval. This handles transient network blips automatically. For logic failures, you use the “On Fail” path to trigger alerting.
Intermediate Q31: Sending Alerts on Failure?
Connect the “On Failure” path of an activity to a Web Activity (calling a Logic App to send email/Teams) or an Office 365 Outlook activity (if supported in your region). This ensures the ops team is notified immediately.
Advanced Q32: Troubleshooting BOM characters?
CSV files sometimes contain invisible Byte Order Marks (BOM) that corrupt column headers. Fix: In the Copy Activity Source settings, confirm the encoding is set to “UTF-8” (not UTF-8 with BOM). See our fix for CSV Column Not Found Errors.
Module F: Migration & SSIS
Real-world migration strategies from legacy tools.
SSIS & ADF Migration
Advanced Q33: Migrating SSIS to Fabric?
Fabric does not support the SSIS Integration Runtime (unlike ADF). You cannot “Lift and Shift” SSIS packages directly. Strategy: You must rewrite the logic. Use Dataflow Gen2 for transformations and Pipelines for orchestration. For complex custom .NET scripts in SSIS, rewrite them in Spark Notebooks.
Intermediate Q34: Migrating ADF to Fabric?
While similar, there is no “1-click” migration tool yet. You typically recreate the pipelines manually. However, since the JSON definitions are similar, you can often copy the JSON code from ADF and paste it into the Fabric Pipeline JSON editor, making adjustments for unsupported activities (like SSIS).
Intermediate Q35: Mapping Data Flows (Gen1 vs Gen2)?
If migrating from Power BI Dataflows (Gen1), you can export the `.pqt` (Power Query Template) file and import it into Fabric Dataflow Gen2. Remember to configure the new Output Destination to point to a Lakehouse or Warehouse.
Migration Strategy
Advanced Q36: Pipeline vs. Shortcut for Migration?
For initial migration, do not copy everything. Use Shortcuts to expose historical data in OneLake instantly. Only build Pipelines to copy “active” or “new” data. This reduces migration time and storage costs significantly.
Intermediate Q37: Handling CDC in Pipelines?
Pipelines support “Incremental Copy” using a watermark column (LastModifiedDate). For true Change Data Capture (log-based), consider using the Mirroring feature instead of building complex pipeline logic to track deletes and updates.
Advanced Q38: Orchestrating across Workspaces?
A pipeline in Workspace A can trigger a Notebook in Workspace B if the calling identity has permissions. However, it is architecturally cleaner to keep orchestration local. If cross-workspace dependency is needed, consider using “Eventstream” or logic apps as the glue.
Intermediate Q39: Cost Management for Pipelines?
Pipelines consume Capacity Units (CU). “Copy Activity” data movement is charged based on DIUs (Data Integration Units). To optimize costs, avoid frequent small file movements. Batch data into larger chunks where possible.
Advanced Q40: Version Control in Fabric?
Always connect your workspace to a Git repository (Azure DevOps). This allows you to commit Pipeline and Dataflow definitions. You can then use Deployment Pipelines to promote these artifacts from Dev -> Test -> Prod, ensuring a robust SDLC process.
References: Microsoft Learn | Delta Lake



