Microsoft Fabric Data Engineering
Interview Questions
Master Spark Pools, V-Order optimization, Notebook development, and advanced troubleshooting Microsoft Fabric Data Engineering Interview Questions for Senior Engineers.
What are the top Fabric Data Engineering interview questions?
The most common Microsoft Fabric Data Engineering interview questions heavily focus on Spark Optimization techniques like V-Order, Z-Order, and Coalescing. Candidates are also tested on troubleshooting specific errors like the “Py4JJavaError” and debugging “Out of Memory” (OOM) issues in Spark Pools. Furthermore, you must understand the nuances of the Lakehouse architecture, specifically when to use managed vs. unmanaged tables.
Are you preparing for Microsoft Fabric Data Engineering interview questions? Succeeding in these interviews requires more than just knowing PySpark syntax; it demands a deep understanding of the Fabric compute runtime, library management, and performance tuning. Therefore, to secure a Senior Data Engineer or Spark Developer role, you must demonstrate the ability to debug complex production failures efficiently.
This guide provides 40 deep-dive questions organized into 6 modules. We have integrated solutions from our Fabric Notebooks Tutorial and performance guides to give you a competitive edge.
Module A: Spark Compute & Data Engineering Pools
Understanding how Fabric allocates compute is essential for cost management and performance. These Microsoft Fabric Data Engineering interview questions cover Starter Pools versus Custom Pools.
Pool Architecture
Beginner Q1: What is a Fabric Starter Pool?
In Azure Synapse, Spark Pools typically took 3-5 minutes to start. In contrast, Fabric Starter Pools use pre-provisioned, shared compute managed by Microsoft. This architecture allows them to drastically reduce cluster startup latency, often initializing in less than 10 seconds. Consequently, you do not need to provision dedicated hardware manually; instead, you simply attach a workspace to a capacity.
Intermediate Q2: Starter vs. Custom Pools?
Fabric offers two compute configurations. Starter Pools are fast-start, default clusters that are best for ad-hoc exploration and dev/test environments. On the other hand, Custom Pools allow you to define node size (Small to XX-Large), auto-scale limits, and libraries. Therefore, Custom Pools are best for production ETL jobs requiring resource isolation.
Advanced Q3: How do Single Node Pools work?
For smaller workloads, Fabric supports Single Node pools where the Spark Driver and Executor run on the same VM. This eliminates network shuffle overhead and significantly reduces costs. However, it is limited by the memory of a single machine; consequently, it is not suitable for TB-scale data processing.
Configuration & Limits
Intermediate Q4: How does “High Concurrency” mode work?
High Concurrency allows multiple Notebook sessions to share the same Spark session (and resources). This reduces startup overhead for users running small queries. However, it does not guarantee workload isolation. As a result, it is generally not recommended for heavy production ETL pipelines where one job could starve others.
Advanced Q5: What causes “Cold Start” vs “Warm Start”?
Cold Start occurs when the Spark session initializes from scratch, allocating new VMs from Azure. In Custom Pools, this can take several minutes. In contrast, Warm Start occurs when a session is already active or when using Starter Pools with pre-warmed nodes. Fabric keeps sessions alive for a configurable “Time to Live” (TTL) to allow rapid subsequent query execution.
Advanced Q6: How does Fabric throttle Spark jobs?
Fabric uses a capacity-based throttling model known as “Smoothing.” Usage is averaged over time. If your interactive operations exceed capacity limits, Fabric may delay execution. Furthermore, background jobs (ETL) are smoothed over 24 hours, meaning momentary spikes won’t immediately fail your pipeline unless the sustained load is too high.
Module B: Notebooks & Development
Notebooks are the primary tool for Data Engineers. These questions test your familiarity with the Fabric developer experience.
Dev Environment
Beginner Q7: What is MSSparkUtils?
MSSparkUtils is a built-in library specific to Fabric. It provides utilities that standard PySpark lacks. For example, you can use mssparkutils.fs.mount() to mount OneLake paths as local file systems, or mssparkutils.credentials.getSecret() to securely retrieve Key Vault secrets. Additionally, it includes mssparkutils.notebook.run() to chain notebooks together.
Intermediate Q8: Fabric Runtime vs. Databricks Runtime?
Fabric Runtime is a curated set of open-source packages (Spark, Delta, Python) managed by Microsoft. It is updated periodically but is generally less flexible than Databricks. In contrast, Databricks Runtime is highly optimized, proprietary (Photon engine), and often offers bleeding-edge Spark features before they hit open source.
Intermediate Q9: How to use VS Code with Fabric?
Fabric supports a “VS Code for the Web” integration directly in the browser. Furthermore, you can use the Synapse VS Code Extension on your desktop. This allows you to write code locally with full IntelliSense and Git integration, then execute it against the remote Fabric Spark cluster.
Library Management
Intermediate Q10: Environments vs. Inline Libraries?
Inline Installation involves using %pip install inside a notebook. While good for testing, it adds installation time to every run. Alternatively, Environments are reusable artifacts where you define libraries (PyPI/Conda) once. Attaching an Environment to a workspace ensures consistent versions and faster startup times across all jobs.
Advanced Q11: Securely handling secrets?
You should never hardcode passwords in notebook cells. Instead, the correct approach is to create an Azure Key Vault, link it to the Fabric Workspace via Cloud Connections, and then use mssparkutils.credentials.getSecret() to retrieve the value at runtime. This ensures credentials never appear in source control.
Beginner Q12: How to reference other notebooks?
You can use the %run magic command to include functions from another notebook in the same workspace. This promotes code reusability. For example, you can create a common utility notebook for logging or date formatting and call it from your main ETL notebooks.
Module C: Optimization & Performance
Performance tuning separates junior engineers from lead engineers. These Microsoft Fabric Data Engineering interview questions focus on V-Order and Shuffle.
Write Optimization
Intermediate Q13: What is V-Order?
V-Order is a write-time optimization that sorts and compresses Parquet files to make them highly efficient for Power BI Direct Lake reads. However, it introduces additional CPU and write latency overhead. Therefore, you should disable it for write-heavy staging layers but enable it for reporting layers. See our Direct Lake Optimization Guide.
Intermediate Q14: Solving the “Small File Problem”?
Frequent inserts create thousands of tiny files, killing read performance. Fabric has “Auto-Compaction” enabled by default. However, for massive datasets, you must manually run the OPTIMIZE command or use coalesce() before writing to merge files.
Advanced Q15: Z-Order vs V-Order?
V-Order optimizes file compression for the VertiPaq engine (Power BI). In contrast, Z-Order is a data layout optimization that colocates related data to enable data skipping during queries. You can use both together to maximize performance for specific filter columns.
Read Optimization
Advanced Q16: Tuning Shuffle Partitions?
The default spark.sql.shuffle.partitions is 200. For TB-scale data, this causes spill-to-disk. For small data, it causes overhead. A good rule of thumb is to size partitions so each task processes ~128MB – 200MB. See our Shuffle Optimization Guide.
Advanced Q17: Handling Data Skew?
Skew happens when one partition is much larger than others. Common fixes include: 1) Salting keys to redistribute data. 2) Using Broadcast Joins for small tables. 3) Enabling Adaptive Query Execution (AQE), which Fabric supports by default.
Intermediate Q18: Cache vs Persist?
Use cache() (defaults to Memory) when you reuse a DataFrame multiple times in a notebook. However, if the data is too large for memory, use persist(StorageLevel.DISK_ONLY) to avoid OOM errors.
Advanced Q19: Broadcast Joins?
When joining a large table with a small table (lookup), use broadcast(small_df). This sends a copy of the small table to every executor, eliminating the need for a shuffle operation on the large table. Consequently, join performance improves dramatically.
Intermediate Q20: Optimizing Delta Merges?
Merge operations can be slow if the target table is not optimized. To improve performance, ensure the join condition matches the partition key of the target table. Additionally, run OPTIMIZE periodically to compact files before merging.
Module D: Lakehouse Architecture
Understanding storage formats and integration is key. These questions explore the nuances of Delta Lake.
Table Formats
Beginner Q21: Managed vs. Unmanaged Tables?
Managed Tables: Spark controls both the metadata and the physical data. Dropping the table deletes the data. Unmanaged Tables: Spark controls only the metadata. Dropping the table leaves the physical files intact. Unmanaged tables are preferred for the Bronze layer where data might be shared with other tools.
Intermediate Q22: Delta Lake vs. Iceberg in Fabric?
Fabric uses Delta Lake as its primary open table format. While Iceberg is supported via shortcuts, Fabric’s “Direct Lake” mode is specifically optimized to read Delta Logs for high-performance BI. For a detailed comparison, see Iceberg vs Delta Lake.
Advanced Q23: Spark Integration with Direct Lake?
When Spark writes to a Delta table, it updates the Delta Log. Power BI Direct Lake reads this log. To ensure consistency, Spark jobs should minimize “small file” creation, as excessive metadata can slow down the Direct Lake framing process.
Maintenance
Intermediate Q24: What does VACUUM do?
VACUUM removes old versions of Parquet files that are no longer needed by the Delta log. This reclaims storage space. However, running VACUUM prevents you from using Time Travel to versions older than the retention period (default 7 days).
Intermediate Q25: Partitioning Strategy?
Do not partition small tables (under 10GB). For larger tables, choose a column with low cardinality (e.g., Year or Month). Avoid high cardinality columns like UserID or DateTime, as this creates millions of tiny partitions, severely degrading performance.
Advanced Q26: Schema Evolution?
Delta Lake supports schema evolution. By setting .option("mergeSchema", "true"), Spark can automatically add new columns to the table schema when writing data. This is essential for handling drifting data sources.
Module E: Troubleshooting
This is where Senior Engineers are tested. Use the STAR method to answer these failure scenarios.
Common Errors
Advanced Q27: Fixing “Py4JJavaError”?
This generic error usually masks file locking or schema issues. Fix: Check if the destination path is locked by another stream. Additionally, verify schema consistency between the DataFrame and the Delta table. See Py4JJavaError Fix Guide.
Advanced Q28: Debugging OOM Errors?
Driver OOM: Usually caused by collect() fetching too much data to the driver. Fix: Write to disk instead.
Executor OOM: Caused by large partitions or inefficient joins. Fix: Increase shuffle partitions or upgrade to a Memory Optimized node size.
Intermediate Q29: “Container killed by YARN”?
This error typically means an executor exceeded its memory limit overhead. It often happens when processing large records (e.g., XML/JSON blobs) or using UDFs that create high object churn. Fix: Increase spark.executor.memoryOverhead.
Intermediate Q30: AnalysisException: Path not found?
This often occurs with Shortcuts. If the source file in ADLS/S3 is deleted or renamed, the Shortcut becomes a broken link. Fix: Validate the source path exists and that the identity executing the notebook has permissions to access it.
Monitoring
Intermediate Q31: Monitoring Spark Jobs?
Use the Monitoring Hub in Fabric. It provides a unified view of the Spark UI, logs, and resource usage (CU consumption). You can drill down into specific stages (DAG visualization) to find bottlenecks like long-running stages or skew.
Advanced Q32: What is “Spill to Disk”?
Spill occurs when there is not enough RAM to hold data during a shuffle, forcing Spark to write temporary data to disk. This severely slows down the job. Fix: Increase the number of shuffle partitions to reduce the size of each task.
Module F: Scenarios & Architecture
Real-world scenario-based Microsoft Fabric Data Engineering interview questions.
Advanced Scenarios
Advanced Q33: Checkpointing in Fabric?
For Structured Streaming, you must specify a checkpoint location in OneLake (/Files/checkpoints/). This allows the job to recover from failure. Warning: Ensure the path is persistent (Unmanaged area) to avoid accidental deletion during table maintenance.
Advanced Q34: Agentic Data Engineering?
This is an emerging pattern where AI agents (using Copilot or custom LLMs) autonomously monitor, validate, and fix data pipelines. Fabric’s integration with Azure OpenAI allows you to build agents that can interpret error logs and trigger self-healing jobs. See our Agentic Tutorial.
Intermediate Q35: CI/CD for Notebooks?
Fabric integrates with Azure DevOps via Git. You should commit your Notebooks to a Git repository. Use Deployment Pipelines to promote content from Dev to Test to Prod workspaces. Best Practice: Use parameterized notebooks so the same code can run against different Lakehouses (Dev vs Prod).
Advanced Q36: Handling GDPR Deletions?
In a Delta Lake, deleting a row (DELETE FROM) is logical; the physical data remains in old Parquet files until you run VACUUM. Therefore, for GDPR compliance, you must run a DELETE followed by a VACUUM command to ensure the data is physically erased.
Advanced Q37: Unit Testing Spark Code?
You can use pytest or unittest libraries within Fabric Notebooks. However, a better approach is to modularize your logic into Python files (Libraries) and unit test them locally in VS Code before deploying them to the Fabric Environment.
Architecture Decisions
Advanced Q38: Spark vs. T-SQL in Fabric?
Use Spark for heavy ETL, complex transformations, unstructured data processing, and machine learning. In contrast, use T-SQL (Warehouse) for serving data to BI tools, enforcing security (RLS), and handling strict schema-on-write scenarios. Often, Spark does the prep, and SQL does the serving.
Intermediate Q39: Medallion Arch with Spark?
Bronze: Raw ingestion (append-only). Silver: Cleaned, deduplicated, and validated data (Merge). Gold: Aggregated business-level tables (Overwrite). Spark is typically used to move data from Bronze to Silver, while Silver to Gold can be Spark or SQL stored procedures.
Advanced Q40: Optimizing Costs?
Monitor your “Smoothing” metrics. If your interactive jobs are causing throttling, move heavy batch jobs to off-peak hours. Additionally, use Starter Pools for dev to avoid paying for idle startup time, and aggressively auto-scale down Custom Pools.
Ready for SQL?
Now that you have mastered these Microsoft Fabric Data Engineering interview questions, let’s explore Data Warehousing.
Start Warehouse Questions →References: Microsoft Learn | Delta Lake



