Transform Data Using Notebooks – Microsoft Fabric Tutorial Series

Transform Data Using Notebooks

Definitive guide: practical patterns, production-ready recipes, UI design for collaboration, performance tuning, security, and an FAQ for end users to transform data using notebooks in Microsoft Fabric.

Part of the Microsoft Fabric Tutorial Series
Read time: ~10–14 minutes

Why Why notebooks in Fabric – Transform Data Using Notebooks

Microsoft Fabric notebooks provide an interactive, narrative-first surface that combines code, visualizations, and documentation while running on Spark for distributed scale. They let data engineers, analysts, and scientists iterate rapidly, debug transformations, and persist outcomes directly into the Fabric Lakehouse without context switching. This model reduces friction between exploration and production by enabling parameterization, scheduling, and integration with Fabric pipelines.

Concise: choose notebooks when you need iterative development, reproducible transformations, and deep integration with Lakehouse storage and Fabric compute.

Start Quick start: create, attach, read – Transform Data Using Notebooks

Create notebook

In Fabric workspace choose New → Notebook. Pick a kernel (Python, Spark SQL, Scala) and attach a Spark compute. Name it with context: domain_action_version (e.g., sales_transform_v1).

Attach compute & libraries

Attach to a job or interactive Spark cluster. Install or attach libraries via the UI (PyPI, Maven) or reference wheel/jar artifacts for consistent environments.

Read Delta table (Python)
df = spark.read.format("delta").load("abfss://lakehouse@yourstorage.dfs.core.windows.net/curated/sales")
df.printSchema()
df.show(5)

Switch languages inline with magic commands (e.g., %%sql) for Spark SQL blocks. Persist transformed data back to Delta for downstream consumption and lineage.

Internal reading: review your Lakehouse setup for best practices before heavy transforms: Fabric Lakehouse Tutorial.

Patterns Transformation patterns and code – Transform Data Using Notebooks

Below are repeatable, production-safe patterns used when you transform data using notebooks in Microsoft Fabric. Each pattern includes intent, code, and practical notes.

1. Schema enforcement and cleaning

Intent: prevent silent schema drift. Explicit casting, null handling, and validation reduce downstream failures.

# enforce types and capture bad rows
from pyspark.sql.types import IntegerType, DoubleType
from pyspark.sql.functions import col

df = df.withColumn("quantity", col("quantity").cast(IntegerType()))
df = df.withColumn("price", col("price").cast(DoubleType()))

bad = df.filter(col("quantity").isNull() | col("price").isNull())
bad.write.format("delta").mode("append").save("/mnt/lake/errors/sales_schema_issues")

Note: write errors to a small errors table for review and remediation. Keep your schema contract in source control.

2. Idempotent transforms & upserts

Intent: make notebooks safe to rerun. Use Delta merge for upserts, and avoid append-only writes for incremental pipelines.

# delta upsert (merge)
from delta.tables import DeltaTable

target = DeltaTable.forPath(spark, "/mnt/lake/curated/sales")
updates = transformed_df.alias("u")
target.alias("t").merge(updates, "t.key = u.key") \
  .whenMatchedUpdateAll() \
  .whenNotMatchedInsertAll() \
  .execute()

3. Window functions & rolling metrics

Intent: generate time-based features and trends used for analytics and ML.

from pyspark.sql.window import Window
from pyspark.sql.functions import avg

w = Window.partitionBy("product").orderBy("date").rowsBetween(-6,0)
df = df.withColumn("rolling_avg_7", avg("sales").over(w))

4. Feature engineering & reproducible feature tables

Intent: produce deterministic features persisted as versioned Delta tables so training and scoring use identical data.

# example feature: binary flag and bucketing
from pyspark.sql.functions import when

df = df.withColumn("is_top_region", when(df.region == "North America",1).otherwise(0))
df.write.format("delta").mode("overwrite").option("partitionBy","date").save("/mnt/lake/features/sales_v1")

Best practice: separate exploratory cells from the canonical transformation pipeline; the bottom of the notebook should contain the deterministic pipeline that writes the final table.

Scale Performance & optimization – Transform Data Using Notebooks

When you transform large datasets, optimize to reduce compute cost and time. Apply these practial rules while developing in notebooks.

  • Filter early — push predicates before joins to reduce scanned data.
  • Partition on query patterns — choose partition columns (date, region) that align with filters.
  • Broadcast small tables — use broadcast(dim_df) for small dimension joins to reduce shuffle.
  • Cache judiciously — persist intermediate DataFrames only when reused multiple times.
  • Avoid wide transformations without planning — monitor shuffle through Job UI and adjust partition sizes.
# example: broadcast join
from pyspark.sql.functions import broadcast
result = large_df.join(broadcast(small_dim), "dim_key")

Measure with Spark UI and tune executor sizes. For repeatable ETL, choose a job cluster profile optimized for batch throughput rather than interactive latency.

Design Top UI / UX notebook practices – Transform Data Using Notebooks

A polished notebook improves team onboarding and collaboration. Apply these UI-focused practices inside Fabric notebooks to make them readable and maintainable.

  • Header block — include purpose, inputs, outputs, usage, and quick run commands at the very top.
  • Mini-TOC — add intra-notebook anchors for fast navigation.
  • Collapse helpers — keep helper functions collapsed and expose only pipeline steps.
  • Annotation — annotate non-obvious business rules and data assumptions with markdown and small diagrams.
  • Pin results — pin important visuals or tables for reviewers and dashboards.

Visual polish tips: use a consistent color palette, keep charts small for EDA, and include a short interpretation line under every chart explaining the insight.

Ops Orchestration, scheduling, CI – Transform Data Using Notebooks

Notebooks are great interactively; to run them reliably in production, wrap them in Fabric Data Pipelines and apply CI practices.

  1. Parameterize — accept run_date, mode, and source path as parameters so the same notebook runs for many partitions.
  2. Pipeline activity — add the Notebook activity to a pipeline and supply parameters from pipeline variables.
  3. Schedule and alert — schedule runs (daily/hourly), capture logs, and configure alerts for failures.
  4. CI & testing — keep notebooks in source control; use small integration tests that run on sample data in PR pipelines.

Automation example: nightly pipeline triggers the transform notebook which merges results into the curated Delta table and emits success metrics to monitoring.

For pipeline patterns and orchestration details, see our companion guide: Data Pipelines in Fabric.

Trust Security, governance & observability

Enterprise-grade governance is essential. Use Microsoft Entra for identity and role-based access control, enforce table-level and column-level security, and enable audit logging. Mask PII at transform time and store a data catalog with lineage metadata for discoverability.

  • RBAC — limit write/modify privileges to transformation authors and CI jobs.
  • Column masking — mask or drop sensitive columns before sharing curated outputs.
  • Lineage — register datasets and their upstream notebook jobs in your governance catalog for compliance.
  • Observability — capture run metadata, durations, and row counts; alert on anomalies in volume or latency.

Reproducibility: snapshot notebook environment (packages, versions) and use versioned Delta tables for repeatable re-runs.

Recipes Production recipes & use cases

Practical recipes you can reuse immediately when you transform data using notebooks in Microsoft Fabric.

Recipe A — Incremental ingest + dedupe + curated merge

# 1. read incremental files
raw = spark.read.option("header","true").csv("/mnt/raw/sales/2025-10-20/*")

# 2. normalize and dedupe
clean = (raw.withColumn("quantity", raw.quantity.cast("int"))
            .dropDuplicates(["order_id"])
            .withColumn("ingest_date", current_date()))

# 3. merge to curated
from delta.tables import DeltaTable
target = DeltaTable.forPath(spark, "/mnt/lake/curated/sales")
target.alias("t").merge(clean.alias("u"), "t.order_id = u.order_id") \
  .whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()

Recipe B — Feature table for ML

# compute features and persist as versioned table
features = (transactions.groupBy("user_id")
             .agg(sum("amount").alias("total_spend"), avg("amount").alias("avg_spend")))
features.write.format("delta").mode("overwrite").option("path","/mnt/lake/features/purchase_v1").save()

Use these as templates; parameterize paths and table names for reuse across environments.

FAQ Frequently asked questions

Is the code safe to run as-is?

The code samples illustrate patterns and must be adapted to your environment. Replace storage paths, table names, and credentials with your project-specific values before executing.

Will running a notebook in Fabric incur costs?

Yes. Notebook execution uses Spark compute resources and storage I/O; cost depends on cluster size, runtime duration, and data scanned. Optimize with partition pruning and appropriate cluster sizing to reduce costs.

Can I version-control notebooks?

Yes. Store notebooks in Git, use CI to validate changes, and prefer lightweight test datasets for PR validation. Keep deterministic transformation logic separate from exploratory cells for cleaner versioning.

How do I make runs idempotent and repeatable?

Use Delta merge for upserts, write partitioned and versioned Delta tables, avoid append-only patterns for incremental pipelines, and parameterize run inputs to control scope.

What are the security considerations?

Use Microsoft Entra for RBAC, apply least privilege to storage and tables, mask or avoid persisting PII in shared curated tables, and enable audit logging and lineage capture for compliance.

How accurate is the information in this article?

The article is hand-written using established Fabric and Delta Lake patterns and links to authoritative resources. Validate commands and storage paths in your environment; confirm platform updates on Microsoft Learn before production runs.

Where can I learn more?

Refer to the Fabric Lakehouse and Data Pipelines guides included in the references below and Microsoft Learn for the most current product guidance.

Links References & links

Last updated: October 2025
Final note: keep exploration and deterministic pipelines separate, parameterize and version outputs, and validate any environment-specific code before production runs.

data transformation in Fabric, Microsoft Fabric notebooks, Spark in Microsoft Fabric, transform data in Fabric, data wrangling Fabric, Fabric Spark tutorial, Fabric data engineering, Microsoft Fabric tutorial, data prep in Microsoft Fabric, Fabric Delta Lake,transform data using notebooks, data transformation with notebooks, Microsoft Fabric notebooks, notebook-based data transformation, data wrangling with notebooks, PySpark notebooks, Spark notebooks in Fabric, Fabric data engineering notebooks, clean data using notebooks, enrich data with notebooks

Scroll to Top