Friday, February 27, 2026

Processing Nested JSON with PySpark in Microsoft Fabric

I have been working on a project in Microsoft Fabric, where I am calling an API using a pipeline and getting JSON files into the Files section of the Lakehouse. These JSON files track every insert, update, and delete that happens in the source database.

In this post, I will walk through this real-world example step by step. The pipeline copies these JSON files sitting into the Lakehouse Files folder, each containing nested CDC data. 

Our goal was to:

Read each JSON file using PySpark inside a Fabric Notebook

Navigate the nested structure to extract column names and their values

Pivot those values so each column name becomes its own table column

Add metadata columns (timestamp, start_lsn and operation type)

Save the result as a Delta table — automatically named after the source file

Loop through all files in the folder so the whole process is automated


By the end of this post you will have a fully parameterised, reusable notebook that you can trigger from a Fabric Pipeline.

Understanding the Source Data

Before we write any code, it helps to understand what our data looks like. Each JSON file has a structure like this:


 {
  "data": {
    "currentItemCount": 20,
    "items": [
      {
        "changes": [
          {
            "__$operation": 3,
            "data": [
              { "columnName": "PKID",  "value": "3834523" },
              { "columnName": "Name", "value": "John Doe" }
            ]
          }
        ]
      }
    ]
  }
}

When CDC files are involved the key things to notice are: 
  • data.items — an array of change sets 
  • items[].changes — an array of individual changes, each with an operation code 
  •  __$operation — a number representing the CDC operation (1=Delete, 2=Insert, 3=Update Before, 4=Update After) 
  • changes[].data — an array of key-value pairs where columnName is the field name and value is the data 

Our task is to turn those key-value pairs into proper columns in a Delta table.

Setting Up Your Fabric Notebook

Open a new Notebook in your Fabric workspace and make sure it is connected to your default Lakehouse. The notebook will be organised into clearly labelled cells.

 

💡 Tip: In Fabric Notebooks you can mark a cell as a Parameters cell by clicking the three-dot menu (...) on the cell and selecting Toggle Parameter Cell. This lets a Pipeline pass in values at runtime without changing your code.

 

Cell 1: Parameters

This cell defines the folder path where your JSON files are stored. Mark this as a Parameters cell so a Pipeline can override it.

 

# Default parameter - override this from a Fabric Pipeline

source_folder = "Files/your_folder_name"

 

💡 Tip: The source_folder value is a relative path inside the Files section of your Lakehouse. Right-click the folder in your Lakehouse explorer and choose Properties to confirm the exact path.

 

Cell 2: Imports

We need a small set of PySpark functions. Import them all upfront to keep things tidy.

 

from pyspark.sql.functions import (

    col, explode, monotonically_increasing_id,

    current_timestamp, when

)

import os


Cell 3: List the JSON Files

We use mssparkutils — a utility built into Fabric Notebooks — to list all the files in our folder. We then filter to only JSON files.

 

# List all files in the source folder

files = mssparkutils.fs.ls(f'/{source_folder}')

 

# Keep only .json files

json_files = [f for f in files if f.name.endswith('.json')]

 

print(f'Found {len(json_files)} JSON files:')

for f in json_files:

    print(f'  {f.name}')

 

💡 Tip: mssparkutils.fs.ls() returns a list of file objects. Each object has a .name property (just the filename) and a .path property (the full path you need for spark.read).

 

Cell 4: Operation Code Helper Function

The __$operation field in CDC data is a number. This helper function maps those numbers to human-readable labels, which makes the final table much easier to understand.

 

def map_operation(df_in):

    return df_in.withColumn('operation',

        when(col('operation') == 1, 'Delete')

       .when(col('operation') == 2, 'Insert')

       .when(col('operation') == 3, 'Update Before')

       .when(col('operation') == 4, 'Update After')

       .otherwise('Unknown')

    )

 

Cell 5: The Main Loop

This is the heart of the notebook. For each JSON file --

       Derive the Delta table name from the filename

       Read the JSON file into a Spark DataFrame

       Explode the nested arrays step by step

       Pivot the key-value pairs into wide-format columns

       Add inserted_date and operation columns

       Append the result to a Delta table

 

for file in json_files:

    # Derive table name from filename (strip .json extension)

    table_name = os.path.splitext(file.name)[0]

 

    # Sanitise: replace hyphens and spaces with underscores

    table_name_clean = table_name.replace('-', '_').replace(' ', '_')

 

    print(f'\nProcessing: {file.name} -> Table: {table_name_clean}')

 

    try:

        # Step 1: Read the JSON file

        df = spark.read.option('multiline', 'true').json(file.path)

 

        # Step 2: Explode the items array

        df_items = df.select(explode(col('data.items')).alias('item'))

 

        # Step 3: Explode changes, then select __$operation

        #         Note: backticks are required around __$operation

        #         because the $ sign is a special character in Spark

        df_changes = df_items.select(

            explode(col('item.changes')).alias('change')

        ).select(

            monotonically_increasing_id().alias('change_id'),

            col('change.`__$operation`').alias('operation'),

            col('change.data').alias('change_data')

        )

 

        # Step 4: Explode the data array (columnName + value pairs)

        df_data = df_changes.select(

            col('change_id'),

            col('operation'),

            explode(col('change_data')).alias('record')

        )

 

        # Step 5: Pull out the columnName and value fields

        df_kv = df_data.select(

            col('change_id'),

            col('operation'),

            col('record.columnName').alias('column_name'),

            col('record.value').alias('value')

        )

 

        # Step 6: Pivot so each unique columnName becomes its own column

        df_wide = df_kv.groupBy('change_id', 'operation') \

                       .pivot('column_name') \

                       .agg({'value': 'first'})

 

        # Step 7: Add metadata columns and drop the surrogate key

        df_final = df_wide \

            .withColumn('inserted_date', current_timestamp()) \

            .drop('change_id')

 

        # Step 8: Map operation codes to readable labels

        df_final = map_operation(df_final)

 

        # Step 9: Append to Delta table named after the file

        df_final.write \

            .format('delta') \

            .mode('append') \

            .option('mergeSchema', 'true') \

            .saveAsTable(table_name_clean)

 

        row_count = df_final.count()

        print(f'  Done: {row_count} rows appended to {table_name_clean}')

 

    except Exception as e:

        print(f'  ERROR processing {file.name}: {str(e)}')

        continue

 

print('\nAll files processed.')

 

Understanding Each Step in Detail

If you are new to PySpark, the explode and pivot steps can look a little daunting. Here is a plain-English explanation of what is happening at each stage.

 

What does explode() do?

When your data has an array (a list of items inside a field), explode() turns each element of that array into its own separate row. Think of it like unzipping a zip file — each item gets its own space.

In the data used, there is a need to call explode() three times because there were arrays nested inside arrays:

       First on data.items to get one row per change set

       Then on item.changes to get one row per individual change

       Then on change.data to get one row per column name/value pair

 

What does pivot() do?

After exploding, our data is in a long format — each row has a column_name and a value. The pivot() function rotates these rows so that each unique column_name becomes its own column and the value moves into the corresponding cell. This is exactly the wide format you would expect in a database table.

 

Why to use backticks around __$operation?

The __$operation field name contains special characters (underscores and a dollar sign). Spark's column expression parser would misinterpret these, so wrap the name in backtick characters to tell Spark to treat the whole string as a single column name.

 

What is mergeSchema?

When appending to a Delta table, if a new JSON file introduces a new column that did not exist before, Spark would normally throw an error. The mergeSchema option tells Delta to automatically add the new column to the table schema rather than failing. This makes the notebook robust to schema changes in your source data.

 

Tips and lessons learnt

 

⚠️ 1: Do not call from_json() on a column that is already a struct. When Spark reads a JSON file, nested objects are automatically parsed into structs. Calling from_json() on a struct causes an AnalysisException because it expects a string. Navigate the struct directly using dot notation like col('data.items').

 

⚠️ 2: You cannot reference a column created by explode() in the same select() call. Always chain a second .select() after the explode so that the new column is fully resolved before you try to access its fields.

 

⚠️ 3: Column names with special characters like $ or spaces must be wrapped in backticks inside col() expressions. Without backticks, Spark will fail to parse the column name and throw an UNRESOLVED_COLUMN error.

 

⚠️ 4: Use mode('append') with option('mergeSchema', 'true') — not option('overwriteSchema', 'true'). overwriteSchema only applies when you are using overwrite mode. Using it with append mode will cause an error.

 

💡 5: Use os.path.splitext(file.name)[0] to cleanly strip the .json extension from the filename. This is more reliable than using .replace('.json', '') because splitext handles edge cases like filenames that contain the word json elsewhere.

 

Triggering from a Fabric Pipeline

Because the source_folder cell is marked as a Parameters cell, this notebook can be called from a Fabric Data Pipeline. In the Pipeline, add a Notebook activity and under the Settings tab you will find a Base parameters section where you can pass in the folder path dynamically.

 

Parameter Name : source_folder

Value          : Files/cdc_data/2024-01-15

 

This means you can have a Pipeline that processes a different date folder every day simply by passing the folder name as a parameter — no code changes needed.

 

Summary

In this post I have explained a complete, beginner-friendly solution for processing nested CDC JSON files in Microsoft Fabric. Here is a quick recap of what is covered:

       How to use mssparkutils.fs.ls() to list files in your Lakehouse

       How to navigate nested JSON structures using dot notation and explode()

       Why backticks are needed for column names with special characters

       How to pivot key-value pairs into a wide table format

       How to add inserted_date and a human-readable operation label

       How to append to Delta tables with automatic schema evolution using mergeSchema

       How to parameterise the notebook so it works with any folder from a Pipeline

 

The full notebook is reusable — just drop new JSON files into the folder and run the Pipeline again. Each file gets its own Delta table named after the source file, making it easy to trace data back to its origin.



Tuesday, September 30, 2025

Medallion Architecture in Fabric

Medallion Architecture is a design pattern that has become the industry standard for Lakehouse environments. It is not a Fabric feature. It describes a way to incrementally improve the quality and structure of data as it moves through three distinct layers: Bronze, Silver, and Gold.

Layer 1: The Bronze Layer (Raw Data)

The Bronze layer is your landing zone. The goal here is data preservation. You bring data in from your source systems (SQL Server, ERP, CRM, IoT logs, or APIs) and store it exactly as it was found.

  • Format: Typically stored as files or Delta tables.
  • The Rule: No transformations allowed! If the source has a typo or a weird date format, you keep it.
  • Why? If you ever need to "replay" your data processing due to a bug in your logic, you have the original source of truth right here in OneLake.

Layer 2: The Silver Layer (Cleansed & Conformed)

This is where the heavy lifting happens. In the Silver layer, you take your raw Bronze data and "clean" it. This is often where Data Engineering personas thrive using Spark Notebooks or Dataflow Gen2.

  • Activities: Filtering out nulls, standardizing date formats (YYYY-MM-DD), joining related tables, and deduplicating records.
  • The Result: You now have a "Single Source of Truth." If two different source systems have different IDs for the same customer, you resolve that here.
  • Storage: Always Delta tables for performance and ACID compliance.

Layer 3: The Gold Layer (Business Ready)

The Gold layer is designed for consumption. This data is curated for specific business departments or use cases (like Sales, Finance, or Marketing).

  • Structure: Often organized in a Star Schema (Facts and Dimensions).
  • The Goal: Speed and ease of use. A Power BI user should be able to connect to a Gold table and understand exactly what the columns mean without doing any further joining or cleaning.
  • Storage: This is often where you might choose between a Gold Lakehouse or a Gold Data Warehouse for final reporting.

Why Medallion Architecture Works Better in Fabric ?

In traditional architectures, moving data between these layers meant complex ETL jobs and moving data between different servers. 

In Microsoft Fabric, this process is significantly streamlined:

Zero Data Movement: Because everything is in OneLake, moving from Bronze to Silver is just a matter of a Spark job reading one folder and writing to another.

Shortcuts: You can "shortcut" your Bronze data from an external AWS S3 bucket directly into your Fabric Lakehouse, meaning your "Bronze" layer doesn't even have to physically reside in Azure!

Unified Governance: You can apply security and sensitivity labels across the entire chain, ensuring that raw PII (Personally Identifiable Information) in Bronze is masked by the time it reaches Gold.

Summary of the Medallion Architecture Flow:

LayerQualityAudienceTypical Tool
BronzeRaw / DirtyData EngineersData Factory Pipelines
SilverClean / IntegratedData ScientistsSpark Notebooks / Dataflows
GoldAggregated / Highly StructuredBusiness AnalystsSQL Endpoint / Power BI


Tuesday, August 19, 2025

The Power of Lakehouses and Warehouses in Microsoft Fabric

In everyone's Microsoft Fabric journey, you need to make a critical decision point -- Where should your data live ?

In Fabric, you have two primary items or "workhorses" for storing and querying your structured data: the Lakehouse and the Warehouse. Because both store data in the open Delta Lake format and live in OneLake, the choice isn't about storage—it's about the engine you want to use and the persona of the person doing the work. 

The Lakehouse: The Data Engineer's Playground

The Lakehouse is the "Swiss Army Knife" of Fabric. It is designed for flexibility, allowing you to store structured tables alongside semi-structured (JSON, XML) and unstructured data (images, PDFs, logs).

  • Primary Engine: Apache Spark.
  • Primary Languages: Python (PySpark), Scala, R, and Spark SQL.
  • Best For: Data engineering, big data processing, and Data Science/ML workloads.

Key Feature: The SQL Analytics Endpoint Every Lakehouse automatically comes with a SQL Analytics Endpoint. This is a read-only gateway that allows SQL-savvy users to query the Lakehouse tables using familiar T-SQL, without needing to know a line of Python.

The Data Warehouse: The SQL Pro’s Powerhouse

If the Lakehouse is a "playground," the Data Warehouse is a high-performance relational engine. It is designed for developers who want the full power of T-SQL (Transactional SQL) and traditional data warehousing capabilities like primary keys, foreign keys, and complex transactions.

  • Primary Engine: Distributed SQL Engine (Polaris).
  • Primary Language: T-SQL (Transact-SQL).
  • Best For: Enterprise data warehousing, complex SQL transformations, and users migrating from SQL Server or Synapse Dedicated Pools.

Key Feature: Unlike the Lakehouse SQL Endpoint, which is read-only, the Data Warehouse is fully read-write. You can use INSERT, UPDATE, and DELETE commands directly against your tables using standard SQL scripts.

The beauty of Fabric is that both tools use the Delta Lake format. This means a Power BI report can look at data in a Lakehouse and a Warehouse simultaneously without any "data movement."

FeatureLakehouseData Warehouse
Primary PersonaData Engineers / ScientistsSQL Developers / Analysts
Primary EngineApache SparkSQL Engine
Write SupportSpark (Python/Scala), DataflowT-SQL (DML), Dataflow, Pipelines
Read SupportSpark & T-SQL (Read-only)T-SQL (Read-Write)
StructureFiles + Folders + TablesRelational Tables only
Multi-table TransactionsLimited (Spark level)Full ACID Support

In a real-world Fabric implementation, you don't usually pick just one. 

Most architects follow this flow:

  • Lakehouse for the "Bronze" and "Silver" layers (raw ingestion and Spark-based cleaning).
  • Data Warehouse for the "Gold" layer (curated, highly structured data ready for business users to query with SQL).
Here is a tip :
  • Choose Lakehouse if: You have semi-structured data (JSON), love Python/Notebooks, or are doing Machine Learning.
  • Choose Warehouse if: You are building a star schema, need full SQL "Write" capabilities, or are migrating a traditional SQL data warehouse.


Wednesday, August 06, 2025

Everything You Need to Know About On-Premise SQL Server Mirroring in Microsoft Fabric

Microsoft Fabric is rapidly changing the data landscape, bringing together analytics, data science, and business intelligence into a single, unified platform. One of its most exciting features for organisations with existing on-premises SQL Server deployments is Mirroring. This capability offers a near real-time, low-friction way to bring your transactional data into the Fabric ecosystem for immediate analytics, reporting, and AI workloads.  

At the time of writing this post, this feature is still in Preview. This means this feature cannot be used in Production scenarios.

But what exactly is it, how does it work, and what should you be aware of before diving in? 

Let's break it down.

What is SQL Server Mirroring in Microsoft Fabric?

At its core, SQL Server Mirroring in Fabric provides a near real-time, low-latency replication mechanism for your on-premises SQL Server databases directly into Microsoft OneLake, Fabric's unified data lake. It creates a "Mirrored Database" item in your Fabric workspace, which then exposes a SQL Analytics Endpoint and a Default Semantic Model. 

This means your live transactional data is instantly available for the below use cases.

  • Querying with T-SQL using the SQL Analytics Endpoint.
  • Building Power BI reports with Direct Lake mode for unparalleled performance.
  • Leveraging Spark notebooks for advanced analytics and transformations.
  • Creating operational dashboards that reflect the latest business transactions.

The magic behind this is that Fabric automatically uses Change Data Capture (CDC) for SQL Server 2016-2022 or a native change feed for SQL Server 2025 to intelligently track and synchronise changes, avoiding the need for complex ETL pipelines for initial data ingestion.

The Power of Mirroring: Key Benefits

  • You can say goodbye to nightly batch jobs. Mirroring ensures your analytics are always based on the freshest possible data.
  • No more manual pipeline creation for initial loads and incremental updates. Fabric handles the replication automatically.
  • Your SQL Server data lands directly in OneLake, making it accessible across all Fabric personas (Data Engineering, Data Science, Data Warehousing, BI).
  • The actual data replication process itself (the initial snapshot and ongoing CDC) does not consume your Fabric Capacity Units (CU). You only pay for the compute when you query or process the data in Fabric.
  • For supported tables, Power BI datasets automatically leverage Direct Lake, offering the performance of Import Mode with the freshness of DirectQuery.
  • The replication is designed to be low impact on your production SQL Server, using CDC mechanisms that are highly optimised.

What You Need to Know: Requirements & Limitations

While incredibly powerful, mirroring isn't a silver bullet, understanding its current requirements and limitations is crucial for a successful implementation.

On-Premises SQL Server Requirements -

  • SQL Server 2016 through 2022 (using CDC) and SQL Server 2025 (using native change feed).
  • Mirroring is supported only on the primary replica of an Always On Availability Group.
  • SQL Server on Windows is fully supported. SQL Server on Linux (2017 CU18+ to 2022) is also supported.
  • A Microsoft Data Gateway (Standard or VNET) is required for secure communication between Fabric and your on-premises SQL Server.

Database & Table Limitations:

  • Single Mirror Target: A database can only be mirrored to one Fabric workspace at a time.
  • No Redundant CDC (SQL Server 2025): If you're on SQL Server 2025, do not manually enable CDC on the source database; Fabric handles this for you.
  • No Delayed Transaction Durability: Databases with this setting cannot be mirrored.
  • 500 Table Limit: Each mirrored database can replicate a maximum of 500 tables. Fabric will pick the first 500 alphabetically if you choose "Mirror all data."
  • Unsupported Data Types: This is a big one! Tables containing columns with types like image, text, ntext, xml, timestamp/rowversion, sql_variant, geometry, geography, hierarchyid, json, vector, or User-Defined Types (UDT) cannot be fully mirrored. The presence of these types will cause the entire column, or sometimes the whole table, to be excluded.
  • Precision Loss: datetime2(7) and datetimeoffset(7) columns will lose their 7th digit of precision (becomes 6 digits) when mirrored to OneLake's Delta Lake format.
  • No Row/Column Filtering: Mirroring is an "all or nothing" proposition for selected tables. You cannot filter specific rows or exclude individual columns during replication. 
  • DDL Changes: Significant schema changes (like altering a Primary Key) will often trigger a full re-sync of the entire table, leading to higher latency for that specific table until the re-sync completes.

Measuring Capacity Usage: Understanding the Costs

  • Replication is FREE (Compute-wise) -- The initial snapshot and continuous Change Data Capture (CDC) processing itself does not consume your Fabric Capacity Units (CU). This is a significant advantage over other replication methods.
  • You Pay for Data Usage -- Your Fabric Capacity Units are consumed when you -

    • Query the mirrored data via the SQL Analytics Endpoint.
    • Interact with Power BI reports built on the mirrored data (especially in Direct Lake mode).
    • Process the data using Spark notebooks or other Fabric experiences.
  • Storage Costs -- You get a generous 1 TB of free mirroring storage for every 1 Capacity Unit (CU) you own. For example, an F64 capacity gets 64 TB of free storage for mirrored data. You only start paying for OneLake storage if you exceed this allowance or if your Fabric Capacity is paused (storage charges still apply).

How to Monitor ? 

Use the Microsoft Fabric Capacity Metrics App to track your CU consumption and storage usage for "Mirrored Database" items. This app will clearly show you what's driving your costs. You can also leverage Workspace Monitoring (via KQL Databases) to track MirroredDatabaseTableExecution logs for metrics like ProcessedBytes, ProcessedRows, and Latency to understand the operational health and volume of your replication.

Prerequisites to Getting Started with Mirroring

  • Verify your SQL Server version, Always On Availability Group configuration, and install a Data Gateway if needed.
  • Fabric will automatically enable CDC on your source database (or use the native change feed for SQL Server 2025). Ensure your SQL Server instance has the necessary permissions.
  • In your Fabric workspace, select "New > More Options > Mirrored Database (SQL Server)."
  • Provide your SQL Server connection details and gateway information.
  • Choose which tables you want to mirror. Fabric will automatically identify and flag any tables with unsupported data types.
  • Keep an eye on the replication status in Fabric and use the Capacity Metrics App to track usage.

Conclusion

On-premises SQL Server Mirroring in Microsoft Fabric is a game-changer for organisations looking to bridge the gap between their transactional systems and modern analytics. By providing near real-time, low-cost data ingestion directly into OneLake, it unlocks a world of possibilities for faster insights and more agile data operations. While you need to be mindful of its current limitations, the benefits of quick, seamless integration make it an invaluable tool in your Fabric journey.


Friday, July 18, 2025

Fabric Pipelines vs. Dataflow Gen2

Within Data Factory in Microsoft Fabric, you'll encounter two primary tools for data movement and transformation tasks: Data Pipelines and Dataflow Gen2. While both achieve similar goals, they cater to different use cases and user preferences. 

Understanding when to use which is key to building efficient and scalable data solutions in Fabric.

What is Data Factory in Fabric?

Data Factory in Microsoft Fabric provides a modern, cloud-based data integration service that allows you to create, schedule, and orchestrate your data movement and transformation workflows. It is essentially the engine that helps you bring data into OneLake and prepare it for analytics.

You'll find the Data Factory experience directly integrated into your Fabric workspace, allowing seamless interaction with other items like Lakehouses and Data Warehouses.


When you click on New Item in Your Fabric Workspace, you will find both Pipelines and Dataflow Gen2 as shown below:




Data Pipelines: The Orchestration Maestro

Data Pipelines in Fabric are the evolution of Azure Data Factory and Synapse Pipelines. They are designed for robust orchestration, control flow, and high-scale data movement. If you need to copy data from various sources to OneLake, execute notebooks, trigger stored procedures, or chain together a complex sequence of activities, Pipelines are your primary tool.

Key Characteristics of Data Pipelines:

  • Orchestration: Excellent for defining a sequence of activities, handling dependencies, and scheduling complex workflows.
  • Data Movement: Highly optimized for copying data between a vast array of data sources (databases, SaaS applications, file systems, cloud storage) to OneLake.
  • Control Flow: Provides activities for conditional logic, looping, error handling, and parallel execution.
  • Code-First & Low-Code Activities: While they primarily involve dragging and dropping activities, many activities (like calling a stored procedure or running a notebook) involve writing or pointing to code.
  • Monitoring: Comprehensive monitoring tools to track pipeline runs, identify failures, and troubleshoot.


When to Use Data Pipelines:

  • Ingesting large volumes of data from various sources into your Lakehouse or Warehouse.
  • Orchestrating end-to-end data workflows that involve multiple steps (e.g., ingest raw data, run a Spark notebook to transform it, then load it into a data warehouse).
  • Triggering other Fabric items, such as Spark notebooks, KQL queries, or dataflow refreshes.
  • Implementing robust error handling and retry mechanisms.
  • Scheduling batch data loads (e.g., daily, hourly).


A typical Data Pipeline showing various activities like Copy Data, Notebook, and Dataflow execution.


Dataflow Gen2: The Low-Code Transformation Powerhouse

Dataflow Gen2 in Fabric is the next generation of Power Query Online, familiar to anyone who has used Power BI or Power Apps. It's a low-code, visual tool primarily focused on data transformation and cleansing, designed for data engineers, analysts, and even business users who prefer a graphical interface.

Dataflow Gen2 excels at shaping, cleaning, and preparing data from a multitude of sources before loading it into your Fabric Lakehouse or Warehouse.


Key Characteristics of Dataflow Gen2:

  • Low-Code/No-Code: The primary interaction is through a visual Power Query editor, allowing users to apply transformations without writing a single line of code.
  • Intuitive Interface: Easy to learn for users familiar with Excel or Power BI's Power Query.
  • Data Cleansing & Shaping: Strong capabilities for common data preparation tasks like merging, splitting, pivoting, unpivoting, type conversion, and error handling.
  • Schema On Write: It writes directly to your Lakehouse or Warehouse in Delta Parquet format, creating or updating tables.
  • Scalability: Leverages Spark compute under the hood for scalable transformations.

When to Use Dataflow Gen2:

  • Quickly ingest and transform smaller to medium-sized datasets into a Lakehouse or Warehouse.
  • When your team prefers a visual, low-code experience for data preparation.
  • Performing common data cleansing and shaping tasks (e.g., standardizing formats, removing duplicates, simple joins).
  • When you need to get data ready for Power BI semantic models with minimal coding effort.
  • For "citizen data integrators" who are comfortable with Power Query.


Pipelines vs. Dataflow Gen2: A Quick Comparison

FeatureData PipelinesDataflow Gen2
Primary FocusOrchestration, Control Flow, Large-Scale MovementVisual Transformation, Data Cleansing, Shaping
User ExperienceActivity-based canvas, JSON definitionPower Query Editor (visual)
Best ForComplex ETL/ELT workflows, orchestration, varied activity types, high-volume ingestionAgile data prep, smaller-to-medium datasets, business user transformations, quick data landing
Code LevelLow-code (configuring activities), code-first for some activities (Notebook, SP)No-code/Low-code (M language generated behind the scenes)
Output TargetCan write to various destinations via Copy activityPrimarily writes to Fabric Lakehouse/Warehouse (Delta)
DependenciesCan orchestrate other Fabric itemsCan be orchestrated by Pipelines


How They Work Together: A Powerful Synergy

The real power of Data Factory in Fabric comes when you use Pipelines and Dataflow Gen2 together.

You might use a Pipeline to:

  1. Copy raw CSV files from an external Blob Storage account into the "Files" area of your Lakehouse.
  2. Then, trigger a Dataflow Gen2 to read those raw CSVs, apply transformations (e.g., parse dates, clean text, merge with a lookup table), and write the cleaned data as a Delta table into the "Tables" area of your Lakehouse.
  3. Finally, use the same Pipeline to trigger a Spark notebook for more advanced transformations or machine learning tasks.


This combination allows you to leverage the strengths of both tools: Pipelines for robust orchestration and large-scale movement, and Dataflow Gen2 for efficient, visual data preparation.


Processing Nested JSON with PySpark in Microsoft Fabric

I have been working on a project in Microsoft Fabric, where I am calling an API using a pipeline and getting JSON files into the Files secti...