My Tech Learning: July 2025

Friday, July 18, 2025

Fabric Pipelines vs. Dataflow Gen2

Within Data Factory in Microsoft Fabric, you'll encounter two primary tools for data movement and transformation tasks: Data Pipelines and Dataflow Gen2. While both achieve similar goals, they cater to different use cases and user preferences.

Understanding when to use which is key to building efficient and scalable data solutions in Fabric.

What is Data Factory in Fabric?

Data Factory in Microsoft Fabric provides a modern, cloud-based data integration service that allows you to create, schedule, and orchestrate your data movement and transformation workflows. It is essentially the engine that helps you bring data into OneLake and prepare it for analytics.

You'll find the Data Factory experience directly integrated into your Fabric workspace, allowing seamless interaction with other items like Lakehouses and Data Warehouses.

When you click on New Item in Your Fabric Workspace, you will find both Pipelines and Dataflow Gen2 as shown below:

Data Pipelines: The Orchestration Maestro

Data Pipelines in Fabric are the evolution of Azure Data Factory and Synapse Pipelines. They are designed for robust orchestration, control flow, and high-scale data movement. If you need to copy data from various sources to OneLake, execute notebooks, trigger stored procedures, or chain together a complex sequence of activities, Pipelines are your primary tool.

Key Characteristics of Data Pipelines:

Orchestration: Excellent for defining a sequence of activities, handling dependencies, and scheduling complex workflows.
Data Movement: Highly optimized for copying data between a vast array of data sources (databases, SaaS applications, file systems, cloud storage) to OneLake.
Control Flow: Provides activities for conditional logic, looping, error handling, and parallel execution.
Code-First & Low-Code Activities: While they primarily involve dragging and dropping activities, many activities (like calling a stored procedure or running a notebook) involve writing or pointing to code.
Monitoring: Comprehensive monitoring tools to track pipeline runs, identify failures, and troubleshoot.

When to Use Data Pipelines:

Ingesting large volumes of data from various sources into your Lakehouse or Warehouse.
Orchestrating end-to-end data workflows that involve multiple steps (e.g., ingest raw data, run a Spark notebook to transform it, then load it into a data warehouse).
Triggering other Fabric items, such as Spark notebooks, KQL queries, or dataflow refreshes.
Implementing robust error handling and retry mechanisms.
Scheduling batch data loads (e.g., daily, hourly).

A typical Data Pipeline showing various activities like Copy Data, Notebook, and Dataflow execution.

Dataflow Gen2: The Low-Code Transformation Powerhouse

Dataflow Gen2 in Fabric is the next generation of Power Query Online, familiar to anyone who has used Power BI or Power Apps. It's a low-code, visual tool primarily focused on data transformation and cleansing, designed for data engineers, analysts, and even business users who prefer a graphical interface.

Dataflow Gen2 excels at shaping, cleaning, and preparing data from a multitude of sources before loading it into your Fabric Lakehouse or Warehouse.

Key Characteristics of Dataflow Gen2:

Low-Code/No-Code: The primary interaction is through a visual Power Query editor, allowing users to apply transformations without writing a single line of code.
Intuitive Interface: Easy to learn for users familiar with Excel or Power BI's Power Query.
Data Cleansing & Shaping: Strong capabilities for common data preparation tasks like merging, splitting, pivoting, unpivoting, type conversion, and error handling.
Schema On Write: It writes directly to your Lakehouse or Warehouse in Delta Parquet format, creating or updating tables.
Scalability: Leverages Spark compute under the hood for scalable transformations.

When to Use Dataflow Gen2:

Quickly ingest and transform smaller to medium-sized datasets into a Lakehouse or Warehouse.
When your team prefers a visual, low-code experience for data preparation.
Performing common data cleansing and shaping tasks (e.g., standardizing formats, removing duplicates, simple joins).
When you need to get data ready for Power BI semantic models with minimal coding effort.
For "citizen data integrators" who are comfortable with Power Query.

Pipelines vs. Dataflow Gen2: A Quick Comparison

Feature	Data Pipelines	Dataflow Gen2
Primary Focus	Orchestration, Control Flow, Large-Scale Movement	Visual Transformation, Data Cleansing, Shaping
User Experience	Activity-based canvas, JSON definition	Power Query Editor (visual)
Best For	Complex ETL/ELT workflows, orchestration, varied activity types, high-volume ingestion	Agile data prep, smaller-to-medium datasets, business user transformations, quick data landing
Code Level	Low-code (configuring activities), code-first for some activities (Notebook, SP)	No-code/Low-code (M language generated behind the scenes)
Output Target	Can write to various destinations via Copy activity	Primarily writes to Fabric Lakehouse/Warehouse (Delta)
Dependencies	Can orchestrate other Fabric items	Can be orchestrated by Pipelines

How They Work Together: A Powerful Synergy

The real power of Data Factory in Fabric comes when you use Pipelines and Dataflow Gen2 together.

You might use a Pipeline to:

Copy raw CSV files from an external Blob Storage account into the "Files" area of your Lakehouse.
Then, trigger a Dataflow Gen2 to read those raw CSVs, apply transformations (e.g., parse dates, clean text, merge with a lookup table), and write the cleaned data as a Delta table into the "Tables" area of your Lakehouse.
Finally, use the same Pipeline to trigger a Spark notebook for more advanced transformations or machine learning tasks.

This combination allows you to leverage the strengths of both tools: Pipelines for robust orchestration and large-scale movement, and Dataflow Gen2 for efficient, visual data preparation.

Friday, July 04, 2025

OneLake: The "OneDrive" for Your Data

In this blog post I would like to dive into the foundational component that makes Microsoft Fabric's unified experience truly revolutionary: OneLake.

Think of OneLake as the "OneDrive" for your entire organisation's data. Just as OneDrive centralises your personal documents, OneLake centralises all your data assets, making them accessible to every engine and every persona within Microsoft Fabric. This isn't just a new storage account, but it's a paradigm shift in how we manage, access, and utilise data.

Problem: Data Silos and Duplication

Before OneLake, enterprise data environments often looked like this:

Your data engineers might land data in Azure Data Lake Storage (ADLS Gen2).
Your data scientists might copy some of that data into a separate environment for their experiments.
Your data warehousing team might ingest and transform data into a SQL Data Warehouse.
Power BI users might import data into their own models.

Every copy, every movement, introduced complexity, increased storage costs, created potential for inconsistencies, and slowed down development. It was a fragmented, costly, and often frustrating experience.

This is what a typical pre-Fabric data landscape often resembled:

OneLake to the Rescue: One Copy, Many Experiences

OneLake fundamentally changes this by enforcing the principle of "One Copy of Data." Instead of copying data between services or creating separate data lakes for different departments, all your organisational data lives in a single, logical data lake within Fabric.

Imagine a central hub where all data naturally flows and resides:

Here's how it works:

Open Data Formats: OneLake stores all data in an open format, primarily Delta Lake Parquet. This means the data isn't locked into a proprietary system.
Engine Agnostic: Whether you're using a Spark notebook for data engineering, a SQL endpoint for warehousing, or Power BI for analytics, all these engines access the exact same underlying data files in OneLake. There's no need to move, convert, or duplicate.
Hierarchical Namespace: OneLake automatically organizes your data within a tenant, workspace, and item structure. This provides a clear and intuitive way to manage your data assets, much like folders and files in OneDrive.

Understanding Workspaces and Items

Within Fabric, your data is organised into Workspaces. A Workspace is a collaborative environment where teams can manage their data assets. Inside a Workspace, you create Items, such as Lakehouses, Data Warehouses, KQL Databases, or Power BI semantic models.

OneLake automatically provisions storage for every item you create.

For example, when you create a Lakehouse, OneLake creates a dedicated folder structure for it, including Tables (for Delta tables) and Files (for raw files).

The Magic of Shortcuts: Virtualizing Data In-Place

What if some of your data already lives outside of Fabric in an existing ADLS Gen2 account, Amazon S3 bucket, or Google Cloud Storage? Do you have to ingest it all into OneLake? No! This is where Shortcuts come in.

Shortcuts allow you to create a virtual link from OneLake to external data sources. The data itself remains in its original location, but it appears as if it's part of OneLake. This means you can:

Query external data using Spark or SQL endpoints in Fabric without moving it.
Combine external data with data already in OneLake seamlessly.
Start leveraging Fabric's powerful compute engines immediately, even with existing data estates.

Visualizing a Shortcut: Data stays external, but Fabric treats it as local:

Key Benefits of Shortcuts:

• No Data Movement: Reduces ingestion time, cost, and complexity.

• Compliance: Data can remain in its original sovereign cloud if needed.

• Unified View: Provides a single pane of glass for all your data, regardless of its physical location.

Why OneLake Matters

OneLake is more than just a storage layer; it's the foundation for true data democratization within an organization. It simplifies the data landscape, reduces complexity, and accelerates the journey from raw data to actionable insights.

• Cost Savings: No more duplicating data across multiple services.

• Improved Governance: A single source of truth makes security and compliance easier to manage.

• Faster Time-to-Insight: Data is immediately available to all Fabric experiences.

• Reduced Complexity: Less data movement and fewer integration points to manage.

In essence, OneLake empowers every data professional in your organisation to work with the same trusted data, fostering collaboration and innovation.

Friday, July 18, 2025

Fabric Pipelines vs. Dataflow Gen2

Friday, July 04, 2025

OneLake: The "OneDrive" for Your Data

Processing Nested JSON with PySpark in Microsoft Fabric