Tuesday, September 30, 2025

Medallion Architecture in Fabric

Medallion Architecture is a design pattern that has become the industry standard for Lakehouse environments. It is not a Fabric feature. It describes a way to incrementally improve the quality and structure of data as it moves through three distinct layers: Bronze, Silver, and Gold.

Layer 1: The Bronze Layer (Raw Data)

The Bronze layer is your landing zone. The goal here is data preservation. You bring data in from your source systems (SQL Server, ERP, CRM, IoT logs, or APIs) and store it exactly as it was found.

  • Format: Typically stored as files or Delta tables.
  • The Rule: No transformations allowed! If the source has a typo or a weird date format, you keep it.
  • Why? If you ever need to "replay" your data processing due to a bug in your logic, you have the original source of truth right here in OneLake.

Layer 2: The Silver Layer (Cleansed & Conformed)

This is where the heavy lifting happens. In the Silver layer, you take your raw Bronze data and "clean" it. This is often where Data Engineering personas thrive using Spark Notebooks or Dataflow Gen2.

  • Activities: Filtering out nulls, standardizing date formats (YYYY-MM-DD), joining related tables, and deduplicating records.
  • The Result: You now have a "Single Source of Truth." If two different source systems have different IDs for the same customer, you resolve that here.
  • Storage: Always Delta tables for performance and ACID compliance.

Layer 3: The Gold Layer (Business Ready)

The Gold layer is designed for consumption. This data is curated for specific business departments or use cases (like Sales, Finance, or Marketing).

  • Structure: Often organized in a Star Schema (Facts and Dimensions).
  • The Goal: Speed and ease of use. A Power BI user should be able to connect to a Gold table and understand exactly what the columns mean without doing any further joining or cleaning.
  • Storage: This is often where you might choose between a Gold Lakehouse or a Gold Data Warehouse for final reporting.

Why Medallion Architecture Works Better in Fabric ?

In traditional architectures, moving data between these layers meant complex ETL jobs and moving data between different servers. 

In Microsoft Fabric, this process is significantly streamlined:

Zero Data Movement: Because everything is in OneLake, moving from Bronze to Silver is just a matter of a Spark job reading one folder and writing to another.

Shortcuts: You can "shortcut" your Bronze data from an external AWS S3 bucket directly into your Fabric Lakehouse, meaning your "Bronze" layer doesn't even have to physically reside in Azure!

Unified Governance: You can apply security and sensitivity labels across the entire chain, ensuring that raw PII (Personally Identifiable Information) in Bronze is masked by the time it reaches Gold.

Summary of the Medallion Architecture Flow:

LayerQualityAudienceTypical Tool
BronzeRaw / DirtyData EngineersData Factory Pipelines
SilverClean / IntegratedData ScientistsSpark Notebooks / Dataflows
GoldAggregated / Highly StructuredBusiness AnalystsSQL Endpoint / Power BI


Tuesday, August 19, 2025

The Power of Lakehouses and Warehouses in Microsoft Fabric

In everyone's Microsoft Fabric journey, you need to make a critical decision point -- Where should your data live ?

In Fabric, you have two primary items or "workhorses" for storing and querying your structured data: the Lakehouse and the Warehouse. Because both store data in the open Delta Lake format and live in OneLake, the choice isn't about storage—it's about the engine you want to use and the persona of the person doing the work. 

The Lakehouse: The Data Engineer's Playground

The Lakehouse is the "Swiss Army Knife" of Fabric. It is designed for flexibility, allowing you to store structured tables alongside semi-structured (JSON, XML) and unstructured data (images, PDFs, logs).

  • Primary Engine: Apache Spark.
  • Primary Languages: Python (PySpark), Scala, R, and Spark SQL.
  • Best For: Data engineering, big data processing, and Data Science/ML workloads.

Key Feature: The SQL Analytics Endpoint Every Lakehouse automatically comes with a SQL Analytics Endpoint. This is a read-only gateway that allows SQL-savvy users to query the Lakehouse tables using familiar T-SQL, without needing to know a line of Python.

The Data Warehouse: The SQL Pro’s Powerhouse

If the Lakehouse is a "playground," the Data Warehouse is a high-performance relational engine. It is designed for developers who want the full power of T-SQL (Transactional SQL) and traditional data warehousing capabilities like primary keys, foreign keys, and complex transactions.

  • Primary Engine: Distributed SQL Engine (Polaris).
  • Primary Language: T-SQL (Transact-SQL).
  • Best For: Enterprise data warehousing, complex SQL transformations, and users migrating from SQL Server or Synapse Dedicated Pools.

Key Feature: Unlike the Lakehouse SQL Endpoint, which is read-only, the Data Warehouse is fully read-write. You can use INSERT, UPDATE, and DELETE commands directly against your tables using standard SQL scripts.

The beauty of Fabric is that both tools use the Delta Lake format. This means a Power BI report can look at data in a Lakehouse and a Warehouse simultaneously without any "data movement."

FeatureLakehouseData Warehouse
Primary PersonaData Engineers / ScientistsSQL Developers / Analysts
Primary EngineApache SparkSQL Engine
Write SupportSpark (Python/Scala), DataflowT-SQL (DML), Dataflow, Pipelines
Read SupportSpark & T-SQL (Read-only)T-SQL (Read-Write)
StructureFiles + Folders + TablesRelational Tables only
Multi-table TransactionsLimited (Spark level)Full ACID Support

In a real-world Fabric implementation, you don't usually pick just one. 

Most architects follow this flow:

  • Lakehouse for the "Bronze" and "Silver" layers (raw ingestion and Spark-based cleaning).
  • Data Warehouse for the "Gold" layer (curated, highly structured data ready for business users to query with SQL).
Here is a tip :
  • Choose Lakehouse if: You have semi-structured data (JSON), love Python/Notebooks, or are doing Machine Learning.
  • Choose Warehouse if: You are building a star schema, need full SQL "Write" capabilities, or are migrating a traditional SQL data warehouse.


Wednesday, August 06, 2025

Everything You Need to Know About On-Premise SQL Server Mirroring in Microsoft Fabric

Microsoft Fabric is rapidly changing the data landscape, bringing together analytics, data science, and business intelligence into a single, unified platform. One of its most exciting features for organisations with existing on-premises SQL Server deployments is Mirroring. This capability offers a near real-time, low-friction way to bring your transactional data into the Fabric ecosystem for immediate analytics, reporting, and AI workloads.  

At the time of writing this post, this feature is still in Preview. This means this feature cannot be used in Production scenarios.

But what exactly is it, how does it work, and what should you be aware of before diving in? 

Let's break it down.

What is SQL Server Mirroring in Microsoft Fabric?

At its core, SQL Server Mirroring in Fabric provides a near real-time, low-latency replication mechanism for your on-premises SQL Server databases directly into Microsoft OneLake, Fabric's unified data lake. It creates a "Mirrored Database" item in your Fabric workspace, which then exposes a SQL Analytics Endpoint and a Default Semantic Model. 

This means your live transactional data is instantly available for the below use cases.

  • Querying with T-SQL using the SQL Analytics Endpoint.
  • Building Power BI reports with Direct Lake mode for unparalleled performance.
  • Leveraging Spark notebooks for advanced analytics and transformations.
  • Creating operational dashboards that reflect the latest business transactions.

The magic behind this is that Fabric automatically uses Change Data Capture (CDC) for SQL Server 2016-2022 or a native change feed for SQL Server 2025 to intelligently track and synchronise changes, avoiding the need for complex ETL pipelines for initial data ingestion.

The Power of Mirroring: Key Benefits

  • You can say goodbye to nightly batch jobs. Mirroring ensures your analytics are always based on the freshest possible data.
  • No more manual pipeline creation for initial loads and incremental updates. Fabric handles the replication automatically.
  • Your SQL Server data lands directly in OneLake, making it accessible across all Fabric personas (Data Engineering, Data Science, Data Warehousing, BI).
  • The actual data replication process itself (the initial snapshot and ongoing CDC) does not consume your Fabric Capacity Units (CU). You only pay for the compute when you query or process the data in Fabric.
  • For supported tables, Power BI datasets automatically leverage Direct Lake, offering the performance of Import Mode with the freshness of DirectQuery.
  • The replication is designed to be low impact on your production SQL Server, using CDC mechanisms that are highly optimised.

What You Need to Know: Requirements & Limitations

While incredibly powerful, mirroring isn't a silver bullet, understanding its current requirements and limitations is crucial for a successful implementation.

On-Premises SQL Server Requirements -

  • SQL Server 2016 through 2022 (using CDC) and SQL Server 2025 (using native change feed).
  • Mirroring is supported only on the primary replica of an Always On Availability Group.
  • SQL Server on Windows is fully supported. SQL Server on Linux (2017 CU18+ to 2022) is also supported.
  • A Microsoft Data Gateway (Standard or VNET) is required for secure communication between Fabric and your on-premises SQL Server.

Database & Table Limitations:

  • Single Mirror Target: A database can only be mirrored to one Fabric workspace at a time.
  • No Redundant CDC (SQL Server 2025): If you're on SQL Server 2025, do not manually enable CDC on the source database; Fabric handles this for you.
  • No Delayed Transaction Durability: Databases with this setting cannot be mirrored.
  • 500 Table Limit: Each mirrored database can replicate a maximum of 500 tables. Fabric will pick the first 500 alphabetically if you choose "Mirror all data."
  • Unsupported Data Types: This is a big one! Tables containing columns with types like image, text, ntext, xml, timestamp/rowversion, sql_variant, geometry, geography, hierarchyid, json, vector, or User-Defined Types (UDT) cannot be fully mirrored. The presence of these types will cause the entire column, or sometimes the whole table, to be excluded.
  • Precision Loss: datetime2(7) and datetimeoffset(7) columns will lose their 7th digit of precision (becomes 6 digits) when mirrored to OneLake's Delta Lake format.
  • No Row/Column Filtering: Mirroring is an "all or nothing" proposition for selected tables. You cannot filter specific rows or exclude individual columns during replication. 
  • DDL Changes: Significant schema changes (like altering a Primary Key) will often trigger a full re-sync of the entire table, leading to higher latency for that specific table until the re-sync completes.

Measuring Capacity Usage: Understanding the Costs

  • Replication is FREE (Compute-wise) -- The initial snapshot and continuous Change Data Capture (CDC) processing itself does not consume your Fabric Capacity Units (CU). This is a significant advantage over other replication methods.
  • You Pay for Data Usage -- Your Fabric Capacity Units are consumed when you -

    • Query the mirrored data via the SQL Analytics Endpoint.
    • Interact with Power BI reports built on the mirrored data (especially in Direct Lake mode).
    • Process the data using Spark notebooks or other Fabric experiences.
  • Storage Costs -- You get a generous 1 TB of free mirroring storage for every 1 Capacity Unit (CU) you own. For example, an F64 capacity gets 64 TB of free storage for mirrored data. You only start paying for OneLake storage if you exceed this allowance or if your Fabric Capacity is paused (storage charges still apply).

How to Monitor ? 

Use the Microsoft Fabric Capacity Metrics App to track your CU consumption and storage usage for "Mirrored Database" items. This app will clearly show you what's driving your costs. You can also leverage Workspace Monitoring (via KQL Databases) to track MirroredDatabaseTableExecution logs for metrics like ProcessedBytes, ProcessedRows, and Latency to understand the operational health and volume of your replication.

Prerequisites to Getting Started with Mirroring

  • Verify your SQL Server version, Always On Availability Group configuration, and install a Data Gateway if needed.
  • Fabric will automatically enable CDC on your source database (or use the native change feed for SQL Server 2025). Ensure your SQL Server instance has the necessary permissions.
  • In your Fabric workspace, select "New > More Options > Mirrored Database (SQL Server)."
  • Provide your SQL Server connection details and gateway information.
  • Choose which tables you want to mirror. Fabric will automatically identify and flag any tables with unsupported data types.
  • Keep an eye on the replication status in Fabric and use the Capacity Metrics App to track usage.

Conclusion

On-premises SQL Server Mirroring in Microsoft Fabric is a game-changer for organisations looking to bridge the gap between their transactional systems and modern analytics. By providing near real-time, low-cost data ingestion directly into OneLake, it unlocks a world of possibilities for faster insights and more agile data operations. While you need to be mindful of its current limitations, the benefits of quick, seamless integration make it an invaluable tool in your Fabric journey.


Friday, July 18, 2025

Fabric Pipelines vs. Dataflow Gen2

Within Data Factory in Microsoft Fabric, you'll encounter two primary tools for data movement and transformation tasks: Data Pipelines and Dataflow Gen2. While both achieve similar goals, they cater to different use cases and user preferences. 

Understanding when to use which is key to building efficient and scalable data solutions in Fabric.

What is Data Factory in Fabric?

Data Factory in Microsoft Fabric provides a modern, cloud-based data integration service that allows you to create, schedule, and orchestrate your data movement and transformation workflows. It is essentially the engine that helps you bring data into OneLake and prepare it for analytics.

You'll find the Data Factory experience directly integrated into your Fabric workspace, allowing seamless interaction with other items like Lakehouses and Data Warehouses.


When you click on New Item in Your Fabric Workspace, you will find both Pipelines and Dataflow Gen2 as shown below:




Data Pipelines: The Orchestration Maestro

Data Pipelines in Fabric are the evolution of Azure Data Factory and Synapse Pipelines. They are designed for robust orchestration, control flow, and high-scale data movement. If you need to copy data from various sources to OneLake, execute notebooks, trigger stored procedures, or chain together a complex sequence of activities, Pipelines are your primary tool.

Key Characteristics of Data Pipelines:

  • Orchestration: Excellent for defining a sequence of activities, handling dependencies, and scheduling complex workflows.
  • Data Movement: Highly optimized for copying data between a vast array of data sources (databases, SaaS applications, file systems, cloud storage) to OneLake.
  • Control Flow: Provides activities for conditional logic, looping, error handling, and parallel execution.
  • Code-First & Low-Code Activities: While they primarily involve dragging and dropping activities, many activities (like calling a stored procedure or running a notebook) involve writing or pointing to code.
  • Monitoring: Comprehensive monitoring tools to track pipeline runs, identify failures, and troubleshoot.


When to Use Data Pipelines:

  • Ingesting large volumes of data from various sources into your Lakehouse or Warehouse.
  • Orchestrating end-to-end data workflows that involve multiple steps (e.g., ingest raw data, run a Spark notebook to transform it, then load it into a data warehouse).
  • Triggering other Fabric items, such as Spark notebooks, KQL queries, or dataflow refreshes.
  • Implementing robust error handling and retry mechanisms.
  • Scheduling batch data loads (e.g., daily, hourly).


A typical Data Pipeline showing various activities like Copy Data, Notebook, and Dataflow execution.


Dataflow Gen2: The Low-Code Transformation Powerhouse

Dataflow Gen2 in Fabric is the next generation of Power Query Online, familiar to anyone who has used Power BI or Power Apps. It's a low-code, visual tool primarily focused on data transformation and cleansing, designed for data engineers, analysts, and even business users who prefer a graphical interface.

Dataflow Gen2 excels at shaping, cleaning, and preparing data from a multitude of sources before loading it into your Fabric Lakehouse or Warehouse.


Key Characteristics of Dataflow Gen2:

  • Low-Code/No-Code: The primary interaction is through a visual Power Query editor, allowing users to apply transformations without writing a single line of code.
  • Intuitive Interface: Easy to learn for users familiar with Excel or Power BI's Power Query.
  • Data Cleansing & Shaping: Strong capabilities for common data preparation tasks like merging, splitting, pivoting, unpivoting, type conversion, and error handling.
  • Schema On Write: It writes directly to your Lakehouse or Warehouse in Delta Parquet format, creating or updating tables.
  • Scalability: Leverages Spark compute under the hood for scalable transformations.

When to Use Dataflow Gen2:

  • Quickly ingest and transform smaller to medium-sized datasets into a Lakehouse or Warehouse.
  • When your team prefers a visual, low-code experience for data preparation.
  • Performing common data cleansing and shaping tasks (e.g., standardizing formats, removing duplicates, simple joins).
  • When you need to get data ready for Power BI semantic models with minimal coding effort.
  • For "citizen data integrators" who are comfortable with Power Query.


Pipelines vs. Dataflow Gen2: A Quick Comparison

FeatureData PipelinesDataflow Gen2
Primary FocusOrchestration, Control Flow, Large-Scale MovementVisual Transformation, Data Cleansing, Shaping
User ExperienceActivity-based canvas, JSON definitionPower Query Editor (visual)
Best ForComplex ETL/ELT workflows, orchestration, varied activity types, high-volume ingestionAgile data prep, smaller-to-medium datasets, business user transformations, quick data landing
Code LevelLow-code (configuring activities), code-first for some activities (Notebook, SP)No-code/Low-code (M language generated behind the scenes)
Output TargetCan write to various destinations via Copy activityPrimarily writes to Fabric Lakehouse/Warehouse (Delta)
DependenciesCan orchestrate other Fabric itemsCan be orchestrated by Pipelines


How They Work Together: A Powerful Synergy

The real power of Data Factory in Fabric comes when you use Pipelines and Dataflow Gen2 together.

You might use a Pipeline to:

  1. Copy raw CSV files from an external Blob Storage account into the "Files" area of your Lakehouse.
  2. Then, trigger a Dataflow Gen2 to read those raw CSVs, apply transformations (e.g., parse dates, clean text, merge with a lookup table), and write the cleaned data as a Delta table into the "Tables" area of your Lakehouse.
  3. Finally, use the same Pipeline to trigger a Spark notebook for more advanced transformations or machine learning tasks.


This combination allows you to leverage the strengths of both tools: Pipelines for robust orchestration and large-scale movement, and Dataflow Gen2 for efficient, visual data preparation.


Friday, July 04, 2025

OneLake: The "OneDrive" for Your Data

In this blog post I would like to dive into the foundational component that makes Microsoft Fabric's unified experience truly revolutionary: OneLake.

Think of OneLake as the "OneDrive" for your entire organisation's data. Just as OneDrive centralises your personal documents, OneLake centralises all your data assets, making them accessible to every engine and every persona within Microsoft Fabric. This isn't just a new storage account, but it's a paradigm shift in how we manage, access, and utilise data.


Problem: Data Silos and Duplication

Before OneLake, enterprise data environments often looked like this:

  • Your data engineers might land data in Azure Data Lake Storage (ADLS Gen2).
  • Your data scientists might copy some of that data into a separate environment for their experiments.
  • Your data warehousing team might ingest and transform data into a SQL Data Warehouse.
  • Power BI users might import data into their own models.

Every copy, every movement, introduced complexity, increased storage costs, created potential for inconsistencies, and slowed down development. It was a fragmented, costly, and often frustrating experience.

This is what a typical pre-Fabric data landscape often resembled: 





OneLake to the Rescue: One Copy, Many Experiences

OneLake fundamentally changes this by enforcing the principle of "One Copy of Data." Instead of copying data between services or creating separate data lakes for different departments, all your organisational data lives in a single, logical data lake within Fabric.

Imagine a central hub where all data naturally flows and resides: 


 
 
Here's how it works:

  1. Open Data Formats: OneLake stores all data in an open format, primarily Delta Lake Parquet. This means the data isn't locked into a proprietary system.
  2. Engine Agnostic: Whether you're using a Spark notebook for data engineering, a SQL endpoint for warehousing, or Power BI for analytics, all these engines access the exact same underlying data files in OneLake. There's no need to move, convert, or duplicate.
  3. Hierarchical Namespace: OneLake automatically organizes your data within a tenant, workspace, and item structure. This provides a clear and intuitive way to manage your data assets, much like folders and files in OneDrive.

Understanding Workspaces and Items

Within Fabric, your data is organised into Workspaces. A Workspace is a collaborative environment where teams can manage their data assets. Inside a Workspace, you create Items, such as Lakehouses, Data Warehouses, KQL Databases, or Power BI semantic models.

OneLake automatically provisions storage for every item you create. 

For example, when you create a Lakehouse, OneLake creates a dedicated folder structure for it, including Tables (for Delta tables) and Files (for raw files).

The Magic of Shortcuts: Virtualizing Data In-Place

What if some of your data already lives outside of Fabric in an existing ADLS Gen2 account, Amazon S3 bucket, or Google Cloud Storage? Do you have to ingest it all into OneLake? No! This is where Shortcuts come in.

Shortcuts allow you to create a virtual link from OneLake to external data sources. The data itself remains in its original location, but it appears as if it's part of OneLake. This means you can:
  • Query external data using Spark or SQL endpoints in Fabric without moving it.
  • Combine external data with data already in OneLake seamlessly.
  • Start leveraging Fabric's powerful compute engines immediately, even with existing data estates.


Visualizing a Shortcut: Data stays external, but Fabric treats it as local: 
 




Key Benefits of Shortcuts:
No Data Movement: Reduces ingestion time, cost, and complexity.
Compliance: Data can remain in its original sovereign cloud if needed.
Unified View: Provides a single pane of glass for all your data, regardless of its physical location.

Why OneLake Matters

OneLake is more than just a storage layer; it's the foundation for true data democratization within an organization. It simplifies the data landscape, reduces complexity, and accelerates the journey from raw data to actionable insights.

Cost Savings: No more duplicating data across multiple services.
Improved Governance: A single source of truth makes security and compliance easier to manage.
Faster Time-to-Insight: Data is immediately available to all Fabric experiences.
Reduced Complexity: Less data movement and fewer integration points to manage.


In essence, OneLake empowers every data professional in your organisation to work with the same trusted data, fostering collaboration and innovation.

Tuesday, June 24, 2025

How to download files from Microsoft Fabric Lakehouse ?

 Challenge:

Yesterday I was trying to download a parquet file from the Microsoft Lakehouse on to my laptop.  So I was searching for the download link.  

But I could not find one.

Solution

Easiest way is to download the Onelake explorer from this link OneLake explorer and use that to browse and download the parquet files.  It works much like Onedrive.


You can also use Python in a Fabric notebook -- Link here


Tuesday, June 03, 2025

Free Power BI Classes week 5

Yesterday I completed the Power BI Class for week 5

We continued with Lab 3 and completed both Lab 3 and 4.

The participants learnt about formatting page background, inserting images, bookmarks, publishing the report to Power BI Service, creating mobile layout and more.  

Here is the recording of the class for week 5


Free Power BI Classes Week 5


Medallion Architecture in Fabric

Medallion Architecture is a design pattern that has become the industry standard for Lakehouse environments. It is not a Fabric feature. It...