Microsoft Fabric Updates Blog

Fabric Change the Game: Exploring the data

Microsoft Fabric offers a set of different tools to explore and prepare the data for analysis and from that perspective, notebooks have become one of the quickest ways to get started with data exploration. This post draws its inspiration from the world of experimentation, exploration, and seamless integration into Microsoft Fabric- Data Science in Microsoft Fabric documentation – Microsoft Fabric | Microsoft Learn.

We’ll begin by delving into the intricacies of reading data from Azure Data Lake Storage using shortcuts and taking advantage of the Onelake flexibility, we’ll set the stage by organizing this raw data into structured tables. With this foundation in place, we’ll take the next step in our journey: basic data exploration.

Our data exploration uses as a source the diverse and captivating city of London with information extracted from: https://data.london.gov.uk/, this portal is an open data-sharing platform about London.

Have you ever wondered how you can unlock the potential of a data Lakehouse?

Step by step:

  1. For this example, the ADLS account is connected by a shortcut to a workspace in Fabric, let’s call it Workspace 1. Inside the same tenant, the goal is to copy that data from ADLS to the new workspace, which we will call Workspace 2. Fig 1 One, illustrates. Because there is an Onelake in between workspaces that should be an easy task. If you want to read more: Fabric Changing the game – OneLake integration | Microsoft Fabric Blog | Microsoft Fabric
Fig 1 One

As with everything in life, there is more than one way to accomplish this, especially in Fabric. The choice here will be mssparkutils, my favorite. Read more here Introduction to Microsoft Spark utilities – Azure Synapse Analytics | Microsoft Learn

The following block is written in Python, and it does a recursive search of the subfolders inside of the main folder. The following folder structure can be checked on the – Fig 2 -Copy.

import os

##Define source and destiny


vpathsource = 'abfss://Workspace1@msit-onelake.dfs.fabric.microsoft.com/Nameofthelakehouse_source.Lakehouse/Files/ADLSstorage/Folderwherethedatais/'


vpathdestiny = 'abfss://Workspace2@msit-onelake.dfs.fabric.microsoft.com/nameoftheLakehouse_destiny.Lakehouse/Files/Folderwheredatais/'

Defined source and destiny as the figure – Fig 2-Copy, illustrates how we will recursively read the data from the folder and copy it using the same structure at the destiny.

Fig 2-Copy

# List files and directories in vpathsource
files_and_dirs = mssparkutils.fs.ls(str(vpathsource))


for item in files_and_dirs:


    if item.isFile:

        # If it's a file, copy it to the destination
        vpathsource_file = os.path.join(vpathsource, item.name)

        vpathdestiny_file = os.path.join(vpathdestiny, item.name)

        mssparkutils.fs.cp(vpathsource_file, vpathdestiny_file)



    elif item.isDir:

        # If it's a directory, create a corresponding directory in the 
        vfolder = item.name

        vpathsource_dir = os.path.join(vpathsource, vfolder)

        vpathdestiny_dir = os.path.join(vpathdestiny, vfolder)

        mssparkutils.fs.mkdirs(vpathdestiny_dir)



        # List files in the source directory
        files_in_dir = mssparkutils.fs.ls(str(vpathsource_dir))
        for file_in_dir in files_in_dir:


            if file_in_dir.isFile:
                # Copy files from the source directory to the destination 
                vfile = file_in_dir.name

                vpathsource_file = os.path.join(vpathsource_dir, vfile)

                vpathdestiny_file = os.path.join(vpathdestiny_dir, vfile)

                mssparkutils.fs.cp(vpathsource_file, vpathdestiny_file)

Note: You can also manage this block with a try\excep and log the data into Onelake with the Logging library using this example:Fabric changing the game: Logging your workload using Notebooks. | Microsoft Fabric Blog | Microsoft Fabric

2. Once this is copied note this is CSV data. We will read it and save it as delta tables inside of the Lakehouse using part of the same idea of folder and subfolder that we just used to copy. Fig 3 – Change the table’s name and show the result of it.


import os


##using here the relative path name
files = mssparkutils.fs.ls('Files/mainfolderwherdaatais/')

for file in files:

    bisDir = file.isDir
    vfolder = file.name

    if bisDir == 1:
        files_folder = mssparkutils.fs.ls(f'a files_folder = mssparkutils.fs.ls(f'Files/mainfolderwherdaatais/{vfolder}/')


        for file in files_folder:
            vfile = file.name
            print (vfile)


            ##removing the extension
            file_name_without_extension =  os.path.splitext(vfile)[0]


            ##replace blank space by _
            new_file_name = file_name_without_extension.replace(" ", "_")


            df =  spark.read.format("csv").option("header","true")\
                             .load(f'Files/Cultural_Infra/{vfolder}/{vfile}')


            df.write.mode("overwrite").format("delta").saveAsTable(f'{new_file_name}')
   
Fig 3 – Change the table’s name

After creating the table and adding the Zorder optimization to cim_2023_music_all, the next step is to do some basic data exploration using a specific ward_2022_name.

Note1: Z-ordering is a technique to colocate related information in the same set of files.

V-order was also applied since I enabled this configuration at the Lakehouse Level. Please check it in the docs.

Note2: V-Order is a write time optimization to the parquet file format that enables lightning-fast reads under the Microsoft Fabric compute engine

Some more information here: Delta Lake table optimization and V-Order – Microsoft Fabric | Microsoft Learn and Data skipping with Z-order indexes for Delta Lake – Azure Databricks | Microsoft Learn

Note3: When ZORDER and VORDER are used together, Apache Spark performs bin-compaction, ZORDER, and VORDER sequentially.

%%sql

OPTIMIZE cim_2023_music_all ZORDER BY (ward_2022_name)

Basic Data Explorarion

3. Tables were created. Now let’s look at the data. First, let me list the data we collected:

If we look at the Ward_2022 column we can see some popular places in London like Kings Cross, and Shepperd Bush Green. Hence, the understanding here is Ward could be understood* as a Neighborhood unit to organize the data.

Fig 2 Wards

4. Using the Ward_2022 as a reference let me get the minimum values for music events available in that Ward, Maximum, Mean, and the Standard Deviation for one hundred of the “Wards”. In that way, we would know which places have more events related to music in London or not. Therefore, using the Standard Deviation, mean, min, and max, we could for example measure the variability of the data points that are close to the average (mean) or not. What is the variation of the data music events across the “wards”? For that, we need to check which places have more, or fewer events, which ones are average, and how much those events vary from one ward to another.

Reference here: https://en.wikipedia.org/wiki/Standard_deviation

import pandas as pd

import matplotlib.pyplot as plt


df_music = spark.sql("SELECT * FROM cim_2023_music_all limit 100  ")


##creating a pandas dataframe from the results
df_music = df_music.toPandas()


## how much events per venue
venue_count = df_music["ward_2022_name"].value_counts()


# Calculate the standard deviation, min, max, mean of the number of venues per ward
std_deviation = venue_count.std()

mean_events = venue_count.mean()

max_venues = venue_count.max()

min_venues = venue_count.min()

# print the results
print(f"Standard Deviation of Music Events: {std_deviation:.2f}")

print(f"Mean of Music Events: {mean_events:.2f}")

print(f"Max Venues per Ward: {max_venues}")

print(f"Min Venues per Ward: {min_venues}")
Fig 3 Stats Events

Using the fact there is an average of 2.44 music events per Ward. As the Standard deviation of music events was calculated, let’s try to find some “wards” with outliers like those significantly above the mean. We also removed the limit of one hundred, so the Standard Deviation should be higher than this time. This could help you understand the variability in music events across “Wards” and spot areas with unusual event counts, the rule of thumb here to calculate will be 1.5 (you can change that). Fig 4 – STD, Min, Max will show the results and demonstrate with the histogram:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df_music = spark.sql("SELECT * FROM cim_2023_music_all")

##creating a pandas dataframe from the results
df_music = df_music.toPandas()



## how much events per venue
venue_count = df_music["ward_2022_name"].value_counts()



# Calculate the standard deviation, min, max, mean of the number of venues per ward
std_deviation = venue_count.std()

mean_events = venue_count.mean()

max_venues = venue_count.max()

min_venues = venue_count.min()




# print the results
print(f"Standard Deviation of Music Events: {std_deviation:.2f}")
print(f"Mean of Music Events: {mean_events:.2f}")
print(f"Max Venues per Ward: {max_venues}")
print(f"Min Venues per Ward: {min_venues}")



# Create a histogram of the number of events per ward using seaborn's histplot
sns.histplot(venue_count, bins=30, kde=True)

# Add title and labels
plt.title('Histogram of Music Events per Ward in London')
plt.xlabel('Number of Events')
plt.ylabel('Frequency')



# Outliers
mean_events = df_music["ward_2022_name"].value_counts().mean()

outliers = df_music["ward_2022_name"].value_counts()[(df_music["ward_2022_name"].value_counts() > (mean_events + 1.5 * std_deviation))]


print("Wards with Outliers:")
print(outliers)

Fig 4 – STD, Min, Max

For more information on this type of analysis please check the docs Visualize data in notebooks – Training | Microsoft Learn

As the Figure above illustrates, there are some “Wards” with significantly more music events than the others like the West End, which makes sense since that is where most of the theater musical events happen. Other places like Hoston East, Camden Town, and St James also were considered an outlier in this case as it also has significantly more events.

Summary: Here using mssparkutils we copy the data from an ADLS shortcut – Workspace 1 to Workspace 2 through the Onelake structure. The copy was recursive as the data was inside of different folders, this is quite simple due to the way Fabric works. Once the data was in the place it was supposed to be, the same recursive logic was applied to create delta tables from the CSV files. From the Delta tables, we start with some basic data exploration using the concepts of Data science available in the docs.

Related blog posts

Fabric Change the Game: Exploring the data

September 25, 2024 by Santhosh Kumar Ravindran

We’re excited to introduce high concurrency mode for notebooks in pipelines, bringing session sharing to one of the most popular orchestration mechanisms for enterprise data ingestion and transformation. Notebooks will now automatically be packed into an active high concurrency session without compromising performance or security, while paying for a single session. Key Benefits: Why Use … Continue reading “Introducing High Concurrency Mode for Notebooks in Pipelines for Fabric Spark”

September 25, 2024 by Jenny Jiang

Fabric Apache Spark Diagnostic Emitter for Logs and Metrics is now in public preview. This new feature allows Apache Spark users to collect Spark logs, job events, and metrics from their Spark applications and send them to various destinations, including Azure Event Hubs, Azure Storage, and Azure Log Analytics. It provides robust support for monitoring … Continue reading “Announcing the Fabric Apache Spark Diagnostic Emitter: Collect Logs and Metrics”