Microsoft Fabric Updates Blog

Building a Custom Sparklens JAR for Microsoft Fabric

Problem Statement

In the previous blog on Profiling Microsoft Fabric Spark Notebooks with Sparklens, we covered how to run Sparklens to profile and tune the performance of your spark notebooks in Microsoft Fabric. In that blog, we used a custom Sparklens JAR. The Sparklens JARs available in the Maven Central repo supports only the Spark version 2.X, which is not compatible with Microsoft Fabric. In this blog, you will learn how to build the sparklens JAR for Spark 3.X, which can be used in Microsoft Fabric.

Prerequisite Reading

To learn what is Sparklens and how to run it on Microsoft Fabric Spark Notebook and optimize performance, please check out this blog: Profiling Microsoft Fabric Spark Notebooks with Sparklens

Discussion

Sparklens is an open-source Spark profiling tool to profile Spark jobs and Notebooks. Latest JARs in Maven Central repo support Spark 2.X and doesn’t work with Spark 3.X. Here are modifications you need to make to run on Spark 3.X. 

Note: Sparklens is not owned/maintained by Microsoft, it’s crucial you implement all necessary security measures, similar to the precautions taken when using any package or library. Please check out Sparklens License details here.

Steps to run Sparklens on Spark 3.X:

1. Setup the Build Tool:

Sparklens is developed in Scala. To package a Scala project, you can use build tools like sbt (simple build tool). Ensure you have sbt installed on your local machine. This blog uses sbt version 0.13.18.

2. Prepare Your Development Environment:

Use your preferred IDE to make necessary changes. For this blog, Visual Studio Code is used. Open the terminal and navigate to the Sparklens directory:

cd sparklens

3. Clone the Repository:

Clone the Sparklens GitHub repository to your local machine from the following link: qubole/sparklens: Qubole Sparklens tool for performance tuning Apache Spark (github.com).

git clone https://github.com/qubole/sparklens.git

4. Modify plugins.sbt:

Update the plugins.sbt file to comment out the existing addSbtPlugin

(addSbtPlugin(“org.spark-packages” % “sbt-spark-package” % “0.2.4”)):

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.12.0")

resolvers += "Spark Package Main Repo" at "https://dl.bintray.com/spark-packages/maven"

// addSbtPlugin("org.spark-packages" % "sbt-spark-package" % "0.2.4")

5. Update build.sbt:

Make the following changes to the build.sbt file:

  • Comment out spName, sparkVersion, and spAppendScalaVersion as they use the := operator, which is for setting keys in earlier sbt versions. Instead, declare these three as variables.
  • Comment out the line that uses sparkVersion.version and replace it with sparkVersion since sparkVersion is a String and does not have a version property.
  • Change the Scala version to 2.12.0 and the Spark version to 3.0.0. Add the spark-sql 3.0.0 library dependency.

Here is the updated sections in the build.sbt:

name := "sparklens"
organization := "com.qubole"

scalaVersion := "2.12.0"

crossScalaVersions := Seq("2.10.6", "2.12.0")

// spName := "qubole/sparklens"

// sparkVersion := "2.0.0"

// spAppendScalaVersion := true

val spName = "qubole/sparklens"

val sparkVersion = "3.0.0"

val spAppendScalaVersion = true


// libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion.version % "provided"

libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion % "provided"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.0.0"

6. Update QuboleJobListener.scala:

In QuboleJobListener.scala (src/main/scala/com/qubole/sparklens/QuboleJobListener.scala), change attemptId to attemptNumber() as shown in this code snippet:

override def onStageCompleted(stageCompleted: SparkListenerStageCompleted): Unit = {
    val stageTimeSpan = stageMap(stageCompleted.stageInfo.stageId)
    if (stageCompleted.stageInfo.completionTime.isDefined) {
      stageTimeSpan.setEndTime(stageCompleted.stageInfo.completionTime.get)
    }
    if (stageCompleted.stageInfo.submissionTime.isDefined) {
      stageTimeSpan.setStartTime(stageCompleted.stageInfo.submissionTime.get)
    }

    if (stageCompleted.stageInfo.failureReason.isDefined) {
      //stage failed
      val si = stageCompleted.stageInfo
      failedStages += s""" Stage ${si.stageId} attempt ${si.attemptNumber()} in job ${stageIDToJobID(si.stageId)} failed.
                      Stage tasks: ${si.numTasks}
                      """
      stageTimeSpan.finalUpdate()
    }else {
      val jobID = stageIDToJobID(stageCompleted.stageInfo.stageId)
      val jobTimeSpan = jobMap(jobID)
      jobTimeSpan.addStage(stageTimeSpan)
      stageTimeSpan.finalUpdate()
    }
  }

7. Update HDFSConfigHelper.scala:

In the HDFSConfigHelper.scala (src\main\scala\com\qubole\sparklens\helper\HDFSConfigHelper.scala), SparkHadoopUtil class has been changed to a private class in Spark 3. Modify this as shown below:

import org.apache.hadoop.conf.Configuration
import org.apache.spark.SparkConf
import org.apache.spark.deploy.SparkHadoopUtil
import org.apache.spark.sql.SparkSession

object HDFSConfigHelper {
  def getHadoopConf(sparkConfOptional: Option[SparkConf]): Configuration = {
    if (sparkConfOptional.isDefined) {
      val spark = SparkSession.builder.config(sparkConfOptional.get).getOrCreate()
      spark.sparkContext.hadoopConfiguration
    } else {
      val spark = SparkSession.builder.getOrCreate()
      spark.sparkContext.hadoopConfiguration
    }
  }
}

8. Compile the Revised Code: Run “sbt compile” to compile the project.

9. Package the Compiled Code: Run “sbt package” to package the project as a JAR file.

10. You can now use the JAR (target/scala-2.12/sparklens_2.12-0.3.2.jar) and run profiling on Microsoft Fabric Notebook: Profiling Microsoft Fabric Spark Notebooks with Sparklens.

Further Reading

qubole/sparklens: Qubole Sparklens tool for performance tuning Apache Spark (github.com)

Profiling Microsoft Fabric Spark Notebooks with Sparklens | Microsoft Fabric Blog | Microsoft Fabric

Related blog posts

Building a Custom Sparklens JAR for Microsoft Fabric

August 28, 2024 by Adi Eldar

Anomaly Detector, one of Azure AI services, enables you to monitor and detect anomalies in your time series data. This service is based on advanced algorithms, SR-CNN for univariate analysis and MTAD-GAT for multivariate analysis and is being retired by October 2026. In this blog post we will lay out a migration strategy to Microsoft Fabric, allowing … Continue reading “Advanced Time Series Anomaly Detector in Fabric”

August 2, 2024 by Meenal Srivastva

Use Trusted workspace access and Managed Private endpoints in Fabric with any F capacity and enjoy the benefits of secure and optimized data access and connectivity  We are thrilled to share with you an update on the Fabric network security features that were announced in general availability earlier this year. Trusted workspace access, and Managed … Continue reading “Announcing the availability of Trusted workspace access and Managed private endpoints in any Fabric capacity”