Microsoft Fabric Updates Blog

Enhancing Open Source: Fabric’s Contributions to FLAML for Scalable AutoML

At Fabric, we’re passionate about contributing to the open-source community, particularly in areas that advance the usability and scalability of machine learning tools. One of our recent endeavors has been making substantial contributions back to the FLAML (Fast and Lightweight AutoML) project, a robust library designed to automate the tedious and complex process of machine learning model selection and hyperparameter tuning.

What is FLAML

FLAML (Fast and Lightweight AutoML) is an open-source library designed to streamline the process of automating machine learning tasks. AutoML, one of FLAML’s key capabilities, automates the often-complex process of model selection, hyperparameter tuning, and training. This automation makes FLAML particularly valuable for quickly building and optimizing models, even with minimal machine learning expertise. With its lightweight design and flexibility, FLAML empowers users to efficiently create high-performing models across a wide range of applications.

Scaling FLAML for Apache Spark Workloads

Recognizing the growing need for scalable solutions in large-scale data processing, we focused on enhancing FLAML’s capabilities for Spark workloads. Apache Spark is a powerhouse for big data processing, and integrating it seamlessly with AutoML processes is crucial for many enterprises looking to accelerate their machine learning pipelines. To address this, we’ve contributed several new Spark and non-Spark estimators to the FLAML project.

When setting `use_spark=True`, you can now parallelize your training and explore a broader range of non-Spark models, offering more flexibility in model selection. In addition, we’ve added more Spark learners, allowing users to experiment with a wider variety of Spark model flavors when working directly with Spark dataframes.

New Apache Spark Model estimatorsNew Non-Apache Spark model estimators
SparkAFTSurvivalRegressionEstimatorElasticNetEstimator
SparkGBTEstimatorLassoLarsEstimator
SparkGLREstimatorSGDEstimator
SparkLGBMEstimatorSVCEstimator
SparkLinearRegressionEstimatorAverage
SparkLinearSVCEstimatorLassoLars_TS
SparkNaiveBayesEstimatorNaïve
SparkRandomForestEstimatorSeasonalAverage
 SeasonalNaive
 TCNEstimator

These contributions significantly enhance FLAML’s versatility, making it a more powerful tool for users dealing with large datasets. Whether you’re handling complex workloads or managing distributed data processing, FLAML can now better support your needs, thanks to these new learners.

Improved MLflow Integration for Better Collaboration

In addition to expanding FLAML’s capabilities for Apache Spark, we’ve focused on improving its integration with MLflow, a widely used open-source platform for managing the entire machine learning lifecycle. We’ve enhanced this integration by adding support for automatically capturing key metrics, parameters, and models, even when autologging is disabled. Some of these metrics and parameters are unique to AutoML and aren’t captured by standard MLflow autologging. Additionally, we’ve streamlined the process by removing redundant intermediate runs that are typically logged by standard MLflow autologging but aren’t necessary for AutoML trials.

This improvement is crucial for ensuring the reproducibility of models. By automatically capturing the key details of each AutoML trial, users can easily track the parameters and metrics that were critical to the model’s performance. This not only promotes greater transparency but also improves collaboration, as teams can more effectively share insights and build upon each other’s work within the AutoML process.

Support for Python 3.11

We’ve also extended FLAML’s support to include Python 3.11, in addition to the previously supported versions, Python 3.8 and Python 3.10. This contribution ensures that FLAML remains accessible and compatible with the latest advancements in the Python ecosystem, allowing users to leverage the performance improvements and new features introduced in Python 3.11.

A Commitment to the Community

Our contributions to the FLAML project are rooted in a commitment to helping the community. We believe that by improving open-source tools, we empower more people to leverage advanced technologies in their work. Whether you’re a data scientist, a data analyst, or a researcher, these enhancements to FLAML are designed to make your workflow easier, enabling you to scale your machine learning projects with greater ease and confidence.

As we continue to innovate and collaborate, we look forward to seeing how you will utilize these new features and improvements.  We invite you to try out these new features and see how they can improve your machine learning projects.

Learn more

To dive deeper into what FLAML has to offer and get started with using its AutoML capabilities, check out the official documentation here. You’ll find comprehensive guides, examples, and resources to help you make the most of this powerful tool.

You can also try these capabilities within Fabric Data Science, where FLAML’s features are seamlessly integrated for an enhanced machine learning experience.

Related blog posts

Enhancing Open Source: Fabric’s Contributions to FLAML for Scalable AutoML

September 25, 2024 by Santhosh Kumar Ravindran

We’re excited to introduce high concurrency mode for notebooks in pipelines, bringing session sharing to one of the most popular orchestration mechanisms for enterprise data ingestion and transformation. Notebooks will now automatically be packed into an active high concurrency session without compromising performance or security, while paying for a single session. Key Benefits: Why Use … Continue reading “Introducing High Concurrency Mode for Notebooks in Pipelines for Fabric Spark”

September 25, 2024 by Jenny Jiang

Fabric Apache Spark Diagnostic Emitter for Logs and Metrics is now in public preview. This new feature allows Apache Spark users to collect Spark logs, job events, and metrics from their Spark applications and send them to various destinations, including Azure Event Hubs, Azure Storage, and Azure Log Analytics. It provides robust support for monitoring … Continue reading “Announcing the Fabric Apache Spark Diagnostic Emitter: Collect Logs and Metrics”