Machine Learning at Scale with Databricks and Kubernetes
Overview
Machine Learning Operationalisation (ML Ops) is a set of practices that aim to quickly and reliably build, deploy and monitor machine learning applications. Many organizations standardize around certain tools to develop a platform to enable these goals.
One combination of tools includes using Databricks to build and manage machine learning models and Kubernetes to deploy models. This article will explore how to design this solution on Microsoft Azure followed by step-by-step instructions on how to implement this solution as a proof-of-concept.
This article is targeted towards:
Organizations looking to build and manage machine learning models on Databricks.
Organizations that have experience deploying and managing Kubernetes workloads.
Organizations looking to deploy workloads that require low latency and interactive model predictions (e.g. a product recommendation API).
A GitHub repository with more details can be found here.
Design
This high-level design uses Azure Databricks and Azure Kubernetes Service to develop an ML Ops platform for real-time model inference. This solution can manage the end-to-end machine learning life cycle and incorporates important ML Ops principles when developing, deploying, and monitoring machine learning models at scale.
At a high level, this solution design addresses each stage of the machine learning lifecycle:
Data Preparation: this includes sourcing, cleaning, and transforming the data for processing and analysis. Data can live in a data lake or data warehouse and be stored in a feature store after it’s curated.
Model Development: this includes core components of the model development process such as experiment tracking and model registration using ML Flow.
Model Deployment: this includes implementing a CI/CD pipeline to containerize machine learning models as API services. These services will be deployed to an Azure Kubernetes cluster for end-users to consume.
Model Monitoring: this includes monitoring the API performance and data drift by analyzing log telemetry with Azure Monitor.
Keep in mind this high-level diagram does not depict any security features large organizations would require when adopting cloud services (e.g. firewall, virtual networks, etc.). Moreover, ML Ops is an organizational shift that requires changes in people, processes, and technology. This might influence the different services, features, or workflows your organization adopts which are not considered in this design. The Machine Learning DevOps guide from Microsoft is one view that provides guidance around best practices to consider.
Build
Next, we will share how an end-to-end proof of concept illustrating how an ML Flow model can be trained on Databricks, packaged as a web service, deployed to Kubernetes via CI/CD and monitored within Microsoft Azure.
Detailed step-by-step instructions describing how to implement the solution can be found in the Implementation Guide of the GitHub repository. This article will focus on what actions are being performed and why.
A high-level workflow of this proof-of-concept is shown below:
Follow the Implementation Guide to implement the proof-of-concept in your Azure subscription.
Infrastructure Setup
The services required to implement this proof-of-concept include:
Azure Databricks workspace to build machine learning models, track experiments, and manage machine learning models.
Azure Kubernetes Service (AKS) to deploy containers exposing a web service to end-users (one for a staging and production environment respectively).
GitHub to store code for the project and enable automation by building and deploying artifacts.
By default, this proof-of-concept has been implemented by deploying all resources into a single resource group. However, for production scenarios, many resource groups across multiple subscriptions would be preferred for security and governance purposes (see Azure Enterprise-Scale Landing Zones) with services deployed using infrastructure as code (IaC).
Some services have been further configured as part of this proof-of-concept:
Azure Kubernetes Service: container insights has been enabled to collect metrics and logs from containers running on AKS. This will be used to monitor API performance and analyze logs.
Azure Databricks: the files in repo feature has been enabled (not enabled by default at the time of developing this proof-of-concept) and a cluster has been created for Data Scientists, Machine Learning Engineers, and Data Analysts to use to develop models.
GitHub: two GitHub Environments have been created for Staging and Production environments along with GitHub Secrets to be used during the CI/CD pipeline.
In practice within an organization, a Cloud Administrator will provision and configure this infrastructure. Data Scientists and Machine Learning Engineers who build, deploy, and monitor machine learning models will not be responsible for these activities.
Model Development
Once the infrastructure is provisioned and data is sourced a Data Scientist can commence developing machine learning models. The Data Scientist can add a Git repository with Databricks Repos for each project they (or the team) are working on within the Databricks workspace.
For this proof-of-concept, the model development process has been encapsulated in a single notebook called train_register_model.
This notebook will train and register the following ML Flow models:
a model used to make predictions from inference data.
a model used to detect outliers for monitoring purposes.
a model used to detect data drift for monitoring purposes.
Training notebook in Azure Databricks
After executing this notebook the machine learning models will be registered and training metrics will be captured in the ML Flow Model Registry and Experiments tracker respectively.
In practice, the model development process requires more effort than illustrated in this notebook and will often span multiple notebooks. Note that important aspects of well-developed ML Ops processes such as explainability, performance profiling, pipelines, etc. have been ignored in this proof-of-concept implementation but foundational components such as experiment tracking and model registration and versioning have been included.
Experiment metrics for hyperparameter tuning in Azure Databricks
Registered models in Azure Databricks
This notebook has been adapted from a Databricks tutorial available here and the dataset is available from the UCI Machine Learning Repository available here.
A JSON configuration file is used to define which version of each model from the ML Flow model registry should be deployed as part of the API. All three models need to be referenced since they perform different functions (predictions, drift detection, and outlier detection respectively).
Data scientists can edit this file once models are ready to be deployed and commit the file to the Git repository. The configuration file service/configuration.json is structured as follows: