Monday, November 8, 2021

Azure Synapse Analytics - Operationalize your Spark ML model into Data Explorer pool for scoring

With the integration of Azure Data Explorer Pools in Azure Synapse Analytics you are getting a simplified user-experience for scenarios integrating with Spark and SQL.

 

In this blog we will focus on the Spark integration.

 

Two use-cases (and there are many others) are the most obvious where the Spark can be a good choice:

  1. Batch training of machine learning models
  2. Data migration scenarios to Data Explorer, with many complex long running ETL pipelines

In the following example we will focus on the first use-case (it is based on a previous blog post from @adieldar), demonstrating the high-integration of the Azure Synapse Analytical runtimes. We will train a model in Spark, deploy the model to a Data Explorer pool using the Open Neural Network Exchange (ONNX)-Format and finally do the model scoring in Data Explorer.

 

Prerequisites:

  1. You need an Azure Synapse Workspace being deployed, see here.
  2. In the workspace create a Spark pool.
  3. Create a Data Explorer Pool with the Python plugin enabled.
  4. Create a database in the Data Explorer Pool.
  5. Add a linked service to the Azure Data Explorer Pool to the database you created in the previous step.

There is no need to deploy any additional libraries.

 

We build a logistic regression model to predict room occupancy based on Occupancy Detection data, a public dataset from the UCI Repository. This model is a binary classifier to predict occupied or empty rooms based on temperature, humidity, light and CO2 sensors measurements. 

The example contains code snippets from a Synapse Spark notebook and a KQL script showing the full process of retrieving the data from Data Explorer, building the model, convert it to ONNX and push it to ADX. Finally, the Synapse Analytics KQL-script scoring query to be run on your Synapse Data Explorer pool.

 

HaukeMallow_0-1635507652421.png

 

All python code is available here. See other examples for the Data Explorer Spark connector here.

 

The solution is built from these steps:

  1. Ingesting the dataset from the Data Explorer sample database
  2. Fetch the training data from Data Explorer to Spark using the integrated connector.
  3. Train an ML model in Spark.
  4. Convert the model to ONNX
  5. Serialize and export the model to Data Explorer using the same Spark connector
  6. Score in Data Explorer Pool using the inline python() plugin that contains the onnxruntime

 

1. Ingest the sample dataset to our Data Explorer database

We ingest the Occupancy Detection dataset in the database we have created in our Data Explorer pool running a cross-cluster copy  KQL script :

 

KQL-Script, table creationKQL-Script, table creation

 

2. Construct a Spark DataFrame from the Data Explorer table

In the Spark notebook we read the Data Explorer table with the built-in spark connector

Construct a DataFrame from the Data Explorer tableConstruct a DataFrame from the Data Explorer table

 

3. We train the machine learning model:

train the ML modeltrain the ML model

 

4. Convert the model to ONNX

The model must be converted to the ONNX format 

convert the model to ONNXconvert the model to ONNX

and serialized in the next step

serialize the modelserialize the model

 

5. Export the model to Data Explorer

Finally, we export the model to the Data Explorer to the table models_tbl

Export the model to a Data Explorer tableExport the model to a Data Explorer table

 

6. Score the model in Data Explorer

This is done by calling predict_onnx_fl(). You can either install this function in your database, or call it in ad-hoc manner in a KQL-script on your Data Explorer Pool.

 

KQL-script for scoringKQL-script for scoring

 

Conclusion

In this blog we presented how to train your ML model in Spark and use it for scoring in Data Explorer. This can be done by converting the trained model from Spark ML to ONNX, a common ML model exchange format, enabling it to be consumed for scoring by Data Explorer python() plugin.

This workflow is common for Azure Synapse Analytics customers that are building Machine Learning algorithms by batch training using Spark models on big data stored in a data lake.

With the new Data Explorer integration in Azure Synapse Analytics everything can be done in a unified environment.

 

 

Posted at https://sl.advdat.com/3CWvXRc