Overview
The Azure Machine Learning team is excited to announce the public preview refresh of the Azure Machine Learning (AML) CLI v2. This refresh builds on our CLI public preview at build, and enables many exciting additions to the CLI v2.
Azure Machine Learning currently exposes most of its functionality through the Python SDK. The previous version of AML Command Line Interface (CLI) and REST APIs were limited in functionality. The machine learning lifecycle involves handoff between data scientists and ML engineers (like deployment pros and data engineers). The data scientists are involved in model creation and are usually experts in Python. The ML engineers do not create models but are involved in providing data, deploying the workloads to production etc. They are not necessarily experts in Python. The python heavy Azure ML with lack of a good CLI/REST, made adoption harder for data engineers involved in the ML lifecycle and for data scientists who did not favor python.
To address this issue, in the revised CLI v2 and REST API, Azure ML uses YAML to describe all assets and resources. Actions, including management of these assets and resources are possible using simple command lines (CLI v2) or the REST API. Users can use the CLI or the REST API to:
- Manage AML resources – workspace, compute, datastores
- Manage AML assets - Datasets, environments, models
- Run standalone jobs locally to develop/test and then move them to the cloud
- Run a series of jobs in a pipeline (New)
- Infer on trained models with Managed Online inferencing or Batch Inferencing
- Create and use reusable components in pipelines (New)
To improve usability even further, VS code Azure ML extension has increased support for the CLI (v2). The consistent YAML representation of all assets and resources enables git-ops as well as sharing scenarios. The REST APIs can also be used via ARM templates. These features collectively enable a simplified experience for all team members through easy transition from local work to cloud work, One-Click deployment of samples, tooling support, etc. All this without any dependency on a specific programming language (say Python).
How it works
Some definitions
To start with let us define a few terms:
- Resources, are platform level capabilities needed to run machine learning on Azure ML. These include the AML workspace which holds everything within it, the compute on which to run tasks and datastores which are pointers to where data is stored.
- Assets, are artifacts consumed or produced by the jobs themselves. These include the datasets, environments and models.
- Job, is a task which is run on a desired compute – it has definitions for what to run, how to run it and where to run it as well as what inputs are consumed and what outputs are produced.
- Pipeline, is a collection of jobs which are run in a particular order based on dependencies/connections between the jobs – each job in a pipeline can be run on a different compute.
We will look at a scenario where we start with a job, run it on local machine, move it to the cloud, and then stitch together a series of jobs into a pipeline.
Run a Job locally
To start with, let us run a job on the local machine since we want to develop and do some basic testing before using up cloud resources. A job is defined using a YAML file. Let us examine the YAML:
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: python train.py --data ${{inputs.the_data}} --save_model ${{outputs.the_model}}
environment:
image: pytorch/pytorch
compute: local
code:
local_path: ./src
inputs:
the_data:
dataset:
local_path: ./data
outputs:
the_model:
mode: upload
The YAML defines that we need to run the command python train.py
on the local machine using a pytorch
image. The job uses inputs which are in the local folder called data. This job can be run using the CLI command line
az ml job create -f local-job.yml
With a few lines in a YAML and one command line, we were able to run a job locally.
Run a job on the cloud
Now that our job runs as expected on the local machine, let us make this job run on the cloud. To do this let us look at the revised YAML:
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: python train.py --data ${{inputs.the_data}} --save_model ${{outputs.the_model}}
environment:
image: pytorch/pytorch
compute: azureml:cpu-cluster
code:
local_path: ./src
inputs:
the_data:
dataset:
local_path: ./data
outputs:
the_model:
mode: upload
The only change in the YAML is that the compute
is now pointing to a resource called cpu-cluster
in the AML workspace. The job uses data from the same local folder which gets uploaded to the cloud for execution. This job can be run using the CLI command line
az ml job create -f cloud-job.yml
Using data from the cloud
In the above example, the data is uploaded from the local machine. However, in real life scenarios, local data will not scale. To use data from the cloud, the YAML can be modified to point to a cloud storage. Given below are a few examples:
the_data:
dataset: #use a folder in the aml blob storage
paths:
- folder: azureml://datastores/workspaceblobstore/paths/my-data/
the_data:
dataset: #use a folder in the cloud via HTTPS
paths:
- folder: https://mainstorage.blob.core.windows.net/example-data/
Use a curated environment from the cloud
The environment can also be picked from the curated environments available in the AML workspace. For e.g.:
environment: azureml:AzureML-Minimal:18
With a few changes in a YAML file, we have been able to move a job from local machine to the cloud.
Run a series of jobs in a pipeline
Now that we have a job running on the cloud, let us examine how to stitch together jobs into a pipeline. Here is the YAML which combines 2 jobs prepare and train into a pipeline. The prepare job uses data from a dataset in the workspace and outputs data into processed_data
. This data is then used in the train step and a model is created.
$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
jobs:
prepare:
command: python prep.py --raw_data ${{inputs.raw_data}} --prep_data ${{outputs.prep_data}}
code:
local_path: ./src
environment: azureml:AzureML-minimal:18
compute: azureml:cpu-cluster
inputs:
raw_data:
dataset:
paths:
- folder: azureml://datastores/workspaceblobstore/paths/data/my_raw_data/
outputs:
processed_data:
mode: upload
train:
command: python train.py --data ${{inputs.the_data}} --save_model ${{outputs.the_model}}
code:
local_path: ./src
environment:
image: docker.io/pytorch/pytorch
compute: azureml:cpu-cluster
inputs:
the_data: ${{jobs.prepare.outputs.processed_data}}
outputs:
the_model:
mode: upload
Since a pipeline is also a job, it can be in the same way using the CLI.
az ml job create -f 2step-pipeline.yml
The YAML of the individual jobs inside the pipeline is very similar to the YAML of a single job itself. The only changes are using the output of one job as input to another. With very few changes, multiple standalone jobs can be stitched into a pipeline.
Run a pipeline with components
Let us look at how we can create components which can be used in a pipeline. Let us look at the training job in the above example. Let us say that this job is to be used in multiple instances by many people across your team. Instead of sharing the YAML definition of the job for team members to copy and use, you could create a reusable component which can then be registered into a workspace and easily referenced.
A reusable component is similar to a (python) function. It defines what it will take in and give out. The logic itself is hidden or not required for the consumer. Let us look at the YAML definition of a component.
$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
name: my-training-component
command: python train.py --data ${{inputs.the_data}} --save_model ${{outputs.the_model}}
code:
local_path: ./train_src
environment:
image: docker.io/pytorch/pytorch
inputs:
training_data:
type: path
outputs:
the_data:
type: path
the_model:
type: path
The YAML is similar to the job to train but has certain differences. Instead of specifying the exact inputs and outputs, only the name and type of the inputs and outputs are defined. The compute is not defined within the component.
This component can be registered in the AML workspace using the CLI as shown below. This will register the component with the name my-training-component
az ml component create -f mytraincomponent.yml
Registered components can be used in a pipeline using azureml:component-name
. Users can also provide a YAML definition of a component to be used in a pipeline. Now let us run our pipeline to use components instead of jobs.
$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
jobs:
prepare:
type: component
component: file:./prep.yml # use a YAML defintion for the component
compute: azureml:cpu-cluster
inputs:
raw_data:
dataset:
paths:
- folder: azureml://datastores/workspaceblobstore/paths/data/my_raw_data/
outputs:
processed_data:
mode: upload
train:
type: component
component: azureml:my-training-component #use the registered component
compute: azureml:cpu-cluster
inputs:
the_data: ${{jobs.prepare.outputs.processed_data}}
outputs:
the_model:
mode: upload
This pipeline now uses 2 components. One defined in a YAML file and the other which is registered in the workspace. This enables reuse and modularizes individual steps and orchestration separately.
How to get started?
To get started, first install the CLI v2, and follow along with our docs and samples:
- Tutorials:
- Train models with CLI v2
- Create pipelines with components
- Samples:
- AML CLI Samples
- Running single step jobs
- Running pipelines with jobs.
- Running pipelines with components.
- Reference documentation:
- YAML reference for CLI v2
- REST API documentation
Coming soon
The new CLI v2 and REST API have some features which are still work in progress. The following will be released in the upcoming months as a follow up release:
- Support for automated ML (AutoML)
- Ability to run a sweep job within a pipeline
- Support for Parallel Jobs (aka Parallel Run Step)
- Support for Tabular Datasets
- Ability to schedule pipelines
Why should I use the new CLI?
The new CLI v2 and REST API provide some key benefits:
Easily move from local to cloud
Users can easily move from local workloads to cloud (remote) workloads. They start with their workload in a container for local execution. Data can either be uploaded or cloud data can be connected to the job. The execution itself can then be moved to run on cloud compute (initially AML Compute/ Compute Instance, later also AMLK8S). The workload itself does not matter here. The user controls what goes into the container image for these workloads, by using curated environments or building their own container using AzureML Environments (incl. Docker context) or something else.
Directly move from training to orchestration
Once the workload has been containerized and moved to the cloud, it can be orchestrated in a pipeline without needing adaptation. To provide scalability (for larger workflows) and reusability (when sharing parts of a pipeline), user can define shareable Components which capture one or multiple steps of a pipeline. Components can be shared in source-code (i.e., via Git repo) or as Python packages.
Directly move from training to deployment
Once a model has been trained (be it locally or in the cloud) and saved in MLflow format, the user can take the model to deployment without having to write additional code. Thanks to managed inferencing, the user does not have to manage a cluster, but can go straight to deploying their model – via the AzureML control plane operations. The same model, without modification, can also be scored in batch.
In addition, the user can bring their own container to be deployed, allowing them to entirely control the serving technology, enabling much simpler integration of other languages (like R) and other serving stacks (like Triton).
Standards-based data plane operations
Logging of metrics, params, artifacts and models, is supported by a standards-based way via MLflow. That means that the user can run their workloads locally and log metrics, etc. which they can then visualize in the local MLflow UI. When the same is run in the cloud, the logged artifacts, and metrics land in the AzureML Run History. For model format, AML supports MLflow, allowing the user to save their model together with the scoring code and the environment, enabling them to test the model locally before going on to deploy it.
Integrate with AzureML
Users have a largely increased set of options when integrating with AzureML. The simplest option will often be to use the CLI (for instance when starting a job or pipeline from a GitHub workflow or an ADF pipeline) since the CLI now supports all operations offered by the platform – and CLI authentication is well supported across the board. For deeper integrations, the user can choose to use ARM templates or REST. Another improvement comes from the fact that all resources and assets in AzureML now have a well-defined serialization to JSON/YAML, which allows for new sharing and migration scenarios, as well as GitOps-style operations where the state of the system is managed in git and deployed to AzureML at certain sync points.
Posted at https://sl.advdat.com/3dOkmJ7