Wednesday, July 28, 2021

Simply, Just Python in the Cloud

 

simply_python_banner.png

 

tl;dr

This is a "lean" tutorial of basics of running your code in Azure.  With your data residing in storage alongside a VM in the cloud, without exploring the labyrinthine complexity of Azure, and using the newly-released VS-Code "Azure Machine Learning - Remote" extension, programming on the VM is as simple as developing code on your local machine, but with the scaling benefits of the Cloud. It's the first step to exploiting what cloud computing  with Azure Machine Learning(AML) has to offer.

 

Cloud services offer many options, Azure more than others; but with such variety it's hard to know where to start. Let's assume you've already subscribed to Azure, and have started to learn your way around Azure's web portal, `portal.azure.com,`---its universal graphical interface for setting up and managing web resources. The general management term for Cloud components, services, and products is "resources." If this is not familiar to you, search for an online course in Azure Fundamentals that covers things like subscriptions, resource groups, regions, etc. Harvard's DS intro so that you feel comfortable using the Azure web portal. Keep in mind most on-line training is targeted at enterprise users, with more detail than you'll need to know about to start. Assumedly you're interested in fast results at reasonable cost, and the enterprise user's overriding concerns also include availability, reliability and other "ilities" are not your immediate concern---you'll get there eventually. In this tutorial you will create a few Azure ML resources interactively in the Azure Portal, then do development by a combination of either VS-Code or command line tools. I assume you are facile with programming and working at the command line. I'll target specifically our task of moving your Python application to run in the Cloud, infested with my strong personal biases to avoid common pitfalls, based on my experience.

 

Why would you want to move your code to a machine in the Cloud? For scale. Azure has some huge machines, vast network bandwidth, and almost infinite memory. Surprisingly, applications with large computational loads may be faster on a single machine than on a cluster, and easier to set up. Of course if your aim is to run petabytes of data, or train billion-node neural networks there are suitable VM-cluster solutions. Either via cluster or large VM, it's a great boon for any large scale problem, not only conventional machine learning applications, but for those large scale MCMC tasks, or training on GPUs.

 

Cloud resources

Cloud services consist essentially of three kinds of resources: compute, storage, and networking. The collection of your application resources are designated a "resource group"---a logical construct for creating, tracking and tearing down a cloud application. Your resource group needs to include a Machine Learning Workspace, a convenient collection of resources that itself is a kind of resource. Find it here in the Azure's graphical interface, portal.azure.com:

portal_machine_learning.png

Go ahead and create one. The Portal prompts are kind enough to offer to create a new resource group and new component resources for it's Storage account and others. Take the defaults and go ahead and create new component resources. Just remember the Workspace name, and use the same resource group name for additional resources you create in your application. If you make a mistake it's easy to delete the whole resource group and start again.

 

Compute

In simple terms, compute consists of virtual machines (VMs), from tiny VM images to VMs with enormous RAM and GPUs, bigger than anything you would ever have at your desk. The Workspace doesn't come with any default compute resources, since it gives you a choice of creating individual VMs, VM compute clusters, or using your existing compute resources. For this tutorial you need to create a "Compute instance"---a single VM---by choosing a size that meets your need. These Ubuntu VMs created in Azure Machine Learning Studio (AML) are managed VMs---with most all software needed already installed and with updates managed for you. This takes the work of setting up and configuring the VM off your shoulders. Alternatively Azure still offers legacy "Data Science VMs" that come with extensive pre-installed ML tools, but except for some special purposes, (such as running Windows OS) AML Compute Instances can take their place.

 

Storage

Storage also is available at scale. But unlike with your desktop, primary storage is not a file system running on a disk local to the VM, but a separate component with an access API. One reason is that storage is permanent, but VMs come and go; start them when you need work done, then tear them down to save money. Storage consists of "blobs" with strong guarantees of persistence and redundancy. A key point in this tutorial is demonstrating how to integrate existing Azure Storage, that might come already loaded with your data.


Networking

Networking services connect compute and storage. Moreover networking in the Cloud consists of software-defined virtual networks that connect to your local "on-premise" equipment and your cloud resources. With cloud resources, Networking glues components together with combinations of local subnets, security services (firewalls), and public-facing IP addresses. For the simple case of a VM with connected storage, the associated network looks like your home network, with a public IP address, and NAT-ed addresses on a subnet. You won't need to know much about this, since it is set up for you when you create a VM. We'll get to that; but the place to start is storage, and getting your data onto it.

 

Setting up your development environment.

The tools you'll need to run your code in the cloud are:

  • VS-Code - Microsoft's open source IDE
  • git: Open source version control

And two Azure resources:

  • Azure Storage
  • An AML Workspace with a Compute Instance.

 

The title illustration shows how these are cobbled together. An AML Workspace brings together an integrated collection of resources: You'll need just two: the managed VM, and attached storage where it keeps the file system with your code. Additionally you'll set up an Azure Storage account, or use an existing one for your data. This isn't truly necessary, but learning to manage storage is good experience. Resource setup and management can be done entirely in the Portal. As for code, you move that to AML version it with git. Then you run your edit-run-debug cycle with two copies of VS-Code running locally. The magic is that one copy is attached to the AML compute instance making it appear as if it's just another local instance.

My focus on using VS-Code for code development, along with two of its Extensions, the Azure ML Extension, and the Azure ML - Remote Extension. Install Extensions from the VS-Code "Activity Bar", found in a column on far left, that look like this:
extensions_icons.pngThe Azure icon will be used to launch remote sessions.

As an open-source project, the momentum to improve and expand VS-Code is worth noting, as told by its dedicated blog. Hard-core Python developers may have a preference for other dedicated Python tools such as PyCharm or Spyder, however in this tutorial I'm using Microsoft's VS-Code  for its elegant project management, symbolic debugging, and Extensions, which integrate nicely with Azure. For these reasons I prefer it to Microsoft's legacy Visual Studio IDE Tools for Python.

 

So install VS-Code if you don't have it already.

 

Besides your Azure subscription, git, and VS-Code you'll need the az command line package, known as the [Azure command-line interface Azure CLI, and its extension for Azure Machine Learning which work equivalently on either Windows or Unix shells. There are installers for it on the webpage.

The az package manages every Azure resource, and its az reference page is a complete catalog of Azure. It's implemented as a wrapper for Azure's RESTful management API. You rarely need to refer to it directly since VS-Code Extensions will call into az for you, and offer to install it if it's missing. And the Portal exposes the same functionality. But each of these are syntactic sugar over command line tools, and you should be aware of both.

In the long run, the other reason to know about az is for writing scripts to automate management steps. Every time you create a resource in the Portal, it gives you the option to download a "template". A template is just a description in JSON of the resource you've created. Using az you can invoke ARM (the Azure Resource Manager) to automate creating it. ARM is a declarative language for resource management that reduces resource creation to a single command. Look into the az deployment group create... command for details.

Commands specific to Azure Machine Learning (AML) in the az extension need to be installed separately. az will prompt you to do this the first time you use az ml.

Don't confuse the az CLI with the Azure python SDK that has overlapping functionality. Below is an example using the SDK to read from Storage programmatically. Most things you can do with the az CLI are also exposed as an SDK in the python azureml package.

Finally, to check that az is working try to authenticate your local machine your Cloud subscription. Follow the directions when you run

 

 

 

 

 

 

$ az login

 

 

 

 

 

 

You do this once at the start. Subsequent az commands will not need authentication.

 

The Application

Ok, you've got your code and data, and installed tools. Let's begin! Assume you've gone ahead and got a new AML Workspace, the steps are

  1.  Setup Storage,
  2.  Prepare your code,
  3.  Create a compute instance, and
  4.  Launch a remote VS-Code session.

One option when you are done with your code is to deploy it by wrapping it in a "RESTful" API call as an Azure Function Application. This exposes your application as an "endpoint", callable from the Web. Building an Azure Function could be the subject of another tutorial.

 

First thing - Prepare your data. It pays to keep your data in the Cloud.

Storage, such as Azure blob storage is cheap, fast and seemingly infinite in capacity; it costs about two cents a gigabyte per month, with minor charges for moving data around. For both cost and network bandwidth reasons you are better off moving your data one time to the cloud and not running your code in the Cloud against your data stored locally. In short, you'll create a storage account, then create a container---the equivalent of a hierarchical file system---of "Azure Data Lake Generation 2" (ADL2) blob storage. There are numerous alternatives to blob storage, and even more premium options to run pretty much any database software on Azure you can imagine, but for the batch computations you intend to run, the place to start is with ADL2 blob storage.

Blob storage appears as files organized in a hierarchical directory structure. As the name implies, it treats data as just a chunk of bytes, oblivious to file format. Use csv files if you want, but I suggest you use a binary format like parquet for its compression, speed and comprehension of defined data-types.

Perhaps your data is already in the Cloud. If not the easiest way to create storage is using the Portal---"create a resource", then follow the prompts. It doesn't need to be in the same resource group as your Workspace, the choice is yours. The one critical choice you need to make is to select enable_dls_gen2.PNG 

Essentially this is blob storage that supersedes the original blob storage that was built on a flat file system. Note that "Azure Data Lake Gen 1" storage is not blob storage for reasons not worth going into and I will ignore it.

You can upload data interactively using this Portal page. Here's an example for a container named "data-od." It takes a few clicks from the top level storage account page to navigate to the storage container and path to get here. It contains file and directory manipulation commands much like the desktop "File Explorer."

portal_storage_panel.png

Later I will show how to programmatically read and write data to storage.

 

In the lean style of this tutorial, I've included only the absolute minimum of the Byzantine collection of features that make up Azure ML. Azure ML has it's own data storage, or more accurately, thinly wraps existing Azure Storage components as "DataSets," adding metadata to your tables, and authenticating to Storage for you, which is nice. However, the underlying storage resource is accessible if you bother to look. This tutorial---an end run around these advanced features---is good practice even if only done once, to understand how Azure ML storage is constituted.

 

Prepare your code

Going against the common wisdom, I recommend you should convert your notebook code to Python modules once you get past initial exploratory analysis. If you are wed to using notebooks, VMs work just fine: Both Azure ML workspaces and Spark clusters let you run notebooks, but you will be limited going forward in managing, debugging, testing, and deploying your code. Notebooks are great for exploring your data, but when developing code for subsequent use one get's tired of their limitations:

  • They don't play well with `git`. If there are merge conflicts between versions, it's a real headache to diff them and edit the conflicts.
  • They get huge once filled data and graphics. Either strip them of output before you commit them, to keep them manageable. Or convert them to html as a sharable, self-documenting document.
  • As your code grows, all the goodness of modular, debuggable object-oriented code is missing. Honestly the symbolic debugger in VS-Code is worth the effort of conversion, not to mention the manageability of keeping code organized by function in separate modules.
  • Python modules encourage you to build tests, a necessary practice for reliable extension, reuse, sharing and, productionizing code.

 

So relegate notebooks to run-once experiments, and promote blocks of code to separate files that can be tested and debugged easily.

Jupyter generates a .py file from a notebook for you at the command line with

 

 

 

 

 

$ jupyter nbconvert --to-py <your notebook>.ipynb

 

 

 

 

 

Rely on git

git deserves true homage. As your projects scale and you collaborate with others ---the things that make you valuable to your organization---git is the comfortable home for your work. As any SW Engineer knows there's no greater solace and respite from the wrath of your colleagues if you've "broken the build" to be able to revert to a working state with just a few commands. Having an documented history is also a gift you give to posterity. Arguably the complexity of git dwarfs the complexity of a hierarchical file system, but there's no shame of relying on Stack Overflow each time you step out of your git comfort zone. And believe me, there is a git command for any conceivable task.

Assume your code is in a git repository of your choice. git will be your conveyance to move code back and forth from cloud to local filesystems.

 

Setting up cloud resources - an Azure Machine Learning (AML) Workspace

I assume you've created an AML Workspace. To launch the web version there's the launch_studio.PNG button once you navigate to the Workspace.

 

A tour of the Workspace

The Workspace collects a universe of machine learning-related functions in one place. I'll just touch on a few that this tutorial uses. However the beauty of using VS-Code is that you don't need to go here: Everything you need to do in the Studio can be done from the Azure Remote Extension.

  • compute instances Once inside the Workspace pane, on its left-side menu, there's a "compute" menu item, which brings you to a panel to create a new compute instance. Create one. You have a wide choice of sizes - can you say 64 cores w/128G RAM? Or with 4 NVIDA Teslas? Otherwise it's fine to just accept the defaults. You can start and stop instances to save money when they are not being used.
  • code editing Under "notebooks" (a misnomer) there's a view of the file system that's attached to the Workspace. It is actually a web-based IDE for editing and running code in a pinch. It can include notebooks, plain python files, local git repositories and so on. This file system is hidden in the workspace's associated AML storage, so it persists even when compute instances are deleted. Any new compute instance you create will share this file system.
  • AML storage The Studio does not reveal its connected AML Storage Account, so consider it a black box where the remote file system and other artifacts are stored. For now work with external Azure storage -- why? So you see how to integrate with other Azure services via storage.
    The brave among you can use the Portal to browse the default storage that comes with your workspace. The storage account can be found by its name that starts with a munged version of the workspace name followed by a random string.

Remember the nice thing is that all this Workspace functionality you need is exposed at your desktop by VS-Code Extensions.

 

The development cycle

Let's get to the fun part---how these components work together and give a better experience. Notably the local edit-run-debug cycle for VS-Code is almost indistinguishable when using cloud resources remotely. Additionally both local and remote VS-Code instances are running simultaneously on your desktop, a much smoother experience than running a remote desktop server ("rdp"), or trying to work remotely entirely in the browser. How cool is that!

 

Let's go through the steps.

 

Running locally

Authentication to the cloud is managed by the Azure ML extension to VS-Code; once you are up and running and connected to your Workspace you will not need to authenticate again.

  • Start VS-Code in the local directory in your `git` repository where VS-Code's `.vscode` folder resides. (The launch.json file there configures run-debug sessions for each file. Learn to add configurations to it.)
  • Once you are satisfied the code runs locally, use `git` to commit and push it. If you commit notebooks, it's better to "clean" them of output first, to keep them manageably-sized.
  • Start the Azure ML compute instance from the Azure ML extension. First connect to your Workspace using the Azure Icon (see above) in the VS-Code Activity menu. You can spawn a remote-attached instance of VS-Code from the local VS-Code instance from the well-hidden right-click menu item "Connect to Compute Instance":

connect_to_compute_instance.PNG

This will both start the VM, and spawn a second VS-Code instance on your desktop connected to your remote Workspace file system!

 

Running remotely

Note that the terminal instance that appears in the remotely-connected VS-Code terminal window is running on your VM! No need for setting up a public key for "putty" or ssh to bring up a remote shell. Similarly the file system visible in VS-Code is the Workspace file system. How cool is that!

  • First thing of course is to bring the repository in the Workspace file system current by using git to pull your recent changes. Then it's just edit-run-debug as normal, but with all compute and file access in the cloud.
  • With both VS-Code instances running on your local machine you can work back and forth between local and remote versions simultaneously, transferring code changes with git.
  • In the remote instance, the Azure ML Extension menus are disabled, naturally. It doesn't make sense to spawn a "remote" to a "remote".
  • When you are done spin-down the VM to save money. Your remote file system is preserved in Storage without you needing to do anything.

If you are sloppy about moving files with git (you forgot something, or you need to move a secrets file), note that the remote-attached instance of VS-Code can view either the cloud or local file systems making it a "back door" to move files between them. Advisedly git is used to maintain consistency between code in the two file systems, so use this back door sparingly.

Alternately the remote session can be invoked in the Portal. Try opening a remote VS-Code session on your local machine from the AML Studio notebooks pane. notebook_studio.pngIf your compute instance is running, you can launch a local instance of VS-Code attached remotely to the Workspace's cloud file system (not just the file that's open in the Portal editor) and a shell running on the compute instance. But I think you'll find it's easier to spawn your remote session directly from your local VS-Code instance, than having to go to the Studio pane in Azure ML page in the Portal to launch it.

 

Programmatic adl2 file access

One change you'll need to run your code in the cloud is to replace local file IO.
The primary change to your code is to replace local file system IO calls with calls to the python SDK for Blob Storage. This works both when the code runs locally, or in the VM as you switch your application to work on data in the cloud.

There are several ways to access files for the VM. Here's one way with the Azure python SDK that works with minimal dependencies, albeit it is a lot of code to write. Wrap this in a module, then put your efforts into the computational task at hand. The python SDK defines numerous classes in different packages. This class, DataLakeServiceClient is specific to adl2 blob storage. Locally you'll need to get these packages,

 

 

 

 

$ pip install azure-storage-file-datalake
$ pip install pyarrow

 

 

 

 

pyarrow is needed by pandas. Apache project "arrow" is a in-memory data format that speaks many languages. 

In this code snippet, the first function returns a connection to storage. The second function locates the file, retrieves file contents as a byte string, then converts that to a file object that can be read as a Pandas DataFrame. This method works the same whether you reference the default storage that comes with the Azure ML workspace you created, or some pre-existing storage in other parts of the Azure cloud.
For details see https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-directory-file-acl-python

 

 

 

 

import pandas as pd
from azure.storage.filedatalake import DataLakeServiceClient

import secrets

def connect_adl2():
    '''Create ADL2 connection objects to blob storage.
    '''
    service_client =\
DataLakeServiceClient(account_url=f"https://{secrets.AZURE_STORAGE_ACCOUNT}.dfs.core.windows.net", credential=secrets.STORAGE_ACCOUNT_KEY)
    # "file system" in this context means container.
    file_system_client = 
    service_client.get_file_system_client(file_system=secrets.CONTAINER)
    return file_system_client

def download_a_file_into_df(absolute_file_path, file_system_client):
    '''Load the file from storage, into a df, from an absolute path in that container.
    '''
    try:
        file_client = file_system_client.get_file_client(absolute_file_path)
        blob_contents = file_client.download_file().readall()
        df = pd.read_parquet(io.BytesIO(blob_contents), engine='pyarrow')
    except Exception as e :
        print(f'ERROR: {e}\nNo blob downloaded for: {secrets.CONTAINER}\
{absolute_file_path}')
        return None
    return df

if __name__ == '__main__':
    client = connect_adl2()
    df = download_a_file_into_df(sys.argv[1], client)
    print(df.describe(), '\nDone!')

 

 

 

 

You can use the __name__ == '__main__' section to embed test code. It doesn't get run when you import the file. But as you develop other scripts that use this one, you can invoke this file with the path to a known blob to test it hasn't broken.

The DataLakeServiceClient authenticates to Azure Storage using an "account key" string that you recover from the Portal, in the Storage Account "Access keys" pane. An account key is an 88 character random string that works as a "password" for your storage account. Copy it from this pane:

access_key.PNG

Since its confidential it doesn't get embedded in the code, but remains hidden in another file secrets.py the local directory that contains only global variable assignments, in this case AZURE_STORAGE_ACCOUNT, STORAGE_ACCOUNT_KEY, and CONTAINER.

Place the file name in your .gitignore file, since it's something you don't want to share. This is a simple way to enforce security, and there are several better, but more involved alternatives. You can keep secrets in environment variables, or even better used Azure "key vault" to store secrets, or just leave secrets management up to Azure ML. Granted account keys are a simple solution, if you manage them carefully when individually developing code, but you'll want to button-down authentication when building code for enterprise systems.

Similarly there are more involved, "code-lite" ways to retrieve blobs. As mentioned Azure ML has it's own thin wrapper and a default storage account created for the Workspace where it's possible to store your data. But as I've argued, you'll want to know how to connect to any Azure data, not just AML-managed data. To put it politely, AML's tendency to "simplify" fundamental cloud services for data science doesn't always make things simpler, and can hide what's under the hood that makes for confusion when the "simple" services don't do what you want.

 

Other services, embellishments, and next steps

For computational tasks that benefit from even more processing power, the cloud offers clusters for distributed computing. But before jumping to a cluster for a distributed processing solution e.g. dask, consider python's multiprocessing module that offers a simple way to take advantage of multiple core machines, and gets around python's single threaded architecture. For an embarrassingly parallel task, use the module to run the same code on multiple datasets, so each task runs in it's own process and returns a result that the parent process receives. The module import is

 

 

 

 

from multiprocessing import Process, Manager

 

 

 

 

The Process class spawns an new OS process for each instance and the Manager class collects the instance's results.

As for going beyond one machine, the sky's the limit. Apache Spark clusters seamlessly integrate coding in Python, R, SQL and Scala, to work interactively with huge datasets, with tools well-tailored for data science. Spark on Azure comes in several flavors, including third party DataBricks, and database-centric integration of Spark and MSSQL clusters branded "Synapse."

 

More machine learning features

The point of this article has been the scaling opportunities available in Azure ML, but did I mention the ML features of Azure ML? In addition to the fun you'll have up-leveling your development to scale, you've started up the learning curve to using Azure's tools for data science. For tracking and comparing results of multiple runs AzureML integrates with `MLFlow`. See this article.  For managing data, AML "DataSets" embellish datafiles with metadata. There is even an SDK for R similar to the python SDK. There are objects for experiments, runs, environments, and model management. "AutoML" automates supervised learning model selection and tuning. That's just a couple examples of how Azure ML repackages services to be a "one stop" shop for machine learning tasks.

 

Azure Function deployment and beyond

Azure can be used not only for running experiments at scale, but for building an application around your code. Cloud applications consist of collections of networked services, running on separate VMs. For convenience one avoids working at the "infrastructure" level and moves to "managed services" where the VMs are not exposed, just the interfaces are, as we did with managed compute instances. You can do the same thing with your code by publishing it to the cloud as a RESTful http endpoint. To your client it looks like a URL except instead of returning a web page it returns the result of your computation. A basic way to build this is by wrapping your code in an Azure Function. An Azure Function typically runs in a few seconds, and doesn't save state from one call to the next. (Saving state is where blob storage comes in, remember?) There are several more featureful alternatives for implementing http endpoints, based on Docker containers for instance, of course some available from the Azure ML Workspace, but all work as managed services, making implementation independent of the platform "infrastructure" it's built on. How to use AFs deserves a separate tutorial, but to get you started, load the Azure Function extension into VS-Code and browse it's features.
Remember when I said that local management for everything comes with az? Well I wasn't telling you the whole truth. For AF you'll need to install separately the Azure Functions Core tools for commands that start with az func. VS-Code will prompt you for this install when you install the VS-Code Azure Functions extension.

Since you're not put-off by writing code, you can proceed to integrate with the full gamut of cloud services: data bases, web hosting, advanced AI, and security---much as your favorite on-line retail site is likely to have done when building an enterprise Azure application.

 

Notes

The VS-Code remote extension source

- https://github.com/microsoft/vscode-tools-for-ai
- https://github.com/MicrosoftDocs/azure-docs/blob/master/articles/machine-learning/how-to-manage-resources-vscode.md


Python SDK reference

- https://azuresdkdocs.blob.core.windows.net/$web/python/azure-storage-file-datalake/12.0.0b7/azure.storage.filedatalake.html?highlight=datalakefileclient#azure.storage.filedatalake.DataLakeFileClient

Posted at https://sl.advdat.com/3iUXx8I