Advanced Data Solutions : Configuring Azure ML projects to run on GitHub Codespaces

Introduction

GitHub Codespaces is a cloud-based development environment that runs your code in a container. It essentially gives you the ability to move your development workflow to the cloud, and to configure your remote environment such that it looks and feels just like your local development environment.

The advantages of codespaces are endless!

It enables you, the readers of my blog, to run all the code in this blog in the codespaces I have configured. Previously you would have to clone my repo to your local machine, create and activate the conda environment I provide, install the Azure CLI and Azure ML extension, and only then could you run the code. Now, you can simply start the codespace I provide, which already contains all the necessary setup, and go straight to running the code!
Similarly, people collaborating on a project can now focus on the project itself, instead of wrangling with the project setup on their local machine. This might not seem like a big advantage for simple projects, but it brings significant time savings for larger projects with many dependencies and collaborators.
For people who are on-the-go with lightweight low-powered machines, it enables them to do meaningful software development work, by utilizing a state-of-the-art machine in the cloud for all of their running and debugging needs.
And finally, by moving development to the cloud, it frees up your local machine in case you (like me) want to try out the latest unstable version of some software. It’s easy to repave your machine if you get into trouble, and get back to work with little downtime.

In this post, I’ll start by guiding you through the steps needed to set up your Azure ML projects on Codespaces. Then I’ll discuss the different ways to configure your VS Code setup using settings, including tips on how to best set up your machine learning projects. And finally, I’ll discuss the issues I ran into while enabling Codespaces for my blog, and the solutions to those issues.

In order to get the most out of this post, it’s best that you already have some familiarity with Git, GitHub, VS Code, Python, Conda, and Azure ML.

You can find my Codespaces configuration for Azure ML in the .devcontainer folder of my Fashion-MNIST GitHub repo. I also use my SINDy GitHub repo and my Activity GitHub repo as examples to demonstrate various aspects of my configuration.

Setting up a codespace for an Azure ML project

At the time of writing, you have access to Codespaces if your repository belongs to a GitHub organization that has been set up using a “GitHub Team” or “GitHub Enterprise Cloud” plan, and the owner of that organization has enabled Codespaces. If you’re the owner of a GitHub organization, you can enable Codespaces by going to “Settings” in your GitHub page, “Codespaces,” “General,” and then choosing a “User permissions” setting other than “Disabled.” Codespaces pricing is explained in the documentation. Alternatively, if you want to use Codespaces in your individual account, you can join the waitlist for Codespaces beta — this is free for now! The rest of the post assumes that you have access to Codespaces, either through an organization or your individual account.

To start Codespaces for a GitHub repository, go to the main page of that repository, click on the green “Code” button, choose the “Codespaces” tab, and click on “New codespace”.

Screenshot of the popup where we can create a new codespace.

Depending on how Codespaces is set up, you may be given a choice of machine type. For example, if the owner of your organization decided that everyone within that organization gets a certain machine type (which can be done by going to “Settings,” “Codespaces,” “Policy”), then there is no choice to be made, and this dialog is skipped. If, on the other hand, you were given permission to use more than one machine type, a dialog similar to the one below is displayed:

Screenshot of a dialog asking us to choose a machine type.

After you make your choice, click on the “Create codespace” button, and a new codespace is created. A new window is displayed that looks like the following:

Screenshot of the window displayed while waiting for codespace creation.

After a little while, your codespace finishes setting up, and VS Code opens in the browser, together with the code in your repository. This is pretty exciting! You’re now running a fully-fledged version of VS Code, just like the one you’re used to running locally, but in a Docker container, in the cloud!

If you were developing a simple application with common requirements (without the need for Azure ML or any special packages), you’d be done. You could run, debug, and re-run your code as usual. That’s because the base Linux image used in the default Docker container includes many of the most popular languages, including Python, Node.js, and JavaScript. You can take a look at the Dockerfile for this base image to see everything that is included. (You may have noticed that a directory named .venv is created at this point, containing your virtual environment. If your .gitignore file doesn’t contain this directory already, this is a good time to add it.)

Our goal is to run an Azure ML application, so we need a bit more setup. Thankfully, Microsoft provides us with predefined containers with common configurations, and an intuitive user interface to install additional features. We can easily replace the default container with a more fully-featured container by going to the VS Code Command Palette (Ctrl + Shift + P), and choosing “Codespaces: Add Development Container Configuration Files…”

Screenshot of the command palette showing the command to add a dev container configuration file.

You will first be asked to select a container configuration definition. I use Miniconda for all my Python package management, so that’s what I’ll choose:

Screenshot of all the available container configuration definitions.

You will then be asked to select additional features to install. The Azure CLI needs to be installed before we can install Azure ML, so I’ll check that box:

Screenshot of all the available container configuration definitions.

You’re now ready to rebuild the codespace with these settings. You’ll likely get a notification asking if you want to rebuild the container — go ahead and do so. Alternatively, you can go to the Command Palette and choose “Codespaces: Rebuild Container.”

Once the codespace finishes rebuilding, VS Code reappears in your browser, and your terminal shows a notice saying that you’re using a custom image defined in your devcontainer.json file:

Screenshot showing terminal with custom image.

Nice! We’re now running our code in a Docker container in the cloud, with conda and the Azure CLI installed! The configuration files for this codespace can be found under the .devcontainer folder. I like to keep just the parts I need and remove everything else, but you can keep everything that was generated and add to it as needed. Here are the essential parts of the Dockerfile:

https://github.com/bstollnitz/fashion-mnist/tree/master/.devcontainer/Dockerfile

# See here for image contents: https://github.com/microsoft/vscode-dev-containers/tree/v0.209.6/containers/python-3-miniconda/.devcontainer/base.Dockerfile
FROM mcr.microsoft.com/vscode/devcontainers/miniconda:0-3

# Update the conda environment according to the conda.yml file in the project.
COPY conda.yml /tmp/conda-tmp/
RUN /opt/conda/bin/conda env update -n base -f /tmp/conda-tmp/conda.yml && rm -rf /tmp/conda-tmp

As you can see, this file uses another Dockerfile as a base, and that base already contains the commands needed to install Miniconda. The choice you made earlier for the “container configuration definition” determined the base Dockerfile used here. In addition, the Dockerfile copies the conda.yml file in the root of the project to the container, and updates the base environment according to the contents of that file. (If you use a different name or location for your conda file, remember to update it here.) This is super handy — when you open your project in a codespace, all the packages needed to run the project will be installed automatically!

Now let’s look at the essential parts of the devcontainer.json file:

https://github.com/bstollnitz/fashion-mnist/tree/master/.devcontainer/devcontainer.json

// For format details, see https://aka.ms/devcontainer.json. For config options, see the README at:
// https://github.com/microsoft/vscode-dev-containers/tree/v0.209.6/containers/python-3-miniconda
{
    "name": "Miniconda (Python 3)",
    "build": {
        "context": "..",
        "dockerfile": "Dockerfile",
    },
    "settings": {
        "python.defaultInterpreterPath": "/opt/conda/bin/python",
    },
    "features": {
        "azure-cli": "latest"
    },
    "remoteUser": "vscode",
}

We’ll add a few more settings to this file as we go, but for now you can see that the location of the Dockerfile is specified first, followed by the location of the python interpreter, additional features, and the name of the remote user. In particular, notice the “azure-cli” feature. This was added when you selected the “Azure CLI” as an additional feature during the setup.

Just by choosing a more appropriate base image and additional features, we have miniconda and the Azure CLI installed in our container. We’re most of the way there, but we still need to install Azure ML. It turns out that the devcontainer gives us a hook to execute a command after the container starts, through the onCreateCommand property. This is exactly what we need — we can simply set the value of this property to the installation command for Azure ML, as shown below:

https://github.com/bstollnitz/fashion-mnist/tree/master/.devcontainer/devcontainer.json

// For format details, see https://aka.ms/devcontainer.json. For config options, see the README at:
// https://github.com/microsoft/vscode-dev-containers/tree/v0.209.6/containers/python-3-miniconda
{
    "name": "Miniconda (Python 3)",
    "build": {
        "context": "..",
        "dockerfile": "Dockerfile",
    },
    "settings": {
        "python.defaultInterpreterPath": "/opt/conda/bin/python",
    },
    "features": {
        "azure-cli": "latest"
    },
    "remoteUser": "vscode",
    "onCreateCommand": "az extension add -n ml -y",
}

For a full list of all the properties you can add to the devcontainer file, check out the documentation.

Our last step in configuring a codespace for our machine learning project is to add the VS Code extensions we need to the devcontainer file. For machine learning projects on Azure, I highly recommend installing the Azure ML extension, which will give you the ability to train and deploy models on Azure directly from within VS Code, and will enable intellisense for your Azure ML YAML configuration files. Installing this extension adds a few other extensions that I find essential, such as the Python extension, and the Pylance extension. You can install an extension by clicking on the “Extensions” icon on the left side of VS Code, searching for its name, selecting the extension you’re looking for, and clicking its blue “Install in Codespaces” button.

Screenshot showing the Azure ML extension page in VS Code.

Then, to instruct the devcontainer to install the extension when the container starts up, click on the gear button to the right of the “Uninstall” button, and select “Add to devcontainer.json”.

Screenshot showing a popup that says "Add to devcontainer.json".

This adds to the “extensions” section of the devcontainer.json file:

https://github.com/bstollnitz/fashion-mnist/tree/master/.devcontainer/devcontainer.json

// For format details, see https://aka.ms/devcontainer.json. For config options, see the README at:
// https://github.com/microsoft/vscode-dev-containers/tree/v0.209.6/containers/python-3-miniconda
{
    "name": "Miniconda (Python 3)",
    "build": {
        "context": "..",
        "dockerfile": "Dockerfile",
    },
    "settings": {
        "python.defaultInterpreterPath": "/opt/conda/bin/python",
    },
    "extensions": [
        "ms-toolsai.vscode-ai",
    ],
    "features": {
        "azure-cli": "latest"
    },
    "remoteUser": "vscode",
    "onCreateCommand": "az extension add -n ml -y",
}

This is all you need to start using Azure ML in a codespace! You can now rebuild the container like you did before (Ctrl + Shift + P followed by “Codespaces: Rebuild Container”) and start using it!

With the work you’ve done, other people who have access to codespaces can now run your project in a container in the cloud, without doing any of the Azure ML setup! This is what they’ll see when they go to your repo on GitHub:

Screenshot of the popup where we can create a new codespace.

Followed by a choice of machine type:

Screenshot of a dialog asking us to choose a machine type.

And then your code opens in VS Code in the browser, in a pre-configured container, ready to be executed!

The Dockerfile and devcontainer.json configurations I show here contain all the steps you need to run the projects in this blog. If you’ve been following my Azure ML posts, you know that I have a preference for YAML configuration files over the Python SDK, and for .py files over notebooks. If you need support for the Azure ML Python SDK or for Jupyter, I recommend that you take a look at the setup provided by the Azure ML team, which you can find on the team’s GitHub. Alernatively, instead of choosing the Miniconda base earlier in these instructions, you could have selected “Show All Definitions,” followed by “Azure Machine Learning.” This gives you a container with full Anaconda and the Azure ML Python SDK, as you can see in their Dockerfile. Keep in mind that if you choose this base, you don’t get the Azure CLI and the Azure ML CLI extension already installed, so you still need to follow all the steps in this post. Also, this Dockerfile doesn’t update the base conda environment with your conda file, so you might want to add that to the Dockerfile. We’ve been looking into providing a better Azure ML base image experience.

In this section, I showed you how to get Azure ML up and running on Codespaces. In the next few sections, I’ll explain how to log in to Azure, I’ll give you some tips on configuring VS Code for machine learning projects on Codespaces, and I’ll discuss the issues I encountered while adding devcontainers to the projects in my blog.

Logging in to Azure

Once you start using your Codespace, you’ll need to log in to Azure. There are two methods for logging in:

You’ll need to log in using your terminal before you can execute “az” commands, and before you can execute YAML files in the cloud (which can be done by right-clicking on the opened file and choosing “Azure ML: Execute YAML”). You can also use the terminal to set your default account, resource group, and machine learning workspace, to avoid having to specify this information in every command.

az login --use-device-code
az account set -s "<YOUR_ACOUNT>" 
az configure --defaults group="<YOUR_RESOURCE_GROUP>" workspace="<YOUR_ML_WORKSPACE>"

You’ll need to log in using the Azure Account extension before you can browse your cloud resources within VS Code, or get intellisense while editing YAML files. You can login by clicking the Azure “A” icon in VS Code’s left side bar, and selecting “Sign in to Azure…” You’ll then be presented with the list of Azure accounts available to you, and selecting an account will show a list of available workspaces. You can mark a workspace as default by clicking on the pin to its right.

We’re looking into ways to consolidate the two login experiences.

Configuring VS Code settings

VS Code is highly customizable through the use of a wide range of settings. There are a few different locations where you can add these settings, though, and choosing the right place can make a big difference in how efficient you are at working across projects and machines.

Any time I want to add a new setting to VS Code, I choose one of the following three locations for the setting:

.devcontainer/devcontainer.json — I add to this file settings that are specific to running the project in a codespace, as we saw in the previous section. This is a good place to set the default Python interpreter path because that path is specific to Codespaces.
.vscode/settings.json — I add to this file settings that are specific to the project, regardless of whether I run the project on my local machine or in a codespace. This is where I add my linter and formatter choices, as we’ll see in the next section.
VS Code user settings with “Settings Sync” — If I want a particular setting to apply to all projects across machines, I add it to my VS Code settings and enable “Settings Sync.” This is what I’ll talk about next.

To change your VS Code settings, click on the button with a gear-shaped icon at the bottom-left of the VS Code window, and choose “Settings”:

Screenshot of the gear button with popup up menu open and a "Settings" option highlighted.

Once you’ve opened the settings editor, you can browse and search for all VS Code settings. By default, any setting that you change in this editor will apply to VS Code on the machine you are currently using. For a consistent development experience across your local machine, Codespaces, and any other machine where you use VS Code, I recommend turning on “Settings Sync.” This enables all your settings to be associated with your GitHub (or Microsoft) account, and it causes them to sync every time you open VS Code as that user. You can turn on “Settings Sync” by clicking again on the VS Code gear button, and then “Turn on Settings Sync…“:

Screenshot of the gear button with popup up menu open and a "Settings Sync" option highlighted.

You’ll then be taken to the following dialog, where you can configure which settings you want to sync. I like to keep all the checkboxes checked:

Screenshot of the "Settings Sync" options.

Next click on the “Sign in & Turn on” button, select your account (GitHub or Microsoft), and your settings will be synced. If you have conflicts in settings from different machines, you’ll be given a chance to select which settings you want to prevail. In addition to seeing your preferences in the “Settings” editor, you can also see them in json format, by going to the Command Palette and selecting “Preferences: Open Settings (JSON).” I like to use the “Settings” tab to browse all the settings that are available to me, and the json file to quickly glance at all the settings I have customized.

My VS Code user settings are not specific to machine learning projects — they apply to every project! For example, this is where I set the color theme for the VS Code user interface (“workbench.colorTheme”: “Default Dark+”), and where I instruct VS Code to show me differences in whitespace when diffing two files (“diffEditor.ignoreTrimWhitespace”: false).

I’ve been a fan of Settings Sync for a while, because it enables me to re-install Visual Studio code and immediately start working in a familiar environment. But with Codespaces, it’s more important than ever. It plays a big role in ensuring that your cloud environment feels as comfortable as your local one.

In this section, I explained how you can configure your VS Code settings, depending on where you want them to apply. In the next section, I will talk about a few of the VS Code settings I use for my machine learning projects, with a special focus on linting and formatting.

Linting and automatic formatting

A good choice of linter and formatter makes a world of difference when writing Python code on VS Code!

A typical Python linter analyzes your source code, verifies that it follows the PEP8 official style guide for Python, and warns you of any instances where it doesn’t. The PEP8 style guide provides guidance on matters such as indentation, maximum line length, and variable naming conventions. I like to use Pylint as my linter because in addition to PEP8 style checks, it also does error checking — detecting when I’ve used a module without importing it, for example. Pylint is the most popular linter for Python at the time I’m writing this.

A typical Python formatter auto-formats your Python code according to the PEP8 standard. For example, imagine that you have a line of code that’s longer than the maximum line length recommended by PEP8. Running the linter will give you a warning, but it won’t fix the issue for you. That’s where the formatter comes in: when you run it, it breaks up the code onto multiple lines automatically. I like to use YAPF from Google, because in addition to making sure your code conforms to PEP8, it also makes it look good. In the example I mentioned, YAPF won’t just break up the line so that it doesn’t violate PEP8’s max line length, it also breaks it up so that it’s as easy as possible to read.

I set my linter and formatter settings in the .vscode/settings.json file within each project, because I may want to customize them per project. Applying them to every project is not the best choice for me because I have some projects that rely on TypeScript and Node.js (such as this blog), and some C# .NET projects, too (like my old blog). But if you write Python exclusively, adding these settings to your user-level VS Code settings might be your best choice. I don’t recommend including them in your devcontainer.json file because you typically want your development environment to be the same locally and in Codespaces.

Here are the contents of my .vscode/settings.json file for the Fashion-MNIST project:

https://github.com/bstollnitz/fashion-mnist/tree/master/.vscode/settings.json

{
    "python.linting.pylintEnabled": true,
    "python.formatting.provider": "yapf",
    "editor.rulers": [
        80
    ],
    "editor.formatOnSave": true,
}

The first two lines specify my choices of linter (Pylint) and formatter (YAPF). The third line instructs VS Code to display a thin vertical line at character 80, since the max line length recommended by PEP8 is 79 characters. This just helps me to visualize where my code should wrap. The fourth line tells VS Code to run YAPF every time I save my code. This is super handy! I can write my code without worrying about making it pretty, and a simple “Ctrl + S” formats it exactly the way I want it!

Now that we’ve enabled Pylint and YAPF, we need to configure their settings, which we typically do by adding a .pylintrc file and a .style.yapf file to the root of the project. The .style.yapf file that I add to all my projects has the following contents:

https://github.com/bstollnitz/sindy/blob/main/.style.yapf

[style]
based_on_style = google

The Formatting style section of YAPF’s documentation lists the four base styles supported by YAPF: “pep8”, “google”, “yapf”, and “facebook”. I chose Google’s style because before using YAPF I was already following their very comprehensive Google Python style guide. The YAPF docs contain a lot more information to further customize how you want YAPF to work.

The .pylintrc file that I use also follows the Google Python style guide, and can be found in this location. Occasionally, I don’t want a particular rule to be enforced, and so I disable it in one of two ways:

If I want it to be disabled for the whole project, I add it to the “disable” section in the .pylintrc file. For example:

https://github.com/bstollnitz/sindy/blob/main/.pylintrc

disable=abstract-method,
        ...
        invalid-name,

If I want it to be disabled just for a particular instance, I add a comment immediately above the line with the lint warning. For example:

https://github.com/bstollnitz/sindy/blob/main/sindy/lorenz-pysindy/src/2_fit.py

# pylint: disable=unused-import
from pysindy.differentiation import FiniteDifference, SINDyDerivative

We discussed here two of the files I add to the root of every project I create. Before we move on, let’s take a brief look at the overall structure I use for all my machine learning projects. They all contain a .devcontainer folder with a Dockerfile and devcontainer.json, as we discussed earlier in this post. They also contain a .vscode folder containing a launch.json with the launch configuration(s) I want (which determines what happens when I press F5), and a settings.json containing any VS Code settings specific to the project. In addition, they contain a folder with the name of the repo, where you can find all my code. And finally, they contain the following files:

A .gitignore file specifying all files and directories that I want git to ignore.
A .pylintrc file and a .style.yapf file, which configure the linter and formatter I use for my projects, as I explained earlier in this section.
A conda.yml file listing all the packages I need to be installed to run the code. As we saw earlier in this post, we configured the Dockerfile to install these automatically when the container starts.
A LICENSE file. I use an MIT License for all my code because I want to allow everyone to use it for all purposes, including commercial projects.
A README.md file, containing instructions to run the code and links to blog posts that explain the code in detail.

Screenshot of the structure of my fashion-mnist project on VS Code.

This is by no means the only way to structure machine learning projects, but it has worked well for me over the years. If you have suggestions on how to improve it, please do reach out.

Troubleshooting

In this section, I will cover the issues I encountered while adding devcontainers to the projects in my blog. I will keep adding to this list as I find and solve new issues.

libGL.so.1: cannot open shared object file: No such file or directory

One issue I ran into was the following exception on an import cv2 line:

Exception has occurred: ImportError x
libGL.so.1: cannot open shared object file: No such file or directory

The import cv2 line imports the OpenCV computer vision library, and OpenCV requires the libGL library. It turns out that this library doesn’t come pre-installed in my container, although it came pre-installed in my local environment (I use WSL2 locally). Installing the library was easily accomplished by adding the following lines to the Dockerfile:

https://github.com/bstollnitz/sindy/blob/main/.devcontainer/Dockerfile

    # Install dependencies of OpenCV that don't come with it.
    RUN apt-get update
    RUN apt-get install libgl1 -y

OutOfMemory exception

Another issue I encountered was an out-of-memory exception when running one of my projects on an 8 GB machine. The solution was to select a more powerful machine while creating the codespace (16 GB did the trick for me). However, I don’t want my readers to run into the same exception, so I added the following to the devcontainer.json:

https://github.com/bstollnitz/activity/blob/main/.devcontainer/devcontainer.json

    "hostRequirements": {
        "memory": "16gb",
    },

This ensures that anyone who creates a codespace for this repo will only be presented with machine choices with 16 GB of memory or more.

Matplotlib graphs aren’t displayed

And finally, I have several files that display matplotlib graphs in popup windows when run locally. This doesn’t work in Codespaces, because a VS Code instance running in the browser can’t open a Windows-style popup window. One workaround is to display the graphs right within VS Code using a Python Interactive Window. You can run a snippet of code in an Interactive Window by placing a # %% line right before the code and then clicking “Run Cell.” In my scenario, I wanted to run an entire Python file, so I navigated to the Command Palette and selected “Jupyter: Run Current File in Interactive Window.”

I encountered one issue while running my code in an Interactive Window though. The following code threw an exception, because the IPython kernel passes several unexpected command-line arguments to my main function:

def main() -> None:
    parser = argparse.ArgumentParser()
    parser.add_argument("--data_dir", dest="data_dir", default=DATA_DIR)
    parser.add_argument("--output_dir", dest="output_dir", default=OUTPUT_DIR)
    args = parser.parse_args()
    ...

Since there’s no way to prevent VS Code from passing IPython arguments when launching an Interactive Window, I wrote a bit of code that detects this situation and sidesteps the problem. One way to see if we’re running in an Interactive Window is to check whether the IPython class name is “ZMQInteractiveShell”, as you can see in the code below:

For example:

https://github.com/bstollnitz/sindy/blob/main/sindy/lorenz-custom/src/3_predict.py

def main() -> None:
    parser = argparse.ArgumentParser()
    parser.add_argument("--data_dir", dest="data_dir", default=DATA_DIR)
    parser.add_argument("--output_dir", dest="output_dir", default=OUTPUT_DIR)
    shell = get_ipython().__class__.__name__
    argv = [] if (shell == "ZMQInteractiveShell") else sys.argv[1:]
    args = parser.parse_args(argv)
    ...

And the problem is fixed! Now if I’m running locally, I can press F5 and the graphs show up in popup windows. And if I’m on Codespaces, I can run in an Interactive Window and the graphs are displayed there. Either way, my matplotlib graphs are displayed as expected.

Interactive Windows are exciting, and I expect that they’ll get a lot more use as GitHub Codespaces grows in popularity!

Conclusion

In this post, you learned how to configure your Azure ML projects to run on GitHub Codespaces, where to configure your VS Code settings depending on where you want them to apply, and how to use VS Code settings to setup a linter and formatter for your Python projects. You also saw some of the issues I encountered while configuring my machine learning projects to run on Codespaces, and the solutions I found for them. I hope that you learned something useful, and that you’ll give GitHub Codespaces a try!

I want to thank Banibrata De and Daniel Schneider from the Azure ML team, Rong Lu, Sid Unnithan and Rich Chiodo from the Visual Studio team, and Tanmayee Prakash Kamath from the GitHub team, for helpful discussions about many of the topics in this post.

About the author

Bea Stollnitz is a principal developer advocate at Microsoft, focusing on Azure ML. See her blog for more in-depth articles about Azure ML and other machine learning topics.

Posted at https://sl.advdat.com/36iwdOThttps://sl.advdat.com/36iwdOT

Tuesday, March 15, 2022

Configuring Azure ML projects to run on GitHub Codespaces

Introduction

Setting up a codespace for an Azure ML project

Logging in to Azure

Configuring VS Code settings

Linting and automatic formatting

Troubleshooting

libGL.so.1: cannot open shared object file: No such file or directory

OutOfMemory exception

Matplotlib graphs aren’t displayed

Conclusion

About the author