Monday, May 2, 2022

Integrating Terraform and Azure DevOps to manage Azure Databricks

Continuous integration and continuous delivery (CI/CD) culture started to get popular, and it brought the challenge of having everything automatized, aiming to make processes easier and maintainable for everyone.

 

One of the most valuable aspects of CI/CD is the integration of the Infrastructure as Code (IaC) concept, with IaC we can version our infrastructure, save money, creating new environments in minutes, among many more benefits. I won't go deeper about IaC, but if you want to learn further visit: The benefits of Infrastructure as Code 

 

IaC can also bring some challenges when creating resources needed for the projects. This is mostly due to creating all the scripts for the infrastructure is a task that is usually assigned to the infrastructure engineers, and it happens that we can't have the opportunity to be helped for any reason.

 

As a Data Engineer, I would like to help you understand the CI/CD process with a hands-on. You'll learn how to create Azure Databricks through Terraform and Azure DevOps, whether you are creating projects by yourself or supporting your Infrastructure Team.

 

In this article, you´ll learn how to integrate Azure Databricks with Terraform and Azure DevOps and the main reason is just because in this moment I've had some difficulties getting the information with these 3 technologies together.

 

First of all, you'll need some prerequisites 

 

  • Azure Subscription
  • Azure Resource Group (you can use an existing one)
  • Azure DevOps account
  • Azure Storage Account with a container named "tfstate"
  • Visual Studio Code (it's up to you)

So, let's start and have some fun

 

Please, go ahead and download or clone this GitHub repository  databrick-tf-ado and get demo-start branch.

In the folder you'll see a file named main.tf and 2 more files in the folder modules/databricks-workspace

 

Vanessa_Segovia_0-1651505246300.png

 

It should be noted that this example is a basic one, so you can find more information of all the features for databricks in this link: https://registry.terraform.io/providers/databrickslabs/databricks/latest/docs 

 

Now, go to the main.tf file in the root folder and find line 8 where the declaration of azurerm starts

 

 

  backend "azurerm" {
    resource_group_name  = "demodb-rg"
    storage_account_name = "demodbtfstate"
    container_name       = "tfstate"
    key                  = "dev.terraform.tfstate"
  }

 

 

there you need to change the value of resource_group_name and storage_account_name for the values of you subscription, you can find those values in your Azure Portal, they need to be already created.

 

storageaccount.png

 

 

In main.tf file inside root folder there's a reference to a module called "databricks-workspace", now in that folder you can see 2 more files main.tf and variables.tf. 

 

main.tf contains the definition to create a databricks workspace, a cluster, a scope, a secret and a notebook, in the format that terraform requires and variables.tf contains the information of the values that could change depending on the environment. 

 

Now that you changed the values mentioned above into a GitHub or DevOps repository if you need assistance for that visit these pages: GitHub or DevOps.

 

At this moment we have our github or devops repository with the names that we require configured, so let´s create our pipeline to deploy our databricks environment into our Azure subscription.

 

First go to your azure subscription and check that you don't have a databricks called demodb-workspace

 

portalazurebefore.png

 

 

You'll need to install an extension so DevOps can use terraform commands so go to Terraform Extension.

 

Once is installed in your project in Azure DevOps click on Pipelines-Release and Create "new pipeline", it appears the option by creating the pipeline with YAML or with the Editor, I'll choose the Editor so we can see it clearer.

 

Vanessa_Segovia_3-1651505246308.png

 

 

In Add an Artifact in the Artifact section of the pipeline select your source type (provider where you uploaded your repository) and fill all the required information, like the image below and click "Add"

 

addartifact.png

 

 

Then click on Add stage in Stages section and choose empty Job and name the stage as "DEV"

 

addstage.png

 

After that click on Jobs below the name of the stage

Vanessa_Segovia_6-1651505246314.png

 

In the Agent job, press the "+" button and search for "terraform" select "Terraform tool installer"

 

addinstallterraform.png

Leave the default information

 

Then Add another 3 tasks of "Terraform" task

 

addterraformtask.png

 

Name the second task after Installer as "Init" and fill the information required like the image:

 

init.png

 

 

For all these 3 tasks set the information of your subscription, resource group, storage account and container, and there's also a value labeled key, there you have to set "dev.terraform.tfstate" is a key that terraform uses to keep tracking of your Infrastructure changes.

 

suscription.png

 

Name next task as "Plan"

 

plan.png

 

Next task "Apply"

 

apply.png

 

Now change the name of your pipeline and save it

 

namepipeline.png

 

And we only need to create a Release to test it

 

You can monitor the progress

 

progress.png

 

 

When it finished, if everything was good you'll see your pipeline as successful 

 

success.png

 

Lastly let´s confirm in the azure portal that everything is created correctly

 

finalportal.png

 

then login in your workspace and check the and run the notebook, so you can test that the cluster, the scope, the secret and the notebook are working correctly.

 

workspace.png

 

 

With that you can easily maintain your environments safe from the changes that contributors can do, only one way to accept modifications into your infrastructure.

 

Let us know any comments or questions.

 

 

 

 

 

 

 

 

Posted at https://sl.advdat.com/3s6hkrbhttps://sl.advdat.com/3s6hkrb