Thursday, November 11, 2021

Azure Databricks Artifacts Deployment

This article is intended for deploying Jar Files, XML Files, JSON Files, wheel files and Global Init Scripts in Databricks Workspace.

 

poornamishra_8-1635851025288.png

 

Overview:

  • In Databricks Workspace, we have notebooks, clusters, and data stores. These notebooks are run on data bricks clusters and use datastores if they need to refer to any custom configuration in the cluster.
  • Developers need environment specific configurations, mapping files and custom functions using packaging for running the notebooks in Databricks Workspace.
  • Developers need a global Init script which runs on every cluster created in your workspace. Global Init scripts are useful when you want to enforce organization-wide library configurations or security screens.
  • This pipeline can automate the process of deploying these artifacts in Databricks workspace.

 

Purpose of this Pipeline:

  • The purpose this pipeline is to pick up the Databricks artifacts from the Repository and upload to Databricks workspace DBFS location and uploads the global init script using REST API's.
  • The CI pipeline builds the wheel (.whl) file using setup.py and publishes required files (whl file, Global Init scripts, jar files etc.) as a build artifact.
  • The CD pipeline uploads all the artifacts to DBFS location and also uploads the global Init scripts using the REST API's.

 

Pre-Requisites:

  1. Developers need to make sure that all the artifacts that need to be uploaded to Databricks Workspace need to be present in the Repository (main branch). The location of the artifacts in the repository should be fixed (Let us consider’/artifacts’ as the location). The CI process will create the build artifact from this folder location.

poornamishra_9-1635851025294.png

 

  1. The Databricks PAT Token and Databricks Target Workspace URL should be present in the key vault.

   

Continuous Integration (CI) pipeline:

 

  • The CI pipeline builds a wheel (.whl) file using the a setup.py file and also creates a build artifact from all files in the artifacts/ folder such as Configuration files (.json), Packages (.jar and .whl), and shell scripts (.sh).
  • It has the following Tasks:
  1. Building the Wheel file using setup.py file (Subtasks below):
    • using the latest python version
    • updating pip
    • Installing wheel package
    • Building the wheel file using command "python setup.py sdist bdist_wheel"
    • This setup.py file can be replaced with any python file that is used to create .whl files

 

  1. Copying all the Artifacts(Jar,Json Config,Whl file, Shell Script) to artifact staging directory
  2. Publishing the Artifacts from the staging directory.
  3. The CD Pipeline will then be triggered after a successful run.

 

  • The YAML code for this pipeline is included in next page with all the steps included.

CI- Pipeline YAML Code:

 

 

name: Release-$(rev:r)

trigger: none

variables:
  workingDirectory: '$(System.DefaultWorkingDirectory)/Artifacts'
  pythonVersion: 3.7

stages:
- stage: Build
  displayName: Build stage

  jobs:
  - job: Build
    displayName: Build
    steps:
    - task: UsePythonVersion@0
      displayName: 'Use Python version'
      inputs:
        versionSpec: $(pythonVersion)
    - task: CmdLine@2
      displayName: 'Upgrade Pip'
      inputs:
        script: 'python -m pip install --upgrade pip'
    - task: CmdLine@2
      displayName: 'Install wheel'
      inputs:
        script: 'python -m pip install wheel'
    - task: CmdLine@2
      displayName: 'Build wheel'
      inputs:
        script: 'python setup.py sdist bdist_wheel'
        workingDirectory: '$(workingDirectory)'
    - task: CopyFiles@2
      displayName: 'Copy Files to:  $(build.artifactstagingdirectory)'
      inputs:
        SourceFolder: '$(workingDirectory)'
        TargetFolder: ' $(build.artifactstagingdirectory)'
    - task: PublishBuildArtifacts@1
      displayName: 'Publish Artifact: DatabricksArtifacts'
      inputs:
        ArtifactName: DatabricksArtifacts

 

 

Continuous Deployment (CD) pipeline:

 

The CD pipeline uploads all the artifacts (Jar, Json Config, Whl file) built by the CI pipeline into the Databricks File System (DBFS). The CD pipeline will also update/upload any (.sh) files from the build artifact as Global Init Scripts for the Databricks Workspace.

 

It has the following Tasks:

  1. Key vault task to fetch the data bricks secrets(PAT Token, URL)
  2. Upload Databricks Artifacts

Arguments:

Databricks PAT Token to access Databricks Workspace

Databricks Workspace URL

Pipeline Working Directory URL where the files((Jar, Json Config, Whl file) are present

 

   3.Upload Global Init Scripts

  • Script Name : DatabricksGlobalInitScriptUpload.ps1

Arguments:

Databricks PAT Token to access Databricks Workspace

Databricks Workspace URL

Pipeline Working Directory URL where the global init scripts are present

 

  • The YAML code for this CD pipeline with all the steps included. and scripts for uploading artifacts are included in the next page.

CD-YAML code:

 

 

name: Release-$(rev:r)

trigger: none

resources:
  pipelines:
    - pipeline: DatabricksArtifacts
      source: DatabricksArtifacts-CI
      trigger:
        branches:
          - main

variables:
  - group: Sample-Variable-Group
  - name: azureSubscription
    value: 'Sample-Azure-Service-Connection'
  - name: workingDirectory_utilities
    value: '$(Pipeline.Workspace)/DatabricksArtifacts/DatabricksArtifacts'

stages:
  - stage: Release
    displayName: Release stage

    jobs:
      - deployment: DeployDatabricksArtifacts
        displayName: Deploy Databricks Artifacts
        strategy:
          runOnce:
            deploy:
              steps:
                - checkout: self
                - task: AzureKeyVault@1
                  inputs:
                    azureSubscription: "$(azureSubscription)"
                    KeyVaultName: $(keyvault_name)
                    SecretsFilter: "databricks-pat,databricks-url"
                    RunAsPreJob: true
                - task: AzurePowerShell@5
                  displayName: Upload Databricks Artifacts
                  inputs:
                    azureSubscription: '$(azureSubscription)'
                    ScriptType: 'FilePath'
                    ScriptPath: '$(System.DefaultWorkingDirectory)/Pipelines/Scripts/DatabricksArtifactsUpload.ps1'
                    ScriptArguments: '-databricksPat $(databricks-pat) -databricksUrl $(databricks-url) -workingDirectory $(workingDirectory_utilities)'
                    azurePowerShellVersion: 'LatestVersion'
                - task: AzurePowerShell@5
                  displayName: Upload Global Init Scripts
                  inputs:
                    azureSubscription: '$(azureSubscription)'
                    ScriptType: 'FilePath'
                    ScriptPath: '$(System.DefaultWorkingDirectory)/Pipelines/Scripts/DatabricksGlobalInitScriptUpload.ps1'
                    ScriptArguments: '-databricksPat $(databricks-pat) -databricksUrl $(databricks-url) -workingDirectory $(workingDirectory_utilities)'
                    azurePowerShellVersion: 'LatestVersion'

 

 

DBFSUpload.ps1

 

 

param(
    [String] [Parameter (Mandatory = $true)] $databricksPat,
    [String] [Parameter (Mandatory = $true)] $databricksUrl,
    [String] [Parameter (Mandatory = $true)] $workingDirectory
)

Function UploadFile {
    param (
        [String] [Parameter (Mandatory = $true)] $sourceFilePath,
        [String] [Parameter (Mandatory = $true)] $fileName,
        [String] [Parameter (Mandatory = $true)] $targetFilePath
    )

    #Grab bytes of source file
    $BinaryContents = [System.IO.File]::ReadAllBytes($sourceFilePath);
    $enc = [System.Text.Encoding]::GetEncoding("ISO-8859-1");
    $fileEnc = $enc.GetString($BinaryContents);

    #Create body of request
    $LF = "`r`n";
    $boundary = [System.Guid]::NewGuid().ToString();
    $bodyLines = (
        "--$boundary",
        "Content-Disposition: form-data; name=`"path`"$LF",
        $targetFilePath,
        "--$boundary",
        "Content-Disposition: form-data; name=`"contents`";filename=`"$fileName`"",
        "Content-Type: application/octet-stream$LF",
        $fileEnc,
        "--$boundary",
        "Content-Disposition: form-data; name=`"overwrite`"$LF",
        "true",
        "--$boundary--$LF"
    ) -join $LF;

    #Create Request
    $params = @{
        Uri         = "$databricksUrl/api/2.0/dbfs/put"
        Body        = $bodyLines
        Method      = 'Post'
        Headers     = @{
            Authorization = "Bearer $databricksPat"
        }
        ContentType = "multipart/form-data; boundary=$boundary"
    }

    Invoke-RestMethod @params;
}

Function GetTargetFilePath {
    param (
        [System.IO.FileInfo] [Parameter (Mandatory = $true)] $sourceFile
    )

    switch ($sourceFile.extension)
    {
        ".json" {return "/FileStore/config/$($sourceFile.Name)"}
        ".jar" {return "/FileStore/jar/$($sourceFile.Name)"}
        ".whl" {return "/FileStore/whl/$($sourceFile.Name)"}
    }
}

#Loop through all files and upload to dbfs
$filenames = get-childitem $workingDirectory -recurse;
$filenames | ForEach-Object {
    if ( $_.extension -eq ".json" -OR $_.extension -eq ".whl" -OR $_.extension -eq ".jar") {
        $targetFilePath = GetTargetFilePath -sourceFile $_;
        Write-Host "Uploading $($_.FullName) to dbfs at $targetFilePath.";
        UploadFile -sourceFilePath $_.FullName -fileName $_.Name -targetFilePath $targetFilePath;
    }
}

 

 

DatabricksGlobalInitScriptUpload.ps1

 

 

param(
    [String] [Parameter (Mandatory = $true)] $databricksPat,
    [String] [Parameter (Mandatory = $true)] $databricksUrl,
    [String] [Parameter (Mandatory = $true)] $workingDirectory
)

Function UploadFile {
    param (
        [String] [Parameter (Mandatory = $true)] $uri,
        [String] [Parameter (Mandatory = $true)] $restMethod,
        [String] [Parameter (Mandatory = $true)] $sourceFilePath,
        [String] [Parameter (Mandatory = $true)] $fileName
    )

    #Grab bytes of source file
    $base64string = [Convert]::ToBase64String([IO.File]::ReadAllBytes($sourceFilePath))

    #Create body of request
    $body = @{
        name     = $fileName
        script   = $base64string
        position = 1
        enabled  = "false"
    }

    #Create Request
    $params = @{
        Uri         = $uri
        Body        = $body | ConvertTo-Json
        Method      = $restMethod
        Headers     = @{
            Authorization = "Bearer $databricksPat"
        }
        ContentType = "application/json"
    }

    Invoke-RestMethod @params;
}

Function GetAllScripts {
    #Create Request
    $params = @{
        Uri         = "$databricksUrl/api/2.0/global-init-scripts"
        Method      = "GET"
        Headers     = @{
            Authorization = "Bearer $databricksPat"
        }
        ContentType = "application/json"
    }

    return Invoke-RestMethod @params;
}

#Loop through all files and upload to databricks global init
$scripts = GetAllScripts
$filenames = get-childitem $workingDirectory -recurse;
$filenames | ForEach-Object {
    if ( $_.extension -eq ".sh") {
        #Check if file name already exists in databricks
        $scriptId = ($scripts.scripts -match $_.Name).script_id
        if (!$scriptId){
            #Create Global init script
            Write-Host "Uploading $($_.FullName) as a global init script with name $($_.Name) to databricks";
            UploadFile -uri "$databricksUrl/api/2.0/global-init-scripts" -restMethod "POST" -sourceFilePath $_.FullName -fileName $_.Name;
        } else{
            #Update Global init script
            Write-Host "Updating global init script with name $($_.Name) to databricks";
            UploadFile -uri "$databricksUrl/api/2.0/global-init-scripts/$scriptId" -restMethod "PATCH" -sourceFilePath $_.FullName -fileName $_.Name;
        }
    }
}

 

 

End Result of Successful Pipeline Runs:

poornamishra_10-1635851491421.png

Global Init Script Upload:

poornamishra_11-1635851505851.png

Conclusion:

Using this CI CD approach we were successfully able to upload the artifacts to the Databricks file system.

 

References:

  1.  https://docs.databricks.com/dev-tools/api/latest/dbfs.html#create 
  2. https://docs.databricks.com/dev-tools/api/latest/global-init-scripts.html#operation/create-script

 

Posted at https://sl.advdat.com/3DaMSzE