Monday, September 20, 2021

Operationalize and Provide SLA for Data Pipelines

Azure Data Factory is loved and trusted by corporations around the world. As Azure's native cloud ETL service for scale-out server-less data integration and data transformation, it is widely used to implement Data Pipelines to prepare, process, and load data into enterprise data warehouse or data lake.


Typically, data pipelines run on auto-pilot, and are scheduled to run on a predefined timetable, using Schedule Trigger or Tumbling Window Trigger. And sometimes, these pipeline pipelines run mission critical workloads, preparing data for business reports or data analytics/machine learning projects.


There are two major challenges delivering service level agreements (SLAs) for these data pipelines:

  1. Compute environment for one activity, for instance SQL for a Stored Procedure activity, may throttle, slowing down the whole data pipeline and missing pipeline SLA.
  2. Pipeline developers aren't always actively monitoring the factory, and proactively seeking out long running pipelines that will miss SLAs.


To address these issues, today we are introducing the Elapsed Time Pipeline Run metric. Combined with Data Factory Alerts, we will empower data pipelines developers to better deliver SLAs to their customers: you tell us how long a pipeline should run, and we will notify you, proactively, when the pipeline runs longer than expected.


For each pipeline you want to create alerts on, during authoring phase, go to pipeline settings (by clicking on the blank space in the pipeline canvas). 

01 Set Elapsed Time Metric.png

Under Settings tab, check Elapsed time metric, and specify expected pipeline run duration. We strongly recommend you to set this to your business SLA, the amount of time that the pipeline can take to meet your business needs. Once the pipeline duration exceeds this setting, Data Factory will log am Elapsed Time Pipeline Run metric (metric id: PipelineElapsedTimeRuns) in Azure Monitor. In other words, you will get notified about long running pipelines proactively, before the pipeline eventually finishes.


Furthermore, follow the steps for Data Factory Alerts to set up alerts on the metric. Engineers will get notified to intervene and take steps to meet the SLAs, through emails or SMSs


We understand some pipelines will naturally take more time to finish than others, (like those with more steps, or moving more data), and there is no one-size-fit-all definition for long running pipelines. We kindly ask you to define the threshold for every pipeline that you need a SLA on. And when logging the metric for a particular pipeline, we will compare to its user-defined setting for expected run duration.


Note: this is an opt-in feature. No metric will ever be logged for a pipeline, if no expected run duration is specified for the aforementioned pipeline.


Posted at