Advanced Data Solutions : Azure Synapse Analytics October Update

Welcome to the Azure Synapse October 2021 update! We’ve got a some key improvements for you this month.

Table of Contents

General

Manage your cost with Azure Synapse pre-purchase plans
Move your Azure Synapse workspace across Azure regions

Apache Spark for Synapse

Spark performance optimizations

Security

All Synapse RBAC roles are now generally available for use in production
Leverage User-Assigned Managed Identities for Double Encryption
Synapse Administrators now have elevated access to dedicated SQL pools

Governance

Synapse workspaces can now automatically push lineage data to Azure Purview

Integrate

Use Stringify in data flows to easily transform complex data types to strings
Control Spark session time-to-live (TTL) in data flows

CI/CD & Git

Deploy Synapse workspaces using GitHub Actions
More control creating Git branches in Synapse Studio

Developer Experience

Enhanced Markdown editing in Synapse notebooks preview

Pandas dataframes automatically render as nicely formatted HTML tables
Mssparkutils runtime context now available for Python and Scala

General updates

Manage your cost with Azure Synapse pre-purchase plans

A pre-purchase plan lets you pre-pay for a bundle of Azure Synapse commit units (SCUs), to be applied over the following one-year term to usage of all generally available Azure Synapse services except data storage. You can use the SCUs at any time and apply to any Synapse workload during the one-year term. This gives you the freedom to get the insights you need with Azure Synapse and control your cost up-front.

Another benefit of SCUs is that you can buy once and mix and match your workloads in Azure Synapse with data integration and ETL pipelines using code-free data flows, big data and machine learning with Apache Spark, data lake exploration using serverless SQL, and data warehousing with our industry-leading SQL engine. Customers that use this pricing option can save up to 28 percent compared to pay-as-you-go pricing.

Azure Synapse pre-purchase plans are generally available across all global regions except China.

For more about these pre-purchase plans and to understand how different analytic engines in Synapse use SCUs read this blog Save money and accelerate end-to-end analytics with NEW Azure Synapse Analytics pre-purchase plans and read this document Optimize Azure Synapse Analytics costs with a Pre-Purchase Plan.

Move your Azure Synapse workspace across Azure regions

Customers can now move their Synapse workspace to another Azure region. Here are two scenarios where moving a workspace is useful for customers. First, organizations may need to place data and applications that process that data in specific regions to meet compliance requirements. In this case, customers can move their Synapse workspace to the same region where their data resides to collocate applications and data. Second, customers may find that a new region launched by Azure is a more optimal location for their resources. In such cases, customers can move their Synapse workspace to the newly available Azure region.

You can learn more about how to move workspaces by reading these instructions Move an Azure Synapse Analytics workspace from one region to another.

Apache Spark for Synapse

Spark performance optimizations

We are constantly making performance improvements to Apache Spark in Synapse. Recently, we’ve made several enhancements that have further improved the performance for Spark 3.1 in Azure Synapse by 13% based on standard benchmarks and up to 202% faster than open-source Apache Spark.

Here are three key improvements we made:

Limit pushdown – For top-k queries, we eliminated compute cycles used for processing rows which are not part of the top-k within the partition. Note that Statistics must be enabled to trigger this optimization.
Optimized sorts – Sorting now can take more advantage of data that has been previously partitioned. This is very useful for queries requiring window operation like getting top 100 highly paid employees in each department or getting 100 most selling products in different categories. This optimization happens automatically and does not need to be enabled.
Bloom filter enhancements - In this release, we have extended support of Bloom filters to sort merge joins in addition to broadcast hash joins which we talked about p reviously. For example, given a fact table ‘Sales’ and a dimension table ‘Items’, application of a Bloom filter will drastically improve performance when we want to get total sales for selected items. Note that Statistics must be enabled to trigger this optimization.

To turn on Statistics which then enable limit pushdown optimization and bloom filter enhancements, see this document Analyze Table.

For more information about these performance enhancements, see Speed up your data workloads with performance updates to Apache Spark 3.1.2 in Azure Synapse.

Security

All Synapse RBAC roles are now generally available for use in production

Synapse supports Role Based Access Control (Synapse RBAC). Using Synapse RBAC, you can control access to your Synapse resources by assigning a Synapse RBAC role to your users. At Synapse GA, we previewed a set of Synapse RBAC roles that provided fine-grained access control. The following preview roles have been generally available since September 15th, 2021.

Synapse Contributor
Synapse Artifact Publisher
Synapse Artifact User
Synapse Compute Operator
Synapse Linked Data Manager
Synapse Credential User
Synapse User

You can assign Synapse RBAC roles to users in Synapse Studio by selecting Manage from the left navigation and then selecting Access control

Learn more about these roles and how to use them by reading Azure Synapse Analytics RBAC roles are now Generally Available

Leverage User-Assigned Managed Identities for Double Encryption

Double Encryption is one of the many layers of protection that Synapse offers to secure customer data. Data at rest is always encrypted using Microsoft-managed keys. Customers can add a second layer of encryption using their customer-managed keys stored in Azure Key Vault. Synapse uses a managed identity to access the customer-managed key in Azure Key Vault.

In addition to system-assigned managed identity, Synapse now supports user-assigned managed identities for encryption. Unlike system-assigned managed identities, user-assigned managed identities are not tied to lifecycle of a Synapse workspace. A customer can grant Azure Key Vault privileges to a user-assigned managed identity and leverage it to configure encryption for multiple Synapse workspaces, reducing operational overhead.

You can easily use your user-assigned managed identity with your Synapse workspace once permissions are granted. When creating a Synapse workspace in the Azure portal, first Enable Double Encryption using customer-managed key. Once that has been enabled, you can choose between User-assigned and System-assigned managed identities.

If you pick User-assigned managed identity, then you’ll be prompted to select the identity to use.

For more about using User-assigned managed identities and how to correctly configure your UAMI, see Encryption for Azure Synapse Analytics workspaces.

Synapse Administrators now have elevated access to dedicated SQL pools

Previously, Synapse Administrators could grant serverless SQL pool access to other users, but only an Active Directory Admin (generally the creator of the workspace) could grant dedicated SQL pool access to users. The Synapse Administrator role also did not have data plane access parity across the serverless and dedicated SQL pools. Based on customer feedback, the Synapse Administrator role now has full access to the data in dedicated SQL pools in a Synapse workspace. Synapse administrators can perform all configuration and maintenance activities on dedicated pools, except for dropping the databases. They also have the ability to grant access to other users.

For more about mechanisms to control access to Azure Synapse, see Azure Synapse access control

Governance

Synapse workspaces can now automatically push lineage data to Azure Purview

Earlier this year we enabled you to link your Synapse workspace to an Azure Purview account – the unified, data governance service in Azure that just became generally available this past month. This enables developers to use the search bar at the top of Synapse Studio to discover and explore organization data using Purview. In this update, we’ve enhanced this integration. If you’ve linked your Synapse workspace to Azure Purview, now your Synapse workspace will automatically send lineage data to Azure Purview without you having to configure anything else. When monitoring pipeline runs in the Integrate hub, clicking on the Lineage status icon for a pipeline run allows you explore the lineage information.

The image flow shows how the lineage will appear (1) Original data comes from a SQL table (2) a Copy activity then copies the data into (3) file in Azure DataLake Storage called Output.csv.

Learn more by reading How to get lineage from Azure Synapse Analytics into Azure Purview.

Integrate

Use Stringify in data flows to easily transform complex data types to strings

Mapping data flows helps you perform code-free data transformation your Synapse pipelines. When you work with complex data types such as structures, arrays, map, you need to transform them into strings. You can do this by using the new Stringify data transformation simplifying this common task.

Here’s an example of expression that uses the stringify function to transform a JSON array containing a list of authors into a single string.

stringify(mydata = body.properties.authors ? string, format: 'json') ~> authors_str

Learn more by reading Stringify transformation in mapping data flow

Control Spark session time-to-live (TTL) in data flows

When dataflows execute, they spin up a Spark session and the Spark session take a few minutes to start. Spark sessions have a timeout and once met, the session automatically terminates. Auto-termination is good because it saves you money. However, in data flows it may be the case that a session ends just before another is needed. In these cases, the dataflow must start another session which itself will take a few minutes to start. If this happens frequently your data flows can spend a lot of time just restarting Spark sessions causing dataflows to take more time run.

We have introduced the ability to control the Spark session TTL in dataflows inside the Azure Integration Runtime. You can now configure how long the sessions will live so that you can avoid unnecessarily starting up Spark sessions which can improve the overall runtime of your dataflows.

Learn more by reading Optimizing performance of the Azure Integration Runtime.

CI/CD & GIT

Deploy Synapse workspaces using GitHub Actions

GitHub Actions help you automate tasks within your software development life cycle. GitHub Actions are event-driven, meaning that you can run a series of commands after a specified event has occurred. For example, every time someone creates a pull request for a repository, you can automatically run a command that executes a software testing script. We introduced a new GitHub Action called Synapse workspace deployment. This new Action allows you to automatically deploy your workspace to instead of relying on any manual processes. Anyone working in developing CI/CD pipelines involving Synapse will find that this greatly simplifies their life.

Learn more by reading Continuous integration and delivery for an Azure Synapse Analytics workspace.

More control creating Git branches in Synapse Studio

When integrated with Git, Azure Synapse has a notion of a branch that is used for collaboration. Previously, you could make a new branch in Synapse Studio, but that branch would always be based on the collaboration branch. Many organizations have specific policies and practices on how branches are used and always creating a new branch from the collaboration branch doesn’t work for them. In this release, we’ve given you full control over this branching strategy. Now, when you create a branch in Synapse Studio, the branch can be based on ANY branch in your git repo. This lets them comply with organizational branching policies and more be more convenient for helping developers track changes and lineage in their branches.

Learn more by reading Source control in Synapse Studio.

Developer Experience

Enhanced Markdown editing in Synapse notebooks preview

Previously our notebooks ‘Preview Features’ in Synapse Studio had a very primitive experience to edit Markdown cells. With this update, we’ve added a Markdown toolbar to the notebooks preview that makes formatting text quick and easy.

BEFORE:

AFTER:

Learn more by reading Create, develop, and maintain Synapse notebooks in Azure Synapse Analytics

Pandas dataframes automatically render as nicely formatted HTML tables

Pandas is a popular library for data analysis that is available in Synapse notebooks. You’ve always been able to display Pandas dataframes in Synapse notebooks, but they were rendered as plain text and it could be very difficult to read these rendered tables especially when they have a lot of columns.

This month’s update now renders Pandas DataFrames as well-formatted HTML tables. Just print out the DataFrame like you always have before and you automatically get the improved table rendering.

import pandas as pd
import numpy as np
df = pd.DataFrame([[38.0, 2.0, 18.0, 22.0, 21, np.nan],[19, 439, 6, 452, 226,232]],
                  index=pd.Index(['Tumour (Positive)', 'Non-Tumour (Negative)'], name='Actual Label:'),
                  columns=pd.MultiIndex.from_product([['Decision Tree', 'Regression', 'Random'],['Tumour', 'Non-Tumour']], names=['Model:', 'Predicted:']))
df

Learn more about this Pandas and visualization by reading Visualizations - Azure Synapse Analytics | Microsoft Docs

Use IPython widgets in Synapse Notebooks

IPython widgets are extensively used by developers using Jupyter notebooks. These widgets are eventful python objects that are visible in browsers as interactive UI controls like sliders, textboxes, radio buttons, etc.

Until now, Synapse Notebooks could not make use of the vast ecosystem of IPython widgets. In this update, you can now use IPython widgets with your Synapse Notebooks just as you can with Jupyter Notebooks. Currently these widgets are only supported for Python in Synapse notebooks. Eventually we will add support for these widgets for other languages in Synapse such as Scala and C#.

Learn more about how to use IPython widgets in Synapse by reading How to use Synapse notebooks - Azure Synapse Analytics | Microsoft Docs

Mssparkutils runtime context now available for Python and Scala

Mssparkutils runtime utils exposes 3 runtime properties. Use the mssparkutils runtime context to get the properties listed as below:

Notebookname - The name of current notbook, will always return value for both interactive mode and pipeline mode.
Pipelinejobid - The pipeline run id, will return value in pipeline mode and return empty string in interactive mode.
Activityrunid - The notebook activity run id, will return value in pipeline mode and return empty string in interactive mode.

Currently runtime context supports both Python and Scala.

Learn more about runtime context by reading Introduction to Microsoft Spark Utilities | Microsoft Docs

Posted at https://sl.advdat.com/3EdDknD

Friday, October 22, 2021

Azure Synapse Analytics October Update