Challenges with data are common across industries: there’s more and more data; it needs to be analyzed faster than ever; and it’s coming in an increasing variety of formats. To respond to the challenges, organizations are looking for cost-efficient and reasonably fast ways to modernize their data platform and make it more flexible and future ready. In many cases the preferred approach is the one allowing you to achieve the goals while using data engine you’re familiar with and the language, frameworks, and tools of your choice.
Data virtualization capabilities, now in preview in Azure SQL Managed Instance, enable you to execute Transact-SQL (T-SQL) queries against data from files stored in Azure Data Lake Storage Gen2 or Azure Blob Storage and combine it with relational data stored locally in the managed instance using logical joins. This way you can transparently access external data while keeping it in its original format and location. There is no data duplication or need to run and maintain ETL processes, which means that you can extract and deliver insights faster. Currently supported file formats are Parquet, CSV, and JSON.
When to use data virtualization
Typical use cases for data virtualization include:
- Providing always up-to-date relational abstraction on top of your raw or disparate data without relocating or transforming it. This way any application capable of running T-SQL queries can consume the data: from BI solutions like Power BI, to line of business applications, to client tools like SQL Server Management Studio or Azure Data Studios. This is an easy and elegant way to expand the list of data sources for your operational reporting solutions.
- Reducing managed instance storage consumption and total cost of ownership (TCO) by archiving cold data to Azure Data Lake Storage, keeping it still within the reach of interactive queries and joins.
- Exploratory data analysis of data sets stored in the most common file formats. This approach is typically used by data scientists and data analysts to collect insights about the data set, from basic ones like number and structure of records, to extracting important variables, detecting outliers and anomalies, and testing underlying assumptions.
There are also scenarios that don’t fall into data virtualization category but can benefit from ability to query files using T-SQL. For example, inserting results of an external query to a local table can simplify existing ETL pipelines or even remove the need for dedicated data integration tools.
Getting started
If you are familiar with PolyBase feature of SQL Server, you may have already recognized the scenarios and underlying concepts. Data virtualization capabilities of Azure SQL Managed Instance use the same syntax as PolyBase and enrich it further with new options. PolyBase queries running on your SQL Server instance and targeting files stored in Azure Data Lake Storage or Blob Storage will continue working on your managed instance with minimal intervention to specify location prefix corresponding to the type of external source and endpoint, like abs:// instead of the generic https:// location prefix.
To enable data virtualization capabilities on your managed instance, run the following commands:
exec sp_configure 'polybase_enabled', 1;
go
reconfigure;
go
For simplicity, we are going to use publicly available Bing COVID-19 dataset that allows anonymous access.
First, create external data source encapsulating data related to the file location:
CREATE EXTERNAL DATA SOURCE DemoPublicExternalDataSource
WITH (
LOCATION = 'abs://public@pandemicdatalake.blob.core.windows.net/curated/covid-19/bing_covid-19_data/latest'
)
You’re now ready to run the first query:
--Number of confirmed Covid-19 cases per country/territory during 2020:
SELECT countries_and_territories, sum(cases) FROM
OPENROWSET(
BULK 'abs://public@pandemicdatalake.blob.core.windows.net/curated/covid-19/ecdc_cases/latest/ecdc_cases.parquet',
FORMAT='PARQUET'
) AS [CovidCaseExplorer]
WHERE year = '2020'
group by countries_and_territories
order by sum(cases) desc
Find more sample queries in the feature documentation referenced at the end of the article.
Summary
Data virtualization capability, now in preview in Azure SQL Managed Instance, introduces a data abstraction concept enabling you to transparently query data stored in files in Azure Data Lake Storage and Azure Blob Storage. It enables any application capable of running T-SQL queries to consume that data by simply connecting to managed instance. It can help you optimize the TCO thanks to lower storage costs compared to storing data in a database. Finally, it improves time-to-insights by offering instant access to up-to-date data without the need to transform or relocate it.
Next steps
Explore the full set of data virtualization capabilities in Azure SQL Managed Instance.
Posted at https://sl.advdat.com/3pPON86https://sl.advdat.com/3pPON86