At-scale data processing systems typically store a single table in storage as multiple files. In the Azure Purview data catalog, this concept is represented by using resource sets; a resource set is a single object in the catalog that represents many assets in storage.
For example, suppose your Spark cluster has persisted a DataFrame into an Azure Data Lake Storage (ADLS) Gen2 data source. In Spark, the table looks like a single logical resource, but on the disk there are likely thousands of Parquet files, each of which represents a partition of the total DataFrame's contents.
IoT data and web log data have the same challenge. Imagine you have a sensor that outputs log files several times per second. It won't take long until you have hundreds of thousands of log files from that single sensor. In Azure Purview, resource sets allow for these partitions to be handled as a single data asset, allowing for easy consumption and preventing oversaturation of the data catalog.
How Azure Purview detects resource sets
Azure Purview supports resource sets in Azure Blob Storage, ADLS Gen1, ADLS Gen2, Azure Files, and Amazon S3.
Azure Purview automatically detects resource sets when scanning. This feature looks at all the data that's ingested via scanning and compares it to a set of defined patterns.
For example, suppose you scan a data source whose URL is https://myaccount.blob.core.windows.net/mycontainer/machinesets/23/foo.parquet. Azure Purview looks at the path segments and determines if they match any built-in patterns. It has built-in patterns for GUIDs, numbers, date formats, localization codes (for example, en-us), and so on. In this case, the number pattern matches 23. Azure Purview assumes this file is part of a resource set named https://myaccount.blob.core.windows.net/mycontainer/machinesets/{N}/foo.parquet.
Or, for a URL such as https://myaccount.blob.core.windows.net/mycontainer/weblogs/en_au/23.json, Azure Purview matches both the localization pattern and the number pattern, producing a resource set named https://myaccount.blob.core.windows.net/mycontainer/weblogs/{LOC}/{N}.json.
Using this strategy, Azure Purview would map the following resources to the same resource set, https://myaccount.blob.core.windows.net/mycontainer/weblogs/{LOC}/{N}.json:
- https://myaccount.blob.core.windows.net/mycontainer/weblogs/cy_gb/1004.json
- https://myaccount.blob.core.windows.net/mycontainer/weblogs/cy_gb/234.json
- https://myaccount.blob.core.windows.net/mycontainer/weblogs/de_Ch/23434.json
Note: Azure Purview intentionally doesn’t try to classify document file types such as Microsoft Word, Microsoft Excel, and PDFs as resource sets.
Advanced resource sets
Azure Purview can customize and further enrich your resource set assets through the Advanced Resource Sets capability. When advanced resource sets are enabled, Azure Purview runs extra aggregations to compute the following information about resource set assets:
- Up-to-date schema and classifications to accurately reflect schema drift from changing metadata.
- Sample file paths of assets that comprise the resource set.
- A partition count that shows how many files make up the resource set.
- A schema count that shows how many unique schemas were found. This value is either a number between 1 and 5, or for values greater than 5, 5+.
- A list of partition types when more than a single partition type is included in the resource set. For example, an IoT sensor might output both XML and JSON files, although both are logically part of the same resource set.
- The total size of all files that comprise the resource set.
These properties can be found on the asset details page of the resource set.
Enabling advanced resource sets also allows for the creation of resource set pattern rules that customize how Azure Purview groups resource sets during scanning.
Enabling advanced resource sets
The advanced resource sets feature is off by default in all new Azure Purview instances. Advanced resource sets can be enabled from Account information in the management hub.
After enabling advanced resource sets, the additional enrichments will occur on all newly ingested assets. The Azure Purview team recommends waiting an hour before scanning in new data lake data after toggling on the feature.
Customizing resource set grouping using pattern rules
When scanning a storage account, Azure Purview uses a set of defined patterns to determine if a group of assets is a resource set. In some cases, Azure Purview's resource set grouping might not accurately reflect your data estate. These issues can include:
- Incorrectly marking an asset as a resource set
- Putting an asset into the wrong resource set
- Incorrectly marking an asset as not being a resource set
To customize or override how Azure Purview detects which assets are grouped as resource sets and how they are displayed within the catalog, you can define pattern rules in the management center. Pattern rules are only available when the advanced resource sets feature is enabled. For step-by-step instructions and syntax, please see resource set pattern rules.
Get started today!
- Quickly and easily create an Azure Preview account to try the generally available features.
- Learn more about how to create resource set pattern rules.
Posted at https://sl.advdat.com/3oXTaOr