With the general availability of Microsoft Purview for Data Governance, (formerly Azure Purview), it is a great time to review some of the key features and provide a learning path made up of free online resources to help you get up to speed.
This article will take you through a curated overview of the application focusing on Data Governance and providing links to more detailed material.
What is Azure Purview
Azure Purview is a cloud-based data governance service that helps you catalog, manage, and govern your on-premises, multi-cloud, and software-as-a-service (SaaS) data. You can create a holistic, up-to-date map of your data landscape and prepare this with automated data discovery, sensitive data classification, and end-to-end data lineage.
Data Governance requires a business process first and foremost, but that business process needs an application that simplifies the implementation. For example, suppose the system is too difficult to implement. In that case, people will not do it, and you will have shadow processes that avoid the rules and exposes organizations to the possibility of compliance trouble. So, Data Governance is a team sport that needs a flexible tool to bring this all together, especially in the hybrid environments companies use today.
The tool’s key benefit is providing a Cloud-Native tool that offers a way to discover, automatically catalog, and tag data that helps build a process around data governance of your Azure data estate.
Pictured below is the landing page.
- What is Azure Purview? – Great summary of the service and issues are facing data governance.
Looking for Azure Purview demo videos?
The following provides extensive demonstrations of the platform. The second one, by the Microsoft Security Community, goes into detail, exploring the Microsoft 365 sensitivity labels from the Microsoft Compliance connectivity.
- Demo: Azure Purview webinar: Introduction to Azure Purview – A 50-minute webinar covering Azure Purview which also includes an extensive demo.
- (NEW) Azure Purview – YouTube Video Channel – Full Youtube channel by a team of creators covering various topics and demos.
Setting the Stage: The Data Governance Problem
In the simplest terms, data governance is about managing data as a strategic asset. It involves ensuring that there are controls around data, such as; content, structure, use, and safety. A great example of this is the need to track and provide guidance around personally identifying information, which must be kept secure for compliance and regulators.
Data Growth and Complexity
As modern business data usage evolves, it embraces advanced analytics, artificial intelligence, and machine learning. This need is driving the amount, velocity, and variety of data in play. With all that data comes a wealth of new possibilities and a new set of challenges. Our ability to and this is important here, is to optimize the management and governance of the ever-greater amounts of data so we are successful. But, especially with regulations such as GDPR, making mistakes can be costly in reputation and financially.
As we have continued our move to the cloud, the amount of data we are willing to keep has grown. With blob storage being far cheaper than a new SAN, we see data not only with a high business value being kept but data that may have value later on.
With Machine learning, AI, and more analytics opportunities, we are keeping data that we want to use to solve business problems that we do not yet know about.
Opportunity vs Risk of Data
I remember one of my former manager’s favorite comments back in the day: “We do not know today what the questions are that we want our data to answer.” Well, now, with AI, Machine Learning, and cheap storage, we can keep more data for longer. But again, this is a balancing act. We have to balance the opportunities we see now and in the future with the risks of more data accessible to more people.
When we were setting up clients with data cataloging, we had a couple of searches that we would take management aside and review, such as; executive salary, layoffs, popular movies, and Napster content. I always was able to find something that shocked them.
When getting into an example on the data side, how many copies of their customer table they had, how outdated it could be, and the shock of what personal customer data happened to be shared around the company. Remember, you are only an Excel download away from a data breach!!!
The key takeaway is that without a plan, you invite issues. Unfortunately, this was usually the best way to start the governance discussion.
Data Governance Resources
- Microsoft Data Governance Blog
- Microsoft Guide to Data Governance – Building a Roadmap
- Data governance matters now more than ever – Microsoft 365 Records Management
Let’s review the parts that makeup Azure Purview.
Purview Data Map
The Data Map is the processing heart of the service. It provides the automation, scanning, and classification of data sources you wish to catalog. The service is multi-cloud, with Amazon S3 coming soon. The listing below shows the current Azure Sources available in preview with other connectors added to as time goes on.
The following sources are currently available (Feb 2021) in preview. The self-hosted integration runtime (SHIR) allows the on-premises data sources.
- On-premises SQL Server SQL Auth UX
- Azure Synapse Analytics (formerly SQL DW)
- Azure SQL Database (DB)
- Azure SQL Database Managed Instance
- Azure Blob Storage
- Azure Data Explorer
- Azure Data Lake Storage Gen1 (ADLS Gen1)
- Azure Data Lake Storage Gen2 (ADLS Gen2)
- Azure Cosmos DB
In addition to these sources, the following file types are supported for scanning, schema extraction, and classification where applicable:
- Structured file formats supported by extension: AVRO, ORC, PARQUET, CSV, JSON, PSV, SSV, TSV, TXT, XML
- Document file formats supported by extension: DOC, DOCM, DOCX, DOT, ODP, ODS, ODT, PDF, POT, PPS, PPSX, PPT, PPTM, PPTX, XLC, XLS, XLSB, XLSM, XLSX, XLT
- Purview also supports custom file extensions and custom parsers.
Azure Purview will also scan within certain files to sample the data to provide meta-data and data types.
Purview has three scanning levels:
- L1 scan: Extracts basic information and metadata like file name, size and fully qualified name
- L2 scan: Extracts schema for structured file types and database tables
- L3 scan: Extracts schema where applicable and subjects the sampled file to system and custom classification rules
With many data files, such as those with a specific format and structured file types, Purview samples 128 rows in each column or 1 MB, whichever is lower. For document file formats, it samples 20 MB of each file. Document files larger than 20 MB are not subject to a deep scan (subject to classification). In that case, Purview captures only basic metadata like file name and fully qualified name.
- Map your data estate with Azure Purview
- Supported data sources and file types in Azure Purview
- Asset management in the Microsoft Purview Data Catalog – Delete Assets
- Understanding resource sets
- Tutorial: Scan data with Azure Purview (Preview)
Purview Data Catalog
Once the meta-data scan has been gathered and the discovery is complete, the data catalog is built. Each scan discovers the metadata attached to a file used to help users find data in their data estate through search. The Purview landing page provides various paths to information, including a search bar. As pictured below, multiple suggestions are provided for selection by entering a search term or by hitting enter; you will see a complete set of results on a filterable page.
For example, you can easily find a dataset called DimCustomer in the SQL database. As shown below, various filters, such as the Browse by Asset Type experience, narrow your navigation down to the SQL Server. You can then select the DimCustomer object, as pictured below, to see the record entry.
A data consumer can discover data using the familiar hierarchical namespace for each data source using an explorer view. Once the data source is registered and scanned, the Data map extracts information about the structure; the hierarchical namespace is shown below. This information is used to build the browsing experience for data discovery.
Data Lineage Example
Seeing the data workflow that brings data from the source through the transformations to the final dashboards will help you better understand your data.
You can scan your Power BI environment and Azure Synapse Analytics workspaces, which automatically publishes all discovered assets and their lineage to the Purview Data Map. You can also connect Azure Purview to Azure Data Factory instances to automatically collect data integration lineage.
As pictured below, you can get a view of what reports and visualizations are created. This allows you to determine which analytics and reports exist and examine the data flow from source to destination.
- Data lineage in Azure Purview Data Catalog client – This article provides an overview of data lineage in Azure Purview Data Catalog. It also details how data systems can integrate with the catalog to capture the lineage of data. Purview can capture lineage for data in different parts of your organization’s data estate and at varying levels of preparation.
- Azure Purview Data Catalog lineage user guide – One of the platform features of Azure Purview is the ability to show the lineage between datasets created by data processes. Systems like Data Factory, Data Share, and Power BI capture data lineage as it moves. Custom lineage reporting is also supported via Atlas hooks and REST API.
Purview Data Insights
Insights are one of Purview’s key pillars where reporting, scanning, and logging resides, which allows you to surface what is happening within your data estate.
Let’s say you are responsible for your data security. You can Extend Microsoft 365 sensitivity labels to assets in Azure Purview and create or select the labels you want to apply to your data. Then, matched with the Insights reports, you can use different filters to essentially set up different ways to slice and dice this information. This gives a detailed overview of your data estate from a compliance standpoint.
In the M-365 Compliance Center, the sensitive information types are the same sensitive information types that we are now bringing to Azure.
The feature provides customers with a single pane of glass view into their catalog and further aims to provide specific insights to the data source administrators, business users, data stewards, data officers and, security administrators. Currently, Purview has the following Insights reports that will be available to customers at public preview. Follow the links below for more detail and sample Insight reports**
- Glossary insights on your data in Azure Purview – This how-to guide describes how to access, view, and filter Purview Glossary insight reports for your data.
- Scan insights on your data in Azure Purview – This how-to guide describes how to access, view, and filter Azure Purview scan insight reports for your data.
- Classification insights about your data from Azure Purview – This how-to guide describes how to access, view, and filter Purview Classification insight reports for your data.
- Sensitivity label insights about your data in Azure Purview – This how-to guide describes how to access, view, and filter security insights provided by sensitivity labels applied to your data.
- File extension insights about your data from Azure Purview – This how-to guide describes how to access, view, and filter insights about the file extensions, or file types, found in your data.
Azure Purview Quick Starts and Tutorials
- Tutorial: Scan data with Azure Purview (Preview) – This is a longer tutorial that includes a Starter Kit to help you set up data in your Purview instance.
- Tutorial: Register and scan an on-premises SQL server – Using a self-hosted integration runtime this tutorial will show you how to set up and scan an on-premises SQL data source.
- Tutorial: Use the REST APIs – In this tutorial, you learn how to use the Azure Purview REST APIs. Anyone who wants to submit data to an Azure Purview Catalog, include the catalog as part of an automated process, or build their own user experience on the catalog can use the REST APIs to do so.