The Ingestion and Transforming of data in Microsoft’s Azure Platform can cover a wide variety of topics a Microsoft Azure Data Engineer needs to know. Using Microsoft’s exam Exam DP-203: Data Engineering on Microsoft Azure – Learn rubric, this post covers some must-have documentation, tutorials, and demos to help you through the topic.
The exam is the only requirement for the role base certification Microsoft Certified: Azure Data Engineer Associate – Learn.
Note: In the online exam listing, Exam DP-203: Data Engineering on Microsoft Azure, there is a section at the bottom called, Two Ways to Prepare. This link takes you to the MSLearn course, pictured below, with several modules covering the exam topics with labs and knowledge tests.
What is Exam DP-203: Data Engineering on Microsoft Azure?
The exam DP-203 covers what a Data Engineer needs to know to be effective on the Azure data platform. They should be a subject matter expert in integrating, transforming, and consolidating data from various structured and unstructured data systems into a format that other users can use downstream for further analysis in other analytical systems.
What Skills Should I have to Pass Exam DP-203?
One of the most important functions of an Azure Data Engineer is the building and maintenance of secure and compliant data processing pipelines by using different tools and techniques. In addition, you also help ensure that data pipelines and data stores are high-performing, secure, efficient, organized, and reliable, given an ever-changing set of business requirements and constraints. With maintenance, you need to deal with unanticipated issues and optimize data platforms to meet the data pipeline’s needs.
To be a successful Data Engineer in Microsoft Azure, you must have a solid knowledge of data processing languages, such as SQL, Python, or Scala. One is usually fine, but I find most of the projects I work on SQL and Python are most important, with Scala coming up once in a while.
The role requires the use of various Azure Data services and languages. Once you learn, you get to teach 🙂 . Azure Data Engineers need to be a resource for stakeholders through knowing their data through exploration.
How can I learn about Azure Data Architecture Patterns?
From an architecture point of view, you need to understand parallel processing with data architecture patterns. Some of the key patterns include the Data Warehousing and Analytics Pattern pictured below. Your first stop in the Azure Data learning journey should be the documentation, Azure Data Architecture Guide – Azure Architecture Center | Microsoft Docs, which has a full listing of various data architecture patterns and samples.
- Data warehousing and analytics – Azure Architecture Center | Microsoft Docs -This example scenario demonstrates a data pipeline that integrates large amounts of data from multiple sources into a unified analytics platform in Azure. This specific scenario is based on a sales and marketing solution. Still, the design patterns are relevant for many industries requiring advanced analytics of large datasets such as e-commerce, retail, and healthcare.
- Choosing an analytical data store – Azure Architecture Center | Microsoft Docs -In a big data architecture, an analytical data store serves processed data in a structured format that can be queried using analytical tools. You need to possibly support querying hot-path and cold-path data through the serving layer, or data serving storage.
- Most important for the exam, the Analytics end-to-end with Azure Synapse (Analytics end-to-end with Azure Synapse – Azure Example Scenarios | Microsoft Docs) – This example scenario demonstrates how to use the extensive family of Azure Data Services to build a modern data platform capable of handling the most common data challenges in an organization. The solution described in this article combines a range of Azure services that will ingest, store, process, enrich, and serve data and insights from different sources (structured, semi-structured, unstructured, and streaming).
Exam DP-203 Skills Outline: Ingest and transform data
These references can take you deeper than what you need for the exam but provide a great reference to the topic. For the exam, review from a high level but make sure you get some Data Wrangling hands-on.
Note: Some sections have a couple of sources with a Read Me First: noting a higher priority topic.
How To Transform data by using Apache Spark
When looking at this topic as a Data Engineer, you are mainly transforming data. The primary way to do this is using DataFrames which also covers Apache Spark Pools in Azure Synaspe Analytics.
- Read Me First: Transform data with DataFrames in Apache Spark Pools in Azure Synapse Analytics – Learn | Microsoft Docs – This has 9 units in an MSLearn course and includes some hands-on lab material.
- Transform data by using Spark in Azure Data Factory – Azure Data Factory | Microsoft Docs -In this tutorial, you use the Azure portal to create an Azure Data Factory pipeline. This pipeline transforms data by using a Spark activity and an on-demand Azure HDInsight linked service.
- Data wrangling with Apache Spark pools (preview) – Azure Machine Learning | Microsoft Docs – This is a Tutoria in the Machine Learning documentation.
- Apache Spark in Python: Beginner’s Guide – DataCamp – This provides an overview of listing of PySpark commands
Transform data by using Transact-SQL
These are basic T-SQL transformations. The following resources provide an overview of data engineering with TSQL commands.
- Apply SQL Transformation – Azure Machine Learning | Microsoft Docs – How to apply SQL transformation using a module of Azure Machine Learning designer.
- Modify data with T-SQL – Learn | Microsoft Docs – A 9 unit MSLearn module with introductory TSQL.
Transform data by using Azure Data Factory
Azure Data Factory is a deep topic, but keeping this high level, you need to know how to ingest data using the service. The main documentation landing page can be found here; Azure Data Factory Documentation – Azure Data Factory | Microsoft Docs. In addition, there are sections on Data Transformation and Ingestion. As pictured below, doing a high-level overview of the Transform Data with Mapping Data Flows would cover most of what is needed.
- Read Me First: Transform data – Azure Data Factory | Microsoft Docs – This article explains data transformation activities in Azure Data Factory that you can use to transform and process your raw data into predictions and insights at scale. A transformation activity executes in a computing environment such as Azure Databricks or Azure HDInsight. In addition, it provides links to articles with detailed information on each transformation activity.
- Data ingestion with Azure Data Factory – Azure Machine Learning | Microsoft Docs – In this article, you learn about the available options for building a data ingestion pipeline with Azure Data Factory. This Azure Data Factory pipeline is used to ingest data for use with Azure Machine Learning – ML as a service. Data Factory allows you to extract, transform easily, and load (ETL) data. Once the data has been transformed and loaded into storage, it can be used to train your machine learning models in Azure Machine Learning.
- Mapping data flows – Azure Data Factory | Microsoft Docs – Mapping data flows are visually designed data transformations in Azure Data Factory. Data flows allow data engineers to develop data transformation logic without writing code. The resulting data flows are executed as activities within Azure Data Factory pipelines that use scaled-out Apache Spark clusters. Data flow activities can be operationalized using existing Azure Data Factory scheduling, control, flow, and monitoring capabilities.
Mapping data flow provides an entirely visual experience with no coding required. Your data flows run on ADF-managed execution clusters for scaled-out data processing. Azure Data Factory handles all the code translation, path optimization, and execution of your data flow jobs.
- Tutorial: Create a mapping data flow – Azure Data Factory | Microsoft Docs – Mapping Data Flows in ADF provides a way to transform data at scale without any coding required. You can design a data transformation job in the data flow designer by constructing a series of transformations. Start with any number of source transformations followed by data transformation steps. Then, complete your data flow with a sink to land your results in a destination.
One of the major aspects of moving data to Azure is how to do this incrementally. In a data integration solution, incrementally (or delta) loading data after an initial full data load is a widely used scenario. A good overview is provided in the following two sources.
- ** Important **: Incrementally copy data from a source data store to a destination data store – Azure Data Factory | Microsoft Docs – The tutorials in this section show you different loading data methods incrementally using Azure Data Factory.
- Tutorial: Incrementally copy data using Change Tracking using PowerShell – Azure Data Factory | Microsoft Docs – In this tutorial, you create an Azure data factory with a pipeline that loads delta data based on change tracking information in the source database in Azure SQL Database to Azure blob storage. For some solutions, using PowerShell to operationalize is a given for scale. Depends on the client or customer.
Transform data by using Azure Synapse Pipelines
The experience of Azure Synapse Pipelines is relatively the same as Azure Data Factory. The main difference being the addition of Spark Pools.
- Data wrangling with Apache Spark pools (preview) – Azure Machine Learning | Microsoft Docs – This covers Azure Synapse Spark Pools, also referenced above.
- Tutorial: Get started integrate with pipelines – Azure Synapse Analytics – In this tutorial, you’ll learn how to integrate pipelines and activities using Synapse Studio.
- Tutorial: Transform data using a mapping data flow – Azure Synapse Analytics | Microsoft Docs – In this quickstart, you’ll use Azure Synapse Analytics to create a pipeline that transforms data from an Azure Data Lake Storage Gen2 (ADLS Gen2) source to an ADLS Gen2 sink using mapping data flow. The configuration pattern in this quickstart can be expanded upon when transforming data using mapping data flow.
- Read Me First: For the exam, it might be beneficial to review the documentation, Differences from Azure Data Factory – Azure Synapse Analytics | Microsoft Docs. In Azure Synapse Analytics, the data integration capabilities such as Synapse pipelines and data flows are based on Azure Data Factory; however, there are some differences.
Transform data by using Stream Analytics
There are two areas with information. You should have a general knowledge but the MSlearn module covers the topic well.
- Transform data by using Azure Stream Analytics – Learn | Microsoft Docs – MSLearn modules that cover the topic quite well. Has exercises and theory topics.
- Introduction to Azure Stream Analytics | Microsoft Docs – General documentation also has links to quick starts. The documentation is more for a full overview of Streaming Analytics.
Knowing how to cleanse data as data flows around your Azure data architecture is different from other areas. The first of two great community articles is a great go-to article by James Serra. He reviews where you should do this work in a general Azure Data architecture. Next, Cathrine Wilhelmsen covers a review and comparison of Azure Data Factory flows.
- Read Me First: Where should I clean my data? | James Serra’s Blog –
- Comparing Mapping and Wrangling Data Flows in Azure Data Factory | Cathrine Wilhelmsen
- Video: (13) ADF Data Lake Clean-up with Data Flow Fuzzy Lookups – YouTube – how to use ADF Mapping Data Flows using fuzzy lookups for data lake cleaning with delimited text in your lake.
- Data Cleansing – Data Quality Services (DQS) | Microsoft Docs – Overview of Data Cleansing techniques and theory.
- Source transformation in mapping data flow – Azure Data Factory | Microsoft Docs –
This topic relates to partitioning the rows of a dataset into two distinct sets. Many tools allow this, and each has its own way. For example, splitting data into two flows is used in machine learning but also Azure Data Factory.
- Split Data: Module reference – Azure Machine Learning | Microsoft Docs – Overview of the topic for machine learning
- Mapping data flow transformation overview – Azure Data Factory | Microsoft Docs – Reviews the various sets of functionality available.
- Read Me First: Check out the Conditional split transformation in mapping data flow – Azure Data Factory | Microsoft Docs, which covers the conditional split transformation routes data rows to different streams based on matching conditions. The conditional split transformation is similar to a CASE decision structure in a programming language. The transformation evaluates expressions, and based on the results, directs the data row to the specified stream.
Many of the tools can handle JSON files.
- JSON format in Azure Data Factory – Azure Data Factory | Microsoft Docs – This article covers support for JSON in ADF.
Encode and Decode data
With Big Data files and transferring files between systems, you will come across different encoded data files.
- Read Me First: Handling data encoding issues while loading data to SQL Data Warehouse | Azure Blog and Updates | Microsoft Azure – Provide insight into some of the data encoding issues you may encounter while using Polybase to load data to SQL Data Warehouse. This article also provides some options that you can use to overcome such issues and load the data successfully.
Configure error handling for the transformation
- Handle error rows with mapping data flows in Azure Data Factory – Azure Data Factory | Microsoft Docs
- Troubleshoot mapping data flows – Azure Data Factory | Microsoft Docs
- Troubleshoot Azure Data Factory connectors – Azure Data Factory | Microsoft Docs
Normalize and denormalize values. (Q. is this ML)
- Flatten transformation in mapping data flow – Azure Data Factory | Microsoft Docs – Use the flatten transformation to take array values inside hierarchical structures such as JSON and unroll them into individual rows. This process is known as denormalization.
- (ML) – Normalize Data: Module Reference – Azure Machine Learning | Microsoft Docs – Use this module to transform a dataset through normalization. Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to change the values of numeric columns in the dataset to use a common scale without distorting differences in the ranges of values or losing information. Normalization is also required for some algorithms to model the data correctly.
- ML Studio (classic): Normalize Data – Azure | Microsoft Docs – This article describes how to use the Normalize Data module in Azure Machine Learning Studio (classic) to transform a dataset through normalization. Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to change the values of numeric columns in the dataset to use a common scale without distorting differences in the ranges of values or losing information. Normalization is also required for some algorithms to model the data correctly.
General Information on Normalization Machine Learning.
- How to De-Normalize and De-Standardize data in Python | by Tracyrenee | Python in Plain English
- Data normalization in machine learning | by Mahbubul Alam | Towards Data Science
- NonMML – Data Denormalization and Normalization of Big Data | XenonStack
- Denormalizing Normalized Data For Machine Learning (orapub.com) – This has a good Python Example.
Transform data by using Scala
- Azure Databricks for Scala developers – Azure Databricks – Workspace – This article provides a guide to developing notebooks and jobs in Azure Databricks using the Scala language.
- Tutorial: Perform ETL operations using Azure Databricks | Microsoft Docs – Tutorial has code examples with Scala.
Perform data exploratory analysis
- Perform exploratory analytics over your data lake (microsoft.com) – Video – Achieve the freedom to query data on your terms using either serverless or provisioned resources.
- Exploratory Data Analysis with Azure Synapse Analytics – Great community walk-through on this topic.
I hope this learning guide has helped you find resources to get you up to speed with some of the information around the Ingestion and Transforming of data in Microsoft’s Azure Platform. The topic can cover a wide variety of solutions a Microsoft Azure Data Engineer needs to know.