| |

Learning Guide: Microsoft Azure Data Engineering – Ingest and Transform Data

The Ingestion and Transforming of data in Microsoft’s Azure Platform can cover a wide variety of topics a Microsoft Azure Data Engineer needs to know. Using Microsoft’s exam Exam DP-203: Data Engineering on Microsoft Azure – Learn rubric, this post covers some must-have documentation, tutorials, and demos to help you through the topic.

The exam is the only requirement for the role base certification Microsoft Certified: Azure Data Engineer Associate – Learn.

Note: In the online exam listing, Exam DP-203: Data Engineering on Microsoft Azure, there is a section at the bottom called, Two Ways to Prepare. This link takes you to the MSLearn course, pictured below, with several modules covering the exam topics with labs and knowledge tests.

_

What is Exam DP-203: Data Engineering on Microsoft Azure?

The exam DP-203 covers what a Data Engineer needs to know to be effective on the Azure data platform.  They should be a subject matter expert in integrating, transforming, and consolidating data from various structured and unstructured data systems into a format that other users can use downstream for further analysis in other analytical systems.

What Skills Should I have to Pass Exam DP-203?

One of the most important functions of an Azure Data Engineer is the building and maintenance of secure and compliant data processing pipelines by using different tools and techniques. In addition, you also help ensure that data pipelines and data stores are high-performing, secure, efficient, organized, and reliable, given an ever-changing set of business requirements and constraints. With maintenance, you need to deal with unanticipated issues and optimize data platforms to meet the data pipeline’s needs.

To be a successful Data Engineer in Microsoft Azure, you must have a solid knowledge of data processing languages, such as SQL, Python, or Scala.  One is usually fine, but I  find most of the projects I work on SQL and Python are most important, with Scala coming up once in a while.

The role requires the use of various Azure Data services and languages.  Once you learn, you get to teach 🙂 . Azure Data Engineers need to be a resource for stakeholders through knowing their data through exploration. 

How can I learn about Azure Data Architecture Patterns?

From an architecture point of view, you need to understand parallel processing with data architecture patterns.   Some of the key patterns include the Data Warehousing and Analytics Pattern pictured below. Your first stop in the Azure Data learning journey should be the documentation, Azure Data Architecture Guide – Azure Architecture Center | Microsoft Docs, which has a full listing of various data architecture patterns and samples. 

Source: Data warehousing and analytics – Azure Architecture Center | Microsoft Docs
  • Data warehousing and analytics – Azure Architecture Center | Microsoft Docs -This example scenario demonstrates a data pipeline that integrates large amounts of data from multiple sources into a unified analytics platform in Azure. This specific scenario is based on a sales and marketing solution. Still, the design patterns are relevant for many industries requiring advanced analytics of large datasets such as e-commerce, retail, and healthcare.
  • Choosing an analytical data store – Azure Architecture Center | Microsoft Docs -In a big data architecture, an analytical data store serves processed data in a structured format that can be queried using analytical tools. You need to possibly support querying hot-path and cold-path data through the serving layer, or data serving storage.
  • Most important for the exam, the Analytics end-to-end with Azure Synapse (Analytics end-to-end with Azure Synapse – Azure Example Scenarios | Microsoft Docs) – This example scenario demonstrates how to use the extensive family of Azure Data Services to build a modern data platform capable of handling the most common data challenges in an organization. The solution described in this article combines a range of Azure services that will ingest, store, process, enrich, and serve data and insights from different sources (structured, semi-structured, unstructured, and streaming).

Exam DP-203 Skills Outline: Ingest and transform data

These references can take you deeper than what you need for the exam but provide a great reference to the topic.  For the exam, review from a high level but make sure you get some Data Wrangling hands-on. 

Note: Some sections have a couple of sources with a Read Me First: noting a higher priority topic.

How To Transform data by using Apache Spark

When looking at this topic as a Data Engineer, you are mainly transforming data. The primary way to do this is using DataFrames which also covers Apache Spark Pools in Azure Synaspe Analytics.  


Transform data by using Transact-SQL

These are basic T-SQL transformations.  The following resources provide an overview of data engineering with TSQL commands.

Transform data by using Azure Data Factory

Azure Data Factory is a deep topic, but keeping this high level, you need to know how to ingest data using the service.  The main documentation landing page can be found here; Azure Data Factory Documentation – Azure Data Factory | Microsoft Docs. In addition, there are sections on Data Transformation and Ingestion. As pictured below, doing a high-level overview of the Transform Data with Mapping Data Flows would cover most of what is needed.

Transform Data with Mapping Data Flows section of the documenation.
  • Read Me First: Transform data – Azure Data Factory | Microsoft Docs – This article explains data transformation activities in Azure Data Factory that you can use to transform and process your raw data into predictions and insights at scale. A transformation activity executes in a computing environment such as Azure Databricks or Azure HDInsight. In addition, it provides links to articles with detailed information on each transformation activity.
  • Data ingestion with Azure Data Factory – Azure Machine Learning | Microsoft Docs –  In this article, you learn about the available options for building a data ingestion pipeline with Azure Data Factory. This Azure Data Factory pipeline is used to ingest data for use with Azure Machine Learning – ML as a service. Data Factory allows you to extract, transform easily, and load (ETL) data. Once the data has been transformed and loaded into storage, it can be used to train your machine learning models in Azure Machine Learning.
  • Mapping data flows – Azure Data Factory | Microsoft Docs –  Mapping data flows are visually designed data transformations in Azure Data Factory. Data flows allow data engineers to develop data transformation logic without writing code. The resulting data flows are executed as activities within Azure Data Factory pipelines that use scaled-out Apache Spark clusters. Data flow activities can be operationalized using existing Azure Data Factory scheduling, control, flow, and monitoring capabilities.
    Mapping data flow provides an entirely visual experience with no coding required. Your data flows run on ADF-managed execution clusters for scaled-out data processing. Azure Data Factory handles all the code translation, path optimization, and execution of your data flow jobs.
  • Tutorial: Create a mapping data flow – Azure Data Factory | Microsoft Docs – Mapping Data Flows in ADF provides a way to transform data at scale without any coding required. You can design a data transformation job in the data flow designer by constructing a series of transformations. Start with any number of source transformations followed by data transformation steps. Then, complete your data flow with a sink to land your results in a destination.

One of the major aspects of moving data to Azure is how to do this incrementally. In a data integration solution, incrementally (or delta) loading data after an initial full data load is a widely used scenario. A good overview is provided in the following two sources.


Transform data by using Azure Synapse Pipelines

The experience of Azure Synapse Pipelines is relatively the same as Azure Data Factory.  The main difference being the addition of Spark Pools.

Transform data by using Stream Analytics

There are two areas with information. You should have a general knowledge but the MSlearn module covers the topic well.


Cleanse data

Knowing how to cleanse data as data flows around your Azure data architecture is different from other areas. The first of two great community articles is a great go-to article by James Serra. He reviews where you should do this work in a general Azure Data architecture. Next, Cathrine Wilhelmsen covers a review and comparison of Azure Data Factory flows.


Split data

This topic relates to partitioning the rows of a dataset into two distinct sets. Many tools allow this, and each has its own way. For example, splitting data into two flows is used in machine learning but also Azure Data Factory.

Shred JSON

Many of the tools can handle JSON files.

Encode and Decode data

With Big Data files and transferring files between systems, you will come across different encoded data files.

Configure error handling for the transformation


Normalize and denormalize values. (Q. is this ML)


Transform data by using Scala


Perform data exploratory analysis

Conclusion

I hope this learning guide has helped you find resources to get you up to speed with some of the information around the Ingestion and Transforming of data in Microsoft’s Azure Platform. The topic can cover a wide variety of solutions a Microsoft Azure Data Engineer needs to know.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.