Skip to content
5MinuteBI Creating Data Power Users 5 Minutes at a time

5MinuteBI

Creating Data Power Users, 5 Minutes at a Time

  • Home
  • Blog
  • About
  • Contact
Home / Data Analytics / Azure Data & Analytics /

How to use Azure Data Factory V2 Sliding Windows

BySteve Young Updated onMay 8, 2023

The following article reviews the process of using Azure Data Factory V2 sliding windows triggers to archive fact data from SQL Azure DB. This process will automatically export records to Azure Data Lake into CSV files over a recurring period, providing a historical archive that will be available to various routines such as Azure Machine Learning, U-SQL Data Lake Analytics, or other big data-style workloads.

Scenario

In the following scenario, an application is writing records to a fact table and includes a date time column that will be used to drive a sliding window export. There is a need to get these records out of the production system and archive them for future analysis and historical requirements before another process trims the table. The team wants to create an automated process that will capture the records and copy them to Azure Data Lake Store at a predefined interval. Using Azure Data Factory V2 and a trigger that runs a pipeline using the tumbling window functionality will fit the bill.

Architecture

A number of moving parts make up the architecture of the overall solution. The numbered points are highlighted below and are part of a larger project.

  1. The Developer will use the Azure Portal, Visual Studio, or Power Shell to develop the objects required.
  2. An SQL Azure DB acts as the source for the example.
  3. Azure Data Lake Store is created with a folder to house the output. The output of the process will store the files in a separate cascading directory for each Year, Month, Day, and Hour. The process will create the folders as needed, and the wild carding is included below.
  4. Azure Data Factory V2 controls the process. At the time of this article, the product is still in preview and may change before release.

A Short Note About Time

The goal is to have all records for each time frame, here by hour, in Azure Data Lake organized by folders. As pictured below, there is a folder in the data lake, Root/History/RawFilesByDate/Year/Month/Day/Hour/<<Filename>>.csv. The browser is in the Eastern Time Zone, but the schedule is running by Coordinated Universal Time (UTC) time. The example below shows the folder is at 04 UTC, the data captured is from the table at 04 UTC, and the file display time is 1 am in the browser, EST.

The window of the run will be for an hour set for UTC. If you need to make sure that the file exported, for example, only has records for the Eastern Time Zone for March 30th, you should select the correct UTC time when setting up your window.

Most systems should use UTC time to store and then have the down steam systems translate the time required for reports later in the process. More information on time slices used in Azure Data Factory is available in the article Scheduling & Execution with Data Factory.

Source Table

For this example, the following table has been created. The key here is the DateTimeEvent, which will drive the window selection used by the process.

CREATE TABLE [dbo].[FactWindowHour](
     [FactKey] [int] IDENTITY(1,1) NOT NULL,
     [DateTimeEvent] [datetime] NOT NULL,
     [Dimension1Key] [int] NOT NULL,
     [Dimension2Key] [int] NOT NULL,
     [MyNotes] [varchar](50) NOT NULL
) ON [PRIMARY]

Sample data

Keeping this example really simple, one record for each hour was added to the table for a couple of days. The end data file should have one record for each export.

The goal is to have an export captured for each hour. The process is driven by the StartTime and EndTime of the window you create. Make sure that if you need the exact timeframe in each file, let’s say an accounting application that needs all the records from the 30th, make sure that your window covers that exact timeframe.

Setting up the Process

You can use PowerShell and program this out, especially if you are migrating these processes from Development to Production. However, using the Wizard is by far the best way when you are just starting out. This process also assumes that you have the Connections to the SQL DB and your Azure Data Lake Store already set up.

Starting the Process

Start by going to the Data Factory blade in the Azure Portal. Select Author & Monitor from the Quick Links. The following options reviewed here will change depending on what schedule you choose, such as a one-time run or a schedule that runs every hour.

The Wizard has a number of options; for this example we are going to Copy Data.

Properties

The first blade will be the Properties selections for the pipeline. Make sure that you change the generic object names in the defaults that will be created during the process. This will make the editing and maintenance easier later on. Having a naming convention will allow you to find what schedule and objects you need once you have a bunch of pipelines.

The Properties section will create the Trigger.

Task name: Change to something more meaningful while using a naming convention to allow for clarity later on.

Task Description: Have something meaningful here also; once you have twenty or thirty of these, commenting will make you more productive and solve future issues. Your future self will thank you.

Task Cadence or Task Schedule: You can run this once or on a schedule, which we will choose here.

Trigger Type: This will create either a schedule that runs and produces the same output each run, or the Tumbling Window, which allows us to capture activity in a specific timeframe, is this example.

Start Date (UTC): Our conversation about time earlier comes into play here. The schedule is set up by UTC time. If your data is in Eastern Time Zone (EST) and needs to change as daylight savings time comes into effect, you need to change your time here as time goes on. Try and keep to UTC for your selections; it becomes easier. For example, if you want to capture ETZ activity for a day, UTC is 4 hours ahead of EST. A good calculator can be found online. Test, Test, Test!!!

Recurrence: This sets up how often you want the schedule to run and what the size of the window will be.

End: This allows you to set a specific end time for the schedule or let it run forever.

The important point is that you cannot change the start time or interval of a sliding window trigger once created. You have to create a new one. I can see the point as to what happens if you change the window and miss data. Not being able to change the time is most likely because the windows are created in the background.

Select the Source

Hitting Next brings up the next input set, the Source. You can create a new connection or use one listed on the screen. For this example, we select a previously created Azure SQL DB source.

Highlighting the sources does provide some detail, and also gives you the ability to edit, however, sensitive information will not be displayed.

Select Next at the bottom of the screen.

Selecting the Data Set

In the next step, we will select the data set we want to capture.

The capture below shows the option to select a table. The Preview will display once selected, and when selecting the Schema, you will see the discovered format of the table. This produces a select (*) so that any new columns added to the table will be included automatically. Otherwise, you have to change the query as the table changes. What you do will depend on what your requirements are downstream.

The discovered Schema will display.

You can select this, but we want to bring in a specific query. If you do this format, you will have to change the query if any more columns are added. The default is “Select *”; however, we will paste in a query in this example. A query could join a number of tables, but for ease of the example, we will only select one table.

We selected the tumbling window, so the following metadata has been added.

Edit the Query Window

This screen allows us to have a couple of options for our query.

We paste in the select query code and hit the Validate button.  In the <<columnName>> placeholder, put the date time column you will be using to drive the window. The below capture shows the validated query with the discovered Schema.

Now hit Next.

Select the Destination

Now select Destination and hit Next. Highlighting the icon will show the details and allow you to edit the Destination Connection if required.

Dynamic Folder Paths

Selecting Next allows you to select the Folder Path

You can browse to the folder you want to use to store the files, giving you a static option for the location, but in this example, we will provide a dynamic folder location as the default is to overwrite the file.

The code below will create the cascading folders for each run. This version goes to Hour and will have the file name being the same for each run. Use the same format for File Name if you want to go to that level.

@{concat('/SystemRun/RawDateHour/', 
    formatDateTime(pipeline().parameters.windowStart, 'yyyy'), '/', 
    formatDateTime(pipeline().parameters.windowStart, 'MM'), '/', 
    formatDateTime(pipeline().parameters.windowStart, 'dd'), '/', 
    formatDateTime(pipeline().parameters.windowStart, 'HH'))}

Note: You can also use “Partition by” when writing to Azure Blob Storage. See Example.

It is very important to note that the capital “HH” provides the 24-hour clock hour, and the capital “MM” provides a zero-padded month, “05” for example.

The next screen provides settings on the file. I usually add in headers, but not always; it will depend on you the requirements of the down steam workflows. In U-SQL, for example, you ignore the headers on import. Just be consistent with the project !!!!!

Setting up Fault Settings if Things Go Wrong

On the settings tab, you can set what happens to the package if there are problems in the rows. Normally, Skip and Log Incompatible rows may be the best way to handle this, but you may have something specific you need to do in the flow once an error occurs.

Various Settings for faults are listed below

Note: See Rerunning Failed Data Slices and Automatic Data Dependency Tracking, which includes how to monitor the execution of slices visually and how to re-run failed slices.

Summary

Now hit Next to show your creation summary. Note the names of the data sets and destination items. Select Edit to give them names that will be more descriptive later on for anything you missed during the creation process.

Deployment

Hitting Next will deploy the objects and provide the success or failure of each object.

After Your First Run !!

On the Monitor tab, you will see the runs based on your schedule and if they were successful.

In the Parameters column, selecting one will show you what was used in that run.

Looking at the file, you will see the records pulled during that window.

Conclusion

Having the ability to easily set up an archive routine and a sliding window to copy data from source systems to Azure Data Lake is a real-time saver. The dynamic file names and folder structure allows a data developer to create these objects and get them automated easily.

Resources

  • Azure Integration Runtime: The Integration Runtime (IR) is the compute infrastructure used by Azure Data Factory to provide the following data integration capabilities across different network environments, Data Movement, Activity Dispatch, and SSIS Package Execution.
  • Power BI Data Gateway: With the on-premises gateways, you can keep your data fresh by connecting to your on-premises data sources without the need to move the data.
  • Pipeline Execution Trigger:
  • How to create a Schedule Trigger:
  • How to Create a Tumbling Window Trigger
  • How to Create an Event Trigger
  • Azure Data Factory V1 Scheduling
  • Azure Data Factory Scheduling and Execution
  • Copy data to or from a file system by using Azure Data Factory

Related

Steve Young

With over 34 years of experience in the data and technology industry, the last 16 with Microsoft, I have had the opportunity to work in various capacities, contributing to my knowledge and expertise in Data Engineering, Power BI, and Data Visualization.

Facebook Twitter Instagram YouTube Linkedin Pinterest

Post navigation

Previous Previous
How to use Data Driven Settings to Control Power BI Reports
NextContinue
SSDT for Visual Studio 2017 – Fixing the Install Error
Search

Categories

  • Business Intelligence (0)
    • Self-Service BI (3)
  • Data Analytics (0)
    • Azure Data & Analytics (6)
    • Data Analysis With Python (3)
    • Data Storytelling (2)
  • Data Architecture (0)
    • Data Governance (5)
    • Data Modeling (2)
    • Power BI Solution Architecture (11)
  • Data Visualization (0)
    • Power BI Data Visualization Framework (2)
    • Power BI Data Visualizations (7)
  • Technical Education and Training (0)
    • AI in Education Content Creation (1)
    • Creating Training Materials (3)
    • Learning Paths in Tech (2)
    • Visual Learning (2)

Meta

  • Log in
  • Entries feed
  • Comments feed
  • Powered by WordPress.com.

Disclaimer: The views expressed are my own. I offer no guarantees for the accuracy of the information shared, and is for educational purposes only.

All non-original photography is sourced and licensed from my account on PEXELS,  STORYBLOCKS, iStockPhoto, and Pixabay. Please use our Contact Page if you have a question.

The information provided on this blog is for educational purposes only. Steve Young is not responsible for any errors or omissions or for any actions taken based on the information provided on this blog.

© 2023 5MinuteBI

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish.Accept Read More
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Non-necessary
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
SAVE & ACCEPT
  • Home
  • Blog
  • About
  • Contact