Power BI gives you the ability to use a wide variety of data sources in your data visualization project. As more data visualization workloads include big data sources, having the ability to use sources such as Azure Data Lake (ADL) is becoming more common. This walkthrough shows you how to access files in a Data Lake and import them into a Power BI data model.
Scenario-Daily Files Drop to ADL
The real-world scenario in this example has a series of files produced each day, streamed into a time-based folder structure as pictured below.
The log files generated each hour are captured in a folder using the format, Year\Month\Day\Hour\*.csv.
These raw files are then combined into analytic extracts using U-SQL or Data Bricks and placed into an area within the Data Lake for consumption in other tools and processes.
This allows Power BI to use the same location to consume the data as the sources are refreshed. This keeps the heavy lifting to a minimum in the ingestion and automation layer.
The folder is stored in the data lake, secured by an Azure Active Directory account.
Formatting of Files
There are a couple of things to keep in mind when creating files for import.
First, many files generated for big data consumption do not include column headers. If you do not have column headers, you will have to add column names later, which can become a lot of work.
Second, the format of the files being in comma or at least delimited format can be a big help. You can import and transform unstructured or report-style files using the Power BI Query Editor, but the more formatted the inputs, the higher the productivity boost of your group.
Note that there is a move to parquet files. Apache Parquet is an open-source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. Think of it as a text Excel file. The file retains the data format information and is compressed, saving space and memory.
If you have a group of people who consume the files for different projects, each group would have to duplicate the transformation process when using the files. Our example pictured below has a storage account that contains a number of folders. Capturing these files in this location will allow us to perform refreshes to the data model as these files are updated over time.
Step 1: Getting the Azure Data Lake Path Folder Path
- In the Azure Administrative portal, bring up your Azure Data Lake instance.
- Open the Data Explorer and navigate to the folder containing your source data.
- Open the Folder Properties blade by clicking on the three dots, then select the option.
On the Folder Properties blade, the URL of the folder you will need for setting up the Power BI datasource is listed under Path.
Copy this path to the clipboard.
Step 2: Use Power BI “Get Data” to Link Files
The next step,
- From Power BI Desktop, connect Power BI to the data using the Get Data toolbar item
- The Azure section has the Azure Data Lake connector,
- Select and then Connect.
- Paste Path from the above step 1 into the URL
- Select the FileSystemView option
- and select OK.
You now need to sign into the resource.
If this is your first time connecting, you must sign in using your Azure Active Directory Organizational Account, or the account your Operations area has provided.
- Select Sign-In, which will bring up the standard Azure sign-on.
Step 3: Load this query into Power BI
This brings in the listing of all the files located in the folder.
There are two choices, Load or Edit. At this point, I prefer to Load as this will bring the file listing into Power BI as a data source with the associated file metadata.
I can save the Power BI book and carry on and load each file into the data model.
Step 4: Edit the Queries – Loading and Shaping
With the source listing loaded, we go back into the Edit Queries to show the file listing Query.
As shown below, each file is listed in the Query. Power BI shows the files as Binary.
There are two different ways to load the file. If you double-click on the “Binary” link, an assumption on the file as to what type it is loads into the model.
This may or may not be how you want to deal with the file, depending on the format that is assumed. Instead, for our example, we right-click on one of the files and add it as a new query, as shown below.
This will also make the properties of the files available for reporting, to show when the file was last modified for example.
The file will then be displayed as a new query and is presented as a file in the query window, as shown below. You then right-click on the file and select how you want the data to be handled on the import. This is rather straightforward with a .CSV formatted file, but you may have another extension you want to make sure is handled in a specific way.
The files will appear as separate queries ready for transformations in the Query Editor.
Notice the Applied Steps pictured below. You can continue to make any modifications, shape or transform the data into a more analysis-ready format, then load it into your model.
Now repeat for each file you want to load into the model, which will give you a set of queries as pictured below.
When ready, select Close & Apply to load into the model.
Step 5: Ready for Clean-Up
Once you have completed the steps, your data is ready for the data modeling process.
I hope you found this quick introduction useful. I am seeing more clients wanting to do some quick ad-hoc analysis with their data lake. Power Bi provides that quick end-to-end ad-hoc tool.