From Chaos to Clarity: The 8 Stages of Data Preparation
Raw data can be challenging to analyze due to its incomplete and disorganized nature. Data preparation is crucial in data analysis and machine learning projects. This article will explore basic concepts and provide details on the various aspects of how to prepare data. Despite potentially being a laborious process, it is necessary for acquiring precise and dependable insights from data.
What is Data Preparation?
Data Preparation involves collecting, cleaning, transforming, and organizing raw data into a format suitable for analysis or modeling. Let’s break this into 8 stages of Data Preparation;
- Why is Data Preparation Important? – The more you can prepare the data for analysis, the less work downstream will be required, especially for each analyst that will use it.
- Collecting Data – Bring data from various sources such as databases, websites, and even Excel spreadsheets from end users.
- Exploring the Data -Used to gain insights into the data, identify patterns and relationships between variables, and detect any potential issues or anomalies that may require further investigation.
- Cleaning Data – In this step, you mainly remove errors, duplicates, and irrelevant data to ensure the data’s accuracy and consistency based on the exploration phase results.
- Transforming Data – Converting data into a form that can be easily analyzed, such as normalization, scaling, or dimensionality reduction.
- Data Integration – You can combine multiple sources once the data is clean. However, you still need to ensure it is in a consistent format and structure.
- Data Enrichment – Adding new data to the existing dataset through external sources or data generation techniques.
- Operationalizing – Automating and streamlining the tasks to improve efficiency, repeatability, and scalability.
Stage 1. Why is Data Preparation Important?
When you get deeper into a data analysis project, your chance of failure or delay is significantly higher if the data you are working with has any issues. If you discover data problems once you have started designing or building your solution, these issues could be catastrophic to your timeline and budget.
There are a number of studies that examine the importance of Data Preparation;
- Data preparation activities such as loading and cleaning accounted for 45% of data scientists’ time, according to a survey from Anaconda. Source: alteryx.com
- Data preparation accounts for about 80% of the work of data scientists. Source: forbes.com
Data preparation is essential to the data process and can be challenging to complete correctly. A Talend.com survey found that 76% of data scientists considered data preparation the least desirable aspect of their job, yet it is also critical for making well-informed and accurate business decisions.
If you make a mistake, your end users will lose faith and trust in your work. One of the most important aspects of my analytical projects was being careful and finicky when preparing raw data to ensure high-quality resulting data.
Having the ability to reuse data is very important. You spend a great deal of time collecting and processing data. Do a good job on this data set; you can use your output as input in subsequent analysis or as a source later in your data chain.
Stage 2. Critical Elements of Data Collection
Data collection is a critical aspect of any data preparation. You bring data from various sources such as databases, websites, and even Excel spreadsheets from end users.
“The success of your end product will depend on the decisions you make in the data collection phase.”
Here are some key aspects to consider when collecting data:
- Defining the Goals and Requirements – Define the purpose of your data collection, including the research question, goals, and objectives. Clear definitions of requirements, goals, and narratives can enhance the significance of your analysis while minimizing the effort required to obtain relevant information.
- Data sources – Identify the sources of data, including primary sources, such as surveys and interviews, and secondary sources, such as public records and databases. In larger environments, you may need to obtain security or permission to use data which might take time. Needing to answer how volatile the data is, the update schedule, and whether you must set up a refresh routine are all things to consider.
- Data Discovery – The first step is determining what data you have through Data Profiling. You are evaluating and examining the data to see any patterns or trends, looking at labels, null values, and data types, and looking for data that could cause errors later in the process. You can sample the data and provide a profile.
There are many ways to do this manually or develop a more automated process. The main output of this step is to examine, analyze and create summaries to learn more about the data source, data sample, or data set.
The below example is Power BI’s, Power Query. Provides the user with a data profile functionality within the tool.
- Instrumentation – Develop and test data collection routines or orchestrations to automate the collection of data. You could be using tools, such as questionnaires or sensors, and you need to ensure they capture the data needed. No matter if you think this analysis or project is a one-time endeavor, I always seemed to be asked to reproduce or update the analytics even a year or so later. Always keep notes and sources.
- Data quality – This is perhaps the most important to your reputation. You must ensure that the data collected is accurate, reliable, and valid through careful design and monitoring.
- Data storage – Choosing an appropriate method for storing, managing, and securing your data, such as databases or cloud storage, MUST be done in such as way as to protect confidentiality and privacy. How long you store the data and other data governance concerns must be examined and documented at some point.
- Data analysis – Plan for data analysis from the start, including selecting appropriate statistical methods, software, and tools. You also need to make sure you have all the base data for calculations. If you want to look at “This Year vs. Last Year,” you need last year’s data.
If you want to examine calculations in the source data, such as margin percentage, for example, you need not only the base data but also the right grain or detail that the calculation was obtained in the source system. This way, as you roll up the data, your calculations will be correct.
- Ethical considerations – Ensure that the data collection follows ethical guidelines and principles, such as informed consent, confidentiality, and transparency. A great article by Catherine Cote of the Harvard Business School had a great definition;
“Data ethics encompasses the moral obligations of gathering, protecting, and using personally identifiable information and how it affects individuals.” Source: 5 PRINCIPLES OF DATA ETHICS FOR BUSINESS.
Your data collection must consider key aspects to obtain relevant, accurate, and reliable data for meaningful analysis.
Stage 3. Data Exploration – What Do You Have?
Data exploration is crucial in data analysis as it allows analysts to understand the data better and make informed decisions about data preprocessing and required methods for further analysis.
Unstructured data makes up 80% of data today. Source: aws.amazon.com
This can also be called Data Profiling, which involves analyzing the dataset to identify patterns, relationships, and anomalies, which can be used to enhance the dataset and provide additional insights. Here are some of the ways you can use data exploration to discover what you have;
- Identify the variables – By exploring the data, you can identify the variables in your dataset and understand their distribution, range, and possible values. This will help design and develop any spreadsheets, databases, or models. Determining any data you do not require is important at this stage as it may interfere with your analysis.
- Check for missing values – Data exploration helps you identify missing values in the dataset, which can be critical for data cleaning and inference. There are a couple of items to consider when you are looking at missing values;
- Can you replace, estimate or fill in any missing values?
- Can you find other sources?
- Do the missing values impact the validity of your analysis?
- Examine distributions – You can use visualizations and descriptive statistics to examine the distributions of the variables in the dataset. The distribution of a variable refers to how the different values are spread across your dataset. You are looking for what patterns you see in the data. Is there anything unusual? Some tools include mean, median, minimum and maximum, and standard deviation statistics.
Remember that you cannot replace the human element; even just scanning the values, you can notice patterns that might not be apparent to a machine. Questions such as; Why is December so large? For example, oh ya, that was Christmas. Why is this store’s sales zero? Ah, yes, the renovation.
- Identify patterns and relationships – Data exploration allows you to identify patterns and relationships in the dataset, such as correlations or trends. Are there comparisons you can use in visualizations or trends that you can use to help your analysis?
We use Python to examine 3 data sets in the following example diagram. We calculate the correlation coefficients using
numpy.corrcoefand plot the data points using
matplotlib.pyplot.scatter. We can see the strong and weak correlations.
- Determine outliers – You can use data exploration techniques to identify outliers, which are values significantly different from other values in the dataset. These are not errors but may be items you need to explain if they throw off or mislead possible results. If your observations are all between 1 and 5 but have one that is 10, this could require more information. Is this an error or a possible result?
- Evaluate data quality – By exploring the data, you can identify potential errors or inconsistencies and take steps to correct them. With these explorations, you can determine the data’s quality and fitness for use. Data may be of poor quality but still useful in explaining some information.
Overall, Data Exploration allows individuals to comprehensively understand their dataset, which includes assessing data quality, identifying variable relationships, and detecting patterns suitable for analysis and decision-making. This is one of the more important steps for data preparation for machine learning.
Stage 4. What is Involved in the Data Cleaning?
In the Data Cleaning step, you mainly remove errors, duplicates, and irrelevant data to ensure accuracy and consistency based on the exploration phase results. This can be one of the most challenging yet important steps if you are evaluating the data for the first time.
The outcome is correcting inaccurate data or problems with the data feed. Always go back to the source. Could a data collection issue be corrected for future data feeds?
Be sure to make notes and detail what you are doing, as you always need to show your work and back up any decisions you make at this stage. Decisions here could affect the results of the analytics, and you always need to explain your work!!
Here are some of the steps that can be involved in cleaning the data:
- Identify and remove duplicates – Check for any duplicate data points in the dataset and remove them to avoid skewing the analysis. You need to be careful how you define if data is duplicated to avoid removing important information. This can include; having data entered twice, some range missing, or a different scale or measurement.
- Handle missing data – When dealing with missing data, it’s important to consider factors such as the type and extent of the missing data. ALWAYS think of how it’s likely to affect the analysis. Based on these considerations, you may choose an appropriate strategy, such as assigning a value based on the existing data or even deletion.
- Address outliers – Identify outliers significantly different from other data points and determine whether to remove or adjust them. The key is not to remove meaningful data. Is it erroneous data or a result you need to evaluate?
- Standardize data & Resolve inconsistencies – Ensure the data is in a consistent format and standardize the data types to avoid confusion or errors in the analysis. Creating a data table with the same data type will allow functions to evaluate. You cannot perform math functions if you have a column of ages, but some values are text, not numbers. If some of your measurements are in Fahrenheit (°F) scale, Celsius (°C) scale, and Kelvin (K) scale, for example, you may need to standardize to allow comparisons and analysis.
- Check for accuracy – Verify the accuracy by comparing it with external sources, such as official statistics or other datasets. Have a critical eye on information or a source you are not familiar with. Always great to have your own first-hand research to use as a spot check.
Remember to use your end-users or data stewards for assistance. They have experience with the datasets and can be invaluable.
- Validate data – Validate the data by running quality control checks to ensure that the data meets the defined criteria for accuracy and completeness—always spot-check calculations and information. Have various teams been involved in writing testing criteria?
Do your totals include all the information? Do you have a filter? One great check is to have a counter. If I have 25 values in one list and 25 in the second, I should have 50 when I combine them. Balance or Cross-Footing totals are always a great validation. See great visuals for more information – What is footing and cross-footing? – Universal CPA Review
- Document the process – If I had a dollar for every time I needed to revisit how something was done !!!! Document the steps taken to clean the data, including any decisions made and any assumptions or modifications made in the process. You should also get a sign-off and review of your decisions with data owners.
Make that documentation available to others who use your data. This is a great beginning for handover documentation. This helps increase trust in your process and is backed up when discussions about issues arise.
Ensuring data accuracy, reliability, and consistency through these steps is crucial for conducting analysis.
Stage 5. What is Data Transformation?
Data transformation involves converting data from its initial format or structure to another so that it can be used for analysis. This includes a process of filtering, restructuring, and combining different data sources to create data suitable to answer the questions at hand.
Here are some examples of data transformation:
- Data aggregation – Aggregating data involves summarizing data by grouping it into categories or subsets. Be careful when you do calculations and summaries. Downstream analytics may need the detail that makes up the new totals or calculations. Getting rid of details is always hard to do.
- Data normalization – Normalizing data involves transforming it to a common scale or format so it looks and reads the same across the set. This may involve;
- Scaling data to a common range, such as between 0 and 1
- Converting categorical data to numerical values.
- Converting phone numbers from xxxxxxxxxx to xxx-xxx-xxxx by putting in dashes.
- Creating a common date format may be the most difficult aspect. This article from 2003 never gets old, Normalizing Dates – SQLServerCentral, and well worth the read.
- Data Restructuring – Restructuring data involves changing the structure or format of the data. This may involve pivoting data to a wide or long format, transposing data from rows to columns, or vice versa.
Think of a table of Rows of Stores by Months as Columns. You could transform this to Months on rows and Stores on Columns to match how your users will analyze this data. Always think of how your analysts will use the data to determine the best format.
Stage 6. What is Data Integration?
Once the data is clean, you can combine it from multiple sources and ensures it’s in a consistent format and structure. The process of combining data into a single dataset, or Integrating data, involves combining data from different sources into fewer (or single) datasets. This may involve merging datasets based on common variables or creating new variables based on the combined data.
Always a good practice to process the data as close to the source as possible. The source system is usually more powerful than the Excel worksheet the analyst may be using. Watch out for downstream analysts that would have to repeat the same process steps to use the same dataset. This is almost always a group effort.
A good example of this has Employee and Sales system data. Creating a view or data set, including sales and employee data, allows bonus calculations and payments. In this example, be careful of the different security and rules of each system you work with. Some users have rights to Employee data, and some may have rights to Sales data. The combination of the two may become the rights to this new data set.
The following are important elements of data integration.
- Data sources – Identify the different sources of data that need to be integrated, such as databases, spreadsheets, or APIs. The key to success is getting each data set into a unified format and selecting tools to connect and orchestrate the process.
For instance, it may be necessary to integrate an Azure SQL store and an Oracle database. The most difficult aspect may be ensuring security and access. The data could be exported to a common format, for example, CSV or Apache Parquet, an open-source column-based data file format designed for efficient storage and retrieval.
- Data mapping – This is the process of connecting fields from one database to another. It must be completed with accuracy, as mistakes could cause a domino effect on the data flow. For example, one table could have a Delivery Date and another Order Date, but both source systems are labeled date.
Every element must be mapped in a common format or schema for analysis by recognizing similar variables and resolving discrepancies. Field names and formats must be considered, especially on columns that map or join the data.
- Data Transformation – The data you combine may have been transformed already, making your job easier. However, handling further transformations as you process the exchange is a common task. The goal becomes ensuring that your result remains consistent and accurate.
This may involve cleaning the data, handling missing values, and normalizing the data based on the new combination of data. For example, how do you handle missing values? Do they become a zero, null or other value? The fact that there is no data point may be a useful analysis point downstream.
- Data loading – This step involves Loading the transformed data into a single repository, such as a data warehouse or data lake. This process can be the most complex and must be robust and traceable. One of the gotchas that people fall into is not planning for repeatability.
Several important things to remember are:
- What do you do with the original files?
- What happens when you have an error? Do you skip it or stop the process?
- Do you need to have incremental updates? If so, how do you determine what data is new?
- Do you need to track the changes from one load to another? This can be important not only for auditing but for your Data Scientists that need to explain changes in models.
- Do you have to change the data structure of each system to include a load flag or date loaded?
- Data quality – Ensure that the integrated data is of high quality by verifying the accuracy, completeness, and consistency of the data. Doing checks on the joined data is very important. Did the joins come together correctly?
- Scalability and Performance – Design your data loading process to handle large volumes of data and accommodate future growth. Optimize performance by using techniques like parallel processing, batching, and indexing.
- Check if there will be an initial history load. I have found teams that design for regular loads, but as the system is deployed, you could have years of data requiring to be loaded. Can the system you design handle that? Will the transaction log balloon?
- Important to monitor system performance and proactively address bottlenecks or other issues that may slow the process as the data grows.
- Error Handling and Recovery – Implement robust error handling procedures to manage issues that may arise during the data loading. This includes logging errors, retrying failed operations, and providing clear error messages to users. Develop a recovery plan to handle potential data loss or corruption, ensuring minimal impact on your system and data integrity.
Create logs during your process. Life is so much easier when you can trace your process. Be careful, however, as exposure of sensitive data in your error, log files, and database backups have affected a significant number of individuals.
- Security and Compliance – Defend your data loading process against unauthorized access and breaches. Implement proper access controls, encryption, and data masking as needed. The location of your source data should be secured and protected the same way as your production data. Securing payroll data in the system cannot be put in danger by having the Data Lake, for example, have wider access.
Ensure your process complies with relevant data protection regulations and industry best practices, such as GDPR or HIPAA. Regularly review and update security measures as threats evolve.
- Data governance – Establish policies and procedures to ensure the integrated data is managed and used securely and ethically. As data usage and sharing continue to grow within organizations, it is important to balance governance with the need for innovation and impactful solutions, whether they are on-premises or in the cloud.
Establishing a set of Data Governance Principles is crucial for ensuring data’s safe and appropriate use in achieving success. This will help your team not to reinvent the wheel every time they start a project.
More information on Data Governce can be found on this site; Why You Need a Data Governance Process Now and Introduction to Microsoft Purview for Data Governance .
Data integration is critical in data analysis because it allows analysts to access and analyze data from multiple sources. By integrating data, analysts can gain a more comprehensive view and identify patterns or relationships that may not be apparent from individual sources.
Stage 7. What is Data Enrichment?
Data enrichment is the process of adding to existing data with additional information to provide a more complete and useful dataset for analysis. This can involve adding new variables or features to the dataset.
Here are some examples of data enrichment:
- Adding External Data – Adding data from external sources, such as demographic or economic data, can provide additional context and insights into the dataset. One example, I produced a set of retail analyses by store and combined it with weather data from each location. This allowed for more detailed analysis and forecasting and better results.
- Geocoding – Geocoding involves adding geographic coordinates to the dataset, which can be used to analyze based on location. Combining customer location with other demographic, social, or environmental data allows the exploration of relationships and trends that might not otherwise be visible.
I was recommended this article by one customer who really put the importance of this into practice. Geocoding large population-level administrative datasets at highly resolved spatial scales – PMC (nih.gov)
- Text Mining – Since the birth of Big Data and Data Mining, Text Analytics has been the mainstay of machine learning. One of the most important uses is sentiment analysis, which involves analyzing text data, such as social media posts or customer feedback, to extract insights and sentiment analysis.
Another example is a Recommender System, which predicts user preferences or interests based on their historical data, behavior, and interactions with a platform. The system uses machine learning algorithms to analyze user data and provide recommendations for various items, including movies, music, books, products, and services. The key is to help users find relevant and engaging content.
- Image or video Analysis – Analyzing image or video data can provide additional insights into the dataset, such as identifying objects or patterns.
Data enrichment can help organizations better understand their data and make more informed decisions based on the insights gained. Analysts can identify patterns and trends that may not be apparent in the original dataset, which can lead to new opportunities and strategies for the organization.
Stage 8. Operationalization – Let’s Do This Again
Operationalizing Data Preparation involves automating and streamlining the tasks to improve efficiency, repeatability, and scalability. When you are successful, people will come back and ask for updates or for you to do it again.
There are several methods to operationalize data preparation, including:
- Great Documentation – Having your process documented, online, and available will not only cut down on the questions but will allow you to gain trust by having an open process. See Wikimedia Blog/Converting wiki pages to blog posts – Meta on converting Blog Posts to Wiki pages.
- Data pipelines – Data pipelines are automated workflows that move data from source systems to target systems while transforming and processing the data along the way. They are essential for any data analyst as they allow for the efficient and accurate movement of data from source systems to target systems. Data pipelines can help streamline the data preparation process.
- Data integration platforms – Data integration platforms are software tools that help organizations integrate data from various sources and prepare it for analysis. They enable organizations to access, transform, and consolidate data from multiple sources into a single repository. This is done by providing an easy-to-use interface that allows users to connect different data sources and manipulate the data to their needs. Users can then use this prepped data for various purposes, such as analytics, reporting, or machine learning.
By investing time and effort into data preparation, organizations can improve the quality of their data and gain a competitive edge in their industry.
Data preparation is an important step in the data analysis process that involves cleaning, transforming, and integrating data to create a dataset that is suitable for analysis. It requires careful planning, attention to detail, and the use of appropriate tools and techniques.
Proper data preparation allows analysts to obtain comprehensive insights and make well-informed choices. This iterative process may require analysts to revisit it several times during analysis. By investing time and effort into data preparation, organizations can improve the quality of their data and gain a competitive edge in their industry.