Python Data Skills 14: Automated Data Cleaning Part 1
Automated Inspection, Automated Summary, Automated Outlier Detection, Easy Merges
The good thing about data cleansing is it’s like a step by step process over, and over, and over again. Start by looking for blanks in your data. Then, look at some correlations to see some pattern. Then use what you saw with some data skills to fix the messy stuff. Then use your domain knowledge to add some extra features if needed. And… voila, ready for machine learning predictions.
Anyways, this post is about how to automate this process. If you follow the stuff the correct way, congrats. The stuff that takes most people an entire week to do, you can get done in 3 hours. Cheers
Table of Contents
Automated Inspection
Automated Summary
Automated Outlier Detection
Easy Merges
FYI, I recommend you go ahead and copy the functions I put below, and store them in a github account somewhere. Will save you plenty of time, since you are re-using it over and over again.
Out of all these below, my favorite one is the “Automated Scatter Plot for outlier detection”. Works great!
1 - Automated Inspection
Getting an initial snapshot of your data is essential. Do this before diving into data cleansing. Knowing what you're dealing with can save you time and steer your data cleansing strategy. This section focuses on creating user-defined functions. These functions automate a process. This process is giving you a quick look at crucial aspects of your dataset.
1.1 Number of Rows/Columns
Your first step should be to understand the dimensions of your dataset. Knowing the number of rows and columns gives you an idea of the scale of the data you are working with. This is especially important when you are dealing with large datasets. It helps you optimize your data cleaning techniques for performance.
def get_dataframe_shape(df):
rows, cols = df.shape
print(f"The DataFrame has {rows} rows and {cols} columns.")
Use this function immediately after loading your data to understand its size. This will help you decide whether you need to sample the data. Sampling the data will provide quicker exploratory data analysis (EDA).
1.2 Data Types for each column
Understanding the data types of each column is essential. Incorrect data types can cause errors during the cleaning process. They can also mislead you during analysis and visualization.
def get_column_types(df):
types = df.dtypes
print("Data Types for Each Column:")
print(types)
This is what you do if the data types do not align with what you expect. (e.g., numerical data interpreted as strings). Consider converting them as an initial cleaning step.
1.3 Identify the pkey
It's crucial to identify a unique identifier or 'primary key' for each data set. Make sure to do this before merging datasets. This column will align the data in the correct way during the merge process.
def identify_primary_key(df):
for col in df.columns:
if df[col].is_unique:
print(f"The primary key for the DataFrame is: {col}")
return col
print("No primary key found.")
return None
Identifying the primary key before merging datasets can save you from headaches later on. If no primary key is found, you might need to create one or take steps to ensure the data can be merged accurately.
The rest of the section is for paid readers only
Keep reading with a 7-day free trial
Subscribe to Data Science & Machine Learning 101 to keep reading this post and get 7 days of free access to the full post archives.