Python Data Skills 6 - Anomalies Part 1
Everything you wanted to know about detecting outliers (in Python)
This post is specific for the detection of outliers in your dataset. Slow down, and re-read this post several times over if you have to. For those of you who are already working in a data role, see how many of these you already knew.
We’ll talk about the art of using Machine Learning algorithms to automatically detect, and fix the outliers within your dataset in Part 2.
I highly recommend adding this page to your bookmarks. You’ll be coming back to this several times, anytime you get a new dataframe to work with.
Table of Contents:
Outliers & NAs
Frequencies
One Variable Outliers
Bivariate Relationships
Subsets, and Logical Inconsistencies
1 - Outliers & NAs
Outliers and NAs are both anomalies within your dataset. Both of them disrupt the quality of the finalized predictions of your ML model, and your analysis. I explained the difference below.
1.1 Outliers
Outliers are data points that differ from the other observations in a dataset. They're the extreme values that diverge from the total pattern of data distribution. Identifying outliers is a critical step in data analysis. This is because their presence can skew your data and lead to a biased or inaccurate model. Outliers can arise due to various reasons. These include measurement errors and data entry errors. They could also be genuine, extreme observations.
For example, in a data set of human heights, a value like 1.9 meters (about 6.2 feet) may be an outlier. This value would be an outlier if it found to be much higher than the average human height.
1.2 NAs
In contrast to outliers, NAs or Missing Values refer to the absence of data in your dataset. In a perfect world, we'd have all the data we need, but unfortunately, data often comes with holes. A NA can show a data point that was not observed (due to oversight, loss, or other reasons).
Missing values pose a significant challenge in data analysis. This is because they can distort the interpretation. This will lead to weaker statistical power and biased estimates. It can also affect the machine learning models' performance.
For more information on the impact of bad data, and where your future is headed, click here.
1.3 Difference
Although both outliers and NAs present their own challenges in data analysis, they are different. Outliers are unusual values that exist within your data, while NAs represent a lack of data. Their impact and the ways to handle them also vary.
Outliers, provide valuable information about variability in your data. This you should keep. But, if they are due to errors, they could bias your results and treat or removed based on the context.
The onus is on you to figure out whether the outlier is legit, or whether it’s an error and should be dealt with.
You should try to handle NAs during pre-processing. This is because most mathematical operations cannot work with missing data. Most machine learning algorithms also cannot work with missing data. There are 2 main strategies to handle missing data:
Deletion: You just delete it.
Imputation: You use statistical models to try to predict what value the missing value would have been.
In essence, it is important to understand both outliers and missing values. This is because they are fundamental to robust and reliable data analysis. Now that we know the major anomalies to look out for, let’s talk about how we can detect them using various different strategies.
2 - Frequencies
In data analysis, frequencies refer to the number of times each distinct value
Keep reading with a 7-day free trial
Subscribe to Data Science & Machine Learning 101 to keep reading this post and get 7 days of free access to the full post archives.