Look-ahead Bias

Short post on look ahead bias. This is the follow up post to the original, and talks about a few more other biases related to it, and offers a few more fixes to it once you have spotted it.

Jul 15, 2022

Let's talk about some common biases you will encounter.

Look ahead bias
Time period bias
Past performance

Look ahead bias

Look-ahead bias is a cognitive bias that causes people to mistakenly believe that they have access to more information than they actually do.

Look ahead bias example

A simple example of look ahead bias is this:

Let's say today is Jan 1, 2022, and you are analyzing the price of stock X
If you wanted to predict what the price of the stock would be at Jan 2, 2022, you could only use the data before Jan 1, 2022.
So, you go to FRED (Federal Reserve Economic Database), and find a metric such as DFF (Federal Funds Effective Rate), and use the numbers they had at Dec 31, 2021.
Congrats you just committed look ahead bias on the trading day of Jan 1, 2022 PIT (Point in Time)

Here's why it produces look ahead bias, the government needs a few days to actually upload their data, so it will take some time before this information is available to anyone in the public, so you are using data that you should not have access to.

an example of ahead bias in the DFF data, which would have produced inaccurate results

After a second look, it seems like the data for Dec 31 would be available 2 days later, meaning the data would be available on Jan 2, 2022, not Jan 1, 2022 like we were hoping for. Let's talk about a few more ways we can avoid look ahead bias from showing up on our analysis, and how to fix it once spotted.

Data Mining

When you are using future information that you are not supposed to have in your analysis, this process is sometimes referred to as data mining. If you are not careful, datamining can wreck your entire desired outcome for your analysis because it has the potential to completely tear apart it's quality.

Avoid look ahead bias

There are a few things you can do to avoid look ahead bias in your data analysis.

Make sure that you are only considering data that is available at the time of your decision making
Use conservative estimates when making forecasts based on your data
Be aware of how you might be unintentionally introducing bias into your analysis by cherry picking data or using ad hoc methods
Continue to monitor your results over time and adjust your process as needed to ensure that this biases is not skewing your results.

If you've determine that your data has lookahead bias, here's how you can fix it.

How to fix it

How a simple backtest research strategy can spot future look ahead bias tendency for a stock price at a future date.

FYI: Feel free to google the above report from the Refinitiv quant research company.

Unfortunately, look-ahead bias can actually invalidate your entire analysis, removing any confidence that people had on your research strategies. If you're determined to get rid of this biases once and for all, here are some details you can look for:

Use cross validation to test how well your analysis would hold up, and make it more robust
Try resampling the data to swap from a daily basis to monthly, or to daily, etc...

You can find more information on this by going to this post here.

Time period bias

Time period bias is a sampling error caused by a sampling observation that has nothing to do with what you are actually trying to study. This can happen either because the data collection process itself spans multiple time periods, or because different datasets are used for different time periods (e.g. using data from 2010 to study trends in 2017).

Time period bias can be problematic for data science in two ways.

It can introduce inaccuracies, and other biases into the dataset => false conclusions
can easily lead to overfitting => you'll end up over tweaking your analysis and get the wrong conclusion

An example of a trading strategy that would've fallen apart if you started investing into your trade account just before the start of the red.

Past performance

Past performance is an important metric to consider when assessing a data model. By understanding how a model has performed in the past, we can gain insights into its accuracy and effectiveness. This information can help us to make better decisions about how to use the model going forward.

There are a number of reasons why studying past performance is so important in data science, I have them listed below:

It can help us understand the limits of a particular model (under-predicts)
identify potential areas for improvement (can give simulation data to tweak behavior)

Class imbalance, and what happens if we are not able to find the best solution for solving this risk.

Survivorship Bias

Survivorship bias is the statistical phenomenon whereby information about survivors is given more weight than non-survivors in a given population.

Note: Another bias similar to this one is called backfill bias

Survivorship bias and time Series Analysis

This can lead to erroneous conclusions when studying time series data, since survivors are more likely to be represented in the data than non-survivors. Survivorship bias can thus lead to a overestimation of growth rates and other trends over time.

There are several ways to adjust for survivorship bias when studying time series data. One approach is to use data from multiple sources, including both survivor and non-survivor populations. Another approach is to use available information about attrition rates in order to extrapolate from the survivor population to the broader population.

A simple example of how companies trading strategies will only be tracked if it survived, and if it was bad, it's forgotten.

Data Science & Machine Learning 101

Discussion about this post