Feature Engineering Part 1: Getting More Value from Your Regression
Feature Engineering, Binary Columns, Quadratic Regression
Required Readings
Table of Contents:
Where You Would Use This
The original Model and Data
Categorical Data
Finishing Thoughts
1 - Where You Would Use This
If you are doing any sort of analysis, there are a few little tricks you can pull off in order to squeeze more value out of your current data. One of the tricks you can use is called feature engineering. The idea is you will take your current dataset, and use some data wrangling to create more additional columns, and in theory some of these newly generate column would be more useful to the model than the old ones.
Hence the quality of the predictions of the model increases.
Here is a simple visualization of the potential of this technique.
FYI: Factors, Features, Independent Variables, and X-variables, the X columns, all literally just mean the same thing. Whichever one you end up saying depends on the domain of your business. Quants like to say Factors, statistical researchers like to say independent variables. Literally just the same thing.
In other words, Feature Engineering literally just means tweaking the X columns a bit.
2 - The original Model and Data
We will continue from the results produced from the Possum linear regression. Here is a quick preview of the predictions done by the old model vs the actual results.
You can read the post on linear regression and the old model here.
You can grab the data here if you want to follow along.
Old Model:
3 - Categorical Data
One simple things we can do with data is to create what’s called binary columns. The idea is we can make a simple True/False marker, using a 0 or a 1. For example, if we look at the Gender Column, we can say Gender_Male=1 when the gender says male, or Gender_Male=0 when it does not say male. Rinse and repeat for several columns. Here is the results for tweaking some data, and then running another linear regression.
R
First, we’ll target the sex and the Pop column, and make some binary categorical data on this. This is what the new data looks like:
Now let’s run the regression, and see the old results vs the new side by side.
Although there is 1 prediction that became slightly worse, generally speaking you can see that the net result is that feature engineering has an overall net positive. Now, in this case, we only ran feature engineering on just 2 columns, imagine what if were were to run it on several columns, and we created new categorical column for some of the numbers as well.
You can see just by tweaking the columns a little bit, we can squeeze out a lot more bang for your buck from your data.
Python
First, we’ll target the sex and the Pop column, and make some binary categorical data on this. This is what the new data looks like:
Now, let’s run the regression and compare the new results with the old ones.
4 - Finishing Thoughts
The above just showcases one way to get more value for your buck by just creating some new binary columns, and you can see that overall it ends up being a net benefit.
In the next post, we’ll continue with a part 2 and talk about something called Quadratic Regression.
Feature engineering intensifies