Data Wrangling 2: Create new columns based on a condition
Create a new column based on another column and a condition.
Required Readings
How to load data.table in R and pandas in Python (libraries)
Table of Contents:
Introduction
Why You Need to Learn This
SQL
Python
R
1 - Introduction
Welcome back to another data Wrangling post. From this post, you already know that you will spend about 80 to 90% of your time on the wrangling the data. Last time, we talked about how to do basic subsets, now let’s talk about something called feature engineering. This specific post will be focused on one aspect of feature engineering.
Features are the X columns when we try to predict a specific Y column. Feature engineering means tweaking the X columns in such a way to try to squeeze out more value from the data. An example of doing feature engineering is to create a new column based on some values of another different column. This post will be focused on how to create a new column from an existing column in SQL,R, and Python.
2 - Why You Need to Learn This
Let’s say you are looking at the income data located here. One useful metric that we can engineer is that we can categorize those under 20 as ‘still in school’, from 20 to 30 as early adults. and 30+ as peak income levels. In order to create this useful metric, we would want to create a brand new column by using the Age column in our dataset.
We will create a new column called Age_Indicator. This will be set to 0 for numbers less than 20, 1 for less than 30, and 2 for 30+. You can basically use this technique here for whatever problem you wish to solve.
3 - SQL
We will use this tool: csv-sql live. It’s useful for uploading a csv and practicing some
Keep reading with a 7-day free trial
Subscribe to Data Science & Machine Learning 101 to keep reading this post and get 7 days of free access to the full post archives.