Data Wrangling 2: Create new columns based on a condition

Create a new column based on another column and a condition.

BowTied_Raptor

May 08, 2022

∙ Paid

Required Readings

Basic SQL/R/Python wrangling

How to load data.table in R and pandas in Python (libraries)

1 - Introduction

Welcome back to another data Wrangling post. From this post, you already know that you will spend about 80 to 90% of your time on the wrangling the data. Last time, we talked about how to do basic subsets, now let’s talk about something called feature engineering. This specific post will be focused on one aspect of feature engineering.

Features are the X columns when we try to predict a specific Y column. Feature engineering means tweaking the X columns in such a way to try to squeeze out more value from the data. An example of doing feature engineering is to create a new column based on some values of another different column. This post will be focused on how to create a new column from an existing column in SQL,R, and Python.

2 - Why You Need to Learn This

Let’s say you are looking at the income data located here. One useful metric that we can engineer is that we can categorize those under 20 as ‘still in school’, from 20 to 30 as early adults. and 30+ as peak income levels. In order to create this useful metric, we would want to create a brand new column by using the Age column in our dataset.

We will create a new column called Age_Indicator. This will be set to 0 for numbers less than 20, 1 for less than 30, and 2 for 30+. You can basically use this technique here for whatever problem you wish to solve.

3 - SQL

We will use this tool: csv-sql live. It’s useful for uploading a csv and practicing some

Keep reading with a 7-day free trial

Subscribe to Data Science & Machine Learning 101 to keep reading this post and get 7 days of free access to the full post archives.

Data Science & Machine Learning 101