SQL4: Grouping & Aggregating Data

Grouping, Aggregation, Filtering with aggregation, exercises & interview questions

Jan 22, 2023

Oh cool, here’s a job by TikTok, and see if you can spot the SQL:

Another Data Science job that requires SQL skillset

Don’t let the salary get to your head, this is in California. Run a COLA calculation, and you’ll realize this is around $60k everywhere else. In other words it's a data analyst position.

But, important thing here. Notice they want you to also know R, and Python. This is because with a few years under your belt, the odds are they’ll promote you to a Senior DS, or a MLE.

This is the best role to get your foot into the Data industry, especially college grads.

Grouping
Aggregation
Filtering with aggregation
Exercises & Interview Questions

Note: All these commands covered in this post is for data wrangling. None of these commands will make any permanent changes to your tables/database.

1 - Grouping

When you are working with a table, sometimes you will want to combine certain rows together. This is referred to as grouping.

Example of the GROUP BY clause to group rows together. — Example of grouping data together.

We use the GROUP BY clause when we want to group up our data.

1.1 GROUP BY

When you are fetching data (SELECT), you can use the GROUP BY clause to group certain rows together. For the rows to be grouped together, they must have the same value. Keep in mind that the GROUP BY clause will show up after the WHERE clause from our last SQL post.
Below is an example of how all these SQL statements we’ve covered so far will fit together:

Template for how SQL statements are to be structured.

The group by clause is great, because it acts as another way to SELECT DISTINCT. For some of you who are autists, although SELECT DISTINCT & GROUP BY will give the same result. The main difference is when they are executed.

You can think of these as like SQL’s version of BEDMAS (Order of operations).
Diagram below:

When GROUP BY is called (without aggregations), SQL realizes what you were after, and gives you the same result. Whereas with SELCT DISTINCT, it doesn’t need to think, and gives you what you were after. That tiny little difference adds in a performance boost for SELECT DISTINCT > GROUP BY. Observe below:

SELECT DISTINCT vs GROUP BY Performance — Visual proof that SELECT DISTINCT is faster than GROUP BY

1.2 Practice

From the Titanic dataset, let’s say I wanted to fetch all the unique ages of people what were on the ship. I can use the SELECT DISTINCT (previous example), or I can use the GROUP BY, and achieve the same effect.
Here is an example of using the group by clause to achieve it:

Examining Distinct ages on the titanic with GROUP BY

2 - Aggregation

Recall from this post that we can actually run function on our numeric data. Using the group by, we can get SQL to group up the relevant rows. Then run a mathematical function on the groups of rows, 1 group at a time.

When we run a function on a column of numerical entries, to get 1 numerical answer. This process is referred to as aggregation. If you forgot, here is the list of functions you can use for aggregation:

COUNT() - calculate and return the total number of Non NULL rows.
AVG() - calculate and return the average
SUM() - calculate and return the total
MIN() - returns the lowest value
MAX() - returns the highest value

Let’s use SQL to group all the sex, and survived. Then we can run a AVG() on the age to get some details on the titanic survivors.

Aggregating with AVG(), and the GROUP BY function

We now have an interesting observation from the titanic disaster. For the females, the older ones tended to survive. For the men, it was the opposite.
Neat.

3 - Filtering with Aggregation

3.1 GROUP BY

Let’s say we wanted to add in some extra criteria. What if we wanted to remove some rows FIRST, and then run a group by aggregation on the remaining rows. This is an example of filtering with the WHERE clause, and then aggregating.

Let’s kick out all the entries, where AGE < 18, since they weren’t even adults. Example below:

Using the HAVING clause to filter after the aggregation — Using the WHERE clause to filter the data, and then perform aggregation

By doing this simple little tweak, we can see that age was not a useful factor for whether a man survived or not. But, for females it still is.

Now, let’s crank it up a notch. What if you wanted to run another filter on your table. But, you wanted to apply to filter, after the GROUP BY, and the aggregation was done. This is where the HAVING clause comes in.

Note: Go back to the last 2 SQL queries I ran. If I only show you the first output, you’ll think younger men tend to be better survivalists than their older counterparts. If I show you the 2nd output, you now know the full story.
This right here is an example of how you can lie with statistics. Remember that, the next time you read some statistics BS on the news. Are they telling you the full story? Or, only what they want you to believe.

Do you see how the left one could be lying with statistics?

3.2 HAVING

After the above GROUP BY, and WHERE filter was applied. If I was now interested in only looking at females, and I was only interested in looking at those, where avg(age) > 30. I could use the HAVING clause for this.

The HAVING clause would come after the GROUP BY clause.

Using the HAVING clause to run another filter, after the aggregation

4 - Exercises & Interview Questions

4.1 - Exercises

Here are some exercises you can try, and put your knowledge to the test.

Use 1 SQL query, and give me the breakdown of: the males that survived/died, and the females that survived/died
Use 1 SQL query, and give me the breakdown of those under the age of 18 that died, based off: gender, and PClass.
Now go back to 2.), and kick out all the entries where PClass = ‘1st’

4.2 - Interview Questions

Here are the most common interview questions you’ll get, based off this specific topic.

What’s the difference between the GROUP BY, and HAVING clause?
Ans: The GROUP BY clause let’s you combine rows which have similar values together into a group. The HAVING clause can then be used to filter the newly generated table even further.
If the GROUP BY clause does the same thing as SELECT DISTINCT, what’s the point of even bothering, and using the SELECT DISTINCT?
Ans: In terms of efficiency, the SELECT DISTINCT tends to run faster than using the GROUP BY clause by itself. This is useful for us, as when we put code into production, we will want to make sure it runs as fast as possible.
What are Aggregate functions, and list some examples?
Ans: When we have a column of data, we can use aggregate functions to get 1 output that sums up all the data into a single value. Some aggregate functions are: AVG, COUNT, SUM, MAX, MIN.
Do aggregate functions work with NULL values? What do they do if they encounter one?
Ans: When aggregate functions encounter a NULL value, they ignore the data point, as if it didn’t exist in the first place.

You’ll notice the questions are starting to get a bit tougher, and rely on the stuff you learned from the previous SQL posts. You can click here to continue to SQL 5: Insert & Update data on tables.

Data Science & Machine Learning 101

Discussion about this post