SQL 9 - Subqueries
One of the most popular questions asked during interviews, *AND* it actually gets used a lot in your role. Added in a real world example for you too.
If you don’t know SQL, no company will hire you for any data related role.
Subqueries are one of the most popular topics that interviewers like to ask you. They ask you things like what are the 2 different types of subqueries. They ask you to do some SQL data wrangling, which requires the usage of subqueries.
You will also use subqueries on a daily basis at your role as well. So, let’s get it.
Subqueries are used a lot. If you already know them, hop on over to Section 5. I have some real world examples, and some tricks that you’ll find useful. They’ll make prod run better, and will save you time when working with SQL.
Table of Contents
Real World Use Case & Tips
Data Science & Machine Learning 101 is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
1 - Reader Questions
1.) This question comes from jbnews:
Should I ever use APPLY or are there better strategies to accomplish the same thing?
Only when dealing with table-valued functions or situations where you need to reference a column from an outer query within a subquery. Other than that, I haven’t found any real use case for it.
2.) This question comes from BowTiedCelt:
Is there a trick to remembering what Correlated Subqueries are? I keep forgetting them
The word correlated means: “Highly connected”. When we say high correlation, it means “highly connected”. Low correlation means “not connected”. Thus, the word Correlated Subqueries would mean: “Highly connected subqueries”.
What would the inner query be connected to? The outer Query.
In other words, the inner query is dependent on the outer query.
2 - Subquery
A query that is embedded inside another SQL statement is known as a SubQuery.
See diagram below:
In order to use a subquery, you’ll need to wrap it around in brackets. The general idea in SQL is that the Subquery will get executed first. Then, followed by the Outer Query.
When working with Subqueries, you have them return:
A Value (1 Row, and 1 Column)
A Series (Multiple Rows, and 1 Column)
A Table (Multiple Rows, and Multiple Columns)
The one it returns will of course depend on what your specific needs are. Let’s run a simple scenario:
I have a table called Model_IDs. It has 2 columns: Model_ID, and Series_ID.
I have a table called Model_Scores. It has 3 columns: ID, Name, and Score.
The Goal is to return a table of Name, and Scores. However, the ones we want are ID = Series_ID.
One way to achieve this is to use Joins.
SELECT ID, Name, Score FROM Model_Scores s JOIN Model_ID i on s.ID=i.Model_ID;
Another way to achieve this is to return all of the Model_IDs from Model_Scores first (subquery).
SELECT Model_ID FROM Model_Scores;
Then, make an outer query that has a WHERE clause for this:
SELECT ID, Name, Score FROM Model_Scores WHERE Model_ID IN (…)
And combine the two together to make a subquery:
SELECT ID, Name, Score FROM Model_Scores WHERE Model_ID IN (SELECT Model_ID FROM Model_Scores);
Congrats, you just made your first subquery. You also saw that for 1 problem, there are multiple ways of handling the solution. Now, let’s focus on the 2 main types of subqueries:
3 - Non-correlated Subquery
In a Non-correlated Subquery, the inner query is completely independent of the outer query. Let’s run an example.
Let’s pretend this excel file is my database (yes really)
Let’s say my task is to get the Model_Scores table, however I need to kick out all rows where ID = Model_ID. To handle this easily, we can use a subquery:
The Inner Query should get me all of the Model_ID in the Model_ID table:
SELECT Model_ID FROM Model_ID;
Once I have this, I can use the NOT Operator with an IN to say NOT IN.
Here is the outer query:
SELECT * FROM Model_Scores WHERE ID NOT IN (...);
Now, we combine the pieces together to make:
SELECT * FROM Model_Scores WHERE ID NOT IN (SELECT Model_ID FROM Model_ID);
4 - Correlated Subquery
In a Correlated Subquery, the inner query is dependant in some way of the outer query. Let’s use our “Excel Database”, and do another example:
Let’s say my task is to get the Model_Scores table, however I need to kick out all rows where ID = Model_ID, AND I also want to to kick our rows where ID is bigger than Series_ID.
Let’s start off by building the Inner Query
SELECT MODEL_ID FROM Model_ID WHERE Model_ID.Series_ID < Model_Scores.ID;
Now, let’s build the Outer Query
SELECT * FROM Model_Scores WHERE ID NOT IN (...);
Now, let’s combine them together to build our Correlated Subquery
SELECT * FROM Model_Scores WHERE ID NOT IN (SELECT MODEL_ID FROM Model_ID WHERE Model_ID.Series_ID < Model_Scores.ID);
5 - Real World Use Case & Tips
When you are looking at SQL code in the real world, you can easily see people’s preferences shin in their code. Sometimes, you’ll see people have huge subqueries. Other times, you’ll see people do some of the most convoluted Joins, and WHERE clauses… just to avoid using 1 single subquery.
Best thing to do is to learn to work with both, so that you can save some time.
5.1 Real World Use Case
When you are working in the real world, people generally keep their SQL schemas following this format below:
Table which is used for merging: This would have the numeric Ids for merging.
Table which has the actual values: This would have the actual numeric values you are interested in.
This image below is probably one of the most important images in understand the current design process for SQL schemas.
If you understand this, you’ll work very efficiently:
Keep reading with a 7-day free trial