In my career, I’ve heard many times, things like “How to write a complex SELECT query?”, “Where to start from?” or “This query looks so complex. How you’ve learned to write such complex queries?”. While I would like to think of myself as of a brilliant mind or genius or add something like “query magician” to my social network profiles, well, writing complex SQL wouldn’t be the only thing required to do that. Therefore, in this article, I’ll try to cover the “magic” behind writing complex SELECT statements.
The Model
As always, I’ll start with the data model we’ll be using. Before you start to write (complex) queries you should understand what is where – which tables stored what data. Also, you should understand the nature of relations between these tables.
If you don’t have these two on disposal, you have 3 options:
- Ask somebody who created the model for the documentation (if that person is available). Same stands for understanding the business logic behind the data
- Create documentation yourself. This takes time, but is really very useful, especially if you jump in the middle of an undocumented project
- You can always do it without the documentation, but you should be pretty sure you know what you’re doing. E.g. I wouldn’t recommend you driving a car where I’ve repaired brakes. I mean, you can try it, but…
All these tips can be used regardless of what you are doing with your database. Having the overall picture will spare you a lot of time in the long-run, so invest some time when you’re starting.
Let’s Start with the Complex Query
In case I spent too many words so far, let’s remind ourselves of the original question – “How to write a complex SELECT query?”. And let’s start with a complex query.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
SELECT country.country_name_eng, SUM(CASE WHEN call.id IS NOT NULL THEN 1 ELSE 0 END) AS calls, AVG(ISNULL(DATEDIFF(SECOND, call.start_time, call.end_time),0)) AS avg_difference FROM country LEFT JOIN city ON city.country_id = country.id LEFT JOIN customer ON city.id = customer.city_id LEFT JOIN call ON call.customer_id = customer.id GROUP BY country.id, country.country_name_eng HAVING AVG(ISNULL(DATEDIFF(SECOND, call.start_time, call.end_time),0)) > (SELECT AVG(DATEDIFF(SECOND, call.start_time, call.end_time)) FROM call) ORDER BY calls DESC, country.id ASC; |
And this is what query returns:
As you can see, we have a complex query and 2 rows in the result. Without any comments, we can’t easily say what does this query does and how it works. Let’s change that now.
How to Write a Complex SELECT Query & Where is the Data?
We’re back to the original question. Now, we’ll answer this step by step. I’ll tell you what was the desired result of the query (assignment given to us).
Return all countries together with the number of related calls and their average duration in seconds. In the result display only countries where average call duration is greater than the average call duration of all calls.
The first thing we’ll do is to determine which tables we’ll be using in the process. In the data model, I’ve added colors to the tables we need to use.
And how to determine which tables should be? The answer has two parts:
- Use all tables containing data you need to display in your result. In our case, the tables in question are country (we need country_name) and call (we need start_time and end_time to calculate the average call duration)
- If the tables from the previous bullet point are not directly related, you’ll also need to include all the tables between them (in our case that would be – how to get from the country table to the call table)
After this analysis we know we must use the following tables: country, city, customer, and call. If we want to use them properly, we need to JOIN these tables using foreign keys. Without even thinking about the final query, we now know it will contain this part:
1 2 3 4 5 6 7 |
SELECT ... FROM country LEFT JOIN city ON city.country_id = country.id LEFT JOIN customer ON city.id = customer.city_id LEFT JOIN call ON call.customer_id = customer.id ...; |
We could do one thing, and that is to test what the query like this would return:
1 2 3 4 5 6 |
SELECT * FROM country LEFT JOIN city ON city.country_id = country.id LEFT JOIN customer ON city.id = customer.city_id LEFT JOIN call ON call.customer_id = customer.id; |
I won’t post the picture of the whole result because it simply has too many columns. Still, you can check it. I always advise that you test parts of your queries. While they won’t be displayed in the final results, they will be used in the background. By testing these parts you’ll get the idea of what is happening in the background, and could assume what the final result should be. But still, we have to answer on “How to write a complex SELECT query?”.
How to Write a Complex SELECT Query – Write Parts of the Query at the Time
We have already written part of the query and that’s a good practice. It will help you to build a complex query from simpler “blocks” but also, you’ll test your query along the way because you’ll be checking parts of it at a time as well, check how the query works when certain parts are added or executed.
I would start with this part “where average call duration is greater than the average call duration of all calls”. It’s obvious that we need to calculate the average duration from all calls (in seconds). So let’s do that.
1 |
SELECT AVG(DATEDIFF(SECOND, call.start_time, call.end_time)) FROM call |
We’ve explained the aggregate functions in the previous article. So far, we haven’t talked about date & time functions, but it’s enough to say that the DATEDIFF function calculates the difference in the units of the given time period (we are after seconds here) between the start time and end time. The result returned implies that the average call duration was 354 seconds.
Now we’ll write down the query which returns aggregated values for all countries.
1 2 3 4 5 6 7 8 9 10 11 12 |
SELECT country.country_name_eng, SUM(CASE WHEN call.id IS NOT NULL THEN 1 ELSE 0 END) AS calls, AVG(ISNULL(DATEDIFF(SECOND, call.start_time, call.end_time),0)) AS avg_difference FROM country LEFT JOIN city ON city.country_id = country.id LEFT JOIN customer ON city.id = customer.city_id LEFT JOIN call ON call.customer_id = customer.id GROUP BY country.id, country.country_name_eng ORDER BY calls DESC, country.id ASC; |
I would like to point out two things here:
- SUM(CASE WHEN call.id IS NOT NULL THEN 1 ELSE 0 END) – This will sum up only existing calls. Since we’ve used LEFT JOIN, we’ll also join countries without any call. In case we’ve used COUNT, we would have value 1 returned for countries without any call, and we want 0 there (we want to see that info)
- AVG(ISNULL(DATEDIFF(SECOND, call.start_time, call.end_time),0)) – This is very similar to the previously mentioned AVG. The difference here is that I’ve used ISNULL(…,0). This simply tests if the calculated value IS NULL, and if so, replaces it with 0. Calculated value could be NULL if there is not data (we’ve used LEFT JOIN)
Let’s see what this query returns.
“How to write a complex SELECT query?” -> Now we’re really close to complete our query and get really close to this answer.
So, the result contains all countries with their number of calls and the average call duration. From this result, we’re interested only in these having average call duration greater than average call duration of all calls. That’s our original query, but with comments added.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
-- the query returns a call summary for countries having average call duration > average call duration of all calls SELECT country.country_name_eng, SUM(CASE WHEN call.id IS NOT NULL THEN 1 ELSE 0 END) AS calls, AVG(ISNULL(DATEDIFF(SECOND, call.start_time, call.end_time),0)) AS avg_difference FROM country -- we've used left join to include also countries without any call LEFT JOIN city ON city.country_id = country.id LEFT JOIN customer ON city.id = customer.city_id LEFT JOIN call ON call.customer_id = customer.id GROUP BY country.id, country.country_name_eng -- filter out only countries having an average call duration > average call duration of all calls HAVING AVG(ISNULL(DATEDIFF(SECOND, call.start_time, call.end_time),0)) > (SELECT AVG(DATEDIFF(SECOND, call.start_time, call.end_time)) FROM call) ORDER BY calls DESC, country.id ASC; |
You can see the query result in the picture below.
Compared to the previous query, we’ve just added the HAVING part. While in the WHERE part of the query we test “regular” values, HAVING part of the query is used to test aggregated values. We’re using it to compare AVG values.
Comments are a crucial thing, not only in databases but in programming in general. By adding these 3 comment lines, the query should become much more readable. Even somebody who looks at this query for the first time will see what you did and why. That somebody could even be you if you’re looking at the code you wrote some time ago. While it takes some time to write these comments, don’t be lazy and do it. You’ll probably save yourself much more time when revisiting old queries/code.
Let’s Wrap up Everything
So, the question was – “How to write a complex SELECT query?”. While there is no easy answer, I would suggest the following steps:
- Think of it as of LEGO bricks and build the query that way. Treat complex parts as black boxes – they will return what they need to and you’ll write (and incorporate into the main query) them later
- Identify all the tables you’ll need in the query
- Join tables containing the data you need to display or the data used in the WHERE part of the query
- Display all data to check if you’ve joined everything correctly and to see the result of such a query
- Create all subqueries separately. Test them to see do they return what they should. Add them to the main query
- Test everything
- Add comments
Could you give us your answer on “How to write a complex SELECT query?”. Which approach have you used?
Table of contents
- Learn SQL: How to prevent SQL Injection attacks - May 17, 2021
- Learn SQL: Dynamic SQL - March 3, 2021
- Learn SQL: SQL Injection - November 2, 2020