SQL - Order of Execution

Michael | Jul 4, 2022

Introduction

In my career as a data analyst, I recognized that SQL is still the most powerful tool. Most of the time, working on exploratory analysis and validating hypotheses with data were the main drivers of impact.

This article will walk you through some of the basic concepts of SQL order of execution. I believe understanding this can be helpful for anyone new to data analytics.

Who is the best Chris?

Let’s go through an example together. You are an analyst at Maverick Corp. and just received a task from the sales manager.

Some employees are debating who is the top sales named Chris in the company right after watching the Avengers movie. The sales manager thinks it will be way more efficient to have the analyst provide the data but adds a constraint that only sales that have completed more than 1 transaction can count.

The sales table contains the employees’ names and unique identification numbers, and the transaction table contains the data for each transaction the sales have made.

Table: sales

idname
1Chris Hamsvalue
2Chris Evians
3Benedict Lumberjack
4Chris P. Ratatouille
5Chris Pint

Table: transactions

transaction_idsales_idamount_usd
1110000
2220000
3420000
4140000
5580000
6320000
7240000
8430000

This is the query you will write to get the required answer:

SELECT name AS sales_name,
       SUM(amount_usd) AS ttl_sales
FROM employee e
JOIN transactions t
ON e.id = t.sales_id
WHERE name LIKE 'Chris %'
GROUP BY name
HAVING COUNT(transaction_id) > 1
ORDER BY ttl_sales DESC
LIMIT 1;

However, the order of execution/operation is different from how you write the code.

  1. FROM (JOIN) - Pick and join tables to get raw data.
  2. WHERE - Filter the raw data.
  3. GROUP BY - Aggregate the raw data.
  4. HAVING - Filter the aggregated data.
  5. SELECT - Return the filtered data.
  6. ORDER BY - Sort the filtered data.
  7. LIMIT - Limit the returned data to number of rows.

FROM/ JOIN

The first step is the statements in FROM clause followed by the JOIN clause. The outcome of these operations is combining the columns of the two tables.

FROM employee e
JOIN transactions t
idnametransaction_idsales_idamount_usd
1Chris Hamsvalue1110000
1Chris Hamsvalue2220000
1Chris Hamsvalue3420000
1Chris Hamsvalue4140000
1Chris Hamsvalue5580000
1Chris Hamsvalue6320000
1Chris Hamsvalue7240000
1Chris Hamsvalue8430000

But this is not the data set you want. We work with relational databases. You need to specify the relation between tables. In this case, we state in ON clause that we only want the rows where the employee id matches the sales id.

ON e.id = t.sales_id
idnametransaction_idsales_idamount_usd
1Chris Hamsvalue1110000
1Chris Hamsvalue4140000
2Chris Evians2220000
2Chris Evians7240000
3Benedict Lumberjack6320000
4Chris P. Ratatouille8430000
4Chris P. Ratatouille3420000
5Chris Pint5580000

WHERE

Now we have the qualified rows, and the next step is the WHERE clause. WHERE filters each row with the stated conditions. When rows do not return a true value, they will be removed from the data set. Benedict unfortunatly did not meet the requirements.

WHERE name LIKE 'Chris %'
idnametransaction_idsales_idamount_usd
1Chris Hamsvalue1110000
1Chris Hamsvalue4140000
2Chris Evians2220000
2Chris Evians7240000
4Chris P. Ratatouille8430000
4Chris P. Ratatouille3420000
5Chris Pint5580000

GROUP BY

GROUP BY clause aggregates rows of the data that have the same distinct value. You will notice that if we group by name, we can not directly group the other columns and this is quite often the reason why your query does not work. This clause is used with aggregation functions such as sum() or count() to show one value per aggregated field. Is this example, we will add the count() functions to the other columns, the function will return the number of values(rows) this column has.

GROUP BY name
count(id)namecount(transaction_id)count(sales_id)count(amount_usd)
2Chris Hamsvalue222
2Chris Evians222
2Chris P. Ratatouille222
1Chris Pint111

HAVING

The HAVING clause is similar to the WHERE clause. But instead of filtering data by individual rows, it filters data by the aggregated rows created after the GROUP BY clause. Since Chris Pint only has one sales record, even with the highest sales, he is still filtered out.

HAVING COUNT(transaction_id) > 1
count(id)namecount(transaction_id)count(sales_id)sum(amount_usd)
2Chris Hamsvalue2250000
2Chris Evians2260000
2Chris P. Ratatouille2250000

SELECT

In the SELECT clause, we decide what columns will display as the results of the query, and if there are some functions like sum() or count(), etc to run.

SELECT 
	name AS sales_name,
	SUM(amount_usd) AS ttl_sales
sales_namettl_sales
Chris Hamsvalue50000
Chris Evians60000
Chris P. Ratatouille50000

ORDER BY

The last is the ORDER BY clause. The top sales will logically have the highest sales number. We can set the ordering by ttl_sales in descending order to complete the objective.

ORDER BY ttl_sales DESC
sales_namettl_sales
Chris Evians60000
Chris Hamsvalue50000
Chris P. Ratatouille50000

LIMIT

We want to know who the best Chris is. Set the LIMIT to 1 to only show the first row.

LIMIT 1
sales_namettl_sales
Chris Evians60000

Conclusion

Not everything is covered here and there are a bunch of other details when writing queries. But these are the essential basics you should know to kickstart your learning. Hopefully, you find this article helpful!

comments powered by Disqus