What Is SQL GROUP BY and How to Use It?
Categories
The article discusses the seven most typical uses of SQL GROUP BY. We’ll go through its syntax and coding examples to give a taste of real-life SQL coding.
The GROUP BY clause is one of the most ubiquitous clauses in SQL. It’s an interesting one, as its use is a basic one but also extends to advanced SQL knowledge. It seems like you have to learn it very early, and you just can’t get rid of it, no matter how complex your SQL queries become. On the contrary – the more complex they are, the more you’ll use GROUP BY.
This article will teach you what SQL GROUP BY is and show you its syntax. But the main part will be showing you how GROUP BY works. We’ll use the interview questions from our platform to show you as many common and relatable examples of GROUP BY in an SQL query.
These examples include SQL GROUP BY with:
- Aggregate functions
- JOINs
- Multiple Columns
- WHERE
- HAVING
- ORDER BY
- Window functions
Once you come to the end of the article, we’re sure you’ll find even more day-to-day situations where you’ll want to use SQL GROUP BY.
What Is GROUP BY Clause in SQL?
The SQL GROUP BY definition is simple: It’s an SQL clause that groups all the rows with the same column value.
Its main purpose is to help with data aggregation, working on that together with the SQL aggregate functions. You know them – COUNT(), SUM(), AVG(), MIN(), MAX(), to mention those most commonly used.
SQL GROUP BY Syntax
When data is grouped in SQL, it is done so by one or several columns. The GROUP BY syntax reflects that, as you need to specify which column(s) you want to group by.
The SQL GROUP BY syntax is, therefore, rather straightforward. However, you have to keep in mind where GROUP BY comes in relation to other SQL clauses. That’s why we’ll use them, too, in the GROUP BY syntax to show where GROUP BY stands.
SELECT column_1,
column_2,
…
FROM table
WHERE …
GROUP BY column_1, column_2, …
HAVING …
ORDER BY column_1, column_2, …;
The above syntax shows that GROUP BY always comes after FROM and WHERE but before HAVING and ORDER BY.
The best way to internalize these rules is to write SQL code that requires GROUP BY, and that’s what we’ll do in the following section.
Practical Examples for Mastering SQL GROUP BY
Using SQL GROUP BY With Aggregate Functions
When using it with aggregate functions, GROUP BY categorizes aggregated data into groups and gives it labels. Labels are distinct values from the column you group by.
The aggregate functions mustn’t be used in GROUP BY because they are executed after GROUP BY.
Question: Total AdWords Earnings
Take a look at this question by Google.
Interview Question Date: July 2020
Find the total AdWords earnings for each business type. Output the business types along with the total earnings.
Link to the question: https://platform.stratascratch.com/coding/10164-total-adwords-earnings
It gives you the google_adwords_earnings table to work with.
business_type | n_employees | year | adwords_earnings |
---|---|---|---|
handyman | 2 | 2018 | 81 |
handyman | 10 | 2018 | 110 |
media | 10000 | 2018 | 123456789 |
handyman | 5000 | 2018 | 1001001 |
handyman | 2 | 2018 | 150 |
Code
To solve this problem, you need to sum the adwords earnings and then group them by the business type. That will give you the total earnings by each business type.
It translates to this in SQL.
SELECT business_type,
SUM(adwords_earnings) AS earnings
FROM google_adwords_earnings
GROUP BY business_type;
There are two columns in SELECT: business_type and earnings. This second column is calculated using the SUM() aggregate function with the column adwords_earnings.
The FROM clause refers to the only table we have.Then GROUP BY: it groups data by the business_type column, which is what the question asks.
Combined with SUM(), this will output the earnings by each business type.
business_type | earnings |
---|---|
media | 247914579 |
transport | 132323280 |
handyman | 6042187 |
Lookit!: Don’t Use the Aggregate Function in GROUP BY
It’s like kids trying to stick a screwdriver into a socket – we don’t know why you would do that, but you might be tempted.
We’ll rid you of the temptation: the aggregate functions are not allowed in GROUP BY. Why? Will you be electrocuted if you put aggregate functions there?
No, but you’ll get an error. Run this code and see for yourself.
SELECT business_type,
SUM(adwords_earnings) AS earnings
FROM google_adwords_earnings
GROUP BY business_type, SUM(adwords_earnings);
See!? And you wouldn’t believe us when we told you!
The aggregate functions are not allowed in GROUP BY
Using SQL GROUP BY With JOINs
When using JOINs in your query, it usually means you’ll have more than one table in the FROM clause. This doesn’t change anything regarding the use of GROUP BY: you simply state the columns by which you want to group your output.
The only change is that joining tables allows you to group by any columns from any table.
Question: Popularity of Hack
Here’s a good example by Meta.
Interview Question Date: March 2020
Meta/Facebook has developed a new programing language called Hack.To measure the popularity of Hack they ran a survey with their employees. The survey included data on previous programing familiarity as well as the number of years of experience, age, gender and most importantly satisfaction with Hack. Due to an error location data was not collected, but your supervisor demands a report showing average popularity of Hack by office location. Luckily the user IDs of employees completing the surveys were stored. Based on the above, find the average popularity of the Hack per office location. Output the location along with the average popularity.
Link to the question: https://platform.stratascratch.com/coding/10061-popularity-of-hack
You have two tables at your disposal. The first one is facebook_employees.
id | location | age | gender | is_senior |
---|---|---|---|---|
0 | USA | 24 | M | FALSE |
1 | USA | 31 | F | TRUE |
2 | USA | 29 | F | FALSE |
3 | USA | 33 | M | FALSE |
4 | USA | 36 | F | TRUE |
The second table is facebook_hack_survey.
employee_id | age | gender | popularity |
---|---|---|---|
0 | 24 | M | 6 |
1 | 31 | F | 4 |
2 | 29 | F | 0 |
3 | 33 | M | 7 |
4 | 36 | F | 6 |
Code
What you need to do here is to find the average popularity and group by office location. Compared to the previous code’s logic, the only addition is that there will be two tables.
Have a look.
SELECT e.location,
AVG(s.popularity) AS avg_popularity
FROM facebook_employees e
JOIN facebook_hack_survey s ON e.id = s.employee_id
GROUP BY e.location;
We select the location and calculate the hack popularity by applying the AVG() function to the popularity column.
Since these two columns are from one table each, we have to join the tables. We join the tables on the columns id and employee_id.
Finally, we group the output by the column that is required to solve the problem. In this case, it’s the column location from the table facebook_employees.
location | avg_popularity |
---|---|
UK | 4.33 |
USA | 4.6 |
India | 7.5 |
Switzerland | 1 |
Lookit!: Careful When Choosing the JOIN
This isn’t in direct relation to GROUP BY in SQL. But you need to be careful when choosing which join type to use, as they may return different results.
For instance, we used (INNER) JOIN for the previous question. The result would have been the same if we had used LEFT JOIN.
However, if we had chosen to use RIGHT JOIN, we wouldn’t have gotten the right (pun intended) result.
Run this code to see what it’ll return.
SELECT e.location,
AVG(s.popularity) AS avg_popularity
FROM facebook_employees e
RIGHT JOIN facebook_hack_survey s ON e.id = s.employee_id
GROUP BY e.location;
location | avg_popularity |
---|---|
6.67 | |
UK | 4.33 |
USA | 4.6 |
India | 7.5 |
Switzerland | 1 |
There’s one additional row in the output with the calculated average popularity but without the location. These are all the employees appearing in the table facebook_hack_survey, but missing from facebook_employees.
To be sure which JOIN to choose, read our article about types of SQL JOINs.
Using SQL GROUP BY With Multiple Columns
Yes, of course, you can group your output by more than one column. You simply list all the required columns in GROUP BY and separate them by a comma.
Question: Number Of Custom Email Labels
Let’s practice this in the Google interview question.
Find the number of occurrences of custom email labels for each user receiving an email. Output the receiver user id, label, and the corresponding number of occurrences.
Link to the question: https://platform.stratascratch.com/coding/10120-number-of-custom-email-labels?code_type=1
It gives us two tables: google_gmail_emails,
id | from_user | to_user | day |
---|---|---|---|
0 | 6edf0be4b2267df1fa | 75d295377a46f83236 | 10 |
1 | 6edf0be4b2267df1fa | 32ded68d89443e808 | 6 |
2 | 6edf0be4b2267df1fa | 55e60cfcc9dc49c17e | 10 |
3 | 6edf0be4b2267df1fa | e0e0defbb9ec47f6f7 | 6 |
4 | 6edf0be4b2267df1fa | 47be2887786891367e | 1 |
and google_gmail_labels.
email_id | label |
---|---|
0 | Shopping |
1 | Custom_3 |
2 | Social |
3 | Promotion |
4 | Social |
Code
This question will require you to group by email labels and the user receiving an email. Here’s how to do this.
SELECT to_user AS user_id,
label,
COUNT(*) AS n_occurences
FROM google_gmail_emails e
INNER JOIN google_gmail_labels l
ON e.id = l.email_id
AND l.label ILIKE 'custom%'
GROUP BY to_user, label;
The question asks us to output the receiver user ID, label, and the number of occurrences. For this third column, we simply use COUNT(*) to count the number of rows, which equals the number of occurrences.
Then we INNER JOIN the two available tables. We join them on the ID and email ID, and where the email label is custom.
To get the number of occurrences by user and email label, we list the columns to_user and label in the GROUP BY clause.
user_id | label | n_occurences |
---|---|---|
8bba390b53976da0cd | Custom_3 | 2 |
5dc768b2f067c56f77 | Custom_2 | 4 |
114bafadff2d882864 | Custom_3 | 1 |
32ded68d89443e808 | Custom_2 | 4 |
a84065b7933ad01019 | Custom_1 | 1 |
Lookit!: Group What You Select
When using GROUP BY with the aggregate functions, it’s important to remember one simple rule, which is especially easy to forget if you need to group by several columns.
And the rule is: the columns you include in SELECT must also be included in GROUP BY. (Except, of course, the columns with the aggregate functions. We learned they don’t go in GROUP BY.)
Run the modified code above, where we included only one column in GROUP BY.
SELECT to_user AS user_id,
label,
COUNT(*) AS n_occurences
FROM google_gmail_emails e
INNER JOIN google_gmail_labels l
ON e.id = l.email_id
AND l.label ILIKE 'custom%'
GROUP BY to_user;
Not only that the answer would be wrong, but the only output you’ll see will be an error.
Using SQL GROUP BY With ORDER BY
The use of ORDER BY doesn’t change anything in how you use GROUP BY. The only thing you need to remember is that ORDER BY comes after GROUP BY in SQL.
As ORDER BY is used for sorting your output, you can specify how you want it sorted. If you want it sorted ascendingly, write the keyword ASC after the columns in ORDER BY. If you omit the ASC keyword, the result will still be sorted ascendingly, as it is the default mode.
If you want descending sorting, use the keyword DESC.
Question: Expensive Projects
Let’s see how this works in a Microsoft interview question.
Interview Question Date: November 2020
Given a list of projects and employees mapped to each project, calculate by the amount of project budget allocated to each employee . The output should include the project title and the project budget rounded to the closest integer. Order your list by projects with the highest budget per employee first.
Link to the question: https://platform.stratascratch.com/coding/10301-expensive-projects?code_type=1
It gives us two tables. The first one is ms_projects.
id | title | budget |
---|---|---|
1 | Project1 | 29498 |
2 | Project2 | 32487 |
3 | Project3 | 43909 |
4 | Project4 | 15776 |
5 | Project5 | 36268 |
The second table is ms_emp_projects.
emp_id | project_id |
---|---|
10592 | 1 |
10593 | 2 |
10594 | 3 |
10595 | 4 |
10596 | 5 |
Code
To answer this question, you will need to group the result by the project title and its budget, then order it by the budget/employee ratio.
Let’s analyze the code more closely.
SELECT title AS project,
ROUND((budget/COUNT(emp_id)::FLOAT)::NUMERIC, 0) AS budget_emp_ratio
FROM ms_projects a
INNER JOIN ms_emp_projects b
ON a.id = b.project_id
GROUP BY title, budget
ORDER BY budget_emp_ratio DESC;
First, we select the project title. Then we calculate the amount of the project’s budget per employee working on that project. We do that by dividing the budget by the number of employees, which we get by using the COUNT() function.
As the required columns are from both tables, we need to join them in the FROM clause. We do that on the project ID columns.
Then we group the output by title and budget, as both columns appear in the SELECT statement.
The question wants you to show the output with the projects with the highest budget per employee on top. In other words, sort the result descendingly using ORDER BY.
project | budget_emp_ratio |
---|---|
Project8 | 24642 |
Project49 | 24387 |
Project15 | 24058 |
Project10 | 23794 |
Project19 | 22493 |
Lookit!: ORDER BY Must Come After GROUP BY
This one has to do with the order (pun intended) in which the SQL clauses are executed. ORDER BY is executed after the data is grouped, so if you put it before GROUP BY, the code won’t work.
Run this modified code to see that. It’s the same code as above. Only we put ORDER BY before GROUP BY.
SELECT title AS project,
ROUND((budget/COUNT(emp_id)::FLOAT)::NUMERIC, 0) AS budget_emp_ratio
FROM ms_projects a
INNER JOIN ms_emp_projects b
ON a.id = b.project_id
ORDER BY budget_emp_ratio
GROUP BY title, budget;
It throws an error.
Using SQL GROUP BY With WHERE Clause
The WHERE clause filters the data. It takes the logical condition state in WHERE, checks the data, and returns only the values that satisfy the condition. The conditions are set using logical operators or commands such as BETWEEN, LIKE, or IN.
If you don’t use SQL GROUP BY, you’re not bothered with the specific use of WHERE. However, using GROUP BY puts an important feature of WHERE to the front: it filters data before grouping. That’s why WHERE always comes before GROUP BY.
Again, knowing the execution order of the SQL clauses shows to be a valuable asset.
Question: Reviews of Hotel Arena
Let’s see how to use GROUP BY and WHERE in the question by Airbnb.
Find how many reviews exist for each review score given to 'Hotel Arena'. Output the hotel name ('Hotel Arena'), each review score, and the number of reviews for that score. Ensure the results only include 'Hotel Arena.'
Link to the question: https://platform.stratascratch.com/coding/10166-reviews-of-hotel-arena
There’s only one table to work with: hotel_reviews.
hotel_address | additional_number_of_scoring | review_date | average_score | hotel_name | reviewer_nationality | negative_review | review_total_negative_word_counts | total_number_of_reviews | positive_review | review_total_positive_word_counts | total_number_of_reviews_reviewer_has_given | reviewer_score | tags | days_since_review | lat | lng |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7 Western Gateway Royal Victoria Dock Newham London E16 1AA United Kingdom | 359 | 2017-07-05 | 8.5 | Novotel London Excel | United Kingdom | coffee and tea at breakfast were not particularly hot Otherwise everything else was fine | 16 | 1158 | we were allocated the newly refurbished rooms and so everything was fresh and the bed was very comfortable the hotel is ideally situated near City Airport although eventually we travelled by train | 34 | 2 | 10 | [' Leisure trip ', ' Family with young children ', ' Standard Double Room with Two Single Beds ', ' Stayed 2 nights ', ' Submitted from a mobile device '] | 29 days | 51.51 | 0.02 |
35 Charles Street Mayfair Westminster Borough London W1J 5EB United Kingdom | 252 | 2015-08-29 | 9.1 | The Chesterfield Mayfair | Israel | No Negative | 0 | 1166 | We liked everything The hotel is simply a boutique the staff were all polite and helpfull The room was clean and been serviced daily Wifi was completely free Breakfast was simply great I so much want to get back | 41 | 8 | 10 | [' Leisure trip ', ' Couple ', ' Classic Double Room ', ' Stayed 4 nights ', ' Submitted from a mobile device '] | 705 day | 51.51 | -0.15 |
14 Rue Stanislas 6th arr 75006 Paris France | 40 | 2017-05-23 | 9.1 | Hotel Le Six | United States of America | There is currently utility construction taking place on the street in front of the hotel so a little noisy at times and barriers in place | 27 | 177 | Neat boutique hotel Some of the most comfortable hotel beds I have ever come across Staff was wonderful Loved the location Not too touristy Luxembourg gardens close by and a great place for a morning run walk | 39 | 3 | 9.2 | [' Leisure trip ', ' Family with young children ', ' Deluxe Double Room ', ' Stayed 4 nights ', ' Submitted from a mobile device '] | 72 days | 48.84 | 2.33 |
Gran V a De Les Corts Catalanes 570 Eixample 08011 Barcelona Spain | 325 | 2016-08-25 | 8.2 | Sunotel Central | United Kingdom | Coffee at breakfast could be better When you spend this amount in a hotel I expect better coffee in the morning | 22 | 3870 | Great bed nice to have a coffee machine in the room love the air conditioning and basically loved the attitude of the staff Really great | 26 | 2 | 9.2 | [' Leisure trip ', ' Group ', ' Comfort Double or Twin Room ', ' Stayed 1 night ', ' Submitted from a mobile device '] | 343 day | 41.38 | 2.16 |
Rathausstra e 17 01 Innere Stadt 1010 Vienna Austria | 195 | 2015-09-17 | 8.5 | Austria Trend Hotel Rathauspark Wien | United Kingdom | A bit out of the way location wise | 9 | 1884 | Clean modern rooms and bathroom well equipped | 9 | 2 | 7.5 | [' Leisure trip ', ' Couple ', ' Comfort Room ', ' Stayed 2 nights ', ' Submitted from a mobile device '] | 686 day | 48.21 | 16.36 |
Code
This question requires you to count the score occurrences for Hotel Arena only and show the hotel name and score.
SELECT hotel_name,
reviewer_score,
COUNT(*) AS score_occurrences
FROM hotel_reviews
WHERE hotel_name = 'Hotel Arena'
GROUP BY hotel_name, reviewer_score;
The code selects the hotel name and the review score. It then counts the number of each score occurrences using COUNT(*).
The condition in WHERE means the calculation is done only on the Hotel Arena hotels.
The output is grouped by hotel name and review score. You already learned how to group by multiple columns.
hotel_name | reviewer_score | count |
---|---|---|
Hotel Arena | 6.3 | 1 |
Hotel Arena | 5.4 | 1 |
Hotel Arena | 7.1 | 1 |
Hotel Arena | 4.2 | 1 |
Hotel Arena | 5.8 | 2 |
Lookit!: WHERE Must Come Before GROUP BY
Let’s see what happens if we use WHERE after GROUP BY.
SELECT hotel_name,
reviewer_score,
COUNT(*) AS score_occurrences
FROM hotel_reviews
GROUP BY hotel_name, reviewer_score
WHERE hotel_name = 'Hotel Arena';
The query again returns an error.
Using SQL GROUP BY With HAVING Clause
The HAVING clause is the same as WHERE, in that both clauses are used for filtering data. But there’s one significant difference: HAVING filters data after grouping.
The conditions in HAVING are set the same way as in WHERE. The difference is that HAVING accepts aggregate functions. And, of course, it always comes after GROUP BY.
Question: Cities With the Most Expensive Homes
Here’s a question by Zillow that asks you to use SQL GROUP BY with HAVING.
Interview Question Date: December 2020
Write a query that identifies cities with higher than average home prices when compared to the national average. Output the city names.
Link to the question: https://platform.stratascratch.com/coding/10315-cities-with-the-most-expensive-homes
We have to work with the table zillow_transactions.
id | state | city | street_address | mkt_price |
---|---|---|---|---|
1 | NY | New York City | 66 Trout Drive | 449761 |
2 | NY | New York City | Atwater | 277527 |
3 | NY | New York City | 58 Gates Street | 268394 |
4 | NY | New York City | Norcross | 279929 |
5 | NY | New York City | 337 Shore Ave. | 151592 |
Code
To give a correct answer, you need to group data by city and filter those cities with average home prices higher than the national average using the HAVING clause.
Here’s how to do it.
SELECT city
FROM zillow_transactions a
GROUP BY city
HAVING AVG(a.mkt_price) >
(SELECT AVG(mkt_price)
FROM zillow_transactions)
ORDER BY city ASC;
The easy first step is to select the cities from the table and group by the same column.
OK, we now have the list of cities. This list should contain only cities with average house prices higher than the national average.
Filtering is done in HAVING. The average housing price by city is calculated using the AVG() function. This is then compared with the subquery that, again, uses AVG() to return the total national average price of homes.
Finally, the output is sorted by city alphabetically.
city |
---|
Mountain View |
San Francisco |
Santa Clara |
Lookit!: HAVING Must Come After GROUP BY
Let’s write the same code, but put HAVING before GROUP BY.
Run the code below to see what happens.
SELECT city
FROM zillow_transactions a
HAVING AVG(a.mkt_price) >
(SELECT AVG(mkt_price)
FROM zillow_transactions)
GROUP BY city
ORDER BY city ASC;
An error, as you probably expected!
Using SQL GROUP BY With the Window Functions
The SQL window functions are the SQL functions that perform a calculation over a set of rows (window) that are related to the current row.
One of their most common uses is data aggregation. In that way, they are similar to aggregate functions. The significant difference is that the window functions don’t collapse the individual rows. In other words, they allow for the aggregated and non-aggregated values to be shown side by side.
The other common window functions use is ranking data.
When using the window functions with GROUP BY, it’s important to remember that the window functions can’t be included in GROUP BY, the same as the aggregate functions.
Question: Activity Rank
Let’s see how GROUP BY works alongside the window functions in this Google interview question.
Interview Question Date: July 2021
Find the email activity rank for each user. Email activity rank is defined by the total number of emails sent. The user with the highest number of emails sent will have a rank of 1, and so on. Output the user, total emails, and their activity rank.
• Order records first by the total emails in descending order. • Then, sort users with the same number of emails in alphabetical order by their username. • In your rankings, return a unique value (i.e., a unique rank) even if multiple users have the same number of emails.
Link to the question: https://platform.stratascratch.com/coding/10351-activity-rank
The question gives the table google_gmail_emails.
id | from_user | to_user | day |
---|---|---|---|
0 | 6edf0be4b2267df1fa | 75d295377a46f83236 | 10 |
1 | 6edf0be4b2267df1fa | 32ded68d89443e808 | 6 |
2 | 6edf0be4b2267df1fa | 55e60cfcc9dc49c17e | 10 |
3 | 6edf0be4b2267df1fa | e0e0defbb9ec47f6f7 | 6 |
4 | 6edf0be4b2267df1fa | 47be2887786891367e | 1 |
Code
To solve the question, you have to use the aggregate and window functions with GROUP BY.
SELECT from_user,
COUNT(*) as total_emails,
ROW_NUMBER() OVER (ORDER BY COUNT(*) DESC, from_user ASC) AS ranking
FROM google_gmail_emails
GROUP BY from_user
ORDER BY 2 DESC, 1;
First, select the sender. Then use COUNT(*) to count the number of times this sender appears. This will equal the number of emails sent by the user.
We use the ROW_NUMBER() window function to rank the users by the number of sent emails.
The output is grouped by the sender. Finally, it is sorted by the number of sent emails descendingly and by the ranking ascendingly.
from_user | total_emails | row_number |
---|---|---|
32ded68d89443e808 | 19 | 1 |
ef5fe98c6b9f313075 | 19 | 2 |
5b8754928306a18b68 | 18 | 3 |
55e60cfcc9dc49c17e | 16 | 4 |
91f59516cb9dee1e88 | 16 | 5 |
Lookit!: Don’t Include the Window Functions in GROUP BY
You learned that the aggregate functions don’t go in GROUP BY. So, you might be tempted to think that the window functions aren’t the aggregate functions (correct!) and that they can go in GROUP BY (not correct!)
Let’s still try it.
SELECT from_user,
COUNT(*) as total_emails,
ROW_NUMBER() OVER (ORDER BY COUNT(*) DESC, from_user ASC) AS ranking
FROM google_gmail_emails
GROUP BY from_user, ROW_NUMBER() OVER (ORDER BY COUNT(*) DESC, from_user ASC)
ORDER BY 2 DESC, 1;
Run the code to make sure it returns an error.
Why the window functions aren’t allowed in GROUP BY? The answer: they are executed after GROUP BY. So grouping by the window functions would mean grouping by something that isn’t yet calculated. In other words: impossible!
Summary
In this article, we discussed the seven most common uses of SQL GROUP BY. As you’ve seen, GROUP BY is a clause that offers various ways of grouping and aggregating data.
Each use case was supported by a practical example, which allowed you to see and write the code. Hope you took that opportunity, as coding is the best way for the new SQL knowledge to sink in.
There are even more coding challenges on our platform. You’re welcome to browse through and solve as many coding interview questions as possible. The questions are categorized into different difficulty levels. That way, you can smoothly learn SQL GROUP BY or any other SQL concept from the basic to an advanced level.