A Comprehensive Guide to CASE WHEN statements in SQL
Categories
Let’s explore the basic and advanced syntax of CASE WHEN statements in SQL and show how to use CASE statements to answer difficult interview questions in SQL.
Humans use the word ‘if’ to describe the desired outcome for a hypothetical scenario. This is a very useful logical concept because, in life, there are many variables.
Programming languages like SQL have adopted this logic because the query can also have variables. In SQL, we use the CASE expression to define if/then logic.
CASE expressions can have many uses in SQL. They are often used with GROUP BY and ORDER statements and sometimes paired with WHERE and HAVING to filter the table conditionally. CASE is very useful for conditionally aggregating values as well. You could say that CASE expression makes these SQL features even more useful.
What is CASE WHEN statement in SQL?
In SQL, the CASE expression defines multiple conditions and the value to be returned if each condition is met. Simply put, CASE expression allows you to write if/then logic in SQL.
Data isn’t always as organized or clear-cut as we’d like. We can use the CASE expression to handle unpredictable scenarios and conditionally determine the output of the query. It even lets you set the ELSE clause to specify the outcome if none of the conditions are met.
CASE can help you solve complex data science tasks elegantly. It is useful for conditionally selecting, aggregating, filtering, and ordering data.
Because of its many use cases, it’s a good idea to master writing robust CASE expressions before you go into an interview.
Let’s look at the basic syntax for writing CASE expressions in SQL.
Syntax to Write CASE expressions in SQL
Let’s explore the basic structure of CASE expressions in SQL.
CASE and END keywords denote the beginning and end of the CASE expression. Between them, there must be at least one pair of WHEN and THEN blocks, which specify the condition and desired outcome if the condition is met.
Let’s look at an example:
CASE
WHEN condition THEN output
END
If the condition is found to be true, the CASE expression will return the corresponding output.
You can make conditions as simple or as complex as you’d like.
For example, use AND and OR logical operators to chain multiple conditions together.
CASE
WHEN condition_one AND condition_two THEN output
END
In this case, the CASE expression will return the output only if both conditions are true.
In the example, we have only one pair of WHEN/THEN blocks. Typically CASE expressions have multiple sets of WHEN/THEN that specify the condition and its outcome.
CASE
WHEN condition_one THEN output_one
WHEN condition_two THEN output_two
END
Final output of the CASE expression will be the output value of whichever condition is satisfied first.
You can add an optional (but very useful) ELSE clause to specify the return value if none of the conditions are met.
CASE
WHEN condition_one THEN output_one
WHEN condition_two THEN output_two
ELSE generic output
END
CASE expression will return ‘generic output’ (value specified by the ELSE clause in the example) if none of the two conditions are met.
If there is no ELSE, and none of the conditions are met, the final output will be NULL.
Use cases for the CASE expression
CASE expression can return values (text, number, date) as well as the reference to a column or even a condition.
CASE is most commonly used with SELECT, but it can be used to conditionally set parameters for ORDER BY, GROUP BY, HAVING, WHERE, and other SQL features. Or only aggregating values that meet the condition.
Knowing these use cases can help you easily solve complex questions during an interview. If you’re preparing for an interview, check out a long list of SQL Interview Questions on the StrataScratch blog. Also, take a look at SQL Query Interview Questions for an opportunity to practice writing SQL code.
Let’s see some use cases for the CASE expression, starting with most obvious.
CASE WHEN statement with SELECT
SELECT and CASE allow you to assess table data to create new values.
Let’s look at the example where we use CASE expression to create values for the new age_group column.
The new column will contain one of three strings: 'senior’, ‘middle age’, or ‘young’ depending on the existing value of the age column.
SELECT
*,
CASE
WHEN age < 30 THEN 'young'
WHEN 30 >= age < 60 THEN 'middle age'
ELSE 'senior'
END AS age_group
FROM customers
Records where age value is under 30 will have the new age_group value of ‘young’. If the age value is between 30 and 60, CASE will return ‘middle age’, and if none of these conditions are met, the ELSE clause will return ‘senior’.
CASE WHEN statement with ORDER BY
CASE can also be paired with ORDER BY to specify the column for ordering rows in the table.
For example, if you had a database of athletes from many different sports, you might want to order athletes by height, if the athlete is a basketballer, and by weight, if the athlete is a wrestler.
SELECT
full_name,
sport,
height,
weight
FROM athletes
ORDER BY
CASE
WHEN sport = 'basketball' THEN height
WHEN sport = 'wrestling' THEN weight
END DESC
If you do use the DESC keyword to specify order direction, it must come after the CASE expression.
CASE WHEN statement with WHERE and HAVING
You can use CASE with filtering clauses to conditionally determine conditions for filtering.
Let’s imagine we have product data grouped by product category. We can use HAVING and CASE together to set up custom filters for each group.
SELECT
SUM(sales),
category
FROM sales
GROUP BY category
HAVING CASE
WHEN category = 'furniture' THEN SUM(sales) > 100
WHEN category = 'office' THEN SUM(sales) > 30
ELSE SUM(sales) > 150
END
The ‘furniture' group must have a minimum aggregate sales of 100, whereas ‘office’ products have a lower minimum of 50. All other product categories must have more than 150 sales to pass the condition of the HAVING clause.
CASE WHEN statement with GROUP BY
Normally, the GROUP BY statement specifies a column that contains values to be grouped. You can use the CASE expression to specify the column conditionally.
Let’s imagine we have international sales data and want to aggregate it. To keep things simple, we want to create a manageable number of groups.
If there are less than 100 rows, group records by country column. This way, we’ll have more specific groups. Otherwise, group them by values in the region column, which will result in more general groups.
Let’s see how we can use the CASE expression to do this:
SELECT
country,
region
count(*)
FROM sales
GROUP BY
CASE
WHEN count(*) < 100 THEN country
ELSE region
END
CASE WHEN statement with aggregate functions
CASE can be particularly useful when paired with aggregate functions.
As you may know, COUNT(column) counts values that are not NULL. On the other hand, CASE returns a value if the condition is satisfied and NULL otherwise. Therefore, we can pass CASE as an argument to COUNT() and calculate how many values in a column satisfy certain criteria.
SELECT
price,
count(CASE WHEN price > 20 THEN product)
FROM products
GROUP BY price
In this example, COUNT() finds the number of products with a price value higher than 20.
To sum up, you can use CASE to make sure aggregate functions are only applied to records that satisfy criteria.
5 SQL Interview Questions on CASE Expressions
Let’s look at interview questions that require CASE expressions to find the answer.
Question 1: Olympic Medals By Chinese Athletes
First question comes from ESPN. Answering it will demonstrate how to couple CASE and SUM() aggregate functions to a great effect.
Find the number of medals earned in each category by Chinese athletes from the 2000 to 2016 summer Olympics. For each medal category, calculate the number of medals for each olympic games along with the total number of medals across all years. Sort records by total medals in descending order.
Link to the question: https://platform.stratascratch.com/coding/9959-olympic-medals-by-chinese-athletes
Understand the question
We have a list of all athletes from many different nationalities. We have to get the number of each type of medal (silver, bronze, gold) won by Chinese athletes in the Olympics from 2000 to 2016.
The initial table is very general. We’ll have to filter, group, and aggregate values to get a specific answer - the number of medals won by Chinese athletes in particular years.
Analyze data
Candidates are given a table that contains information about all Olympic athletes going back to the 20th century.
Each athlete is described in detail:
- Integer values in the id column uniquely identify athletes.
- name, sex, age, height, weight columns describe athlete’s full name, sex, age, height and weight, respectively.
- The team column refers to their national team.
- The noc column specifies the national olympic committee.
- games values refer to the official name of the Olympic event.
- Values in year and city columns refer to time and place where the event took place.
- sport, event and medal values refer to the sport practiced by the athlete, event where they participated and category of the medal they won (if any).
Let’s look at the available data to better understand it:
id | name | sex | age | height | weight | team | noc | games | year | season | city | sport | event | medal |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3520 | Guillermo J. Amparan | M | Mexico | MEX | 1924 Summer | 1924 | Summer | Paris | Athletics | Athletics Men's 800 metres | ||||
35394 | Henry John Finchett | M | Great Britain | GBR | 1924 Summer | 1924 | Summer | Paris | Gymnastics | Gymnastics Men's Rings | ||||
21918 | Georg Frederik Ahrensborg Clausen | M | 28 | Denmark | DEN | 1924 Summer | 1924 | Summer | Paris | Cycling | Cycling Men's Road Race Individual | |||
110345 | Marinus Cornelis Dick Sigmond | M | 26 | Netherlands | NED | 1924 Summer | 1924 | Summer | Paris | Football | Football Men's Football | |||
54193 | Thodore Tho Jeitz | M | 26 | Luxembourg | LUX | 1924 Summer | 1924 | Summer | Paris | Gymnastics | Gymnastics Men's Individual All-Around |
Everything looks normal. Except the fact that some records have empty age, height, and weight fields.
Another important detail is that the medal field is empty if the athlete did not win any medals.
Plan your approach
We need to filter records so that there are only Chinese athletes that participated in 2000, 2004, 2008, 2012 and 2016 olympics.
Once we have filtered results, we need to create groups for each unique combination of year and medal category. Gold medals won in 2016 and silver medals won in 2016 would be two different groups.
The next step is to find the total number of medals for each year. We can use CASE with SUM() to only aggregate values that meet a criteria (medals won in a specific year). We can set up SUM() aggregate functions for every year from 2000 to 2016.
Finally, we will need to group records by quality and order them in a descending order, from highest medal count to the lowest.
Write the code
1. Filter the table
To solve this question, we only need records of Chinese athletes that participated in the 2000, 2004, 2008, 2012 and 2016 Olympic games. All other nationalities and all other competitions need to be filtered out.
We can use the OR logical operator to provide a list of multiple competition years that are ‘acceptable’. A much easier solution is to use the IN operator.
We also need to filter out athletes (regardless of their nationality) that did not win any medals. In the olympic_athletes_events table, medal value for athletes that did not win is NULL.
We use IS NOT NULL to identify athletes who won a medal, and therefore have a medal value that is not NULL.
SELECT
year,
medal,
count(*) AS n_medals
FROM
olympics_athletes_events
WHERE
team = 'China' AND
year IN (2000, 2004, 2008, 2012, 2016) AND
medal IS NOT null
GROUP BY
year,
medal
We create groups for every unique combination of year and medal values and use COUNT(*) to get the number of records in each group.
We still need to work with data from this query, so in the following steps, we are going to use it as a subquery.
2. Aggregate medals for each year
Our subquery outputs groups of year and medal (bronze, silver, gold) values, and corresponding number of medals for each group.
Next, we need to find the number of medals won in each year (2000, 2004, 2008, 2012 and 2016) and total number of medals won in all these years.
We can use SUM() aggregate function and CASE expression to only add up values that meet the criteria. In this case, we’ll write five different SUM() aggregate functions to add up medals won in a specific year, from 2000 to 2016.
Our final SUM() function should unconditionally add up medals in all years.
SELECT
base.medal,
sum(CASE WHEN year = 2000 THEN n_medals ELSE 0 END) AS medals_2000,
sum(CASE WHEN year = 2004 THEN n_medals ELSE 0 END) AS medals_2004,
sum(CASE WHEN year = 2008 THEN n_medals ELSE 0 END) AS medals_2008,
sum(CASE WHEN year = 2012 THEN n_medals ELSE 0 END) AS medals_2012,
sum(CASE WHEN year = 2016 THEN n_medals ELSE 0 END) AS medals_2016,
sum(n_medals) AS total_medals
FROM
(subquery)
3. Combine same medals into one group and order them
The question asks us to find the aggregate number of each type of medals won in specified years, so we need to combine records for bronze, silver and gold medals into one group.
SELECT
base.medal,
sum(CASE WHEN year = 2000 THEN n_medals ELSE 0 END) AS medals_2000,
sum(CASE WHEN year = 2004 THEN n_medals ELSE 0 END) AS medals_2004,
sum(CASE WHEN year = 2008 THEN n_medals ELSE 0 END) AS medals_2008,
sum(CASE WHEN year = 2012 THEN n_medals ELSE 0 END) AS medals_2012,
sum(CASE WHEN year = 2016 THEN n_medals ELSE 0 END) AS medals_2016,
sum(n_medals) AS total_medals
FROM
(SELECT
year,
medal,
count(*) AS n_medals
FROM
olympics_athletes_events
WHERE
team = 'China' AND
year IN (2000, 2004, 2008, 2012, 2016) AND
medal IS NOT null
GROUP BY
year,
medal) AS base
GROUP BY
base.medal
ORDER BY
total_medals DESC
And finally, order groups by the total number of medals, from highest to lowest.
Output
The question should have the following output:
medal | medals_2000 | medals_2004 | medals_2008 | medals_2012 | medals_2016 | total_medals |
---|---|---|---|---|---|---|
Bronze | 0 | 0 | 0 | 0 | 2 | 2 |
Gold | 0 | 0 | 0 | 0 | 1 | 1 |
Question 2: Rows With Missing Values
In this question, we’ll use CASE expressions to do conditional filtering.
Interview Question Date: April 2022
The data engineering team at YouTube want to clean the dataset user_flags. In particular, they want to examine rows that have missing values in more than one column. List these rows.
Link to the question: https://platform.stratascratch.com/coding/2106-rows-with-missing-values
Understand the question
To solve this question, we have to write a SQL query that returns records with more than one empty column. In other words, records that have more than one NULL value.
Analyze data
user_firstname | user_lastname | video_id | flag_id |
---|---|---|---|
Richard | Hasson | y6120QOlsfU | 0cazx3 |
Mark | May | Ct6BUPvE2sM | 1cn76u |
Gina | Korman | dQw4w9WgXcQ | 1i43zk |
Mark | May | Ct6BUPvE2sM | 1n0vef |
Mark | May | jNQXAC9IVRw | 1sv6ib |
We need to find the number of NULL values for each record. We are not interested in values, only whether or not they are NULL.
Plan your approach
In this question, we have to return rows that satisfy criteria - have more than one NULL value.
To return these values, we’ll need to use the SELECT statement with the FROM and WHERE clauses to only return rows that meet the condition - have more than one NULL value.
Setting up the condition is the most challenging part of this question. You can use CASE to count the number of NULL values for each record and return rows where that count is more than 1.
Write the code
1. SELECT data from the table
First, let’s select all rows from the user_flags table.
SELECT *
FROM user_flags
2. Filter the table
To filter tables in SQL, we need to use the WHERE clause. We need to make sure the number of NULL columns is more than one.
We’ll need four CASE expressions to check if the value in each column is NULL and return 1 if it is.
Finally, we’ll add the results of four CASE expressions and get the total number of NULL values.
SELECT *
FROM user_flags
WHERE (CASE
WHEN user_firstname IS NULL THEN 1
ELSE 0
END) + (CASE
WHEN user_lastname IS NULL THEN 1
ELSE 0
END) + (CASE
WHEN video_id IS NULL THEN 1
ELSE 0
END) + (CASE
WHEN flag_id IS NULL THEN 1
ELSE 0
END) > 1
For instance, if two of the four CASE expressions return 1, four CASE expressions will add up to 2.
Output
Our final answer needs to include all rows that have two or more missing values.
user_firstname | user_lastname | video_id | flag_id |
---|---|---|---|
Courtney | dQw4w9WgXcQ | ||
Gina | Korman | ||
Greg | 5qap5aO4i9A | ||
Ct6BUPvE2sM |
Question 3: Churn rate of Lyft drivers
This question came up during an interview at the ride-sharing service Lyft. Let’s explore how CASE expressions can help us find the answer.
Calculate the overall churn rate for Lyft drivers across all years in the dataset. Churn is defined as the percentage of drivers who have stopped driving for Lyft, as indicated by a recorded end_date in the lyft_drivers table. In your answer, express the churn rate as a ratio, instead of a percentage. For example, 0.1 instead of 10%.
Link to the question: https://platform.stratascratch.com/coding/10016-churn-rate-of-lyft-drivers
Understand the question
Understanding churn rate is the main challenge of this question.
The churn rate essentially measures the share of users that left the platform compared to all users.
To calculate it, you need to find the number of users who left and divide that number by the number of all users.
Now you can write the formula and use values from the table to do the actual calculation.
Analyze data
To solve this question, we have to work with data in the lyft_drivers table, which has four columns.
- index integer value, which is probably a unique identifier for drivers.
- start_date datetime values most likely represent the date when driver first signed up for Lyft
- end_date values specify the date when the driver stopped working for Lyft.
- yearly_salary integer values stand for the driver’s yearly earnings in dollars.
Now, let’s see a preview of the table filled with data:
index | start_date | end_date | yearly_salary |
---|---|---|---|
0 | 2018-04-02 | 48303 | |
1 | 2018-05-30 | 67973 | |
2 | 2015-04-05 | 56685 | |
3 | 2015-01-08 | 51320 | |
4 | 2017-03-09 | 67507 |
Previewing the table reveals one crucial detail - some driver records do not have an end_date value, but others do.
It’s safe to assume that drivers with an end_date value have stopped working for Lyft, and those without this value are still driving.
Plan your approach
To calculate the churn rate, we need to divide the number of users who left (stopped driving) by the number of all users.
You can use the COUNT(*) aggregate function to get the number of all drivers. To find the number of drivers who left the platform, you’ll have to combine CASE with the COUNT() aggregate function.
Remember that records that have an end_date value represent drivers who stopped driving. Drivers without the end_date value are still driving.
Write the code
Try your hand here:
Output
The query should output just one value - the churn rate of Lyft drivers. It needs to be a decimal number between 0 and 1.
global_churn_rate |
---|
0.5 |
Question 4: Extremely Late Delivery
This question was asked during an interview at the food delivery service DoorDash.
Interview Question Date: June 2022
To remain competitive, the company you work with must reduce the number of extremely late deliveries.
A delivery is flagged as extremely late if the actual delivery time is more than 20 minutes (not inclusive) after the predicted delivery time.
You have been asked to calculate the percentage of orders that arrive extremely late each month.
Your output should include the month in the format 'YYYY-MM' and the percentage of extremely late orders as a percentage of all orders placed in that month.
Link to the question: https://platform.stratascratch.com/coding/2113-extremely-late-delivery
Understand the question
In this question, you have to work with delivery order data and find the number of extremely late deliveries.
The question clearly states that an order can be considered ‘extremely late’ if it arrives 20 minutes later than expected.
The question tells you to output the percentage of extremely late orders in every month. It also provides instructions about the output format.
Analyze data
Let’s look at values in delivery_orders, the only table available for this question.
- delivery_id contains integer values to identify each order
- order_placed_time contains the exact date and time when the order was placed.
- predicted_delivery_time values specify when the order was supposed to arrive. It’s safe to assume that we will look at values in this column to determine which orders are extremely late.
- Datetime values in the actual_delivery_time column specify when the order was actually delivered.
- Integer values in the delivery_rating column probably represent the customer’s feedback
- dasher_id values most likely identify the courier.
- restaurant_id identifies the restaurant that prepared the food
- consumer_id identifies the person receiving the order
We can preview the delivery_orders table to verify our assumptions:
delivery_id | order_placed_time | predicted_delivery_time | actual_delivery_time | delivery_rating | driver_id | restaurant_id | consumer_id |
---|---|---|---|---|---|---|---|
O2132 | 2021-11-17 04:45:00 | 2021-11-17 05:37:00 | 2021-11-17 05:58:00 | 4 | D239 | R633 | C1001 |
O2152 | 2021-12-09 19:09:00 | 2021-12-09 19:41:00 | 2021-12-09 19:41:00 | 3 | D238 | R635 | C1010 |
O2158 | 2022-01-04 02:31:00 | 2022-01-04 02:56:00 | 2022-01-04 03:21:00 | 4 | D239 | R634 | C1010 |
O2173 | 2022-02-09 00:45:00 | 2022-02-09 01:19:00 | 2022-02-09 01:33:00 | 0 | D239 | R633 | C1038 |
O2145 | 2021-12-04 17:20:00 | 2021-12-04 18:04:00 | 2021-12-04 18:31:00 | 1 | D239 | R634 | C1042 |
Based on our assessment, we’ll need to compare values in predicted_delivery_time and actual_delivery_time columns to find extremely late deliveries.
Plan your approach
Use CASE to set up a condition for identifying extremely late delivery. If the difference between predicted_delivery_time and actual_delivery_time values is more than 20, CASE should return 1; otherwise 0.
Pass it as an argument to the SUM() aggregate function, which will add up every value returned by the CASE expression.
To find the percentage, we need to divide the number of late orders by the number of all orders and multiply the ratio by 100.
SUM() and COUNT() aggregate functions should apply to each month, not the entire table. So we need to group records by month.
The output should show months in a specific format.
Write the code
Try your hand here:
Output
Final answer should show the percentage of late deliveries for each month. Note that the question has specific instructions on how the month needs to be formatted.
year_month | perc_extremely_delayed |
---|---|
2021-11 | 33.33 |
2021-12 | 13.33 |
2022-02 | 11.11 |
2022-01 | 36.36 |
Question 5: Find the percentage of shippable orders
In our last question, candidates have to calculate the percentage of all orders that are shippable.
Find the percentage of shipable orders. Consider an order is shipable if the customer's address is known.
Link to the question: https://platform.stratascratch.com/coding/10090-find-the-percentage-of-shipable-orders
Understand the question
The task is clear - find what percentage of all orders are shippable. We need to start by finding the number of all shippable orders.
The question description states that an order is shippable if it has an address.
Analyze data
Let’s look at values in two tables available for this question: orders and customers
- id column contains integer values to identify each order
- cust_id values identify the customer who placed the order
- order_date specifies the year, month, and day when the order was placed.
- Values in the order_details column describe the type of product ordered.
- total_order_cost column refers to order size in dollars.
Let’s preview the orders table:
id | cust_id | order_date | order_details | total_order_cost |
---|---|---|---|---|
1 | 3 | 2019-03-04 | Coat | 100 |
2 | 3 | 2019-03-01 | Shoes | 80 |
3 | 3 | 2019-03-07 | Skirt | 30 |
4 | 7 | 2019-02-01 | Coat | 25 |
5 | 7 | 2019-03-10 | Shoes | 80 |
The data in this table looks standard.
Let’s examine values in the customers table.
- id column identifies the customer
- first_name contains the customer’s first name
- last_name contains the customer’s last name
- city specifies the city the user is from
- address is likely the customer’s default shipping address
- phone_number column contains the customer’s phone number
Now, the preview of the customers table:
id | first_name | last_name | city | address | phone_number |
---|---|---|---|---|---|
8 | John | Joseph | San Francisco | 928-386-8164 | |
7 | Jill | Michael | Austin | 813-297-0692 | |
4 | William | Daniel | Denver | 813-368-1200 | |
5 | Henry | Jackson | Miami | 808-601-7513 | |
13 | Emma | Isaac | Miami | 808-690-5201 |
Everything looks okay, except some customers don’t have an address value.
Plan your approach
To answer this question, we need to check if the person who placed the order has provided a shipping address. Records in the orders table give us information about orders, including the identity of the user responsible for the order. However, there is no information about whether or not the user has provided a shipping address.
Information about users’ addresses is stored in the customers table. Both tables have values that reference customer identities. This is a shared dimension we can use to combine data from two tables.
We can use CASE to check the address value. If it is NULL, then CASE should return a boolean value ‘false’, because orders placed by these customers are not shippable. If it’s a non-NULL value, the expression should return a boolean value ‘true’, because orders are shippable.
Save the result of the CASE expression in a new column, which will indicate whether or not the order is shippable.
As a final step, find the total number of orders that are shippable. Then find the percentage of orders that are shippable.
Write the code
Try your hand here:
Output
Final answer for this question is just one value - the percentage of shippable orders.
percent_shipable |
---|
28 |
Summary
In this “CASE WHEN statements in SQL” guide, we discussed syntax and many use cases of the CASE expression. You won’t use it as frequently as WHERE, GROUP BY, ORDER BY, and other major SQL features, but CASE is still very important and worth your time. When applicable, it can help you find simple answers to difficult SQL questions.
To demonstrate, we selected five interview questions with answers that revolved around CASE expression. Looking at our solutions and trying to answer the questions yourself will definitely help you master CASE expressions and their use.
You can maximize your chances of landing a job by signing up on the StrataScratch platform, which has hundreds of questions involving CASE and other SQL features.
The StrataScratch platform is the perfect place to practice writing queries. You can answer questions and see if your query would be accepted during an actual interview. If you get stuck, the platform has hints to point you in the right direction. You can see various solutions to the same problem and train your mind to find the most efficient solutions to each question.