SELECT UNIQUE in SQL: Mastering Data Retrieval with DISTINCT
Categories
What is SELECT UNIQUE in SQL? What about SELECT DISTINCT? How do you retrieve unique data? We will answer these three and more questions in this article.
Databases contain a lot of data, but this data is not necessarily unique. With so many rows of data being stored, it’s realistic to expect that some of that data will have the same values in the columns as some other data—in short, duplicate values.
Having duplicate values goes against the requirement of many data handling tasks, which is to eliminate duplicate values. This is where the SQL clause DISTINCT comes into play.
What “Unique” Means in the Context of Database Records
In a database context, “unique” means that each database record (or row) is different from all others. It doesn’t mean that all the values (columns) need to be different from all other rows’ columns; it’s enough that only one column is different to proclaim the row unique.
So, what we’re looking at here is the unique combination of values. Take a look at this illustration.
The red rows are not unique because they have exactly the same values in all columns. We have duplicate rows for whatever reason – maybe it’s intentional or an error.
However, the green row is unique, despite the value in one column only (first_name) being different from the other two rows. This is a record of a different person (maybe Bob’s twin brother Leo Zimmerman?) than in the other two rows.
Why Retrieving Unique Records is Crucial
There are three main reasons why getting unique data is important.
1. Data Analysis
Many a data professional has made the mistake of including duplicate values when only unique values are required. By doing so, they made the entire data analysis and its conclusion utterly worthless. Not excluding duplicates from your data means you will artificially inflate whatever values you’re trying to calculate. This could lead to painfully wrong conclusions and business decisions. Yes, you have the power to run the businesses into the ground! Are you scared of such a responsibility? You should be.
Imagine you have a database that records each order as a new row whenever the order’s status changes.
Naively, you just sum all the rows and report $28,200 monthly revenue—well above the monthly plan of $18,000. A record! The boss is happy, every employee gets a $1,000 bonus, and the whole company goes on holiday for five days as a reward. What happens when you come back to work and realize you included duplicate orders?
Shit, the actual monthly revenue is $16,600, almost 50% below the monthly plan!
You’ll be happy if the boss only throws a chair your way. (With all the ‘respectful workplace’ and ‘code of conduct’ guidelines flying straight out the window!)
2. Data Integrity
The other side of the previous point is having duplicate values where there shouldn’t be. In short, data integrity. Compromise it, and you’re again risking producing data that will lead to erroneous decision-making.
For example, you now have a database where each transaction in the table should be unique. However, due to a system glitch, some transactions were erroneously recorded multiple times.
If you checked the data, you would see that only some transaction IDs appear once. By doing so, you could report the mistake, make corrections, and initiate an investigation to find the cause of the error. On top of that, you can temporarily work with such data by simply summing the unique transaction IDs. That way, you get a correct total transaction amount of $10,500, not $19,500, as you would by including duplicate transactions.
3. Querying Efficiency
Creating a sub-dataset that includes only unique values can significantly improve the efficiency of the queries used in the following steps of data processing. This is logical, as having only unique rows reduces the amount of data to be queried.
Take this, for example. Imagine this table tracks user login attempts.
Potentially, the table could have millions of login data rows. Imagine that you’re making some report for which it’s enough for the user to successfully log in only once to be considered for the report.
If that’s the main criterion, you would wisely reduce the data by selecting only unique values. Then, you can perform any further analysis much quicker and easier.
The DISTINCT Keyword
We’ve been talking about unique values, so what the heck is DISTINCT now? Well, SQL uses the DISTINCT keyword to select unique values. There’s no such thing as UNIQUE. OK, that’s not entirely true. So, let’s clear this out.
UNIQUE is an old command for selecting unique values in the Oracle database. Later, the SQL standard defined DISTINCT as the keyword for fetching unique (or distinct) values. So, this is now used in all databases, including Oracle. However, Oracle didn’t ditch UNIQUE. Even though it completely disappeared for collective consciousness, you can still, if you insist or feel very retro, use UNIQUE instead of DISTINCT in Oracle to get unique values.
The DISTINCT keyword is the most commonly used as a part of SELECT DISTINCT queries.
Examples of SELECT DISTINCT Queries
Since there’s nothing that special with the SELECT DISTINCT syntax, I’ll immediately go to simple examples that show its two widespread uses.
I’ll just run through queries and what they do. Don’t worry; very detailed explanations of applying DISTINCT in practice will follow in the next section.
Using SELECT DISTINCT With One Column
One typical use is making a list of unique names of customers. It’s expected that many customers have the same first name, so to create a list of names, you’d need SELECT DISTINCT.
SELECT DISTINCT first_name
FROM customers;
It’s a regular SELECT statement but with the keyword DISTINCT before the selected column. This way, the database knows you’re interested in showing each customer name only once.
Using SELECT DISTINCT With Aggregate Function
Another common use of DISTINCT is with the aggregate function.
For example, you’re tracking individual customer orders. Each row records the order ID, together with the customer ID and order value. Naturally, many customers placed multiple orders.
You’re interested in finding the number of unique order amounts across all customers. You can do that using DISTINCT in the aggregate function, COUNT() in this case.
SELECT COUNT(DISTINCT order_amount) AS number_of_unique_order_amounts
FROM orders;
When using DISTINCT with an aggregate function, remember that DISTINCT goes inside the aggregate function and before the column within its parentheses.
Using DISTINCT on Multiple Columns
When used with one column, DISTINCT returns the unique values of that column. However, when used with multiple columns, DISTINCT looks for unique value combinations across the selected columns.
Two important notes here. No, you don’t write the DISTINCT keyword in front of every column in SELECT; you write it only in front of the first selected column.Second, no, you can’t selectively choose which selected columns you want to apply DISTINCT to. In other words, you either select unique combinations of all the columns or no unique columns at all. In even ‘otherer’ words, whatever columns you list in SELECT will be included in the DISTINCT clause.
Here’s an example to make it clear.
This is a slightly more complicated example than the previous two, so I’ll show you an example dataset. That way, you’ll visualize what’s going on better.
What would happen if you wrote a query like this?
SELECT DISTINCT first_name,
last_name,
department
FROM employee;
It would return unique combinations of the employees’ first names, last names, and departments.
Let’s compare it to the original table we’re querying.
Remember, we’re talking about the combination of all columns. That’s why both Mark Johnson and Mark Wallace from Marketing appear: they have the same first name and department but different last names.
There are also two Mark Johnsons, but they are not duplicates, as they are from different departments.
As you can see, the only omitted row is the duplicate of Francesca Garibaldi from Sales. Rows 5 and 6 in the initial table are exactly the same, so they are shown only once in the query output.
Practical Applications of SELECT UNIQUE in SQL
Let’s now see how selecting unique values in SQL can be applied in the business context. There are some great examples in our coding interview questions. There it is, another practical application for you: passing the SQL job interview by practicing SQL interview questions.
Business Application: SELECT DISTINCT With One Column
This question is by Wine Magazine. It’s an example of a simple use of SELECT DISTINCT with one column.
Find all wine varieties which can be considered cheap based on the price. A variety is considered cheap if the price of a bottle lies between 5 to 20 USD. Output unique variety names only.
Link to the question: https://platform.stratascratch.com/coding/10022-find-all-wine-varieties-which-can-be-considered-cheap-based-on-the-price
The question asks you to query the table winemag_p1 and find all the unique cheap wine varieties.
id | country | description | designation | points | price | province | region_1 | region_2 | variety | winery |
---|---|---|---|---|---|---|---|---|---|---|
126576 | US | Rich and round, this offers plenty of concentrated blackberry notes enveloped in warm spices and supple oak. There's a hint of green tomato leaves throughout, but the lush fruit combined with sturdy grape tannins and high acidity would pair well with fatty short ribs or braised pork. | Estate Club | 87 | 32 | Virginia | Virginia | Merlot | Veramar | |
127077 | Italy | This luminous sparkler offers measured aromas of yellow fruit and honey and delivers a broad, wide approach in the mouth. The wine's texture is creamy and full and there is a spicy point of effervescence on the palate. | Extra Dry | 85 | 19 | Veneto | Prosecco di Valdobbiadene | Prosecco | Varaschin | |
143029 | US | There are some pretty features to this wine, namely the lovely tannins and rich fruit, but it's just too soft and sweet. Tastes like blackberry jam, with extra white sugar spinkled on top. | Signature | 83 | 45 | California | Paso Robles | Central Coast | Cabernet Sauvignon | Byington |
98813 | US | Lovely nose, which runs from violets and cherries right into coffee and sweet chocolate. In the mouth the wine coats the palate with Napa-like intensity, only buttressed with more vivid acidity. It's both smooth and deep. The aromas are more detailed, interesting and really clean, loaded with dark flavors of soy, coffee, molasses and black cherry liqueur. | Champoux & Loess Vineyards Reserve | 93 | 100 | Washington | Washington | Washington Other | Cabernet Sauvignon | Boudreaux Cellars |
43172 | US | Tastes sweeter and fruitier than the 2009, which was dry, so if you like orange marmalade, pineapple tart and vanilla cookie sugariness in your Sauvignon Blancs, it's for you. | Les Pionniers | 86 | 40 | California | Sonoma County | Sonoma | Sauvignon Blanc | Capture |
I used distinct on the wine variety column. After that, I only needed to filter data using where to output only wines with prices between $5 and $20.
SELECT DISTINCT variety
FROM winemag_p1
WHERE price BETWEEN 5 AND 20;
Here’s the output.
variety |
---|
Kuntra |
Riesling |
Sangiovese |
Assyrtiko |
Malbec-Syrah |
Just for fun, let’s see how to use SELECT UNIQUE in Oracle. I’ll take the above code, use UNIQUE instead of DISTINCT, and run it in the Oracle code editor.
SELECT UNIQUE variety
FROM winemag_p1
WHERE price BETWEEN 5 AND 20
The code works perfectly!
Business Application: SELECT DISTINCT With Aggregation
The following question was sourced from the Netflix and Google interviews.
Count the number of movies that Abigail Breslin was nominated for an oscar.
Link to the question: https://platform.stratascratch.com/coding/10128-count-the-number-of-movies-that-abigail-breslin-nominated-for-oscar
It requires you to count the number of movies for which Abigail Breslin was an Oscar nominee. The question gives you the table oscar_nominees to work with.
year | category | nominee | movie | winner | id |
---|---|---|---|---|---|
2006 | actress in a supporting role | Abigail Breslin | Little Miss Sunshine | FALSE | 1 |
1984 | actor in a supporting role | Adolph Caesar | A Soldier's Story | FALSE | 2 |
2006 | actress in a supporting role | Adriana Barraza | Babel | FALSE | 3 |
2002 | actor in a leading role | Adrien Brody | The Pianist | TRUE | 4 |
1942 | actress in a supporting role | Agnes Moorehead | The Magnificent Ambersons | FALSE | 5 |
To solve this question, use DISTINCT with the COUNT() function. Then, filter the nominee using WHERE.
SELECT COUNT(DISTINCT movie) AS n_movies_by_abi
FROM oscar_nominees
WHERE nominee = 'Abigail Breslin';
Abigail was nominated for only one movie.
n_movies_by_abi |
---|
1 |
Business Application: DISTINCT With Multiple Columns
Here’s an example of a business needing to use DISTINCT with multiple columns in SQL. The question is by the City of San Francisco.
Find the business names that scored less than 50 in inspections. Output the result along with the corresponding inspection date and the score.
Link to the question: https://platform.stratascratch.com/coding/9720-find-the-business-names-that-have-inspection-scores-of-less-than-50/official-solution
You’ll work with the sf_restaurant_health_violations table.
business_id | business_name | business_address | business_city | business_state | business_postal_code | business_latitude | business_longitude | business_location | business_phone_number | inspection_id | inspection_date | inspection_score | inspection_type | violation_id | violation_description | risk_category |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5800 | John Chin Elementary School | 350 Broadway St | San Francisco | CA | 94133 | 37.8 | -122.4 | {'longitude': '-122.403154', 'needs_recoding': False, 'latitude': '37.798358', 'human_address': '{"address":"","city":"","state":"","zip":""}'} | 5800_20171017 | 2017-10-17 | 98 | Routine - Unscheduled | 5800_20171017_103149 | Wiping cloths not clean or properly stored or inadequate sanitizer | Low Risk | |
64236 | Sutter Pub and Restaurant | 700 Sutter St | San Francisco | CA | 94102 | 37.79 | -122.41 | {'longitude': '-122.41188', 'needs_recoding': False, 'latitude': '37.78881', 'human_address': '{"address":"","city":"","state":"","zip":""}'} | 64236_20170725 | 2017-07-25 | 88 | Routine - Unscheduled | 64236_20170725_103133 | Foods not protected from contamination | Moderate Risk | |
1991 | SRI THAI CUISINE | 4621 LINCOLN Way | San Francisco | CA | 94122 | 37.76 | -122.51 | {'longitude': '-122.507779', 'needs_recoding': False, 'latitude': '37.764073', 'human_address': '{"address":"","city":"","state":"","zip":""}'} | 1991_20171129 | 2017-11-29 | 86 | Routine - Unscheduled | 1991_20171129_103139 | Improper food storage | Low Risk | |
3816 | Washington Bakery & Restaurant | 733 Washington St | San Francisco | CA | 94108 | 37.8 | -122.41 | {'longitude': '-122.405845', 'needs_recoding': False, 'latitude': '37.795174', 'human_address': '{"address":"","city":"","state":"","zip":""}'} | 3816_20160728 | 2016-07-28 | 67 | Routine - Unscheduled | 3816_20160728_103108 | Contaminated or adulterated food | High Risk | |
39119 | Brothers Restaurant | 4128 GEARY Blvd | San Francisco | CA | 94118 | 37.78 | -122.46 | {'longitude': '-122.463762', 'needs_recoding': False, 'latitude': '37.781148', 'human_address': '{"address":"","city":"","state":"","zip":""}'} | 39119_20160718 | 2016-07-18 | 79 | Routine - Unscheduled | 39119_20160718_103133 | Foods not protected from contamination | Moderate Risk |
This table records data on the health violation level. One inspection can have several health violations, so it’s highly likely that the same inspection will appear several times.
Having that in mind, as well as the requirement to output the business name, inspection date, and score, we know we need to use DISTINCT the following way.
SELECT DISTINCT business_name,
inspection_date,
inspection_score
FROM sf_restaurant_health_violations
WHERE inspection_score < 50;
This ensures that each inspection data appears only once for a particular business.
In the final code line, you also need to filter data so the output shows only inspections below the score of 50.
The output shows only two businesses and inspections.
business_name | inspection_date | inspection_score |
---|---|---|
Da Cafe | 2016-09-07 | 48 |
Lollipot | 2018-05-22 | 45 |
Advanced Topics in Selecting Unique Values in SQL
Now, I want to talk about some more complex (but not too much!) ways of using DISTINCT.
Business Application: SELECT DISTINCT With Aggregate and GROUP BY
In one of the previous examples, I used SELECT DISTINCT with an aggregate function. Because there was only an aggregated column in SELECT, there was no need to use GROUP BY in the query.
Let’s now see an example where GROUP BY is necessary. For instance, in this question by Airbnb.
Find the number of apartments per nationality that are owned by people under 30 years old.
Output the nationality along with the number of apartments.
Sort records by the apartments count in descending order.
Link to the question: https://platform.stratascratch.com/coding/10156-number-of-units-per-nationality
The question asks you to find the number of apartments by host nationality, but you should count only hosts younger than 30.
We work with two tables here. The first one is airbnb_hosts.
host_id | nationality | gender | age |
---|---|---|---|
0 | USA | M | 28 |
1 | USA | F | 29 |
2 | China | F | 31 |
3 | China | M | 24 |
4 | Mali | M | 30 |
The second is a table of accommodation units.
host_id | unit_id | unit_type | n_beds | n_bedrooms | country | city |
---|---|---|---|---|---|---|
0 | A1 | Room | 1 | 1 | USA | New York |
0 | A2 | Room | 1 | 1 | USA | New Jersey |
0 | A3 | Room | 1 | 1 | USA | New Jersey |
1 | A4 | Apartment | 2 | 1 | USA | Houston |
1 | A5 | Apartment | 2 | 1 | USA | Las Vegas |
Yes, you’re right; we will select nationality and use COUNT() with DISTINCT to find the number of units. The selected columns are from different tables, so we need to join them. (For a reminder how this works, read about types of SQL JOIN.
To get the count by nationality, it’s necessary to group the output by it. So, this is not really different from any other use of the aggregate functions. The principle is always the same: the columns you have in SELECT have to be in GROUP BY. The DISTINCT keyword doesn’t change anything in that sense; it only impacts how the aggregate function itself will behave.
Regarding our query, we need to fulfill some additional question requirements. We use WHERE to include only accommodations categorized as apartments and those owned by under-30s.
As a last step, we must sort the output by the number of apartments descendingly.
SELECT nationality,
COUNT(DISTINCT unit_id) AS apartment_count
FROM airbnb_units apartment
INNER JOIN airbnb_hosts host
ON apartment.host_id = host.host_id
WHERE host.age < 30 AND unit_type = 'Apartment'
GROUP BY host.nationality
ORDER BY apartment_count DESC;
The code output is this.
nationality | apartment_count |
---|---|
USA | 2 |
Business Application: DISTINCT in HAVING
DISTINCT is not limited to SELECT; you can use it in HAVING to filter the output based on unique values.
Let’s solve the question by the City of Los Angeles to see how.
Find all owners which have only a single facility. Output the owner_name and order the results alphabetically.
Link to the question: https://platform.stratascratch.com/coding/9694-single-facility-corporations
Here, we need to output the owners that have restaurants with a single facility.
We will query the los_angeles_restaurant_health_inspections table.
serial_number | activity_date | facility_name | score | grade | service_code | service_description | employee_id | facility_address | facility_city | facility_id | facility_state | facility_zip | owner_id | owner_name | pe_description | program_element_pe | program_name | program_status | record_id |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DAQHRSETQ | 2017-06-08 | MARGARITAS CAFE | 93 | A | 1 | ROUTINE INSPECTION | EE0000006 | 5026 S CRENSHAW BLVD | LOS ANGELES | FA0023656 | CA | 90043 | OW0004133 | BAZAN, ASCENCION | RESTAURANT (61-150) SEATS HIGH RISK | 1638 | MARGARITAS CAFE | ACTIVE | PR0011718 |
DA2GQRJOS | 2017-03-07 | LAS MOLENDERAS | 97 | A | 1 | ROUTINE INSPECTION | EE0000997 | 2635 WHITTIER BLVD | LOS ANGELES | FA0160416 | CA | 90023 | OW0125379 | MARISOL FEREGRINO | RESTAURANT (0-30) SEATS HIGH RISK | 1632 | LAS MOLENDERAS | INACTIVE | PR0148504 |
DAMQTA46T | 2016-03-22 | SANDRA'S TAMALES | 93 | A | 1 | ROUTINE INSPECTION | EE0001049 | 5390 WHITTIER BLVD | LOS ANGELES | FA0171769 | CA | 90022-4032 | OW0178828 | SANDRA'S TAMALES INC. | RESTAURANT (0-30) SEATS MODERATE RISK | 1631 | SANDRA'S TAMALES | ACTIVE | PR0164225 |
DAXMBTIRZ | 2018-02-12 | CAFE GRATITUDE | 97 | A | 1 | ROUTINE INSPECTION | EE0000828 | 639 N LARCHMONT BLVD STE #102 | LOS ANGELES | FA0058921 | CA | 90004 | OW0005704 | CAFE GRATITUDE LARCHMONT LLC | RESTAURANT (61-150) SEATS HIGH RISK | 1638 | CAFE GRATITUDE | ACTIVE | PR0019854 |
DAK8TBMS0 | 2015-09-10 | THE WAFFLE | 90 | A | 1 | ROUTINE INSPECTION | EE0000709 | 6255 W SUNSET BLVD STE #105 | LOS ANGELES | FA0051830 | CA | 90028 | OW0035796 | THE WAFFLE, LLC | RESTAURANT (61-150) SEATS HIGH RISK | 1638 | THE WAFFLE | ACTIVE | PR0010922 |
We select the owner names from the table and group the results by the owner ID and name. This ensures we have unique owners, which are repeated across multiple inspections.
Now, we need to keep only the owners with a one-facility restaurant. We can do that using HAVING, which means we filter after grouping the output. In HAVING, we use COUNT() with DISTINCT to count the unique facility IDs.
SELECT owner_name
FROM los_angeles_restaurant_health_inspections
GROUP BY owner_id, owner_name
HAVING COUNT(DISTINCT facility_id) = 1
ORDER BY owner_name;
The code outputs this list.
owner_name |
---|
110 SUNSHINE SMOOTHIES, INC. |
110 SUNSHINE SMOOTHIES, INC. |
7111 MELROSE LLC |
99 CENT ONLY STORES, LLC |
9 STONES, LLC |
Business Application: DISTINCT With CASE Statement
The DISTINCT clause can also be used with the CASE statement, like with any regular column.
In this question by Salesforce and Asana, we use DISTINCT with the COUNT() function and CASE statement.
Interview Question Date: July 2021
Find the growth rate of active users for Dec 2020 to Jan 2021 for each account. The growth rate is defined as the number of users in January 2021 divided by the number of users in Dec 2020. Output the account_id and growth rate.
Link to the question: https://platform.stratascratch.com/coding/2052-user-growth-rate
We’ll work with a single table named sf_events.
date | account_id | user_id |
---|---|---|
2021-01-01 | A1 | U1 |
2021-01-01 | A1 | U2 |
2021-01-06 | A1 | U3 |
2021-01-02 | A1 | U1 |
2020-12-24 | A1 | U2 |
From the table, we need to find the growth rate of active users between December 2020 and January 2021 for each account.
How do we do that? Growth is defined as the number of users in January 2021 divided by the number in December 2020.
So, this is exactly what COUNT(), DISTINCT, and CASE do in the query below. The first instance (in the numerator) uses the CASE statement to find all active users between 1 January 2021 and 31 January 2021. DISTINCT ensures that only unique users are counted.
The calculation in the denominator does the same but for dates between 1 December 2020 and 31 December 2020.
In the end, the output is just grouped by account ID.
SELECT account_id,
COUNT(DISTINCT (CASE
WHEN date BETWEEN '2021-01-01' AND '2021-01-31' THEN user_id
ELSE NULL
END))::float / COUNT(DISTINCT (CASE
WHEN date BETWEEN '2020-12-01' AND '2020-12-31' THEN user_id
ELSE NULL
END))::float AS growth_rate
FROM sf_events
GROUP BY 1;
Here’s the code output.
account_id | ?column? |
---|---|
A1 | 1 |
A2 | 2 |
A3 | 0.5 |
Best Practices and Common Pitfalls
Writing code with DISTINCT is not complicated at all. Using DISTINCT to its strength is much more difficult. In doing that, try to apply some best practices to avoid common pitfalls in using DISTINCT.
Best Practices in Using DISTINCT in SQL
The overview below shows five best practices I advise you to follow. Familiarize yourself with the list, and then I’ll explain it.
1. Use DISTINCT Sparingly: You should use DISTINCT only when necessary. If the data in the table is already unique, don’t use DISTINCT. If you can get unique values by grouping data, use GROUP BY. DISTINCT can slow down your query, especially with large datasets, as it goes through all the data to find and eliminate duplicates.
2. Use Indexes: If the columns used in DISTINCT are indexed, this can improve your query performance.
3. Limit the Number of Columns in DISTINCT: Don’t go into an overkill with the number of columns in DISTINCT: the more columns, the slower the query. Think about what columns you need and include only them in DISTINCT.
4. Use Aggregation Instead of DISTINCT: In one of the examples, I talked about how distinct values are fetched using GROUP BY. Whenever you can do it with GROUP BY or GROUP BY + aggregation, do so.
5. Understand Data: If you know your data, you’ll also know if some is unique. This relates to the nature of the data itself. But it also refers to data integrity constraints that ensure data is unique. Being aware of them saves you from using DISTINCT, as it’s unnecessary.
You’ll get into pitfalls if you ignore these best practices: you’ll use DISTINCT all the time with too many columns on data that already is unique while completely ignoring the existence of data grouping and aggregation. As a result, you’ll write too much code that will work unnecessarily slowly.
Conclusion
Knowing how to query unique values in SQL is important for anyone working with data. Hence, mastering DISTINCT is, too.
While its syntax is nothing short of banal, knowing when and how to use DISTINCT is anything but. You’ll need some practice to get to that level. Practice and business level is what StrataScratch is all about: in our coding questions section; there are almost 150 interview questions focusing on DISTINCT.
More than enough to learn DISTINCT inside out!