Counting Distinct Values in SQL: Tips and Examples

What is DISTINCT in SQL?

DISTINCT is an SQL keyword used for removing duplicates from the query output. In other words, whether the data appears once or several times in a dataset, it doesn’t matter – DISTINCT always returns it only once. In even-more-other words, SQL SELECT DISTINCT returns unique values.

When applied to one column, DISTINCT will return unique values from that column. If applied to multiple columns, DISTINCT returns a unique combination of values from those columns.

The Importance of Counting Distinct Values in Various Data Analysis Scenarios

Focusing on unique entries in datasets gives data analysis the possibility to understand data diversity, frequency, and distribution.

By doing so, you can understand patterns, anomalies, and trends that might be hidden away in redundant data.

Why Focus on Counting Distinct Values in Sql

1. Understanding Diversity

The diversity of a dataset is a key metric in many fields, which signals its richness and variety. Some examples are:

Marketing – for understanding customer preferences’ diversity
Ecology – for assessing biodiversity
Human resources – for analyzing employees’ skill set diversity and developing training and development programs based on that

2. Frequency and Distribution Analysis

By counting distinct values, you can go beyond simple counts and understand the distribution of unique items within a dataset. For instance, aside from knowing that the company sold 50,000 products, you can see that these products are distributed among 1,000 distinct products, which tells you much more about product variety and customer choice.

3. Identifying Trends and Anomalies

Counting distinct values can reveal certain trends. For example, suppose you track the emergence of new distinct values, such as new products or customer segments, over time in your time-series dataset. In that case, you can discover changes in customer and market behavior.

4. Data Cleaning and Quality

Employing distinct values count is also common when you want to check data quality and clean the data. If your distinct count is higher or lower than expected, it can lead to further data exploration that will reveal duplicate data, missing data, or inconsistencies.

How to Use DISTINCT in SQL?

Using DISTINCT in SQL is really easy: simply place it after SELECT and before the column names.

Here’s a syntax.

SELECT DISTINCT column_1,
	 column_2,
 	 ...
FROM table;

Counting Distinct Values

Counting values is done using the COUNT() aggregate function. To count values in the specific column, simply write its name in the function’s parentheses.

SELECT COUNT(column_1) AS column_alias
FROM table;

Combined with DISTINCT, it will do what we were building to – counting distinct values.

The syntax prescribes writing DISTINCT in COUNT() before the columns that appear in the function.

SELECT COUNT(DISTINCT column_1,
			    column_2,
			    ...)
FROM table;

Practical Tips for Using DISTINCT and COUNT()

I’ll now give you several practical tips for using DISTINCT in COUNT().

Handling NULL Values

One important thing to remember is that using COUNT() with a specific column name excludes NULLs from the count. So, using it with DISTINCT means that NULLs won’t be counted as a separate category.

If you want to count NULL values as a distinct category, the easiest way to do it is to use COALESCE().

SELECT COUNT(DISTINCT COALESCE(column_name, 'NULL'))
FROM table;

This function allows you to replace the NULL values with a placeholder, 'NULL', as shown in the example above.

Performance Considerations

When dealing with large datasets, using COUNT(DISTINCT) can be computationally expensive. This is because the database engine needs to scan the entire dataset, sort it, and compare values to return unique values. Only after that can the count be performed.

All this can be additionally slowed down if the columns in COUNT(DISTINCT) are not indexed.

Here are some tips on how to improve the query performance.

Improving the Count Distinct Query Performance in SQL

Indexing: Create indexes on a column named in COUNT(DISTINCT). While they improve performance, bear in mind that indexes require additional storage space, which can have a detrimental impact on write operations, such as INSERT, UPDATE, and DELETE.
Optimize the Query Execution Plan: Use your database’s query optimizer and query hints.
GROUP BY With COUNT(): One alternative to COUNT(DISTINCT) is a query with GROUP BY and COUNT(). Grouping the dataset by required columns and then counting values can sometimes be more efficient than COUNT(DISTINCT). This is especially the case if the data is already well-partitioned or indexed.
Approximate Queries: If you’re dealing with a large dataset where an exact count is not necessary, use approximate algorithms (such as APPROX_COUNT_DISTINCT() in Apache Spark) to estimate the count of distinct values.
Break Down the Query: Use subqueries or CTEs to break down your query into smaller parts. In addition, you can use partitioning to limit the scope of COUNT(DISTINCT).
Parallel Processing: If you have the possibility, leverage parallel query execution of your database engine or use distributed databases, such as Apache Spark or Amazon Redshift.
Data Aggregation and Preprocessing: If a distinct count is required frequently, you could consider aggregating data during ETL processes and performing count operations on that data.

Basic Real-World Examples

It’s now time to get into using SQL COUNT(DISTINCT) in practice. We’ll help ourselves with the interview questions from our platform.

Simple Distinct Values Count

The simplest possible example of counting distinct values is this interview question by Postmates. It asks you to count the customers who placed an order and calculate the average order amount.

Customer Average Orders

Interview Question Date: February 2021

Postmates

EasyID 2013

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

How many customers placed an order and what is the average order amount?

Table: postmates_orders

Link to the question: https://platform.stratascratch.com/coding/2013-customer-average-orders

The question gives you the postmates_orders table. It’s a list of orders and their details.

Table: postmates_orders

id	customer_id	courier_id	seller_id	order_timestamp_utc	amount	city_id
1	102	224	79	2019-03-11 23:27:00	155.73	47
2	104	224	75	2019-04-11 04:24:00	216.6	44
3	100	239	79	2019-03-11 21:17:00	168.69	47
4	101	205	79	2019-03-11 02:34:00	210.84	43
5	103	218	71	2019-04-11 00:15:00	212.6	47

The solution to this question counts distinct customers. Why distinct? Because a customer can place more than one order. There’s no point in counting customers each time they order, as it will artificially inflate the total number of customers.

In addition, I use the AVG() function to calculate the average order amount.

SELECT COUNT(DISTINCT customer_id) AS number_of_customers,
       AVG(amount) AS average_order_amount
FROM postmates_orders;

The output shows there are, in total, five customers, and their average order amount is 139.224.

All required columns and the first 5 rows of the solution are shown

count	avg
5	139.224

Handling NULLs With COUNT(DISTINCT)

Another question, this time by Crunchbase, will teach you how to handle NULL values.

Number Of Acquisitions

Crunchbase

EasyID 10162

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Find the number of acquisitions that occurred in each quarter of each year. Output the acquired quarter in YYYY-Qq format along with the number of acquisitions and order results by the quarters with the highest number of acquisitions first.

Table: crunchbase_acquisitions

Link to the question: https://platform.stratascratch.com/coding/10162-number-of-acquisitions

I’ll write a solution for a reworked problem. The code will count unique company country codes, including those companies with NULLs instead of country codes.

The table you have at your disposal is named crunchbase_acquisitions.

Table: crunchbase_acquisitions

company_permalink	company_name	company_category_code	company_country_code	company_state_code	company_region	company_city	acquirer_permalink	acquirer_name	acquirer_category_code	acquirer_country_code	acquirer_state_code	acquirer_region	acquirer_city	acquired_at	acquired_month	acquired_quarter	acquired_year	price_amount	price_currency_code	id
/company/1form-com	1Form.com	other	AUS		Australia - Other		/company/rea-group	REA Group	news	AUS		Victoria	Victoria	2014-01-14	2014-01	2014-Q1	2014	15000000	USD	13
/company/24-7-real-media	24/7 Real Media	advertising	USA	NY	New York	New York	/company/wpp	WPP	public_relations	USA	NY	New York	New York	2008-05-17	2008-05	2008-Q2	2008	649000000	USD	21
/company/280-north	280 North	software			unknown		/company/motorola-mobility	Motorola Mobility	mobile	USA		United States - Other		2010-07-01	2010-07	2010-Q3	2010	20000000	USD	23
/company/280-north	280 North	software			unknown		/company/motorola-solutions	Motorola Solutions	enterprise	USA	IL	Chicago	Schaumburg	2010-07-01	2010-07	2010-Q3	2010	20000000	USD	24
/company/2nd-story-software-inc	2nd Story Software, Inc.	software	USA	IA	Cedar Rapids	Cedar Rapids	/company/blucora	Blucora	web	USA	WA	Seattle	Bellevue	2012-01-09	2012-01	2012-Q1	2012	287500000	USD	26

Here’s how to solve this problem. I use the DISTINCT clause in the COUNT() function as I did in the previous example. However, this would ignore NULL values, so I must embed the COALESCE() function in it. This is the function that goes through the column company_country_code. If the value in the column is non-NULL, the function returns the actual value from that row. If the value is NULL, the function replaces it with the string 'NULL'. That way, a distinct category named 'NULL' is assigned to NULL values so it can be included in the distinct count.

SELECT COUNT(DISTINCT COALESCE(company_country_code, 'NULL')) AS number_of_countries
FROM crunchbase_acquisitions;

Based on country code counting, the output tells us there are 65 unique countries in the dataset.

All required columns and the first 5 rows of the solution are shown

number_of_countries
65

I can check if the code really counted the NULLs as a separate category by rewriting the code without COALESCE().

SELECT COUNT(DISTINCT company_country_code) AS number_of_countries
FROM crunchbase_acquisitions;

As you can see, there’s one country fewer compared to the previous output; this is a NULL category missing.

All required columns and the first 5 rows of the solution are shown

number_of_countries
64

Advanced Techniques

Now, let me talk about the more advanced usage of DISTINCT with COUNT().

GROUP BY With COUNT(DISTINCT)

The GROUP BY clause organizes query output into categories. It is typically used with aggregate functions, which allows you to show values aggregated by groups.

In the case of COUNT(DISTINCT), it means counting distinct values by group.

The general syntax is shown here.

SELECT column_1,
	 COUNT(DISTINCT column_2) AS column_alias
FROM table_name
GROUP BY column_1;

Example

Here’s an interview question by Spotify.

Aggregate Listening Data

Interview Question Date: January 2024

Spotify

EasyID 10367

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

You're tasked with analyzing a Spotify-like dataset that captures user listening habits. For each user, calculate the total listening time and the count of unique songs they've listened to. In the database duration values are displayed in seconds. Round the total listening duration to the nearest whole minute.

The output should contain three columns: 'user_id', 'total_listen_duration', and 'unique_song_count'.

Table: listening_habits

Link to the question: https://platform.stratascratch.com/coding/10367-aggregate-listening-data

It wants you to calculate the total listening time per user and count the unique songs they’ve listened to.

You have to work with the listening_habits table to solve this problem.

Table: listening_habits

user_id	song_id	listen_duration
101	5001	240
101	5002	0
102	5001	300
102	5003	0
101	5001	240

To show these calculations by a user, you need to select the user_id column.

Next, use SUM() to calculate the total listening time. In addition, divide the result by 60 to get the minutes, and then round the result to the nearest minute using ROUND().

The number of unique songs by a user is calculated using COUNT(DISTINCT).

Finally, group the output by the user ID to get all these calculations for each user separately.

SELECT user_id, 
       ROUND(SUM (listen_duration) / 60.0) AS total_listening_time , 
       COUNT(DISTINCT song_id) AS number_of_songs
FROM listening_habits
GROUP BY user_id;

Here’s the output.

All required columns and the first 5 rows of the solution are shown

user_id	total_listen_duration	unique_song_count
101	8	2
102	5	2
103	6	1
104	6	2
105	4	1

Window Functions for Distinct Counts Within Partitions

The window functions allow you to perform calculations on the window of rows, i.e., rows preceding and/or following the current row. They are famous for their ability to show aggregate data while not collapsing individual rows.

In SQL, there’s an optional clause PARTITION BY used with window functions. It splits the dataset into subsets based on one or more columns, i.e., it partitions it. Doing that with the dataset means that the window function calculations will be executed for each partition separately.

It would be nice if you could use COUNT(DISTINCT) as a window function with PARTITION BY to count distinct values by partition, wouldn’t it? Unfortunately, DISTINCT is not allowed in the COUNT() window function!

The workaround depends on the particular problem you need to solve and your dataset. However, it mainly boils down to utilizing subqueries or CTEs to simulate using COUNT(DISTINCT) as a window function.

One option is to use a window function – namely DENSE_RANK() – for distinct counts within partitions. The general approach is to rank data within the partition by the column for which you want to know the distinct count. The last row in the partition (if ranked ascendingly) will contain the highest rank, which is the same as the unique count of the data in the partition.

The count will be unique because DENSE_RANK() assigns the same rank to duplicate values, so duplicates won’t artificially increase the rank (or count).

We’ll show how this works in an example.

Example
Take a look at this interview question by ESPN.

Median Age Of Gold Medal Winners

ESPN

MediumID 9958

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Software Engineer

Find the median age of gold medal winners across all Olympics.

Table: olympics_athletes_events

Link to the question: https://platform.stratascratch.com/coding/9958-median-age-of-gold-medal-winners

There’s a table named olympics_athletes_events.

Table: olympics_athletes_events

id	name	sex	age	team	noc	games	year	season	city	sport	event
3520	Guillermo J. Amparan	M		Mexico	MEX	1924 Summer	1924	Summer	Paris	Athletics	Athletics Men's 800 metres
35394	Henry John Finchett	M		Great Britain	GBR	1924 Summer	1924	Summer	Paris	Gymnastics	Gymnastics Men's Rings
21918	Georg Frederik Ahrensborg Clausen	M	28	Denmark	DEN	1924 Summer	1924	Summer	Paris	Cycling	Cycling Men's Road Race Individual
110345	Marinus Cornelis Dick Sigmond	M	26	Netherlands	NED	1924 Summer	1924	Summer	Paris	Football	Football Men's Football
54193	Thodore Tho Jeitz	M	26	Luxembourg	LUX	1924 Summer	1924	Summer	Paris	Gymnastics	Gymnastics Men's Individual All-Around

I’ll modify the question and write a query that shows an athlete’s ID, name, and team while also showing the total number of athletes who were team members, regardless of when they competed.

The first CTE ranks athletes. It does that with DENSE_RANK() window function ranking by ID and data being partitioned by the team, meaning it will rank for each team separately. With this CTE, you get the cumulative count of athletes in each team, where the last row within the partition contains the count of athletes in a team. The count is unique because DENSE_RANK() assigns the same rank to all athletes with the same ID, i.e., the athletes that appear several times in the dataset.

WITH ranked_athletes AS (
    SELECT id, 
           name, 
           team,
           DENSE_RANK() OVER (PARTITION BY team ORDER BY id) AS athlete_rank
    FROM olympics_athletes_events
)

Here’s the output snapshot of that first CTE.

All required columns and the first 5 rows of the solution are shown

id	name	team	athlete_rank
3325	Santiago Amat Cansino	Amolgavar	1
17489	Facundo Callioni	Argentina	1
27126	Jorge del Mazo	Argentina	2
66754	Roberto Larraz	Argentina	3
100656	Charles Riddy	Argonaut Rowing Club	1

Take the Argentina team, for example. There are three unique athletes on the team, and the cumulative count (or rank) in the athlete_rank column is also three, and rightly so.

However, the list of athletes is not unique, which you can see in the example of Pierre Tolar, who represented Luxembourg two times and appears two times in the output.

WITH ranked_athletes AS (
    SELECT id, 
           name, 
           team,
           DENSE_RANK() OVER (PARTITION BY team ORDER BY id) AS athlete_rank
    FROM olympics_athletes_events
)

SELECT *
FROM ranked_athletes
WHERE team = 'Luxembourg';

All required columns and the first 5 rows of the solution are shown

id	name	team	athlete_rank
14719	Rene Brasseur	Luxembourg	1
54193	Thodore Tho Jeitz	Luxembourg	2
120786	Pierre Tolar	Luxembourg	3
120786	Pierre Tolar	Luxembourg	3
129575	Jean Welter Sr.	Luxembourg	4

The data has to be deduplicated. I’ll solve that in one of the following steps.

The second CTE takes the first one to extract the max rank – the number of unique athletes per team – using the MAX() aggregate function.

The third CTE deduplicates the first CTE, so now I have a dataset containing only the unique combinations of the athlete’s ID, name, and team.

In the final SELECT, I join the second and third CTEs, i.e., the unique count with the list of unique athletes.

WITH ranked_athletes AS (
    SELECT id, 
           name, 
           team,
           DENSE_RANK() OVER (PARTITION BY team ORDER BY id) AS athlete_rank
    FROM olympics_athletes_events
),

max_rank_per_team AS (
    SELECT team, 
           MAX(athlete_rank) AS team_athlete_count
    FROM ranked_athletes
    GROUP BY team
),

distinct_athletes AS (
    SELECT DISTINCT id, 
           name, 
           team
    FROM ranked_athletes
)

SELECT da.name, 
       da.team, 
       mrt.team_athlete_count
FROM distinct_athletes da
JOIN max_rank_per_team mrt
ON da.team = mrt.team
ORDER BY da.team;

Here’s the output snapshot. I’ve achieved what I wanted, which was to show the unique athlete’s ID, name, and team, along with the count of unique athletes per team.

All required columns and the first 5 rows of the solution are shown

name	team	team_athlete_count
Santiago Amat Cansino	Amolgavar	1
Roberto Larraz	Argentina	3
Jorge del Mazo	Argentina	3
Facundo Callioni	Argentina	3
Charles Riddy	Argonaut Rowing Club	1

If you want to check whether it really shows only unique athletes, check it on the Luxembourg and Pierre Tolar example using this query.

WITH ranked_athletes AS (
    SELECT id, 
           name, 
           team,
           DENSE_RANK() OVER (PARTITION BY team ORDER BY id) AS athlete_rank
    FROM olympics_athletes_events
),

max_rank_per_team AS (
    SELECT team, 
           MAX(athlete_rank) AS team_athlete_count
    FROM ranked_athletes
    GROUP BY team
),

distinct_athletes AS (
    SELECT DISTINCT id, 
           name, 
           team
    FROM ranked_athletes
)

SELECT da.id,
       da.name, 
       da.team, 
       mrt.team_athlete_count
FROM distinct_athletes da
JOIN max_rank_per_team mrt
ON da.team = mrt.team
WHERE da.team = 'Luxembourg'
ORDER BY da.team;

The output shows Pierre Tolar only once, confirming this is a list of unique athletes.

All required columns and the first 5 rows of the solution are shown

id	name	team	team_athlete_count
14719	Rene Brasseur	Luxembourg	4
54193	Thodore Tho Jeitz	Luxembourg	4
120786	Pierre Tolar	Luxembourg	4
129575	Jean Welter Sr.	Luxembourg	4

Now, the solution I showed you above may be a little too elaborate. I can write a shorter query that achieves the same result without DENSE_RANK().

I immediately go to data deduplication in the first CTE in the query below.

The second CTE counts distinct athletes per team by using COUNT(DISTINCT) and GROUP BY, and that’s it – I now only need to JOIN these two CTEs in SELECT and specify the columns I want to output.

WITH distinct_athletes_teams AS (
    SELECT DISTINCT id, 
           name, 
           team
    FROM olympics_athletes_events
),

team_athlete_counts AS (
    SELECT team, 
           COUNT(DISTINCT id) AS team_athlete_count
    FROM distinct_athletes_teams
    GROUP BY team
)

SELECT dat.name, 
       dat.team, 
       tac.team_athlete_count
FROM distinct_athletes_teams dat
JOIN team_athlete_counts tac
ON dat.team = tac.team
ORDER BY team;

The output stays the same as with the first code.

All required columns and the first 5 rows of the solution are shown

name	team	team_athlete_count
Santiago Amat Cansino	Amolgavar	1
Jorge del Mazo	Argentina	3
Facundo Callioni	Argentina	3
Roberto Larraz	Argentina	3
Charles Riddy	Argonaut Rowing Club	1

Common Mistakes and How to Avoid Them

There are three common mistakes made when using COUNT() with DISTINCT.

Common Mistakes in Using Count With Distinct in SQL

1. Forgetting to Handle NULLs: First, you should know your data and whether it includes NULLs. Second, you should know whether your task requires including the NULL values in the count. If it does, you’ll have to include them using techniques such as COALESCE().

2. Misunderstanding DISTINCT With Multiple Columns: It’s a common misconception, especially with beginners, that DISTINCT applied on multiple columns returns unique values for each column separately. No, it doesn’t! It returns unique value combinations across all columns. Used like that in COUNT(), you get the count of unique value combinations.

3. Neglecting Indexing: Using COUNT() with DISTINCT can be computationally expensive, especially on large datasets. This is because it first has to scan the whole dataset to find all the values in the specified columns, then remove duplicates and, finally, count the unique records.

Conclusion

I covered in this article several topics relevant to counting distinct values in SQL using COUNT(DISTINCT). The most important is knowing what DISTINCT as a clause does and the syntax of using it with COUNT().

However, I also demonstrated how to handle NULL values and gave you a couple of more complex examples of using COUNT(DISTINCT).

Along with some tips about its performance and avoiding common mistakes when counting distinct values, you’re now ready to build on that knowledge by writing the code on your own. Feel free to experiment with the above questions and solve them in other SQL dialects.

There are still many coding problems requiring COUNT(DISTINCT) knowledge in our coding interview questions section. Use them to polish your knowledge of counting distinct values in SQL.

Counting Distinct Values in SQL: Tips and Examples

Categories

What is DISTINCT in SQL?

The Importance of Counting Distinct Values in Various Data Analysis Scenarios

1. Understanding Diversity

2. Frequency and Distribution Analysis

3. Identifying Trends and Anomalies

4. Data Cleaning and Quality

How to Use DISTINCT in SQL?

Counting Distinct Values

Practical Tips for Using DISTINCT and COUNT()

Handling NULL Values

Performance Considerations

Basic Real-World Examples

Simple Distinct Values Count

Customer Average Orders

Handling NULLs With COUNT(DISTINCT)

Number Of Acquisitions

Advanced Techniques

GROUP BY With COUNT(DISTINCT)

Aggregate Listening Data

Window Functions for Distinct Counts Within Partitions

Median Age Of Gold Medal Winners

Common Mistakes and How to Avoid Them

Conclusion

Latest Posts:

Counting Distinct Values in SQL: Tips and Examples

Commonly Asked Data Structure Interview Questions

Python Libraries for Data Clean-Up

Share

Follow

Categories