Utilizing DENSE_RANK for Data Deduplication in SQL

Utilizing DENSE RANK for Data Deduplication in SQL


Data deduplication in SQL can be done in several ways. Using DENSE_RANK() is one. Today, we’ll talk about using this function to improve your data’s quality.

Working with data often requires deduplicating it. This is a regular step in ensuring data quality and accuracy.

The most obvious method that springs to mind when talking about data deduplication is using the DISTINCT keyword.

However, using SQL DENSE_RANK() is also one of the methods, and this is the topic of our article today.

What is Data Deduplication in SQL?

Data deduplication is a task where duplicate data is identified and removed from the dataset.

Deduplicating data has several benefits in data management. The reduced size of the saved data leads to storage optimization, which, in turn, leads to cost reduction.

Another benefit of deduplication is improved data quality, which provides a foundation for more accurate data analysis.

Importance of Ranking Functions in Identifying Duplicate Records

Except for its primary purpose, which is, obviously, ranking data, the SQL rank functions are also excellent for deduplicating data.

The ranking functions are window functions, meaning they perform an operation (ranking) on the window frame, a set of rows related to the current row.

Three ranking window functions in SQL are:

  • ROW_NUMBER()
  • RANK()
  • DENSE_RANK()

Each of these functions ranks data in a distinct manner.

ROW_NUMBER()

The ROW_NUMBER() window function ranks the rows sequentially, even if there are rows with the same value. So, there are no same ranks repeated, and there is no gap between the ranks. If there are rows with the same value, their order will be determined randomly.

Here’s the table song_streams that I’ll use as an example.

Row Number Window Function for Data Deduplication in SQL

Ranking data from the highest to the lowest streams using ROW_NUMBER() will return this.

Row Number Window Function for Data Deduplication in SQL

The output shows all the rows are sequentially ranked. Even when rows have the same number of streams (duplicates), ROW_NUMBER() still ranks them sequentially, i.e., rows with the same value won’t get the same rank.

This feature is useful for removing duplicates from the dataset by identifying them in each partition. The logic is that if the data is not duplicated, then there will be only one row per partition if data is partitioned by every column.

RANK()

Another ranking window function is RANK(). Its characteristic is assigning the same rank to the rows with the same values. The next row not having the same value will get a new rank but with a gap. The gap is equal to the number of rows getting the same rank.

Using the same table as earlier, this is how RANK() would rank the values.

Ranking Window Function for Data Deduplication in SQL

The three rows with the same values got the same rank. The next rank is not 8, but 10 since three rows are ranked 7.

This function, too, can be used to identify duplicates. However, it’s used less often due to its rank skipping, which can, in some cases, mess up deduplication.

DENSE_RANK()

The DENSE_RANK() function ranks the rows with the duplicates the same as RANK() – they all get the same rank. However, there’s no rank skipping when the following row has a different value and requires a new rank.

Here’s how it will look in our example table.

Ranking Window Function for Data Deduplication in SQL

The ranking is the same as with RANK(), only there’s no rank skipping, so the last row is ranked 8, not 10.

This can be utilized in identifying duplicate values when there should be no gap between the ranks, which is the core topic of our article.

DENSE_RANK() in Deduplication

The general principle of deduplication using DENSE_RANK() is first identifying duplicate data by ranking data. This is done by partitioning the output using the PARTITION BY window functions clause. The partitioning should be done using the column or a pair of columns that will expose the data duplication.

You also need to specify the column in ORDER BY to determine by what values the data will be ranked and in what manner (ascendingly or descendingly). The column in ORDER BY should uniquely define the data, so the ID column is usually used.

The logic behind this is that if the rows have the same values in the PARTITION BY columns, then these rows will belong to one partition. If you rank the rows within a partition, then these rows would be ranked differently, as they all have different IDs. This means you would have partitions with only one row that is ranked as one for the non-duplicate data, and partitions with more than one row will be ranked sequentially; these are duplicates.

Then, you would just need to remove ranks higher than 1 from the partitions to deduplicate data.

Doing so requires knowing the DENSE_RANK() syntax.

Syntax and Basic Usage of DENSE_RANK()

The syntax of DENSE_RANK() is as follows.

SELECT …, 
       DENSE_RANK() OVER (PARTITION BY column_name ORDER BY column_name [ASC | DESC])
FROM table_name;

The parentheses of the DENSE_RANK() function remain empty.

The OVER() clause is mandatory and creates a window function.

In the ranking functions, including DENSE_RANK(), the ORDER BY clause is mandatory. It defines the order in which the data will be ranked, which can be ascending or descending. ORDER BY accepts one or more columns.

The PARTITION BY clause is optional. When used, data will be partitioned (divided into subsets) based on the column referenced in PARTITION BY. If you omit this clause, then the whole output is one partition. PARTITION BY, too, accepts more than one column.

Now, we will demonstrate how this works when identifying duplicates. The table below is named movie_streams.

DENSE RANK for Deduplication in SQL


The code here identifies duplicates using DENSE_RANK().

SELECT id, 
	 movie_title,
	 streams,
	 DENSE_RANK() OVER (PARTITION BY movie_title, streams ORDER BY id) AS ranking
FROM movie_streams;

The table is partitioned by the movie title and streams. These columns determine uniqueness; any rows with the same movie title and the number of streams are duplicates. PARTITION BY will put such rows in the same partition.

After that, the data is ordered by ID, which is the column by which the data will be ranked descendingly.

Here’s the output.

DENSE RANK for Deduplication in SQL

Wherever there’s a rank higher than one, those are duplicate values. Now, the only thing you need to do is to remove them completely from the database using the DELETE statement or simply exclude them from the output using SELECT. In both cases, you’d need to use the WHERE clause, excluding all the ranks higher than 1.

Practical Applications of DENSE_RANK() for Deduplication in SQL

It’s now time to use several SQL interview questions from our platform and show how to use DENSE_RANK() in data deduplication in SQL.

Example #1: Common Friends Script

We can use this question by Google to create a simple example of using DENSE_RANK() to deduplicate data.


Tables: users, friends

Link to the question: https://platform.stratascratch.com/coding/10365-common-friends-script

There’s a table named users.

Table: users
user_iduser_name
1Karl
2Hans
3Emma
4Emma
5Mike


We can use it to check for duplicate users and remove them, as we only need unique user names for our analysis.

This is done exactly as I showed you earlier. I partition data by the columns that will determine the uniqueness; in this case, it’s only the user_name column. Then, I rank the data by the user ID.

SELECT user_id,
       user_name,
       DENSE_RANK() OVER (PARTITION BY user_name ORDER BY user_id) AS ranking
FROM users;

The output shows users named Emma and Lucas have duplicates.

All required columns and the first 5 rows of the solution are shown

user_iduser_nameranking
9Anna1
3Emma1
4Emma2
2Hans1
10John1
1Karl1
6Lucas1
8Lucas2
5Mike1
7Sarah1

To remove these duplicates, I can, for example, turn the above code into CTE and then exclude all the data that is not ranked as one. Also, I actually don’t need any other data, so I can only select user_name to appear in the output.

WITH data_ranking AS (
 SELECT user_id,
       user_name,
       DENSE_RANK() OVER (PARTITION BY user_name ORDER BY user_id) AS ranking
FROM users
)

SELECT user_name
FROM data_ranking
WHERE ranking = 1;

Here it is, the list of users without duplicates.

All required columns and the first 5 rows of the solution are shown

user_name
Anna
Emma
Hans
John
Karl
Lucas
Mike
Sarah

Example #2: Workers With The Highest Salaries

In this example, I’ll use the question from the Amazon and DoorDash interviews.


Tables: worker, title

Link to the question: https://platform.stratascratch.com/coding/10353-workers-with-the-highest-salaries

The question provides a table named worker.

Table: worker
worker_idfirst_namelast_namesalaryjoining_datedepartment
1MonikaArora1000002014-02-20HR
2NiharikaVerma800002014-06-11Admin
3VishalSinghal3000002014-02-20HR
4AmitahSingh5000002014-02-20Admin
5VivekBhati5000002014-06-11Admin

I want to show a list of departments and unique salary amounts paid in the particular departments. In the code, I partition by deparments and salary. I order data by worker_id so that the data is ranked by this column ascendingly.

SELECT department, 
       salary, 
       RANK() OVER (PARTITION BY department, salary ORDER BY worker_id) AS ranking
FROM worker;

Here’s the ranking for each partition. You can see there are duplicates in the Account department and twice in the Admin department.

All required columns and the first 5 rows of the solution are shown

departmentsalaryranking
Account750001
Account750002
Account2000001
Admin800001
Admin900001
Admin900002
Admin5000001
Admin5000002
HR650001
HR850001
HR1000001
HR3000001

To remove the duplicates in one step, you can rewrite the code as a CTE, like in the previous example. However, I’ll rewrite it as a subquery just to bring in some diversity.

SELECT department, 
       salary
FROM (SELECT department, 
       salary, 
       RANK() OVER (PARTITION BY department, salary ORDER BY worker_id) AS ranking
       FROM worker
      ) AS rank_sq
WHERE ranking = 1;


Here’s the output.

All required columns and the first 5 rows of the solution are shown

departmentsalary
Account75000
Account200000
Admin80000
Admin90000
Admin500000
HR65000
HR85000
HR100000
HR300000

Example #3: Find the Duplicate Records in the Dataset

The last example for deduplication in SQL practice is the interview question by Google and Amazon.


Table: title

Link to the question: https://platform.stratascratch.com/coding/9849-find-the-duplicate-records-in-the-dataset

Like the question asks, we’ll find duplicate values. However, we won’t output the number of times the duplicate records appear; we’ll output the actual duplicate values.

There’s the table title at our disposal.

Table: title
worker_ref_idworker_titleaffected_from
1Manager2016-02-20
2Executive2016-06-11
8Executive2016-06-11
5Manager2016-06-11
4Asst. Manager2016-06-11

The approach is similar to previous examples. There’s a SELECT statement that uses DENSE_RANK() to find duplicate records. The data is partitioned by the columns worker_title and affected_from and ordered by the worker ID.

SELECT worker_ref_id,
       worker_title, 
       affected_from, 
       DENSE_RANK() OVER (PARTITION BY worker_title, affected_from ORDER BY worker_ref_id) AS rank
    FROM title;

I’ll write the above query as a CTE, then reference it and output only the records with the rank above 1, i.e., duplicate values.

WITH ranked_titles AS (
    SELECT worker_ref_id,
           worker_title, 
           affected_from, 
           DENSE_RANK() OVER (PARTITION BY worker_title, affected_from ORDER BY worker_ref_id) AS rank
    FROM title
)

SELECT *
FROM ranked_titles
WHERE rank > 1;

Here’s the output showing three rows that are duplicates.

All required columns and the first 5 rows of the solution are shown

worker_ref_idworker_titleaffected_fromrank
7Executive2016-06-112
8Executive2016-06-113
6Lead2016-06-112

To spice things up, let’s remove these records from the database. This is what you will often do in your job, as the point of finding duplicates is to remove them and clean your data.

Here’s the example code that would do it. The CTE part is the same. The second SELECT statement is now a subquery in the WHERE clause of a DELETE statement.

WITH ranked_titles AS (
    SELECT worker_ref_id,
           worker_title, 
           affected_from, 
           DENSE_RANK() OVER (PARTITION BY worker_title, affected_from ORDER BY worker_ref_id) AS rank
    FROM title
)
DELETE FROM title
WHERE worker_ref_id IN (
    SELECT worker_ref_id
    FROM ranked_titles
    WHERE rank > 1
);

Optimizing Performance for Large Datasets

When you’re deduplicating large datasets, it’s important to optimize the query performance. Not doing so can make your query significantly slow and make data deduplication more tedious than it should be.

Deduplicating large Datasets in SQL

1. Indexing

Indexing relevant columns is the most important technique for performance optimization. The columns that are frequently used in WHERE, JOIN, GROUP BY,  and ORDER BY should be indexed.

Another tip is to consider using composite indexes in cases where your queries frequently involve multiple columns, which I would say will be the case.

If you imagine the table title we used earlier is large, we could’ve created a composite index before finding duplicates.

CREATE INDEX idx_worker_date ON title (worker_title, affected_from);

This will help the database execute the queries that partition the dataset by worker_title and affected_from.

2. Query Structuring

The way you structure your query can also improve (or degrade) its performance. Here are several optimization tips when deduplicating data in large datasets:

  • Avoid SELECT *: Instead of selecting all columns from the tables, specify only the required columns in SELECT. That way, the query won’t retrieve unnecessary data.
  • Use CTEs Moderately: CTEs are a great tool for improving your query’s readability. However, they can hinder the performance of your query if used excessively.
  • Filter Data: Whenever possible, filter data using WHERE so your deduplication query is fed a smaller dataset.

3. Partitioning Tables

You already learned what partitioning means. It’s not only possible to do it temporarily in window functions, but you can also create range partitions. By doing so, you would split the table into several smaller tables, which could significantly reduce the running time of your query.

For example, if we imagine that the table sales contains dates from 2016, 2017, and 2018 in the column affected_from, we could partition into three subsets when creating the table, one for each year.

CREATE TABLE title (
    worker_ref_id INT,
    worker_title VARCHAR(255),
    affected_from DATE
)
PARTITION BY RANGE (affected_from);

CREATE TABLE title_p2016 PARTITION OF title
    FOR VALUES FROM ('2016-01-01') TO ('2017-01-01');

CREATE TABLE title_p2017 PARTITION OF title
    FOR VALUES FROM ('2017-01-01') TO ('2018-01-01');

CREATE TABLE title_p2018 PARTITION OF title
    FOR VALUES FROM ('2018-01-01') TO ('2019-01-01');

If we, for instance, need to find and delete duplicates for all the rows from 2016, we could rewrite the query from our last interview questions like this.

WITH ranked_titles AS (
    SELECT worker_ref_id,
           worker_title, 
           affected_from, 
           DENSE_RANK() OVER (PARTITION BY worker_title, affected_from ORDER BY worker_ref_id) AS rank
    FROM title
    WHERE affected_from >= '2016-01-01' AND affected_from < '2017-01-01'
)

DELETE 
FROM title
WHERE worker_ref_id IN (
    SELECT worker_ref_id
    FROM ranked_titles
    WHERE rank > 1
    AND affected_from >= '2016-01-01' AND affected_from < '2017-01-01'
);

That way, the query will look through only one partition of the table, which will be much quicker than looking through the whole table.

4. Monitoring and Tuning

Whatever RDBS you use, it has tools for monitoring query performance and database health. Use them regularly to identify bottlenecks and then remove them. For instance, query analyzers can be utilized to get detailed information about slow-running queries and optimize them.

Common Mistakes When Using DENSE_RANK() for Deduplication

The three most common mistakes in deduplicating data with DENSE_RANK() are given below.

Common Mistakes When Using Dense Rank for Deduplication in SQL

1. Misunderstanding the PARTITION BY Clause

One of the most common mistakes is not partitioning the data the way it allows you to deduplicate data. You have to be aware that ranking is performed for each partition separately.

Also, to deduplicate data, two or more columns very commonly do partitioning. They have to be carefully chosen. In general, the combination of columns should collectively serve to represent a unique record.

2. Misunderstanding the ORDER BY Clause

First, how DENSE_RANK() ranks the data depends on ORDER BY, so you must choose between ascending and descending ranking. The distinction is important in some cases.

Second, it’s also important to choose an appropriate column by which to order the data. The column should be deterministic; otherwise, you could end up with unexpected, inconsistent, and incorrect results. You must be sure that the column you order by is for sure unique; that’s why the ID column is typically the best choice, as you’ve seen in our examples.

3. Overlooking NULL Values

If you have NULLs in your data, DENSE_RANK() will treat them as identical. In other words, all the NULL rows could form one partition and all except the first row could be marked as duplicates, even though they’re not. I don’t have to tell you this leads to wrong data deduplication.

Best Practices to Ensure Accurate and Efficient Deduplication

The best practices are virtually a Venn diagram of optimization techniques and common mistakes.

Best Practices to Ensure Accurate and Efficent Deduplication in SQL

1. Have clear criteria for partitioning – the columns used in PARTITION BY must represent the uniqueness of records.

2. Use indexes – they can improve deduplication query performance, especially with large datasets

3. Test thoroughly – Check if the query’s deduplication logic does what you want it to do before you apply it to the whole database.

Conclusion

One method for data deduplication in SQL is using window functions. Very often, the choice is DENSE_RANK().

Using it to rank data within partitions makes it easy to identify duplicates – all the rows with a ranking above 1 are duplicates.

While the principles of deduplicating data using DENSE_RANK() are generally simple, you still need to be careful about by which columns you partition and order data.

This requires some code-writing practice. There’s plenty of opportunity for that in more than 1,000 SQL interview questions on our platform. Many of those questions cover window functions and, specifically, DENSE_RANK(), so help yourself.

Utilizing DENSE RANK for Data Deduplication in SQL


Become a data expert. Subscribe to our newsletter.