SQL PARTITION BY: Advanced Analytical Insights
Categories
SQL PARTITION BY makes SQL window functions – already a powerful analytical tool – even more suitable for advanced analytical insights from your data.
seller_id | total_sales | product_category | market_place | sales_date |
---|---|---|---|---|
s728 | 29158.51 | books | us | 2024-01-01 00:00:00 |
s918 | 24286.4 | books | uk | 2024-01-01 00:00:00 |
s217 | 23481.03 | books | in | 2024-01-01 00:00:00 |
s483 | 49361.62 | clothing | uk | 2024-01-01 00:00:00 |
s790 | 31050.13 | clothing | in | 2024-01-01 00:00:00 |
Expected Output
Official output:
[3, 2, 1]Official output:
[5, 6]Official output:
[7, 9]SQL is so ubiquitous in data analysis that even your grandmother knows how to aggregate data, sum it, or calculate the average values.
That’s all very nice and cute, but you want advanced analytical insights. SQL has a multitude of functions and clauses to handle complex queries. PARTITION BY is one of the more potent ones, enabling advanced data segmentation and analysis.
In this article, we’ll explore its syntax, use cases, and practical applications using examples from our platform.
In short, by the end of the article, you’ll be better at SQL than your granny.
What is the PARTITION BY Clause in SQL?
PARTITION BY is an optional clause in window functions. What are they, you might ask. Read a detailed answer in our SQL window functions guide.
In short, window functions are advanced analytics functions that perform operations on a window of rows and are able to aggregate and show individual rows simultaneously.
PARTITION BY adds another feature to the window functions: it divides a result set into partitions or subsets based on the criteria you define. In other words, PARTITION BY makes it possible to apply the window function on each data subset separately.
Basic Syntax and Usage PARTITION BY in SQL
In window functions, the OVER() clause is mandatory, as it defines the window function. PARTITION BY is a sub-clause of OVER(), as is ORDER BY. They are both optional clauses.
In the context of SELECT, here’s the PARTITION BY syntax.
SELECT column_1,
column_2,
window_function () OVER (PARTITION BY column_name ORDER BY column_name) AS column_alias
FROM table;
The syntax is simple: after PARTITION BY comes one or multiple column names by which you want to divide your data into subsets.
For example, if you have this table…
…and write this query…
SELECT employee_id,
department_id,
salary,
AVG(salary) OVER (PARTITION BY department_id) AS avg_department_salary
FROM employees;
… you’d get this result.
In the query, I partitioned the result by the column department_id. So, in the output, all the data where department with ID = 1 is one partition, ID = 2 is the second, and ID = 3 is the third partition.
Because data is partitioned, the average salary is calculated for each department separately. For example, the average salary for the first department is 5,896.47. It’s calculated like this: (9,415.11 + 4,115.54 + 4,158.77)/3 = 17,689.42/3 = 5,896.47.
For every subsequent partition, the calculation resets and computes the average independently from the previous results.
So, there you have it! A simple example of how you can show all the data and calculate the average salary for each department using PARTITION BY.
Examples of Scenarios Where SQL PARTITION BY is Useful
PARTITION BY in SQL provides advanced analytical options. Here are 15 examples of where you may find it useful.
It’s far from an exhaustive list, as PARTITION BY has broad use.
1. Sales Running Totals: Summing sales for each salesperson per month.
2. Ranking Students: Rank students within each class based on the scores.
3. Moving Averages: Calculate moving averages (e.g.,5-day moving average) for stock prices of each stock.
4. Quarterly Growth Rates: Calculate quarter-over-quarter revenue growth by region.
5. Top Salespersons: Find the top five employees within each subsidiary based on sales.
6. Monthly Expense Totals: Calculate monthly expenses for each department.
7. Cumulative Sales: Calculate cumulative sales for each product category day by day.
8. Employee Tenure: Rank employees within each department by the length of their employment at the company.
9. Customer Order Frequency: Count the number of orders by each customer within a time period (e.g., month, quarter, year)
10. New Subscribers: Monthly track the number of new users within each subscription plan.
11. Departmental Budgets: Calculate the average budget by each department over multiple years.
12. Website Traffic Analysis: Analyze page views by user session.
13. Loan Repayments: Sum the loan repayment amount for each client.
14. Product Ratings: Rank products within the product category by their average rating.
15. Daily Temperature Averages: Calculate the temperature rolling average for every city.
Use Cases for SQL PARTITION BY
The PARTITION BY clause always does the same thing – divides the result set into data subgroups. How these partitions are used depends on the window function applied.
For example, you can use PARTITION BY with aggregate window functions, such as COUNT(), SUM(), AVG(), MIN(), or MAX(). By doing so, you aggregate data within the partition while keeping all the individual rows. Of course, the aggregation is performed for each partition separately.
The above is very different from GROUP BY, as it also allows you to aggregate data when used with one of the regular aggregate functions but collapses data into a single row per group.
In other words, PARTITION BY allows you to display aggregate and analytical data at the same time, while GROUP BY doesn’t.
Another common use of PARTITION BY is with the ranking window functions, i.e., ROW_NUMBER(), RANK(), DENSE_RANK(), PERCENT_RANK(), and NTILE(). These will rank data within each partition. You also need to use ORDER BY with the ranking functions to specify whether the ranking should be in ascending or descending order.
PARTITION BY is also useful with analytical window functions. For example, you can use LAG() or LEAD() to compare each row with the preceding or following rows, respectively. With FIRST_VALUE(), LAST_VALUE(), and NTH_VALUE(), PARTITION BY can get you the value of the first, last, or nth row, respectively, within a partition.
PARTITION BY vs. GROUP BY
While PARTITION BY and GROUP BY might seem similar, they have very different uses due to their distinct characteristics.
SQL GROUP BY is primarily used with basic aggregate functions. This is great for summary statistics and reports, where you only need to show aggregated results without the row-level details.
On the other hand, PARTITION BY can be used with all the window functions (not only aggregate), so there’s a broader range of calculations possible. In conjunction with window functions, it’s ideal for analytical queries where complex calculations are required without losing row-level data.
To be more specific, you would use GROUP BY to show total sales per month and PARTITION BY to show daily sales while displaying monthly totals alongside.
Practical Applications of SQL PARTITION BY
Let’s now use several interview questions from our platform to demonstrate the SQL PARTITION BY clause’s practical applications.
Example #1: PARTITION BY With SUM() Window Function
The question by Amazon and Etsy wants you to calculate the percentage of the total spend a customer spent on each order. You should output the customer’s first name, order details, and percentage of the order cost to their total spend across all orders.
Interview Question Date: April 2019
Calculate the ratio of the total spend a customer spent on each order. Output the customer’s first name, order details, and ratio of the order cost to their total spend across all orders.
Assume each customer has a unique first name (i.e., there is only 1 customer named Karen in the dataset) and that customers place at most only 1 order a day.
Percentages should be represented as decimals
Link to the question: https://platform.stratascratch.com/coding/9899-percentage-of-total-spend
We’ll work with two tables. The first is orders.
id | cust_id | order_date | order_details | total_order_cost |
---|---|---|---|---|
1 | 3 | 2019-03-04 | Coat | 100 |
2 | 3 | 2019-03-01 | Shoes | 80 |
3 | 3 | 2019-03-07 | Skirt | 30 |
4 | 7 | 2019-02-01 | Coat | 25 |
5 | 7 | 2019-03-10 | Shoes | 80 |
The second one is customers.
id | first_name | last_name | city | address | phone_number |
---|---|---|---|---|---|
8 | John | Joseph | San Francisco | 928-386-8164 | |
7 | Jill | Michael | Austin | 813-297-0692 | |
4 | William | Daniel | Denver | 813-368-1200 | |
5 | Henry | Jackson | Miami | 808-601-7513 | |
13 | Emma | Isaac | Miami | 808-690-5201 |
Let’s first calculate the total amount each customer spent across all orders. We’ll do this using the SUM() window function with PARTITION BY.
In the SUM() function, we define the values we want to sum, total_order_cost, in this case. To make this function a window function, it’s necessary to use the OVER() clause. If left like this, the window function would return the total amount of all orders for all customers.
However, we need to do that for each customer separately, so we use PARTITION BY to split the result into partitions by the customer's first name.
SELECT c.first_name,
o.order_details,
SUM(o.total_order_cost) OVER (PARTITION BY c.first_name) AS total_cost_by_customer
FROM orders o
JOIN customers c ON c.id = o.cust_id;
Here are the first several rows of the output. It’s a list of all customers and their orders, with the total order value for each customer.
first_name | order_details | total_cost_by_customer |
---|---|---|
Eva | Coat | 205 |
Eva | Slipper | 205 |
Eva | Shirts | 205 |
Farida | Coat | 260 |
Farida | Shoes | 260 |
Farida | Skirt | 260 |
Farida | Shirts | 260 |
Henry | Shoes | 80 |
I can now write the final code. As I need a percentage of the total spend per order, I will divide each order's cost by the customer's total order cost. In addition, I’ll convert the result to FLOAT to show the result with decimal places.
SELECT c.first_name,
o.order_details,
o.total_order_cost/SUM(o.total_order_cost) OVER (PARTITION BY c.first_name)::FLOAT AS percentage_total_cost
FROM orders o
JOIN customers c ON c.id = o.cust_id;
There you have it: an output as required by the question.
first_name | order_details | percentage_total_cost |
---|---|---|
Eva | Coat | 0.61 |
Eva | Slipper | 0.1 |
Eva | Shirts | 0.29 |
Farida | Coat | 0.38 |
Farida | Shoes | 0.31 |
Example #2: PARTITION BY With MAX() Window Function
Next, we have a question by MetLife and Credit Karma.
Interview Question Date: February 2021
Write a query that joins this submissions table to the loans table and returns the total loan balance on each user’s most recent ‘Refinance’ submission. Return all users and the balance for each of them.
Link to the question: https://platform.stratascratch.com/coding/2003-recent-refinance-submissions
Here, we need to show all users and their loan balances for the most recent ‘Refinance’ submission.
We’ll again work with two tables. The first one is loans.
id | user_id | created_at | status | type |
---|---|---|---|---|
1 | 100 | 2017-04-21 | prequal_completd_offer | Refinance |
2 | 100 | 2017-04-27 | offer_accepted | Refinance |
3 | 101 | 2017-04-22 | prequal_completd_no_offer | Refinance |
4 | 101 | 2017-04-23 | offer_accepted | Refinance |
5 | 101 | 2017-04-25 | offer_accepted | Personal |
The second one is submissions.
id | balance | interest_rate | rate_type | loan_id |
---|---|---|---|---|
1 | 5229.12 | 8.75 | variable | 2 |
2 | 12727.52 | 11.37 | fixed | 4 |
3 | 14996.58 | 8.25 | fixed | 9 |
4 | 21149 | 4.75 | variable | 7 |
5 | 14379 | 3.75 | variable | 5 |
The solution contains a subquery that shows the date of the most recent ‘Refinance’ submission for each user. It does so by using the MAX() window function with PARTITION BY.
The MAX() window function goes through the column created_at and returns the highest value, i.e., the latest date. PARTITION BY is there to create partitions on two criteria: user ID and loan type.
Since we need only the refinancing submissions, we filter data using WHERE.
SELECT DISTINCT id,
user_id,
created_at,
MAX(created_at) OVER (PARTITION BY user_id) AS most_recent_refinance
FROM loans
WHERE type = 'Refinance';
Here’s the output.
id | user_id | created_at | most_recent_refinance |
---|---|---|---|
1 | 100 | 2017-04-21 | 2017-04-27 |
2 | 100 | 2017-04-27 | 2017-04-27 |
3 | 101 | 2017-04-22 | 2017-04-23 |
4 | 101 | 2017-04-23 | 2017-04-23 |
8 | 108 | 2017-04-21 | 2017-04-27 |
9 | 108 | 2017-04-27 | 2017-04-27 |
11 | 100 | 2015-04-21 | 2017-04-27 |
I’ll now turn this query into a subquery and write the rest of the code.
To select the user ID and the balance in the main query, we join the subquery with the table submissions.
As we need the balance of the latest ‘Refinance’ submission, we equal the most recent refinance date from the subquery with created_at from submissions.
In the final step, the output is grouped by user ID and balance.
SELECT r.user_id,
s.balance
FROM
(SELECT DISTINCT id,
user_id,
created_at,
MAX(created_at) OVER (PARTITION BY user_id) AS most_recent_refinance
FROM loans
WHERE type = 'Refinance') AS r
INNER JOIN submissions s
ON r.id = s.loan_id
WHERE most_recent_refinance = created_at
GROUP BY r.user_id, s.balance;
The output shows three users and their respective loan balances.
user_id | balance |
---|---|
100 | 5229.12 |
101 | 12727.52 |
108 | 14996.58 |
Combining PARTITION BY With Other SQL features
PARTITION BY can also be used in conjunction with some other SQL features, such as:
- FILTER
- ORDER BY
- Frame clauses (ROWS BETWEEN and RANGE BETWEEN)
- CASE WHEN
Example #3: PARTITION BY With FILTER
The FILTER clause extends the aggregate window functions by giving the possibility of conditional aggregation within each partition.
Let’s revisit the previous example and rewrite the code using FILTER.
The code stays virtually unchanged. The main difference is that the condition WHERE type = 'Refinance' from the subquery is now moved to a window function itself.
SELECT r.user_id,
s.balance
FROM
(SELECT DISTINCT id,
user_id,
created_at,
MAX(created_at) FILTER (WHERE type = 'Refinance') OVER (PARTITION BY user_id) AS most_recent_refinance
FROM loans) AS r
INNER JOIN submissions s
ON r.id = s.loan_id
WHERE r.most_recent_refinance = r.created_at
GROUP BY r.user_id, s.balance;
Example #4: PARTITION BY With ORDER BY
When I say ORDER BY, I mean the optional window function clause used for sorting the data within a partition.
By optional, I mean it’s not mandatory for some aggregate window functions, but it can be used in some instances. However, it is required for ranking and value window functions to work.
Here’s one such example in the question by Redfin.
Interview Question Date: January 2021
Redfin helps clients to find agents. Each client will have a unique request_id and each request_id has several calls. For each request_id, the first call is an “initial call” and all the following calls are “update calls”. What's the average call duration for all update calls?
Link to the question: https://platform.stratascratch.com/coding/2022-update-call-duration
We need to find the average call duration for all update calls.
Here, we have a table named redfin_call_tracking.
created_on | request_id | call_duration | id |
---|---|---|---|
2020-03-01 04:08:04 | 2 | 3 | 1 |
2020-03-01 05:28:47 | 1 | 28 | 2 |
2020-03-01 07:27:36 | 2 | 22 | 3 |
2020-03-01 13:18:21 | 1 | 12 | 4 |
2020-03-01 15:08:08 | 2 | 13 | 5 |
Let’s start writing the solution from the second subquery, this SELECT statement.
SELECT *,
RANK() OVER (PARTITION BY request_id ORDER BY created_on ASC) AS rk
FROM redfin_call_tracking;
It selects all the columns from the table and ranks the calls within each partition by date. The window function we use is RANK(). The result set is partitioned by the request ID. To make sure the calls for each request are ranked from the earliest to the latest date, we order the data within the partition in ascending order.
Why do we do all that? Because whichever call is ranked as first is an initial call. All other calls are update calls. You’ll see in the next step why we need this when we write the above SELECT as a subquery.
So, this is the complete code.
There’s another subquery referencing the above SELECT to get only the call IDs with a ranking above 1, which keeps only the update calls, as the initial call will always be ranked 1.
All this is then used in the main query to calculate the average duration of all these update calls.
SELECT AVG(call_duration) AS avg_call_duration
FROM redfin_call_tracking
WHERE id IN
(SELECT id
FROM
(SELECT *,
RANK() OVER (PARTITION BY request_id ORDER BY created_on ASC) AS rk
FROM redfin_call_tracking) AS sq
WHERE rk > 1);
The code returns an average of 17.882.
avg |
---|
17.88 |
Example #5: PARTITION BY With Frame Clauses
ROWS BETWEEN and RANGE BETWEEN are frame clauses in window functions used for defining a specific subset of rows within the partition.
To show what I mean by that, I’ll use the previous question as a basis, but I’ll calculate a moving average of call duration for each request.
The moving average is calculated using the AVG() window function, as you might expect. As in the previous example, I partitioned the data by the request ID and ordered it by the date.
I need the ROWS BETWEEN clause to turn this into a moving average. It tells the AVG() function to include the current row’s and the previous two rows’ call duration. You can specify any number of rows before and/or after the current row in your moving average calculation as long as it makes sense.
SELECT *,
AVG(call_duration) OVER (PARTITION BY request_id ORDER BY created_on ASC ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS moving_avg_call_duration
FROM redfin_call_tracking;
Here are the first several rows of the output.
created_on | request_id | call_duration | id | moving_avg_call_duration |
---|---|---|---|---|
2020-03-01 05:28:47 | 1 | 28 | 2 | 28 |
2020-03-01 13:18:21 | 1 | 12 | 4 | 20 |
2020-03-01 16:27:23 | 1 | 19 | 6 | 19.67 |
2020-03-02 00:10:10 | 1 | 28 | 9 | 19.67 |
2020-03-02 04:52:52 | 1 | 13 | 12 | 20 |
2020-03-02 08:20:25 | 1 | 21 | 16 | 20.67 |
The moving average in the first row is the same as the call duration. Because there are no preceding rows, the calculation can only consider the current row: 28/1 = 28.
In the second, the moving average includes the current and the previous row: (28+12)/2 = 20. Only one preceding row is considered because there are no two preceding rows available.
The third row can finally involve the current and the preceding two rows: (28+12+19) = 19.667.
As we move further down the data, the window frame also moves to always include the current row and two previous rows. So, the next calculation is (12+19+28) = 19.667.
The next moving average is (19+28+13) = 20, and so on. Once we get to the next request ID, the calculation will reset as the moving average is calculated for each partition separately.
Example #6: PARTITION BY With CASE WHEN
The CASE WHEN statement is SQL’s way of handling the IF-THEN logic; it returns data that satisfies the specific criteria.
This can be utilized to perform conditional aggregation in the context of window functions. I’ll show you an example in this Uber interview question.
Find the advertising channel with the smallest maximum yearly spending that still brings in more than 1500 customers each year.
Link to the question: https://platform.stratascratch.com/coding/10013-positive-ad-channels
However, I’ll change the question’s requirements. So, imagine an advertising channel is considered high-yield when it acquires over 1,500 customers in a year.
The task is to show the total number of customers each channel acquired as a high-yielding channel. In other words, we want to include only values that are above 1,500.
We have the table uber_advertising at our disposal.
year | advertising_channel | money_spent | customers_acquired |
---|---|---|---|
2019 | celebrities | 10000000 | 1800 |
2019 | billboards | 1000000 | 2000 |
2019 | busstops | 1500 | 400 |
2019 | buses | 70000 | 2500 |
2019 | tv | 300000 | 5000 |
Here’s the code.
I used CASE WHEN to include in the sum only those rows with more than 1,500 customers acquired. Then, I partitioned the results by advertising channel.
SELECT *,
SUM(CASE WHEN customers_acquired > 1500 THEN customers_acquired ELSE 0 END) OVER (PARTITION BY advertising_channel) AS high_yield_customers_acquired
FROM uber_advertising
ORDER BY advertising_channel, year;
Take a look at the output.
year | advertising_channel | money_spent | customers_acquired | high_yield_customers_acquired |
---|---|---|---|---|
2017 | billboards | 200200 | 2100 | 5900 |
2018 | billboards | 500000 | 1800 | 5900 |
2019 | billboards | 1000000 | 2000 | 5900 |
2017 | buses | 3589000 | 2700 | 7500 |
2018 | buses | 550000 | 2300 | 7500 |
2019 | buses | 70000 | 2500 | 7500 |
2017 | busstops | 80000 | 800 | 0 |
2018 | busstops | 35000 | 600 | 0 |
2019 | busstops | 1500 | 400 | 0 |
2017 | celebrities | 300000 | 1900 | 5800 |
2018 | celebrities | 123555 | 2100 | 5800 |
2019 | celebrities | 10000000 | 1800 | 5800 |
2017 | radio | 80000 | 1200 | 0 |
2018 | radio | 2500 | 250 | 0 |
2019 | radio | 1500 | 51 | 0 |
2017 | tv | 80000 | 4700 | 15000 |
2018 | tv | 500000 | 5300 | 15000 |
2019 | tv | 300000 | 5000 | 15000 |
For example, billboards and buses acquired 5,900 and 7,500 customers in total, respectively. However, for bus stops, we have 0, as this channel never went above 1,500 customers in any of the three years. The same is true with the radio.
Handling NULL Values and Outliers
The NULL values and outliers can sometimes significantly skew your analysis. So, knowing how to exclude them when using PARTITION BY is important.
Example #7: Handling NULL Values
The NULL values are typically handled in the WHERE clause.
There’s a question by Google and Deloitte that shows this.
Interview Question Date: March 2022
The election is conducted in a city and everyone can vote for one or more candidates, or choose not to vote at all. Each person has 1 vote so if they vote for multiple candidates, their vote gets equally split across these candidates. For example, if a person votes for 2 candidates, these candidates receive an equivalent of 0.5 vote each. Find out who got the most votes and won the election. Output the name of the candidate or multiple names in case of a tie. To avoid issues with a floating-point error you can round the number of votes received by a candidate to 3 decimal places.
Link to the question: https://platform.stratascratch.com/coding/2099-election-results
To solve the question, we must find out who got the most votes and won the election.
We’re given the voting_results table.
voter | candidate |
---|---|
Kathy | |
Charles | Ryan |
Charles | Christine |
Charles | Kathy |
Benjamin | Christine |
The first select statement uses the COUNT() window function and PARTITION BY to calculate the number of votes each voter gave. This result then divides 1 to get a vote value, as each voter can vote for multiple candidates.
There are also the voters who didn’t vote for anybody; these will have the candidate value shown as NULL. We need to exclude those rows; otherwise, they will be treated as a voice for a NULL candidate. No, there’s no candidate named NULL! So, we exclude it in WHERE.
SELECT voter,
candidate,
1.0 / COUNT(*) OVER (PARTITION BY voter) AS vote_value
FROM voting_results
WHERE candidate IS NOT NULL;
I can now turn this into a subquery and write the rest of the code. There’s another subquery that builds on this one.
It sums the vote value and rounds it to the third decimal, as instructed by the question. Then, it takes this vote value and ranks the candidates. DENSE_RANK() is used, so the candidates with the same number of votes will be ranked the same, and there will be no gaps between the ranks in case of ties.
The main query then just filters data to show only the candidates who are ranked as the first.
SELECT candidate
FROM
(SELECT candidate,
ROUND(SUM(vote_value), 3) AS n_votes,
DENSE_RANK() OVER (ORDER BY ROUND(SUM(vote_value), 3) DESC) AS place
FROM
(SELECT voter,
candidate,
1.0 / COUNT(*) OVER (PARTITION BY voter) AS vote_value
FROM voting_results
WHERE candidate IS NOT NULL) AS value
GROUP BY candidate) AS results
WHERE place = 1;
The winner of these elections is Christine.
candidate |
---|
Christine |
Example #8: Handling Outliers
Here, we have a great example of how you can handle outliers using PARTITION BY.
Salesforce and Glassdoor ask you to compare each employee’s salary with the average salary of the corresponding department.
However, I want to make this calculation more realistic, so I will exclude the outliers from the calculation. Not doing that can skew the results, e.g., a manager with an extremely high salary can artificially increase the average salary of the generally lowly-paid department.
Interview Question Date: May 2019
Compare each employee's salary with the average salary of the corresponding department. Output the department, first name, and salary of employees along with the average salary of that department.
Link to the question: https://platform.stratascratch.com/coding/9917-average-salaries
We have only one table, namely employee.
id | first_name | last_name | age | sex | employee_title | department | salary | target | bonus | city | address | manager_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | Max | George | 26 | M | Sales | Sales | 1300 | 200 | 150 | Max@company.com | California | 2638 Richards Avenue | 1 |
13 | Katty | Bond | 56 | F | Manager | Management | 150000 | 0 | 300 | Katty@company.com | Arizona | 1 | |
11 | Richerd | Gear | 57 | M | Manager | Management | 250000 | 0 | 300 | Richerd@company.com | Alabama | 1 | |
10 | Jennifer | Dion | 34 | F | Sales | Sales | 1000 | 200 | 150 | Jennifer@company.com | Alabama | 13 | |
19 | George | Joe | 50 | M | Manager | Management | 100000 | 0 | 300 | George@company.com | Florida | 1003 Wyatt Street | 1 |
The first step is to use the PERCENTILE_CONT() function and WITHIN GROUP clause in a CTE to calculate each department's 95th percentile.
PERCENTILE_CONT() stands for continuous percentile and is an aggregate function for calculating the nth percentile you define in the function parentheses.
The WITHIN GROUP clause defines the order of the values in the percentile calculation; in this case, it’s employee salary sorted in ascending order.
WITH percentiles AS (
SELECT department,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY salary) AS p95_salary
FROM employee
GROUP BY department
),
Running this CTE as a simple SELECT returns this output.
SELECT department,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY salary) AS p95_salary
FROM employee
GROUP BY department;
department | p95_salary |
---|---|
Audit | 1085 |
Management | 242500 |
Sales | 2000 |
Now that we have 95th percentiles for each department, we can write another CTE that removes all the individual salaries that are above the 95th departmental percentile.
WITH percentiles AS (
SELECT department,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY salary) AS p95_salary
FROM employee
GROUP BY department),
filtered_salaries AS (
SELECT e.department,
e.first_name,
e.salary
FROM employee e
JOIN percentiles p
ON e.department = p.department
WHERE e.salary <= p.p95_salary
)
To write the final solution, you just need to add SELECT with the AVG() window function and partition data by department.
WITH percentiles AS (
SELECT department,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY salary) AS p95_salary
FROM employee
GROUP BY department),
filtered_salaries AS (
SELECT e.department,
e.first_name,
e.salary
FROM employee e
JOIN percentiles p
ON e.department = p.department
WHERE e.salary <= p.p95_salary
)
SELECT department,
first_name,
salary,
AVG(salary) OVER (PARTITION BY department) AS avg_salary_by_department
FROM filtered_salaries;
Here’s the output.
department | first_name | salary | avg_salary_by_department |
---|---|---|---|
Audit | Jason | 1000 | 900 |
Audit | Celine | 1000 | 900 |
Audit | Michale | 700 | 900 |
Management | Katty | 150000 | 150000 |
Management | Allen | 200000 | 150000 |
Management | George | 100000 | 150000 |
Sales | Britney | 1200 | 1295.24 |
Sales | Jack | 1300 | 1295.24 |
Sales | Ben | 1300 | 1295.24 |
Sales | Tom | 1200 | 1295.24 |
Sales | Antoney | 1300 | 1295.24 |
Sales | Morgan | 1200 | 1295.24 |
Sales | Molly | 1400 | 1295.24 |
Sales | Nicky | 1400 | 1295.24 |
Sales | Monika | 1000 | 1295.24 |
Sales | Adam | 1300 | 1295.24 |
Sales | Mark | 1200 | 1295.24 |
Sales | John | 1500 | 1295.24 |
Sales | Joe | 1000 | 1295.24 |
Sales | Henry | 2000 | 1295.24 |
Sales | Max | 1300 | 1295.24 |
Sales | Sam | 1000 | 1295.24 |
Sales | Jennifer | 1000 | 1295.24 |
Sales | Laila | 1000 | 1295.24 |
Sales | Sarrah | 2000 | 1295.24 |
Sales | Suzan | 1300 | 1295.24 |
Sales | Mandy | 1300 | 1295.24 |
Now, the above code can be simplified, but our database doesn’t allow using OVER() with PERCENTILE_CONT(). Nevertheless, here’s the simplified code that should give you the same output as the one above.
WITH percentiles AS (
SELECT department,
first_name,
salary,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY salary) OVER (PARTITION BY department) AS p95_salary
FROM employee
)
SELECT
department,
first_name,
salary,
AVG(salary) OVER (PARTITION BY department) AS avg_salary_by_department
FROM
percentiles
WHERE
salary <= p95_salary;
Best Practices for Efficient Queries
Yes, SQL PARTITION BY does allow you to get very complex insights from your data. However, as your queries become more complex, there’s a danger of them becoming less efficient.
Here’s some advice on how to avoid that when using PARTITION BY.
1. Indexing: Leverage indexing on columns that you use in PARTITION BY.
2. Limited Partitioning: Don’t overdo the partitioning, as it can decrease the query performance. Use PARTITION BY only when necessary.
3. Combine PARTITION BY and WHERE: Whenever possible, filter data using WHERE before partitioning it. This reduces the size of a dataset PARTITION BY works with.
Limitations and Caveats
Everything has its limitations, and SQL PARTITION BY is not above that. These are the three most important limitations and caveats you should consider.
1. Performance: With very large datasets, PARTITION BY can be resource-intensive.
2. Complexity: Partitioning datasets is a complex operation. You should have this in mind and carefully design queries to avoid performance bottlenecks.
3. Database Support: Not all SQL databases fully support advanced window functions. You saw that in the previous example, where our PostgreSQL version doesn’t support using OVER() with PERCENTILE_CONT(). In addition, the syntax of the window functions between some databases can differ.
Conclusion
When the PARTITION BY and window functions work together, they enhance each other’s possibilities. Used together, they allow you to rank, aggregate, and analyze data, while also adding multiple dimensions to analysis with data partitioning.
The practical applications are almost infinite, accompanied by the unique ability you can’t have any other way in SQL. It is the ability to show analytical and aggregate data simultaneously.
Due to this, they are also very popular in the SQL interview questions. Don’t waste your time and start practicing PARTITION BY and window function; you can find them among the many coding questions we have.
Also, visit our blog and YouTube channel for more resources on PARTITION BY and other important SQL topics.