SQL PARTITION BY: Advanced Analytical Insights

SQL Partition by


SQL PARTITION BY makes SQL window functions – already a powerful analytical tool – even more suitable for advanced analytical insights from your data.

All required columns and the first 5 rows of the solution are shown

seller_idtotal_salesproduct_categorymarket_placemonth
s19545633.35booksde2024-01-01
s72829158.51booksus2024-01-01
s91824286.4booksuk2024-01-01
s48349361.62clothinguk2024-01-01
s79031050.13clothingin2024-01-01
Expected Output
Test 1
Official output:
[3, 2, 1]
Test 2
Official output:
[5, 6]
Test 3
Official output:
[7, 9]

SQL is so ubiquitous in data analysis that even your grandmother knows how to aggregate data, sum it, or calculate the average values.

That’s all very nice and cute, but you want advanced analytical insights. SQL has a multitude of functions and clauses to handle complex queries. PARTITION BY is one of the more potent ones, enabling advanced data segmentation and analysis.

In this article, we’ll explore its syntax, use cases, and practical applications using examples from our platform.

In short, by the end of the article, you’ll be better at SQL than your granny.

What is the PARTITION BY Clause in SQL?

PARTITION BY is an optional clause in window functions. What are they, you might ask. Read a detailed answer in our SQL window functions guide.

In short, window functions are advanced analytics functions that perform operations on a window of rows and are able to aggregate and show individual rows simultaneously.

PARTITION BY adds another feature to the window functions: it divides a result set into partitions or subsets based on the criteria you define. In other words, PARTITION BY makes it possible to apply the window function on each data subset separately.

Basic Syntax and Usage PARTITION BY in SQL

In window functions, the OVER() clause is mandatory, as it defines the window function. PARTITION BY is a sub-clause of OVER(), as is ORDER BY. They are both optional clauses.

In the context of SELECT, here’s the PARTITION BY syntax.

SELECT column_1,
       column_2,
       window_function () OVER (PARTITION BY column_name ORDER BY column_name) AS column_alias
FROM table;

The syntax is simple: after PARTITION BY comes one or multiple column names by which you want to divide your data into subsets.

For example, if you have this table…

Syntax of Partition By Clause in Sql

…and write this query…

SELECT employee_id,
       department_id,
       salary,
       AVG(salary) OVER (PARTITION BY department_id) AS avg_department_salary
FROM employees;

… you’d get this result.

Syntax of Partition By Clause in Sql

In the query, I partitioned the result by the column department_id. So, in the output, all the data where department with ID = 1 is one partition, ID = 2 is the second, and ID = 3 is the third partition.

Because data is partitioned, the average salary is calculated for each department separately. For example, the average salary for the first department is 5,896.47. It’s calculated like this: (9,415.11 + 4,115.54 + 4,158.77)/3 = 17,689.42/3 = 5,896.47.

For every subsequent partition, the calculation resets and computes the average independently from the previous results.

So, there you have it! A simple example of how you can show all the data and calculate the average salary for each department using PARTITION BY.

Examples of Scenarios Where SQL PARTITION BY is Useful

Examples of Scenarios Where Partition By in Sql is Useful

PARTITION BY in SQL provides advanced analytical options. Here are 15 examples of where you may find it useful.

It’s far from an exhaustive list, as PARTITION BY has broad use.

1. Sales Running Totals: Summing sales for each salesperson per month.

2. Ranking Students: Rank students within each class based on the scores.

3. Moving Averages: Calculate moving averages (e.g.,5-day moving average) for stock prices of each stock.

4. Quarterly Growth Rates: Calculate quarter-over-quarter revenue growth by region.

5. Top Salespersons: Find the top five employees within each subsidiary based on sales.

6. Monthly Expense Totals: Calculate monthly expenses for each department.

7. Cumulative Sales: Calculate cumulative sales for each product category day by day.

8. Employee Tenure: Rank employees within each department by the length of their employment at the company.

9. Customer Order Frequency: Count the number of orders by each customer within a time period (e.g., month, quarter, year)

10. New Subscribers: Monthly track the number of new users within each subscription plan.

11. Departmental Budgets: Calculate the average budget by each department over multiple years.

12. Website Traffic Analysis: Analyze page views by user session.

13. Loan Repayments: Sum the loan repayment amount for each client.

14. Product Ratings: Rank products within the product category by their average rating.

15. Daily Temperature Averages: Calculate the temperature rolling average for every city.

Use Cases for SQL PARTITION BY

The PARTITION BY clause always does the same thing – divides the result set into data subgroups. How these partitions are used depends on the window function applied.

For example, you can use PARTITION BY with aggregate window functions, such as COUNT(), SUM(), AVG(), MIN(), or MAX(). By doing so, you aggregate data within the partition while keeping all the individual rows. Of course, the aggregation is performed for each partition separately.

The above is very different from GROUP BY, as it also allows you to aggregate data when used with one of the regular aggregate functions but collapses data into a single row per group.

In other words, PARTITION BY allows you to display aggregate and analytical data at the same time, while GROUP BY doesn’t.

Another common use of PARTITION BY is with the ranking window functions, i.e., ROW_NUMBER(), RANK(), DENSE_RANK(), PERCENT_RANK(), and NTILE(). These will rank data within each partition. You also need to use ORDER BY with the ranking functions to specify whether the ranking should be in ascending or descending order.

PARTITION BY is also useful with analytical window functions. For example, you can use LAG() or LEAD() to compare each row with the preceding or following rows, respectively. With FIRST_VALUE(), LAST_VALUE(), and NTH_VALUE(), PARTITION BY can get you the value of the first, last, or nth row, respectively, within a partition.

PARTITION BY vs. GROUP BY

While PARTITION BY and GROUP BY might seem similar, they have very different uses due to their distinct characteristics.

SQL GROUP BY is primarily used with basic aggregate functions. This is great for summary statistics and reports, where you only need to show aggregated results without the row-level details.

On the other hand, PARTITION BY can be used with all the window functions (not only aggregate), so there’s a broader range of calculations possible. In conjunction with window functions, it’s ideal for analytical queries where complex calculations are required without losing row-level data.

To be more specific, you would use GROUP BY to show total sales per month and PARTITION BY to show daily sales while displaying monthly totals alongside.

Practical Applications of SQL PARTITION BY

Let’s now use several interview questions from our platform to demonstrate the SQL PARTITION BY clause’s practical applications.

Example #1: PARTITION BY With SUM() Window Function

The question by Amazon and Etsy wants you to calculate the percentage of the total spend a customer spent on each order. You should output the customer’s first name, order details, and percentage of the order cost to their total spend across all orders.


Tables: orders, customers

Link to the question: https://platform.stratascratch.com/coding/9899-percentage-of-total-spend

We’ll work with two tables. The first is orders.

Table: orders
idcust_idorder_dateorder_detailstotal_order_cost
132019-03-04Coat100
232019-03-01Shoes80
332019-03-07Skirt30
472019-02-01Coat25
572019-03-10Shoes80

The second one is customers.

Table: customers
idfirst_namelast_namecityaddressphone_number
8JohnJosephSan Francisco928-386-8164
7JillMichaelAustin813-297-0692
4WilliamDanielDenver813-368-1200
5HenryJacksonMiami808-601-7513
13EmmaIsaacMiami808-690-5201

Let’s first calculate the total amount each customer spent across all orders. We’ll do this using the SUM() window function with PARTITION BY.

In the SUM() function, we define the values we want to sum, total_order_cost, in this case. To make this function a window function, it’s necessary to use the OVER() clause. If left like this, the window function would return the total amount of all orders for all customers.

However, we need to do that for each customer separately, so we use PARTITION BY to split the result into partitions by the customer's first name.

SELECT c.first_name,
       o.order_details,
       SUM(o.total_order_cost) OVER (PARTITION BY c.first_name) AS total_cost_by_customer
FROM orders o
JOIN customers c ON c.id = o.cust_id;

Here are the first several rows of the output. It’s a list of all customers and their orders, with the total order value for each customer.

All required columns and the first 5 rows of the solution are shown

first_nameorder_detailstotal_cost_by_customer
EvaCoat205
EvaSlipper205
EvaShirts205
FaridaCoat260
FaridaShoes260
FaridaSkirt260
FaridaShirts260
HenryShoes80

I can now write the final code. As I need a percentage of the total spend per order, I will divide each order's cost by the customer's total order cost. In addition, I’ll convert the result to FLOAT to show the result with decimal places.

SELECT c.first_name,
       o.order_details,
       o.total_order_cost/SUM(o.total_order_cost) OVER (PARTITION BY c.first_name)::FLOAT AS percentage_total_cost
FROM orders o
JOIN customers c ON c.id = o.cust_id;

There you have it: an output as required by the question.

All required columns and the first 5 rows of the solution are shown

first_nameorder_detailspercentage_total_cost
EvaCoat0.61
EvaSlipper0.098
EvaShirts0.293
FaridaCoat0.385
FaridaShoes0.308

Example #2: PARTITION BY With MAX() Window Function

Next, we have a question by MetLife and Credit Karma.


Tables: loans, submissions

Link to the question: https://platform.stratascratch.com/coding/2003-recent-refinance-submissions

Here, we need to show all users and their loan balances for the most recent ‘Refinance’ submission.

We’ll again work with two tables. The first one is loans.

Table: loans
iduser_idcreated_atstatustype
11002017-04-21prequal_completd_offerRefinance
21002017-04-27offer_acceptedRefinance
31012017-04-22prequal_completd_no_offerRefinance
41012017-04-23offer_acceptedRefinance
51012017-04-25offer_acceptedPersonal

The second one is submissions.

Table: submissions
idbalanceinterest_raterate_typeloan_id
15229.128.75variable2
212727.5211.37fixed4
314996.588.25fixed9
4211494.75variable7
5143793.75variable5

The solution contains a subquery that shows the date of the most recent ‘Refinance’ submission for each user. It does so by using the MAX() window function with PARTITION BY.

The MAX() window function goes through the column created_at and returns the highest value, i.e., the latest date. PARTITION BY is there to create partitions on two criteria: user ID and loan type.

Since we need only the refinancing submissions, we filter data using WHERE.

SELECT DISTINCT id, 
                user_id, 
                created_at,
                MAX(created_at) OVER (PARTITION BY user_id) AS most_recent_refinance
FROM loans
WHERE type = 'Refinance';

Here’s the output.

All required columns and the first 5 rows of the solution are shown

iduser_idcreated_atmost_recent_refinance
11002017-04-212017-04-27
21002017-04-272017-04-27
31012017-04-222017-04-23
41012017-04-232017-04-23
81082017-04-212017-04-27
91082017-04-272017-04-27
111002015-04-212017-04-27

I’ll now turn this query into a subquery and write the rest of the code.

To select the user ID and the balance in the main query, we join the subquery with the table submissions.

As we need the balance of the latest ‘Refinance’ submission, we equal the most recent refinance date from the subquery with created_at from submissions.

In the final step, the output is grouped by user ID and balance.

SELECT r.user_id,
       s.balance
FROM
  (SELECT DISTINCT id, 
       user_id, 
       created_at,
       MAX(created_at) OVER (PARTITION BY user_id) AS most_recent_refinance
   FROM loans
   WHERE type = 'Refinance') AS r
INNER JOIN submissions s 
ON r.id = s.loan_id
WHERE most_recent_refinance = created_at
GROUP BY r.user_id, s.balance;

The output shows three users and their respective loan balances.

All required columns and the first 5 rows of the solution are shown

user_idbalance
1005229.12
10112727.52
10814996.58

Combining PARTITION BY With Other SQL features

Combining Partition By With Other SQL features

PARTITION BY can also be used in conjunction with some other SQL features, such as:

  • FILTER
  • ORDER BY
  • Frame clauses (ROWS BETWEEN and RANGE BETWEEN)
  • CASE WHEN

Example #3: PARTITION BY With FILTER

The FILTER clause extends the aggregate window functions by giving the possibility of conditional aggregation within each partition.

Let’s revisit the previous example and rewrite the code using FILTER.

The code stays virtually unchanged. The main difference is that the condition WHERE type = 'Refinance' from the subquery is now moved to a window function itself.

SELECT r.user_id,
       s.balance
FROM
  (SELECT DISTINCT id, 
                   user_id, 
                   created_at,
                   MAX(created_at) FILTER (WHERE type = 'Refinance') OVER (PARTITION BY user_id) AS most_recent_refinance
   FROM loans) AS r
INNER JOIN submissions s 
ON r.id = s.loan_id
WHERE r.most_recent_refinance = r.created_at
GROUP BY r.user_id, s.balance;

Example #4: PARTITION BY With ORDER BY

When I say ORDER BY, I mean the optional window function clause used for sorting the data within a partition.

By optional, I mean it’s not mandatory for some aggregate window functions, but it can be used in some instances. However, it is required for ranking and value window functions to work.

Here’s one such example in the question by Redfin.


Table: redfin_call_tracking

Link to the question: https://platform.stratascratch.com/coding/2022-update-call-duration

We need to find the average call duration for all update calls.

Here, we have a table named redfin_call_tracking.

Table: redfin_call_tracking
created_onrequest_idcall_durationid
2020-03-01 04:08:04231
2020-03-01 05:28:471282
2020-03-01 07:27:362223
2020-03-01 13:18:211124
2020-03-01 15:08:082135

Let’s start writing the solution from the second subquery, this SELECT statement.

SELECT *,
       RANK() OVER (PARTITION BY request_id ORDER BY created_on ASC) AS rk
FROM redfin_call_tracking;

It selects all the columns from the table and ranks the calls within each partition by date. The window function we use is RANK(). The result set is partitioned by the request ID. To make sure the calls for each request are ranked from the earliest to the latest date, we order the data within the partition in ascending order.

Why do we do all that? Because whichever call is ranked as first is an initial call. All other calls are update calls. You’ll see in the next step why we need this when we write the above SELECT as a subquery.

So, this is the complete code.

There’s another subquery referencing the above SELECT to get only the call IDs with a ranking above 1, which keeps only the update calls, as the initial call will always be ranked 1.

All this is then used in the main query to calculate the average duration of all these update calls.

SELECT AVG(call_duration) AS avg_call_duration
FROM redfin_call_tracking
WHERE id IN
    (SELECT id
     FROM
       (SELECT *,
               RANK() OVER (PARTITION BY request_id ORDER BY created_on ASC) AS rk
        FROM redfin_call_tracking) AS sq
     WHERE rk > 1);

The code returns an average of 17.882.

All required columns and the first 5 rows of the solution are shown

avg
17.882

Example #5: PARTITION BY With Frame Clauses

ROWS BETWEEN and RANGE BETWEEN are frame clauses in window functions used for defining a specific subset of rows within the partition.

To show what I mean by that, I’ll use the previous question as a basis, but I’ll calculate a moving average of call duration for each request.

The moving average is calculated using the AVG() window function, as you might expect. As in the previous example, I partitioned the data by the request ID and ordered it by the date.

I need the ROWS BETWEEN clause to turn this into a moving average. It tells the AVG() function to include the current row’s and the previous two rows’ call duration. You can specify any number of rows before and/or after the current row in your moving average calculation as long as it makes sense.

SELECT *,
       AVG(call_duration) OVER (PARTITION BY request_id ORDER BY created_on ASC ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS moving_avg_call_duration
FROM redfin_call_tracking;

Here are the first several rows of the output.

All required columns and the first 5 rows of the solution are shown

created_onrequest_idcall_durationidmoving_avg_call_duration
2020-03-01 05:28:47128228
2020-03-01 13:18:21112420
2020-03-01 16:27:23119619.667
2020-03-02 00:10:10128919.667
2020-03-02 04:52:521131220
2020-03-02 08:20:251211620.667

The moving average in the first row is the same as the call duration. Because there are no preceding rows, the calculation can only consider the current row: 28/1 = 28.

In the second, the moving average includes the current and the previous row: (28+12)/2 = 20. Only one preceding row is considered because there are no two preceding rows available.

The third row can finally involve the current and the preceding two rows: (28+12+19) = 19.667.

As we move further down the data, the window frame also moves to always include the current row and two previous rows. So, the next calculation is (12+19+28) = 19.667.

The next moving average is (19+28+13) = 20, and so on. Once we get to the next request ID, the calculation will reset as the moving average is calculated for each partition separately.

Example #6: PARTITION BY With CASE WHEN

The CASE WHEN statement is SQL’s way of handling the IF-THEN logic; it returns data that satisfies the specific criteria.

This can be utilized to perform conditional aggregation in the context of window functions. I’ll show you an example in this Uber interview question.


Table: uber_advertising

Link to the question: https://platform.stratascratch.com/coding/10013-positive-ad-channels

However, I’ll change the question’s requirements. So, imagine an advertising channel is considered high-yield when it acquires over 1,500 customers in a year.

The task is to show the total number of customers each channel acquired as a high-yielding channel. In other words, we want to include only values that are above 1,500.

We have the table uber_advertising at our disposal.

Table: uber_advertising
yearadvertising_channelmoney_spentcustomers_acquired
2019celebrities100000001800
2019billboards10000002000
2019busstops1500400
2019buses700002500
2019tv3000005000

Here’s the code.

I used CASE WHEN to include in the sum only those rows with more than 1,500 customers acquired. Then, I partitioned the results by advertising channel.

SELECT *,
       SUM(CASE WHEN customers_acquired > 1500 THEN customers_acquired ELSE 0 END) OVER (PARTITION BY advertising_channel) AS high_yield_customers_acquired
FROM uber_advertising
ORDER BY advertising_channel, year;

Take a look at the output.

All required columns and the first 5 rows of the solution are shown

yearadvertising_channelmoney_spentcustomers_acquiredhigh_yield_customers_acquired
2017billboards20020021005900
2018billboards50000018005900
2019billboards100000020005900
2017buses358900027007500
2018buses55000023007500
2019buses7000025007500
2017busstops800008000
2018busstops350006000
2019busstops15004000
2017celebrities30000019005800
2018celebrities12355521005800
2019celebrities1000000018005800
2017radio8000012000
2018radio25002500
2019radio1500510
2017tv80000470015000
2018tv500000530015000
2019tv300000500015000

For example, billboards and buses acquired 5,900 and 7,500 customers in total, respectively. However, for bus stops, we have 0, as this channel never went above 1,500 customers in any of the three years. The same is true with the radio.

Handling NULL Values and Outliers

The NULL values and outliers can sometimes significantly skew your analysis. So, knowing how to exclude them when using PARTITION BY is important.

Example #7: Handling NULL Values

How to Exclude Null Values and Outliers Using  SQL Partition By

The NULL values are typically handled in the WHERE clause.

There’s a question by Google and Deloitte that shows this.


Table: voting_results

Link to the question: https://platform.stratascratch.com/coding/2099-election-results

To solve the question, we must find out who got the most votes and won the election.

We’re given the voting_results table.

Table: voting_results
votercandidate
Kathy
CharlesRyan
CharlesChristine
CharlesKathy
BenjaminChristine

The first select statement uses the COUNT() window function and PARTITION BY to calculate the number of votes each voter gave. This result then divides 1 to get a vote value, as each voter can vote for multiple candidates.

There are also the voters who didn’t vote for anybody; these will have the candidate value shown as NULL. We need to exclude those rows; otherwise, they will be treated as a voice for a NULL candidate. No, there’s no candidate named NULL! So, we exclude it in WHERE.

SELECT voter,
       candidate,
       1.0 / COUNT(*) OVER (PARTITION BY voter) AS vote_value
FROM voting_results
WHERE candidate IS NOT NULL;

I can now turn this into a subquery and write the rest of the code. There’s another subquery that builds on this one.

It sums the vote value and rounds it to the third decimal, as instructed by the question. Then, it takes this vote value and ranks the candidates. DENSE_RANK() is used, so the candidates with the same number of votes will be ranked the same, and there will be no gaps between the ranks in case of ties.

The main query then just filters data to show only the candidates who are ranked as the first.

SELECT candidate
FROM
  (SELECT candidate,
          ROUND(SUM(vote_value), 3) AS n_votes,
          DENSE_RANK() OVER (ORDER BY ROUND(SUM(vote_value), 3) DESC) AS place
   FROM
     (SELECT voter,
             candidate,
             1.0 / COUNT(*) OVER (PARTITION BY voter) AS vote_value
      FROM voting_results
      WHERE candidate IS NOT NULL) AS value
   GROUP BY candidate) AS results
WHERE place = 1;

The winner of these elections is Christine.

All required columns and the first 5 rows of the solution are shown

candidate
Christine

Example #8: Handling Outliers

Here, we have a great example of how you can handle outliers using PARTITION BY.

Salesforce and Glassdoor ask you to compare each employee’s salary with the average salary of the corresponding department.

However, I want to make this calculation more realistic, so I will exclude the outliers from the calculation. Not doing that can skew the results, e.g., a manager with an extremely high salary can artificially increase the average salary of the generally lowly-paid department.


Table: employee

Link to the question: https://platform.stratascratch.com/coding/9917-average-salaries

We have only one table, namely employee.

Table: employee
idfirst_namelast_nameagesexemployee_titledepartmentsalarytargetbonusemailcityaddressmanager_id
5MaxGeorge26MSalesSales1300200150Max@company.comCalifornia2638 Richards Avenue1
13KattyBond56FManagerManagement1500000300Katty@company.comArizona1
11RicherdGear57MManagerManagement2500000300Richerd@company.comAlabama1
10JenniferDion34FSalesSales1000200150Jennifer@company.comAlabama13
19GeorgeJoe50MManagerManagement1000000300George@company.comFlorida1003 Wyatt Street1

The first step is to use the PERCENTILE_CONT() function and WITHIN GROUP clause in a CTE to calculate each department's 95th percentile.

PERCENTILE_CONT() stands for continuous percentile and is an aggregate function for calculating the nth percentile you define in the function parentheses.

The WITHIN GROUP clause defines the order of the values in the percentile calculation; in this case, it’s employee salary sorted in ascending order.

WITH percentiles AS (
  SELECT department,
         PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY salary) AS p95_salary
  FROM employee
  GROUP BY department
),

Running this CTE as a simple SELECT returns this output.

SELECT department,
             PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY salary) AS p95_salary
FROM employee
GROUP BY department;

All required columns and the first 5 rows of the solution are shown

departmentp95_salary
Audit1085
Management242500
Sales2000

Now that we have 95th percentiles for each department, we can write another CTE that removes all the individual salaries that are above the 95th departmental percentile.

WITH percentiles AS (
  SELECT department,
         PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY salary) AS p95_salary
FROM employee
GROUP BY department),

filtered_salaries AS (
  SELECT e.department,
         e.first_name,
         e.salary
  FROM employee e
  JOIN percentiles p
  ON e.department = p.department
  WHERE e.salary <= p.p95_salary
)

To write the final solution, you just need to add SELECT with the AVG() window function and partition data by department.

WITH percentiles AS (
  SELECT department,
         PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY salary) AS p95_salary
FROM employee
GROUP BY department),

filtered_salaries AS (
  SELECT e.department,
         e.first_name,
         e.salary
  FROM employee e
  JOIN percentiles p
  ON e.department = p.department
  WHERE e.salary <= p.p95_salary
)

SELECT department,
       first_name,
       salary,
       AVG(salary) OVER (PARTITION BY department) AS avg_salary_by_department
FROM filtered_salaries;

Here’s the output.

All required columns and the first 5 rows of the solution are shown

departmentfirst_namesalaryavg_salary_by_department
AuditJason1000900
AuditCeline1000900
AuditMichale700900
ManagementKatty150000150000
ManagementAllen200000150000
ManagementGeorge100000150000
SalesBritney12001295.238
SalesJack13001295.238
SalesBen13001295.238
SalesTom12001295.238
SalesAntoney13001295.238
SalesMorgan12001295.238
SalesMolly14001295.238
SalesNicky14001295.238
SalesMonika10001295.238
SalesAdam13001295.238
SalesMark12001295.238
SalesJohn15001295.238
SalesJoe10001295.238
SalesHenry20001295.238
SalesMax13001295.238
SalesSam10001295.238
SalesJennifer10001295.238
SalesLaila10001295.238
SalesSarrah20001295.238
SalesSuzan13001295.238
SalesMandy13001295.238

Now, the above code can be simplified, but our database doesn’t allow using OVER() with PERCENTILE_CONT(). Nevertheless, here’s the simplified code that should give you the same output as the one above.

WITH percentiles AS (
  SELECT department,
               first_name,
               salary,
               PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY salary) OVER (PARTITION BY department) AS p95_salary
  FROM employee
)

SELECT 
  department, 
  first_name, 
  salary, 
  AVG(salary) OVER (PARTITION BY department) AS avg_salary_by_department
FROM 
  percentiles
WHERE 
  salary <= p95_salary;

Best Practices for Efficient Queries

Yes, SQL PARTITION BY does allow you to get very complex insights from your data. However, as your queries become more complex, there’s a danger of them becoming less efficient.

Here’s some advice on how to avoid that when using PARTITION BY.

1. Indexing: Leverage indexing on columns that you use in PARTITION BY.

2. Limited Partitioning: Don’t overdo the partitioning, as it can decrease the query performance. Use PARTITION BY only when necessary.

3. Combine PARTITION BY and  WHERE: Whenever possible, filter data using WHERE before partitioning it. This reduces the size of a dataset PARTITION BY works with.

Limitations and Caveats

Everything has its limitations, and SQL PARTITION BY is not above that. These are the three most important limitations and caveats you should consider.

1. Performance: With very large datasets, PARTITION BY can be resource-intensive.

2. Complexity: Partitioning datasets is a complex operation. You should have this in mind and carefully design queries to avoid performance bottlenecks.

3. Database Support: Not all SQL databases fully support advanced window functions. You saw that in the previous example, where our PostgreSQL version doesn’t support using OVER() with PERCENTILE_CONT(). In addition, the syntax of the window functions between some databases can differ.

Conclusion

When the PARTITION BY and window functions work together, they enhance each other’s possibilities. Used together, they allow you to rank, aggregate, and analyze data, while also adding multiple dimensions to analysis with data partitioning.

The practical applications are almost infinite, accompanied by the unique ability you can’t have any other way in SQL. It is the ability to show analytical and aggregate data simultaneously.

Due to this, they are also very popular in the SQL interview questions. Don’t waste your time and start practicing PARTITION BY and window function; you can find them among the many coding questions we have.

Also, visit our blog and YouTube channel for more resources on PARTITION BY and other important SQL topics.

SQL Partition by


Become a data expert. Subscribe to our newsletter.