Wine Magazine SQL Interview Question on RANK function

SQL Interview Question on RANK function from Wine Magazine

Published: January 16, 2023

Categories:

Written by:
Nathan Rosidi

An exhaustive walkthrough to a hard-level SQL question involving windows functions, PARTITION BY clauses, CASE statements, and Common Table Expressions.

Are you a wine enthusiast? If you are, then you’ve probably heard of Wine Magazine. Let us solve one of the advanced SQL interview questions on RANK function. The question is taken from Wine Magazine that involves clauses like PARTITION BY, UNION ALL, CASE statements, and window functions. We will use a 3-step framework to solve the problem that can be used to solve any coding problem.

Interview Question: Find the Cheapest and the Most Expensive Wine

Most Expensive And Cheapest Wine

Last Updated: March 2020

Wine Magazine

HardID 10041

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Find the cheapest and the most expensive variety in each region. Output the region along with the corresponding most expensive and the cheapest variety. Be aware that there are 2 region columns, the price from that row applies to both of them.

Note: The results set contains no ties.

Table: winemag_p1

Link to the question: https://platform.stratascratch.com/coding/10041-most-expensive-and-cheapest-wine

This is the question we also made a video for, so make sure you take a look.

The goal of the problem is to display the most expensive and cheapest wine for each region mentioned in the dataset. Speaking of the dataset, let us take a peek at the dataset itself.

1. Exploring the Dataset

The schema and the preview of the dataset are as follows:

Table: winemag_p1

The table winemag_p1 contains columns describing the wine, its designation, the price, the province, the variety, and the region from which it comes. You must note that there are two region columns. From closer inspection, the two regions, in some cases, seem to be interrelated. For instance, Virginia is listed as the region in both columns for the Merlot from the Veramar winery. For someone who has not understood the question, this can cause confusion.

Looking back at the question, it is clear that the price applies to both regions, which means that they are not interchangeable. They are two separate regions to which the wine and its price apply. In situations like these, it is best to communicate with the interviewer and clarify your assumptions about the dataset and your approach.

There also seem to be NULL values in the columns designation, price, region_1, and region_2. These need to be filtered out while coding, especially when performing windows functions over these columns. So, it is crucial to keep an eye out for those pesky NULL values.

2. Writing out the Approach

Before beginning to code, getting the approach figured out is a smart way to go about solving coding problems. It gives you an opportunity to think about and structure your code effectively. What’s more, you can discuss your approach with the interviewer and clarify any questions you might have regarding the dataset and how it is formed. Communicating such intricacies with the interviewer demonstrates your ability to solve the problem in a well-rounded manner.

As for this problem, here is the approach we will take.

Step 1: Merge the two region columns into one.

The first obstacle in solving this problem is the region conundrum. The dataset has provided us with two columns. The wine and its price apply to both the regions mentioned in the dataset. So, it only makes sense to merge the two columns into one. Here, we will make use of UNION ALL to merge the two subsets of data with respect to the regions.

Step 2: Use CTEs or Common Table Expressions for each subquery.

Although using CTEs might look like a lot of extra lines of code, it makes the code cascade seamlessly and is easy on the eyes. Any other developer or even you would find going back to such well-organized code easy to understand and maintain. So, we will section the subqueries into CTEs as we work on the resultant table of the latest CTE.

Step 3: Rank the most expensive and the cheapest wine.

Now, we get to the nitty-gritty of the problem. Ranking. Our objective is to find the most expensive and cheapest wine for each region. So, we will use two markers or columns named ‘expensive_rank’ and ‘cheap_rank’ to rank the wine appropriately for each region from the newly merged region column.

RANK() is a windows function that we will perform over the region column using the OVER(PARTITION BY) clause. Moreover, we will perform an order on the price column, either ascending or descending, in the same OVER(PARTITION BY) clause in order to rank the wine.

Step 4: Filter out the NULL values.

If you recall seeing NULL values in the dataset preview, this is where they will appear. Since we are ranking the regions based on price in both ascending and descending orders, rows with NULL values will inevitably occupy the resultant dataset. So, the obvious next step is to exclude the rows with NULL values.

Step 5: Use CASE statements to select the 1st rank in each of the ranking columns.

Once the NULL values are removed, we have a viable dataset to work with. Now, we can use the rank columns and pick the most expensive and cheapest wine variety for each region using CASE statements.

Step 6: Select the required output columns using the Max function.

Finally, we can select the required columns, as demanded by the question, with the help of the MAX() function. We would see two rows for each region from the latest resultant table since that was a CASE statement, not an aggregation function.

So, now it is time to use an aggregate function like MAX() or MIN() so that we can group them in the end to display just one row per region. In this case, we will use MAX() function.

3. Coding the Solution

Now, let’s get our hands dirty!

Step 1: Merge the two region columns into one.

We will select region_1, variety, and the price columns in one subset and region_2 and the same two columns in another subset and perform UNION ALL. This way, we will merge the two region columns into one, along with the duplicates.

SELECT region_1 AS region,
       variety,
       price
FROM winemag_p1
UNION ALL
SELECT region_2 AS region,
       variety,
       price
FROM winemag_p1

Let us run the query and check the output.

Our base table is ready.

Step 2: Use CTEs or Common Table Expressions for each subquery.

Let us turn this query into a CTE (Common Table Expression) such that it acts a temporary result set named ‘cte1’.

WITH cte1 AS
  (SELECT region_1 AS region,
          variety,
          price
   FROM winemag_p1
   UNION ALL SELECT region_2 AS region,
                    variety,
                    price
   FROM winemag_p1)

Step 3: Rank the most expensive and the cheapest wine.

From cte1, we will create two new columns named ‘expensive_rank’ and ‘cheap_rank’. We will use these columns to rank the wine for each region in terms of price. The column ‘expensive_rank’ will sort in descending order of price, and ‘cheap_rank’ will sort in ascending order. We are using the OVER(PARTITION BY) clause here.

WITH cte1 AS
  (SELECT region_1 AS region,
          variety,
          price
   FROM winemag_p1
   UNION ALL SELECT region_2 AS region,
                    variety,
                    price
   FROM winemag_p1)
SELECT region,
       variety,
       price,
       RANK() OVER(PARTITION BY region
                   ORDER BY price DESC) AS expensive_rank,
       RANK() OVER(PARTITION BY region
                   ORDER BY price) AS cheap_rank
FROM cte1

Run the query.

We can see that the rows with NULL values for region and price have been arranged at the top of the result set. Let us remove those in the following step.

Step 4: Filter out the NULL values.

Let us filter out rows that contain NULL values in the region and price columns.

WHERE region IS NOT NULL
  AND price IS NOT NULL

Let us run the query and look at the difference in output.

Here we are. Let us move on to picking the highest ranks in each of the ranking columns.

Step 5: Use CASE statements to select the 1st rank in each of the ranking columns.

Firstly, we need to make the current query into a CTE named ‘cte2’. Now, we can designate the most expensive and the cheapest wine of each region using CASE statements. And then, we will name the columns accordingly, as you can see in the query below.

WITH cte1 AS
  (SELECT region_1 AS region,
          variety,
          price
   FROM winemag_p1
   UNION ALL SELECT region_2 AS region,
                    variety,
                    price
   FROM winemag_p1),
     cte2 AS
  (SELECT region,
          variety,
          price,
          RANK() OVER(PARTITION BY region
                      ORDER BY price DESC) AS expensive_rank,
          RANK() OVER(PARTITION BY region
                      ORDER BY price) AS cheap_rank
   FROM cte1
   WHERE region IS NOT NULL
     AND price IS NOT NULL)
SELECT region,
       CASE
           WHEN expensive_rank =1 THEN variety
       END AS most_expensive,
       CASE
           WHEN cheap_rank =1 THEN variety
       END AS cheapest
FROM cte2
WHERE expensive_rank= 1
  OR cheap_rank=1

Run the query.

Now the resultant table contains two rows for each region, with each row showing the most expensive and the cheapest wine respectively. It is time to aggregate the rows to show the expected output.

Step 6: Select the required output columns using the Max function.

The final step is to use MAX() function to pick the NON-NULL value out of the two columns and assign the rank. To perform on the previous query, we will again section that into another CTE named ‘cte3’.

In the end, we will use the GROUP BY clause since we’re using an aggregate function in this query. And we will ORDER BY 1. Here is the final query.

WITH cte1 AS
  (SELECT region_1 AS region,
          variety,
          price
   FROM winemag_p1
   UNION ALL SELECT region_2 AS region,
                    variety,
                    price
   FROM winemag_p1),
     cte2 AS
  (SELECT region,
          variety,
          price,
          RANK() OVER(PARTITION BY region
                      ORDER BY price DESC) AS expensive_rank,
          RANK() OVER(PARTITION BY region
                      ORDER BY price) AS cheap_rank
   FROM cte1
   WHERE region IS NOT NULL
     AND price IS NOT NULL),
     cte3 AS
  (SELECT region,
          CASE
              WHEN expensive_rank =1 THEN variety
          END AS most_expensive,
          CASE
              WHEN cheap_rank =1 THEN variety
          END AS cheapest
   FROM cte2
   WHERE expensive_rank = 1
     OR cheap_rank=1)
SELECT region,
       MAX(most_expensive) AS most_expensive,
       MAX(cheapest) AS cheapest
FROM cte3
GROUP BY 1
ORDER BY 1

Let us run the query and see the final solution to our problem.

Conclusion

And, there it is! We have solved the problem with advanced SQL concepts like CASE statements, PARTITION BY clauses, as well as RANK() and MAX() functions. As we mentioned earlier, once you lay down your approach, the rest is a walk in the park. Hope you had fun solving this advanced problem. See you here again!

Wine Magazine SQL Interview Question on RANK function

Interview Question: Find the Cheapest and the Most Expensive Wine

Most Expensive And Cheapest Wine

1. Exploring the Dataset

2. Writing out the Approach

3. Coding the Solution

Conclusion

Latest Posts:

Using Principal Component Analysis in R for Real-World Data

Looping Through Lists in Python: A Comprehensive Tutorial

LLM Deep Dive: Practitioner vs. Researcher Path