Wine Magazine SQL Interview Question on RANK function
Categories
An exhaustive walkthrough to a hard-level SQL question involving windows functions, PARTITION BY clauses, CASE statements, and Common Table Expressions.
Are you a wine enthusiast? If you are, then you’ve probably heard of Wine Magazine. Let us solve one of the advanced SQL interview questions on RANK function. The question is taken from Wine Magazine that involves clauses like PARTITION BY, UNION ALL, CASE statements, and window functions. We will use a 3-step framework to solve the problem that can be used to solve any coding problem.
Interview Question: Find the Cheapest and the Most Expensive Wine
Interview Question Date: March 2020
Find the cheapest and the most expensive variety in each region. Output the region along with the corresponding most expensive and the cheapest variety. Be aware that there are 2 region columns, the price from that row applies to both of them.
Note: The results set contains no ties.
Link to the question: https://platform.stratascratch.com/coding/10041-most-expensive-and-cheapest-wine
This is the question we also made a video for, so make sure you take a look.
The goal of the problem is to display the most expensive and cheapest wine for each region mentioned in the dataset. Speaking of the dataset, let us take a peek at the dataset itself.
1. Exploring the Dataset
The schema and the preview of the dataset are as follows:
id | country | description | designation | points | price | province | region_1 | region_2 | variety | winery |
---|---|---|---|---|---|---|---|---|---|---|
126576 | US | Rich and round, this offers plenty of concentrated blackberry notes enveloped in warm spices and supple oak. There's a hint of green tomato leaves throughout, but the lush fruit combined with sturdy grape tannins and high acidity would pair well with fatty short ribs or braised pork. | Estate Club | 87 | 32 | Virginia | Virginia | Merlot | Veramar | |
127077 | Italy | This luminous sparkler offers measured aromas of yellow fruit and honey and delivers a broad, wide approach in the mouth. The wine's texture is creamy and full and there is a spicy point of effervescence on the palate. | Extra Dry | 85 | 19 | Veneto | Prosecco di Valdobbiadene | Prosecco | Varaschin | |
143029 | US | There are some pretty features to this wine, namely the lovely tannins and rich fruit, but it's just too soft and sweet. Tastes like blackberry jam, with extra white sugar spinkled on top. | Signature | 83 | 45 | California | Paso Robles | Central Coast | Cabernet Sauvignon | Byington |
98813 | US | Lovely nose, which runs from violets and cherries right into coffee and sweet chocolate. In the mouth the wine coats the palate with Napa-like intensity, only buttressed with more vivid acidity. It's both smooth and deep. The aromas are more detailed, interesting and really clean, loaded with dark flavors of soy, coffee, molasses and black cherry liqueur. | Champoux & Loess Vineyards Reserve | 93 | 100 | Washington | Washington | Washington Other | Cabernet Sauvignon | Boudreaux Cellars |
43172 | US | Tastes sweeter and fruitier than the 2009, which was dry, so if you like orange marmalade, pineapple tart and vanilla cookie sugariness in your Sauvignon Blancs, it's for you. | Les Pionniers | 86 | 40 | California | Sonoma County | Sonoma | Sauvignon Blanc | Capture |
The table winemag_p1 contains columns describing the wine, its designation, the price, the province, the variety, and the region from which it comes. You must note that there are two region columns. From closer inspection, the two regions, in some cases, seem to be interrelated. For instance, Virginia is listed as the region in both columns for the Merlot from the Veramar winery. For someone who has not understood the question, this can cause confusion.
Looking back at the question, it is clear that the price applies to both regions, which means that they are not interchangeable. They are two separate regions to which the wine and its price apply. In situations like these, it is best to communicate with the interviewer and clarify your assumptions about the dataset and your approach.
There also seem to be NULL values in the columns designation, price, region_1, and region_2. These need to be filtered out while coding, especially when performing windows functions over these columns. So, it is crucial to keep an eye out for those pesky NULL values.
2. Writing out the Approach
Before beginning to code, getting the approach figured out is a smart way to go about solving coding problems. It gives you an opportunity to think about and structure your code effectively. What’s more, you can discuss your approach with the interviewer and clarify any questions you might have regarding the dataset and how it is formed. Communicating such intricacies with the interviewer demonstrates your ability to solve the problem in a well-rounded manner.
As for this problem, here is the approach we will take.
Step 1: Merge the two region columns into one.
The first obstacle in solving this problem is the region conundrum. The dataset has provided us with two columns. The wine and its price apply to both the regions mentioned in the dataset. So, it only makes sense to merge the two columns into one. Here, we will make use of UNION ALL to merge the two subsets of data with respect to the regions.
Step 2: Use CTEs or Common Table Expressions for each subquery.
Although using CTEs might look like a lot of extra lines of code, it makes the code cascade seamlessly and is easy on the eyes. Any other developer or even you would find going back to such well-organized code easy to understand and maintain. So, we will section the subqueries into CTEs as we work on the resultant table of the latest CTE.
Step 3: Rank the most expensive and the cheapest wine.
Now, we get to the nitty-gritty of the problem. Ranking. Our objective is to find the most expensive and cheapest wine for each region. So, we will use two markers or columns named ‘expensive_rank’ and ‘cheap_rank’ to rank the wine appropriately for each region from the newly merged region column.
RANK() is a windows function that we will perform over the region column using the OVER(PARTITION BY) clause. Moreover, we will perform an order on the price column, either ascending or descending, in the same OVER(PARTITION BY) clause in order to rank the wine.
Step 4: Filter out the NULL values.
If you recall seeing NULL values in the dataset preview, this is where they will appear. Since we are ranking the regions based on price in both ascending and descending orders, rows with NULL values will inevitably occupy the resultant dataset. So, the obvious next step is to exclude the rows with NULL values.
Step 5: Use CASE statements to select the 1st rank in each of the ranking columns.
Once the NULL values are removed, we have a viable dataset to work with. Now, we can use the rank columns and pick the most expensive and cheapest wine variety for each region using CASE statements.
Step 6: Select the required output columns using the Max function.
Finally, we can select the required columns, as demanded by the question, with the help of the MAX() function. We would see two rows for each region from the latest resultant table since that was a CASE statement, not an aggregation function.
So, now it is time to use an aggregate function like MAX() or MIN() so that we can group them in the end to display just one row per region. In this case, we will use MAX() function.
3. Coding the Solution
Now, let’s get our hands dirty!
Step 1: Merge the two region columns into one.
We will select region_1, variety, and the price columns in one subset and region_2 and the same two columns in another subset and perform UNION ALL. This way, we will merge the two region columns into one, along with the duplicates.
SELECT region_1 AS region,
variety,
price
FROM winemag_p1
UNION ALL
SELECT region_2 AS region,
variety,
price
FROM winemag_p1
Let us run the query and check the output.
region | variety | price |
---|---|---|
Virginia | Merlot | 32 |
Prosecco di Valdobbiadene | Prosecco | 19 |
Paso Robles | Cabernet Sauvignon | 45 |
Washington | Cabernet Sauvignon | 100 |
Sonoma County | Sauvignon Blanc | 40 |
Our base table is ready.
Step 2: Use CTEs or Common Table Expressions for each subquery.
Let us turn this query into a CTE (Common Table Expression) such that it acts a temporary result set named ‘cte1’.
WITH cte1 AS
(SELECT region_1 AS region,
variety,
price
FROM winemag_p1
UNION ALL SELECT region_2 AS region,
variety,
price
FROM winemag_p1)
Step 3: Rank the most expensive and the cheapest wine.
From cte1, we will create two new columns named ‘expensive_rank’ and ‘cheap_rank’. We will use these columns to rank the wine for each region in terms of price. The column ‘expensive_rank’ will sort in descending order of price, and ‘cheap_rank’ will sort in ascending order. We are using the OVER(PARTITION BY) clause here.
WITH cte1 AS
(SELECT region_1 AS region,
variety,
price
FROM winemag_p1
UNION ALL SELECT region_2 AS region,
variety,
price
FROM winemag_p1)
SELECT region,
variety,
price,
RANK() OVER(PARTITION BY region
ORDER BY price DESC) AS expensive_rank,
RANK() OVER(PARTITION BY region
ORDER BY price) AS cheap_rank
FROM cte1
Run the query.
region | variety | price | expensive_rank | cheap_rank |
---|---|---|---|---|
Alexander Valley | Merlot | 19 | 3 | 1 |
Alexander Valley | Cabernet Sauvignon | 35 | 2 | 2 |
Alexander Valley | Cabernet Sauvignon | 45 | 1 | 3 |
Anderson Valley | Pinot Noir | 52 | 1 | 1 |
Barbaresco | Nebbiolo | 24 | 3 | 1 |
We can see that the rows with NULL values for region and price have been arranged at the top of the result set. Let us remove those in the following step.
Step 4: Filter out the NULL values.
Let us filter out rows that contain NULL values in the region and price columns.
WHERE region IS NOT NULL
AND price IS NOT NULL
Let us run the query and look at the difference in output.
region | variety | price | expensive_rank | cheap_rank |
---|---|---|---|---|
Alexander Valley | Merlot | 19 | 3 | 1 |
Alexander Valley | Cabernet Sauvignon | 35 | 2 | 2 |
Alexander Valley | Cabernet Sauvignon | 45 | 1 | 3 |
Anderson Valley | Pinot Noir | 52 | 1 | 1 |
Barbaresco | Nebbiolo | 24 | 3 | 1 |
Here we are. Let us move on to picking the highest ranks in each of the ranking columns.
Step 5: Use CASE statements to select the 1st rank in each of the ranking columns.
Firstly, we need to make the current query into a CTE named ‘cte2’. Now, we can designate the most expensive and the cheapest wine of each region using CASE statements. And then, we will name the columns accordingly, as you can see in the query below.
WITH cte1 AS
(SELECT region_1 AS region,
variety,
price
FROM winemag_p1
UNION ALL SELECT region_2 AS region,
variety,
price
FROM winemag_p1),
cte2 AS
(SELECT region,
variety,
price,
RANK() OVER(PARTITION BY region
ORDER BY price DESC) AS expensive_rank,
RANK() OVER(PARTITION BY region
ORDER BY price) AS cheap_rank
FROM cte1
WHERE region IS NOT NULL
AND price IS NOT NULL)
SELECT region,
CASE
WHEN expensive_rank =1 THEN variety
END AS most_expensive,
CASE
WHEN cheap_rank =1 THEN variety
END AS cheapest
FROM cte2
WHERE expensive_rank= 1
OR cheap_rank=1
Run the query.
region | most_expensive | cheapest |
---|---|---|
Alexander Valley | Merlot | |
Alexander Valley | Cabernet Sauvignon | |
Anderson Valley | Pinot Noir | Pinot Noir |
Barbaresco | Nebbiolo | |
Barbaresco | Nebbiolo |
Now the resultant table contains two rows for each region, with each row showing the most expensive and the cheapest wine respectively. It is time to aggregate the rows to show the expected output.
Step 6: Select the required output columns using the Max function.
The final step is to use MAX() function to pick the NON-NULL value out of the two columns and assign the rank. To perform on the previous query, we will again section that into another CTE named ‘cte3’.
In the end, we will use the GROUP BY clause since we’re using an aggregate function in this query. And we will ORDER BY 1. Here is the final query.
WITH cte1 AS
(SELECT region_1 AS region,
variety,
price
FROM winemag_p1
UNION ALL SELECT region_2 AS region,
variety,
price
FROM winemag_p1),
cte2 AS
(SELECT region,
variety,
price,
RANK() OVER(PARTITION BY region
ORDER BY price DESC) AS expensive_rank,
RANK() OVER(PARTITION BY region
ORDER BY price) AS cheap_rank
FROM cte1
WHERE region IS NOT NULL
AND price IS NOT NULL),
cte3 AS
(SELECT region,
CASE
WHEN expensive_rank =1 THEN variety
END AS most_expensive,
CASE
WHEN cheap_rank =1 THEN variety
END AS cheapest
FROM cte2
WHERE expensive_rank = 1
OR cheap_rank=1)
SELECT region,
MAX(most_expensive) AS most_expensive,
MAX(cheapest) AS cheapest
FROM cte3
GROUP BY 1
ORDER BY 1
Let us run the query and see the final solution to our problem.
region | most_expensive_variety | cheapest_variety |
---|---|---|
Alexander Valley | Cabernet Sauvignon | Merlot |
Anderson Valley | Pinot Noir | Pinot Noir |
Barbaresco | Nebbiolo | Nebbiolo |
Brunello di Montalcino | Sangiovese | Sangiovese |
California | Pinot Noir | Sauvignon Blanc |
Conclusion
And, there it is! We have solved the problem with advanced SQL concepts like CASE statements, PARTITION BY clauses, as well as RANK() and MAX() functions. As we mentioned earlier, once you lay down your approach, the rest is a walk in the park. Hope you had fun solving this advanced problem. See you here again!