Amazon SQL Interview Question Walkthrough: Caller History
Categories
A detailed introduction to using FIRST_VALUE(), explained in a step-by-step solution to Amazon’s medium-level SQL interview question.
As you prepare for data science interviews, you’ll notice that window functions come in handy when SQL interview questions require complicated calculations and manipulations. If you’re completely new to window functions, check out this “SQL Window Functions”. It will give you an overview of window functions and the variety available for different tasks.
In this article, we will focus only on one window function and use the case to understand some ins and outs of window functions that could easily catch you off guard in an interview!
The interview question, entitled ‘Caller History’, is one of the medium-level Amazon SQL interview questions.
Interview Question: Caller History
Interview Question Date: October 2022
Given a phone log table that has information about callers' call history, find out the callers whose first and last calls were to the same person on a given day. Output the caller ID, recipient ID, and the date called.
We also made a video walkthrough of this question. Check it out!
Using a call log table, you are asked to determine the callers whose last and first calls were to the same person on a given day. After determining these callers, you should return a table containing the caller ID, the recipient ID, and the date of the call.
Solution Approach Framework
Before coding the solution, I would advise you to take a structured approach to answer the question. The framework that’s worked well from experience follows:
- Exploring the dataset
- Writing out the code logic
- Coding up the solution
While it can be tempting to dive straight into the solution, it’s always helpful to explore the dataset first. It helps you identify your starting point with the data and pre-empt additional steps that may be required for the solution. While exploring the dataset, you’ll get an idea of whether you have all the information you need or whether you need to make some transformations of the data or calculations. You’ll also notice whether there are duplicates, missing values, or edge cases that need to be dealt with to get a robust solution.
After familiarizing yourself with the data, make a high-level plan and break this down step-by-step, communicating this to the interviewer. When you determine the subtasks required, you’ll be in a better place to manage your time better, focus on the proper coding of each part while allowing the interview to follow along with your approach!
Solution to This Amazon Sql Interview Question
Let’s go through these steps individually.
1. Exploring the dataset
The caller_history is a list of calls showing the ID of the caller, who they called (recorded as the recipient_id), and the exact time and date of when the call was made.
caller_id | recipient_id | date_called |
---|---|---|
1 | 2 | 2022-01-01 09:00:00 |
1 | 3 | 2022-01-01 17:00:00 |
1 | 4 | 2022-01-01 23:00:00 |
2 | 5 | 2022-07-05 09:00:00 |
2 | 5 | 2022-07-05 17:00:00 |
The table doesn’t directly tell us who the first and last call recipients are, so determining this is a crucial part of your task. Also, notice that you’ve been provided a datetime object, but the time detail will not always be relevant for the subtasks, so you’ll have to extract the date from the date_called column later.
2. Writing out the code logic
By exploring the dataset, we already identified some preliminary steps. Now, it’s time to formulate an overall plan to tackle the question.
The solution can be broken down into a few main subtasks, which are:
- Determining who the first and last callers are for each person per day - using a window function
- Checking if those two call recipients are the same - using a filter
- For cases where the two call recipients are the same, returning the ID of the call receiver, the ID of the caller, and the date when the call happened - using a CTE or a subquery
3. Coding the solution
1. Determine first and last caller per caller per day
Looking back at the dataset, we have a full history of calls across all users and during the entire period; however, our analysis is required at a more granular level.
What we want to do is to group the data for each of these callers and identify for every day that they made a call, what is the ID of the call recipient for the first and last call of that day.
caller_id | recipient_id | date_called |
---|---|---|
1 | 2 | 2022-01-01 09:00:00 |
1 | 3 | 2022-01-01 17:00:00 |
1 | 4 | 2022-01-01 23:00:00 |
2 | 5 | 2022-07-05 09:00:00 |
2 | 5 | 2022-07-05 17:00:00 |
For example, Caller 1 made a few calls in January. The first one was to #2, and the last one was to #4. Meanwhile, Caller 2 made calls on various dates. On July 5, there were 3 calls - the first one to #5 and the last one to #3. On July 6, however, there was only one call made to #3, so #3 becomes the first and the last call recipient.
We will create two new columns to extract this information so that we can eventually compare them in the next step. In a complex analysis like this, window functions are powerful as they allow us to cut the dataset into different levels of granularity, perform some transformations within those sections, and finally, calculate or retrieve a value based exclusively on each of these subsections.
For this problem, we can use the FIRST_VALUE() function, which follows the syntax below:
FIRST_VALUE(col_name) OVER(PARTITION BY col_name ORDER BY col_name)
And this reveals what’s really happening in the background, which is to:
1. Create subsets of the data based on the PARTITION BY clause
- First, it will look at the PARTITION BY clause and create subsets of data based on this value. For our task, we want to group the dataset at two levels: caller_id and then date_called so that we have a ‘partition’ for every unique caller_ID and date_called combination.
2. Sort each of the subsets based on the ORDER BY clause
- After partitioning the dataset, you can reorder the rows within these partitions. To find the first (and last) call recipient, we need to arrange each partition from the earliest to the latest call time. (This is the only time we need the time information in the date_called column!)
3. Performs the function/main operation on the ordered partition
- And finally, we can call our function FIRST_VALUE() to return the first recipient ID of the ordered partitions.
Thus, our window function is:
FIRST_VALUE(recipient_id) OVER(PARTITION BY caller_id, DATE(date_called)ORDER BY date_called ASC) AS first_call
We can also identify the recipient of the last call using the same logic as above. However, we will do the sort in reverse so that the latest call becomes the first value in each partition.
SELECT *,
FIRST_VALUE(recipient_id) OVER(PARTITION BY caller_id, DATE(date_called)ORDER BY date_called ASC) AS first_call,
FIRST_VALUE(recipient_id) OVER(PARTITION BY caller_id, DATE(date_called)ORDER BY date_called DESC) AS last_call
FROM caller_history
caller_id | recipient_id | date_called | first_call | last_call |
---|---|---|---|---|
1 | 2 | 2022-01-01 09:00:00 | 2 | 4 |
1 | 3 | 2022-01-01 17:00:00 | 2 | 4 |
1 | 4 | 2022-01-01 23:00:00 | 2 | 4 |
2 | 5 | 2022-07-05 09:00:00 | 5 | 3 |
2 | 5 | 2022-07-05 17:00:00 | 5 | 3 |
Notice here that window functions will return a value for every row in the partition. Meanwhile, GROUP BY() aggregations will return a single row for each aggregation, and that makes it easier to understand what operations were done on the data. Without this in window functions, it is easier to make mistakes without even realizing it, so be careful!
FIRST_VALUE() is actually a special variation of the NTH_VALUE() function where N = 1. We could also use the LAST_VALUE() to find the recipient of the last call where we do not have to do a reverse sort.
Syntax:
NTH_VALUE(col_name, N) OVER(PARTITION BY col_name ORDER BY col_name)
To illustrate this, check out the output of the two columns below:
SELECT *,
FIRST_VALUE(recipient_id) OVER(PARTITION BY caller_id, DATE(date_called) ORDER BY date_called ASC) AS first_val,
NTH_VALUE(recipient_id, 1) OVER(PARTITION BY caller_id, DATE(date_called) ORDER BY date_called ASC) as nth_val
FROM caller_history
2. Compare the IDs of the first and last call recipients
We already identified the IDs for the first and last calls, but we only want results where our first_call column is equal to the last_call column. While it would be convenient for us to specify this in the WHERE clause, window functions are only calculated after the where clause.
Therefore, putting it in the WHERE clause will return an error because the first_call and last_call columns have not yet been calculated. These values are not available when SQL reads through the WHERE clause, as you will see in the error below.
SELECT *,
FIRST_VALUE(recipient_id) OVER(PARTITION BY caller_id, DATE(date_called)ORDER BY date_called ASC) AS first_call,
FIRST_VALUE(recipient_id) OVER(PARTITION BY caller_id, DATE(date_called)ORDER BY date_called DESC) AS last_call
FROM caller_history
WHERE first_call = last_call
To use the results of a window function as a filter, a subquery or a CTE is required. Let’s create a CTE called first_and_last and select the entries where the two columns are equal.
WITH first_and_last AS
(SELECT *,
FIRST_VALUE(recipient_id) OVER(PARTITION BY caller_id, DATE(date_called)ORDER BY date_called ASC) AS first_call,
FIRST_VALUE(recipient_id) OVER(PARTITION BY caller_id, DATE(date_called)ORDER BY date_called DESC) AS last_call
FROM caller_history)
SELECT *
FROM first_and_last
WHERE first_call = last_call
caller_id | recipient_id | date_called | first_call | last_call |
---|---|---|---|---|
2 | 5 | 2022-07-06 17:00:00 | 5 | 5 |
2 | 3 | 2022-08-01 09:00:00 | 3 | 3 |
2 | 3 | 2022-08-01 17:00:00 | 3 | 3 |
2 | 4 | 2022-08-02 09:00:00 | 4 | 4 |
2 | 5 | 2022-08-02 10:00:00 | 4 | 4 |
3. Retrieve the caller ID, recipient ID, and the date_called fields
As the final step, we only want to retrieve the caller_ID, the recipient_ID, and the date_called information from the previous results.
Unlike aggregate functions like SUM() or AVG(), which output one row for each aggregation, window functions display a result for each row in the table because it provides additional information and does not change the underlying dataset. Imagine that calculations are only being done in the background when you use window functions.
Thus, if you don’t use DISTINCT() in the select clause, you’ll end up with duplicate results like this:
WITH first_and_last AS
(SELECT *,
FIRST_VALUE(recipient_id) OVER(PARTITION BY caller_id, DATE(date_called)ORDER BY date_called ASC) AS first_call,
FIRST_VALUE(recipient_id) OVER(PARTITION BY caller_id, DATE(date_called)ORDER BY date_called DESC) AS last_call
FROM caller_history)
SELECT
caller_id,
first_call AS recipient,
DATE(date_called)
FROM first_and_last
WHERE first_call = last_call
caller_id | recipient | date |
---|---|---|
2 | 5 | 2022-07-06 |
2 | 3 | 2022-08-01 |
2 | 3 | 2022-08-01 |
2 | 4 | 2022-08-02 |
2 | 4 | 2022-08-02 |
Correcting this, our final query now shows a summary of the results:
WITH first_and_last AS
(SELECT *,
FIRST_VALUE(recipient_id) OVER(PARTITION BY caller_id, DATE(date_called)ORDER BY date_called ASC) AS first_call,
FIRST_VALUE(recipient_id) OVER(PARTITION BY caller_id, DATE(date_called)ORDER BY date_called DESC) AS last_call
FROM caller_history)
SELECT
DISTINCT caller_id,
first_call AS recipient,
DATE(date_called)
FROM first_and_last
WHERE first_call = last_call
Conclusion
Window functions are a favorite topic in interviews, so it pays off to know them in depth.
In solving this Amazon SQL interview question, we learned a few key takeaways which are:
1. How window functions operate
- A window function will create partitions of the dataset, rearrange the rows within each partition and retrieve a value based on these ordered partitions.
2. When window functions are calculated
- Window functions are calculated after the FROM, JOIN, WHERE, GROUP BY, and HAVING clauses (but before DISTINCT)
- Hence, filtering based on the results of a window function often requires using a subquery or a CTE.
3. How it differs from a GROUP BY() function
- Aggregate functions return a summary table, whereas window functions show the original dataset and add a summary column to it.
Now that you know more about window functions, why don’t you test these skills in the coding platform? Good luck with your next interview!