Amazon Data Engineer Interview Questions
Categories
Amazon Data Engineer interview questions that will specifically focus on the SQL coding skills relevant from a data engineering point of view.
Data Engineering is one of the most rapidly evolving and highly paid areas in Data Science. Data Engineers take the research and curiosity of a data scientist and meld it with the teamwork, discipline, and test-driven approach of traditional software engineering. Data Engineering brings all the cutting-edge models to life and provides the input data and features to monitor and evolve the existing ones. It is reported to be one of the highest-paid areas in Data Science, with a median pay of around USD 90,000 per year.
What does a Data Engineer do?
A data engineer needs to wear multiple hats. She is expected to be resourceful to be able to deal with multiple technologies and platforms as the needs of the business change. We have an exhaustive article on Data Engineer Interview Questions here. A high-level overview of the tasks that a Data Engineer might need to perform is -
- Acquire data from multiple sources to fulfill business requirements.
- Build ETL pipelines to transform the data into information that can be used by Analysts and Data Scientists.
- Test, monitor, and evolve these data pipelines as the business needs evolve.
- Validate data integrity and shortlist data analysis tools to suit business size.
- Be the gatekeeper of data and ensure compliance with data governance and security policies.
- Capacity, Technology, and Resource Planning for future business needs.
In terms of technologies required, most data engineers are required to be proficient in SQL, Python, Bash Scripting, ETL tools like Airflow and Kafka, Git, Docker, and associated containerization tools and one or more of the top cloud platforms - GCP, AWS, Azure. In this article we will look at problems that appeared in Amazon Data Engineer Interviews. We will specifically focus on the SQL coding skills that are relevant from a data engineering point of view. We also have an article on Amazon SQL Interview Questions covering multiple areas. The topics that we will cover in this article are
- EDA and Validation
- Aggregation and Metrics
- JOINS
- Text and Datetime Manipulation
We will finish the article off with a case study that applies one or more of the aforementioned skills. Let's begin.
Amazon Data Engineer Interview Questions
EDA and Validation
One of the key responsibilities of a Data Engineer is to monitor and validate data integrity. The Data Engineer is responsible for the data quality and hence has to ensure that the data conforms to the agreed standards. Let us take an example of this with a simple problem from a past Amazon Data Engineer interview.
Amazon Data Engineer Interview Question #1: Primary Key Violation
Given a dim_customer table with a key (cust_id) identifying each customer, return all the customer IDs (cust_id) that are violating the primary key constraints. Output all the customer IDs that are present more than once in the table. Output the IDs and the number of times the key is present.
Link to the question: https://platform.stratascratch.com/coding/2107-primary-key-violation
The problem uses the dim_customer table with the following fields.
cust_id: | varchar |
cust_name: | varchar |
cust_city: | varchar |
cust_dob: | datetime |
cust_pin_code: | int |
The data is presented thus.
cust_id | cust_name | cust_city | cust_dob | cust_pin_code |
---|---|---|---|---|
C273 | Stephen V. Cooke | New York | 1996-11-28 | 8235 |
C274 | Peter P. Mankin | Mount Upton | 1984-06-25 | 6050 |
C274 | Juan C. Parker | Mertzon | 1989-07-07 | 6867 |
C274 | Eve E. McClure | Southfield | 1995-05-18 | 7791 |
C275 | Charles J. Stevens | Oakland | 1975-12-02 | 5930 |
Solution
The solution is pretty straightforward. To find the number of occurrences, we start by counting the number of occurrences for each cust_id using the GROUP BY clause.
SELECT
cust_id
, count(*)
FROM dim_customer
GROUP BY 1
;
We keep only those cust_id that are present more than once. To do this, we use the HAVING clause, not the WHERE clause, since we need to work on the aggregated data, not the underlying table.
SELECT
cust_id
, count(*)
FROM dim_customer
GROUP BY 1
HAVING count(*) > 1
;
Let us take this problem a notch further. We will use the same table.
Amazon Data Engineer Interview Question #2: Find the five highest salaries
Link to the question: https://platform.stratascratch.com/coding/9868-find-the-five-highest-salaries
The problem uses the worker table with the following fields.
worker_id: | int |
first_name: | varchar |
last_name: | varchar |
salary: | int |
joining_date: | datetime |
department: | varchar |
The data is presented in the following manner -
worker_id | first_name | last_name | salary | joining_date | department |
---|---|---|---|---|---|
1 | Monika | Arora | 100000 | 2014-02-20 | HR |
2 | Niharika | Verma | 80000 | 2014-06-11 | Admin |
3 | Vishal | Singhal | 300000 | 2014-02-20 | HR |
4 | Amitah | Singh | 500000 | 2014-02-20 | Admin |
5 | Vivek | Bhati | 500000 | 2014-06-11 | Admin |
Solution
This problem can be solved in multiple ways. To get an idea of how to solve this, let us explore the salary field a bit more.
SELECT
salary
FROM worker
ORDER BY salary DESC
;
This gives us the following output.
If we use the LIMIT clause, we will get the first five rows from the output
SELECT
salary
FROM worker
ORDER BY salary DESC
LIMIT 5
;
However, this would be incorrect as the salary values are repeated.
To get unique salary values, we use the DISTINCT clause.
SELECT
DISTINCT salary
FROM worker
ORDER BY salary DESC
LIMIT 5
;
We will also get the same output if we use the GROUP BY and LIMIT clauses together. It will simply group the identical salaries together and limit the top five values only.
SELECT
salary
FROM worker
GROUP BY salary
ORDER BY salary DESC
LIMIT 5
;
Using GROUP BY in this case results in an identical solution with almost identical performance times as discussed here.
Given the exploratory nature of work, sometimes Data Engineers might not have access to all the libraries and functionalities in the test environment. This is specifically the case while working with scripting languages like Python. The skill of the Data Engineer here is to use the barebones standard libraries and develop workarounds to help solve the problem. Let us try that in our next problem.
Amazon Data Engineer Interview Question #3: Find the second highest salary without using ORDER BY
Link to the question: https://platform.stratascratch.com/coding/9857-find-the-second-highest-salary-without-using-order-by
The problem uses the same worker table that we used in the previous problem.
Solution
The problem is relatively straightforward. However, we are prohibited from using the ORDER BY clause. So let us break this problem down into two parts.
- We start off by taking out the maximum salary
- Then, query the table again and take the maximum salary after excluding the maximum salary.
SELECT
MAX(salary)
FROM worker
;
We get the following output.
Now we get the maximum salary after excluding this salary.
SELECT
MAX(salary)
FROM worker
WHERE salary NOT IN
(
SELECT
MAX(salary)
FROM worker
)
;
Aggregations
Data Engineers are also required to constantly monitor and report metrics. They need to be on top of changing values, drift in the input, etc. To accomplish this, they need to be able to work with aggregate functions. Let us look at one such example.
Amazon Data Engineer Interview Question #4: Manager of the largest department
Find the name of the manager from the largest department. Output their first and last names.
Link to the question: https://platform.stratascratch.com/coding/2060-manager-of-the-largest-department
The problem uses the az_employees table with the following fields.
id: | int |
first_name: | varchar |
last_name: | varchar |
department_id: | int |
department_name: | varchar |
position: | varchar |
id | first_name | last_name | department_id | department_name | position |
---|---|---|---|---|---|
9 | Christy | Mitchell | 1001 | Marketing | Senior specialist |
13 | Julie | Sanchez | 1001 | Marketing | Intern |
14 | John | Coleman | 1001 | Marketing | Senior specialist |
15 | Anthony | Valdez | 1001 | Marketing | Junior specialist |
26 | Allison | Johnson | 1001 | Marketing | Senior specialist |
Solution
To solve this problem, we need to do the following -
- Find the number of employees in each department
- Output the manager from the department(s) with the maximum number of employees
We start by adding a column with the number of employees in that department. We use a window function over a GROUP BY aggregation as it saves us the aggregation-merge trip.
SELECT
*
, COUNT(*) OVER (PARTITION BY department_id) as num_employees
FROM az_employees
;
We get the following output.
Now we can easily subset the department based on the maximum number of employees. To do this, we use the ILIKE function to identify the managers.
WITH dept_emp AS (
SELECT
*
, COUNT(*) OVER (PARTITION BY department_id) as num_employees
FROM az_employees
)
SELECT
first_name
, last_name
FROM dept_emp
WHERE num_employees = (SELECT MAX(num_employees) FROM dept_emp)
AND position ILIKE '%manager%'
;
If the ILIKE function is not available like in MySQL, we can convert the ‘position’ column to lower or upper case and use the LIKE operator.
WITH dept_emp AS (
SELECT
*
, COUNT(*) OVER (PARTITION BY department_id) as num_employees
FROM az_employees
)
SELECT
first_name
, last_name
FROM dept_emp
WHERE num_employees = (SELECT MAX(num_employees) FROM dept_emp)
AND UPPER(position) LIKE '%MANAGER%'
;
Both of them will give us the same result. You can know more about aggregation functions here. And about window functions in SQL here.
Joining Tables
Till now we have worked with only a single table. In real life however, databases are highly normalized and require merging multiple tables to get the required output. Let us look at a few problems. We start with a relatively easy one.
Amazon Data Engineer Interview Question #5: Customer Order and Details
List the number of orders, customers, and the total order cost for each city. Include only those cities where at least five orders have been placed but count all the customers in these cities even if they did not place any orders.
Link to the question: https://platform.stratascratch.com/coding/9908-customer-orders-and-details
This problem uses the customers and orders tables. The fields in the customers table are
id: | int |
first_name: | varchar |
last_name: | varchar |
city: | varchar |
address: | varchar |
phone_number: | varchar |
The data therein looks like this.
id | first_name | last_name | city | address | phone_number |
---|---|---|---|---|---|
8 | John | Joseph | San Francisco | 928-386-8164 | |
7 | Jill | Michael | Austin | 813-297-0692 | |
4 | William | Daniel | Denver | 813-368-1200 | |
5 | Henry | Jackson | Miami | 808-601-7513 | |
13 | Emma | Isaac | Miami | 808-690-5201 |
The following are the fields in the orders table.
id: | int |
cust_id: | int |
order_date: | datetime |
order_details: | varchar |
total_order_cost: | int |
The data in the orders table is presented thus.
id | cust_id | order_date | order_details | total_order_cost |
---|---|---|---|---|
1 | 3 | 2019-03-04 | Coat | 100 |
2 | 3 | 2019-03-01 | Shoes | 80 |
3 | 3 | 2019-03-07 | Skirt | 30 |
4 | 7 | 2019-02-01 | Coat | 25 |
5 | 7 | 2019-03-10 | Shoes | 80 |
Solution
The problem is not too difficult. However, one needs to be careful about following the instructions provided in the question properly. Here is the outline to solve this problem
- Merge the two tables
- Aggregate the number of orders, total order value, and the number of customers for each city.
- Subset only those cities that have five or more orders.
We start off by merging the two tables. We need a LEFT JOIN because we need to count even those customers who have not placed any orders.
SELECT
customers.city
, customers.id
, orders.cust_id
, orders.id
, orders.total_order_cost
FROM
customers
LEFT JOIN orders
ON customers.id = orders.cust_id
ORDER BY 1
;
We get the following output.
Note: We do not need the cust_id field from the orders table. We used that to illustrate how the merge results will look.
Now that we have all the fields, we can aggregate our metrics.
- To find the number of orders and the value of the orders, we use the fields from the orders table.
- To find the overall number of customers in a city, we use the cust_id field from the customers table.
Further, we need to be careful not to count duplicates while counting the number of customers. The actual query is not too complicated.
SELECT
customers.city
, COUNT(orders.id) AS orders_per_city
, COUNT(DISTINCT customers.id) AS customers_per_city
, SUM(orders.total_order_cost) AS orders_cost_per_city
FROM
customers
LEFT JOIN orders
ON customers.id = orders.cust_id
GROUP BY 1
;
We get the following output.
We can now subset the relevant cities by using the HAVING clause.
SELECT
customers.city
, COUNT(orders.id) AS orders_per_city
, COUNT(DISTINCT customers.id) AS customers_per_city
, SUM(orders.total_order_cost) AS orders_cost_per_city
FROM
customers
LEFT JOIN orders
ON customers.id = orders.cust_id
GROUP BY 1
HAVING COUNT(orders.id) >=5
;
Let us dial up the difficulty a bit with the next one. This uses information from three tables.
Amazon Data Engineer Interview Question #6: Exclusive Amazon Products
Find the products that are exclusive only to Amazon and not available at TopShop or Macy’s. A product is exclusive if the combination of product name and maximum retail price is not available in the other stores. Output the product name, brand name, price, and rating for all exclusive products.
Link to the question: https://platform.stratascratch.com/coding/9608-exclusive-amazon-products
The problem uses three tables: innerwear_macys_com, innerwear_topshop_com, innerwear_amazon_com. Each of the three tables has the same fields.
product_name: | varchar |
mrp: | varchar |
price: | varchar |
pdp_url: | varchar |
brand_name: | varchar |
product_category: | varchar |
retailer: | varchar |
description: | varchar |
rating: | float |
review_count: | float |
style_attributes: | varchar |
total_sizes: | varchar |
available_size: | varchar |
color: | varchar |
The fields of interest are product_name, mrp, brand_name, price, and rating. The retailer field is constant across the table. Therefore, the innerwear_macys_com table will have all retailer column values as Macys US, the innerwear_topshop_com will have the field as US Topshop and so on. Here is an outline of the relevant fields available in the innerwear_topshop_com table.
product_name | mrp | price | pdp_url | brand_name | product_category | retailer | description | rating | review_count | style_attributes | total_sizes | available_size | color |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Sheer Enough Sheer Lace Hipster 845253 | $48.00 | $33.60 | http://www1.macys.com/shop/product/wacoal-sheer-enough-sheer-lace-hipster-845253?ID=2502262&CategoryID=65739 | Wacoal | Women - Lingerie & Shapewear - Designer Lingerie | Macys US | Lovely, light and perfectly unique. This sheer lace and mesh hipster from Wacoal is whisper-soft beneath your favorite outfits. | ["Mesh and lace front and back panels", "Lined at gusset", "Nylon/spandex; gusset: cotton", "Hand wash", "Imported", "Web ID: 2502262"] | ["S", "M", "XL"] | ["S", "M", "XL"] | High Rise | ||
Retro Chic Contour Bra 853186 | $65.00 | $45.50 | http://www1.macys.com/shop/product/wacoal-retro-chic-contour-bra-853186?ID=4355583&CategoryID=65739 | Wacoal | Women - Lingerie & Shapewear - Designer Lingerie | Macys US | Superb support and lovely, feminine details come together in this seamless, full coverage, molded bra from Wacoal. | 5 | 3 | ["Adjustable, restricted, stretch straps with back release", "Kicked-in back straps", "Seamless, underwire, full-coverage cups with supportive spacers", "Two needle wire cover", "Triple back hook-and-eye closure", "Polyester/spandex; lace: nylon", "Hand wash", "Imported", "Web ID: 4355583"] | ["32D", "42DD", "44D", "44DD", "44DDD"] | ["32D", "42DD", "44D", "44DD", "44DDD"] | White |
b.splendid Seamless Hipster 978255 | $13.00 | $13.00 | http://www1.macys.com/shop/product/b.temptd-by-wacoal-b.splendid-seamless-hipster-978255?ID=4427348&CategoryID=55805 | b.tempt'd by Wacoal | Women - Lingerie & Shapewear - Panties | Macys US | Soft, sleek heather fabric is seamlessly designed for the smoothest look yet in the b.splendid hipster from b.tempt'd by Wacoal. | ["Elastic waistband", "Smooth stitching at the leg openings for no visible panty lines", "Lined at gusset", "Nylon/polyester/spandex; gusset: cotton/spandex", "Hand wash", "Imported", "Web ID: 4427348"] | ["S", "M", "L", "XL"] | ["S", "M", "L", "XL"] | Limoges/Heather | ||
Icon Modern T-Shirt Bra F3646 | $42.00 | $42.00 | http://www1.macys.com/shop/product/calvin-klein-icon-modern-t-shirt-bra-f3646?ID=926021&CategoryID=65739 | Calvin Klein | Women - Lingerie & Shapewear - Designer Lingerie | Macys US | Make sure you add this style to your collection. Calvin Klein's Icon Modern T-Shirt Bra features seamless contour demi cups with encased wire for extra support. Style #F3646 | 4.6 | 17 | ["Fully-adjustable straps", "Seamless, contour demi cups with encased wire for support", "Mesh wings with picot trim and mesh at center gore", "Padded closure for comfort", "Nylon/elastane", "Hand wash", "Imported", "Web ID: 926021", "Halter strap at back neck", "Unlined, wireless cups", "Wide lace band under bust", "Pull-on style", "Cotton/elastane", "Hand wash", "Imported"] | ["32A", "32B", "32C", "32D", "32DD", "34A", "34B", "34C", "34D", "34DD", "36A", "36B", "36C", "36D", "36DD", "38B", "38C", "38D", "38DD"] | ["32A", "32B", "32C", "32D", "34A", "34B", "34C", "34D"] | Smoke |
Jogger Pants QS5716 | $58.00 | $58.00 | http://www1.macys.com/shop/product/calvin-klein-modern-cotton-jogger-pants-qs5716?ID=3676035&CategoryID=65739 | Calvin Klein Modern Cotton | Women - Lingerie & Shapewear - Designer Lingerie | Macys US | A super soft cotton-blend makes these jogger pants from Calvin Klein the perfect choice when you're looking for total comfort. | ["Wide elastic waistband features logo", "Off-seam pockets", "Full-length with ribbed cuffs", "Cotton/polyester", "Machine washable", "Imported", "Web ID: 3676035", "Halter strap at back neck", "Unlined, wireless cups", "Wide lace band under bust", "Pull-on style", "Cotton/elastane", "Hand wash", "Imported"] | ["XS", "S", "M", "L", "XL"] | ["XS", "S", "M", "L", "XL"] | Black |
Solution
There are multiple ways of solving this problem. One way is to JOIN the tables on product_name and mrp fields and find only those products which are present exclusively on Amazon. Let’s see how this works by joining the Amazon and TopShop tables.
SELECT
a.product_name
, t.product_name AS t_prod
, m.product_name AS m_prod
, a.mrp
, t.mrp AS t_mrp
, m.mrp AS m_mrp
, a.brand_name
, a.price
, a.rating
FROM innerwear_amazon_com AS a
LEFT JOIN innerwear_topshop_com AS t
ON a.product_name = t.product_name
AND a.mrp = t.mrp
;
We get the following output.
Where the TopShop fields are NULL, then that product is not available on TopShop. We repeat the same process with the Macys table.
SELECT
a.product_name
, t.product_name AS t_prod
, m.product_name AS m_prod
, a.mrp
, t.mrp AS t_mrp
, m.mrp AS m_mrp
, a.brand_name
, a.price
, a.rating
FROM innerwear_amazon_com AS a
LEFT JOIN innerwear_topshop_com AS t
ON a.product_name = t.product_name
AND a.mrp = t.mrp
LEFT JOIN innerwear_macys_com AS m
ON a.product_name = m.product_name
AND a.mrp = m.mrp
;
We get the following output.
Now we can simply subset the products that are available exclusively on Amazon and keep only the relevant fields.
SELECT
a.product_name
, a.brand_name
, a.price
, a.rating
FROM innerwear_amazon_com AS a
LEFT JOIN innerwear_topshop_com AS t
ON a.product_name = t.product_name
AND a.mrp = t.mrp
LEFT JOIN innerwear_macys_com AS m
ON a.product_name = m.product_name
AND a.mrp = m.mrp
WHERE
t.mrp IS NULL
AND t.product_name IS NULL
AND m.mrp IS NULL
AND m.product_name IS NULL
;
Alternatively, we can solve this using UNION. UNION will provide us with non-repeating values of the combination of product_name and price.
SELECT DISTINCT product_name, mrp FROM innerwear_macys_com
UNION
SELECT DISTINCT product_name, mrp FROM innerwear_topshop_com
;
This gives the following output.
We can now remove this combination from the Amazon listings by using a simple subquery.
SELECT
product_name
, brand_name
, price
, rating
FROM innerwear_amazon_com AS a
WHERE (product_name, mrp) NOT IN (
SELECT product_name, mrp FROM
(
SELECT DISTINCT product_name, mrp FROM innerwear_macys_com
UNION
SELECT DISTINCT product_name, mrp FROM innerwear_topshop_com
) q
)
;
You can check the combination of two or more fields as a tuple against a subquery just as you would do with a single column. This can help reduce multiple joins and gives a simple and neat-looking result. You can know more about different types of SQL JOINS and UNION here.
Text and Datetime Manipulation
Text and Datetime fields are ubiquitous. We have detailed articles on these two concepts - “String and Array Functions in SQL” for Text functions and “SQL Scenario Based Interview Questions” for Datetime functions. Let us start with text manipulation.
Amazon Data Engineer Interview Question #7: Most expensive product in each category
Find the most expensive product in each category. Report the category, product name, and the price (as a number).
Link to the question: https://platform.stratascratch.com/coding/9607-the-most-expensive-products-per-category
The problem uses the innerwear_amazon_com dataset that we saw in the previous problem with the following fields.
product_name: | varchar |
mrp: | varchar |
price: | varchar |
pdp_url: | varchar |
brand_name: | varchar |
product_category: | varchar |
retailer: | varchar |
description: | varchar |
rating: | float |
review_count: | int |
style_attributes: | varchar |
total_sizes: | varchar |
available_size: | varchar |
color: | varchar |
If you observe carefully, the price field is represented as a text field. Let us see how the relevant fields look.
product_name | mrp | price | pdp_url | brand_name | product_category | retailer | description | rating | review_count | style_attributes | total_sizes | available_size | color |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Wacoal Women's Full Figure Basic Beauty Underwire Bra | $50.00 | $50.00 | https://www.amazon.com/-/dp/B005FR9XVK?th=1&psc=1 | Wacoal | Bras | Amazon US | Seamless molded two-ply cups with inner sling for smooth support | 4.2 | 960 | [ 85% Nylon/15% Spandex , Imported , Hook and Eye closure , Hand Wash , Full-coverage bra with built-in camisole strap with stretch back release , Cups with hidden inner sling for shape and support , Band and sides smooth and minimize bulge , Hook-and-eye closure ] | 32D , 32DD , 32DDD , 32G , 34C , 34D , 34DD , 34DDD , 34G , 34H , 36C , 36D , 36DD , 36DDD , 36G , 36H , 38C , 38D , 38DD , 38DDD , 38G , 38H , 40C , 40D , 40DD , 40DDD , 40G , 40H , 42C , 42D , 42DD , 42DDD , 42G , 42H , 44C , 44D , 44DD , 44DDD , 44G , 44H | 32D , 32DD , 32DDD , 32G , 34C , 34D , 34DD , 34DDD , 34G , 34H , 36C , 36D , 36DD , 36DDD , 36G , 36H , 38C , 38D , 38DD , 38DDD , 38G , 38H , 40C , 40D , 40DD , 40DDD , 40G , 40H , 42C , 42D , 42DD , 42DDD , 42G , 44C , 44D , 44DD , 44DDD , 44G , 44H | Naturally Nude |
Calvin Klein Women's Bottoms Up Hipster Panty | $12.00 | $11.00 | https://www.amazon.com/-/dp/B007F8RVDO?th=1&psc=1 | Calvin-Klein | Panties | Amazon US | The bottoms up hipster features color prints, a refined lace trim, and a thin elasticized waistband for comfort and shape retention. | 4.5 | 99 | [ 82%-84% Nylon 16%-18% Elastane , Imported , Machine Wash , Super soft microfiber , Lace trim ] | Small , Medium , Large | Small , Medium | Buff |
Wacoal Women's Retro Chic Underwire Bra | $60.00 | $60.00 | https://www.amazon.com/-/dp/B007JTYQQY?th=1&psc=1 | Wacoal | Bras | Amazon US | Beautiful low plunge chantilly lace bra with superior support. |Beautiful low plunge chantilly lace bra | 4.4 | 753 | [ 82% Nylon/ 18% Spandex/Elastane , Hand Wash , Full-coverage bra featuring lace cups with mesh yoke , Band and sides smooth and minimize bulge , Seamed cups for superior lift, shape, and support , Leotard back ] | 30B , 30D , 32B , 32C , 32D , 32DD , 32DDD , 32G , 34C , 34D , 34DD , 34DDD , 34G , 34H , 34I , 36C , 36D , 36DD , 36DDD , 36G , 36H , 36I , 38C , 38D , 38DD , 38DDD , 38G , 38H , 38I , 40C , 40D , 40DD , 40DDD , 40G , 40H , 40I , 42D , 42DD , 42DDD , 42G , 42H , 44D , 44DD , 44DDD , 44G , 44H , 46D , 46DD , 46DDD , 46G , 46H , 48H | 32D , 32DD , 32DDD , 34C , 34D , 34DD , 34DDD , 34G , 36C , 36D , 36DD , 36DDD , 36G , 36H , 38C , 38DD , 38DDD , 38G , 38H , 40C , 40D , 40DD , 40DDD , 40G , 40H , 42D , 42DD , 42DDD | Ivory |
Calvin Klein Women's Carousel 3 Pack Thong | $33.00 | $19.99 | https://www.amazon.com/-/dp/B01MZ8D589?th=1&psc=1 | Calvin-Klein | Panties | Amazon US | This carousel thong 3-pack features classic cotton blend fabrication and an iconic Calvin Klein repeating logo waistband. | 4 | 2 | [ Cotton , Imported , Contrasting elasticized Calvin Klein logo waistband , Cotton gusset , Three low-rise thong panties each featuring logoed waistband and cotton gusset ] | Women's Large / 12-14 , Small , Medium , Large | Medium , Large | Salvia/Grey Heather/Sultry |
b.tempt'd by Wacoal Women's Lace Kiss Bralette | $18.00 | $11.65 | https://www.amazon.com/-/dp/B00SHYSSGE?th=1&psc=1 | b-temptd | Bras | Amazon US | Lace kiss bralette has soft allover lace that make a beautiful underpinning | 4 | 512 | [ 100% Nylon , Imported , Hand Wash , Lace bralette featuring semi-sheer cups, scalloped trim, and adjustable straps ] | Small , Medium , Large , X-Large | Medium | Night/Animal Accent |
Solution
The biggest challenge in this problem is to convert the price to a number. We cannot typecast the field directly to a number since the field has non-numeric characters. Therefore, we proceed in the following manner.
- Extract the numeric value from the price field.
- Find the highest-priced product from each category and output the relevant fields.
Let us try cleaning up the price field. There are numerous ways of accomplishing this. Let us look at a few.
One way is to use the REPLACE() function to replace the $ sign and then typecase the output to a number.
SELECT
product_name
, product_category
, REPLACE(price, '$', '')::NUMERIC AS cleaned_price
FROM innerwear_amazon_com
;
This gives us the following output.
We can also try a more complex regular expression match using REGEX_REPLACE(). We can do this by trying to find all instances of characters that are not digits or the decimal symbol (.)
SELECT
product_name
, product_category
, REGEXP_REPLACE(price, '[^0-9.]+', '', 'g' ) AS cleaned_price
FROM innerwear_amazon_com
;
This gives the same result as earlier. Now we need to find the costliest item in each category. To do this we use the window function RANK(), partitioning over the product category.
SELECT
product_name
, product_category
, REGEXP_REPLACE(price, '[^0-9.]+', '', 'g' )::NUMERIC AS price
, RANK() OVER (PARTITION BY product_category
ORDER BY REGEXP_REPLACE(price, '[^0-9.]+', '', 'g' )::NUMERIC DESC) AS rnk
FROM innerwear_amazon_com
;
Resulting in the following output.
We now take the first-ranked products along with their prices in each category.
WITH ranked_products AS (
SELECT
product_name
, product_category
, REGEXP_REPLACE(price, '[^0-9.]+', '', 'g' )::NUMERIC AS price
, RANK() OVER (PARTITION BY product_category
ORDER BY REGEXP_REPLACE(price, '[^0-9.]+', '', 'g' )::NUMERIC DESC) AS rnk
FROM innerwear_amazon_com
)
SELECT
product_category
, product_name
, price
FROM ranked_products
WHERE rnk = 1
;
Let us solve a problem involving Datetime manipulation.
Amazon Data Engineer Interview Question #8: First Day Retention
A first day retention happens when a player logs in on the next day of his or her first ever log-in. Calculate the first day retention rate for the given set of video-game players and output the proportion of players who meet this definition as a ratio of the total number of players.
Link to the question: https://platform.stratascratch.com/coding/2090-first-day-retention-rate
The problem uses the players_logins table with the following fields.
player_id: | int |
login_date: | datetime |
The data is presented in the following manner
player_id | login_date |
---|---|
101 | 2021-12-14 |
101 | 2021-12-18 |
101 | 2021-12-15 |
101 | 2021-12-19 |
102 | 2021-12-31 |
Solution
There are numerous ways of solving this Amazon data engineer interview question. Let us look at one way of solving this.
- Find the earliest login date for each player
- Check if the player logged in one day after the earliest login date.
- Calculate the proportion of players who satisfy the first-day retention criteria as a fraction of all the players
We start by adding a column with the earliest login date by applying a window function.
SELECT
*
, MIN(login_date) OVER (PARTITION BY player_id) AS min
FROM players_logins
;
We get the following output.
We used the window function instead of a GROUP BY and JOIN to avoid multiple passes. Now we can tag the logins that happened the very next day of the first login. To do this, we simply add one day to the earliest login date and tag the logins.
SELECT
*
, MIN(login_date) OVER (PARTITION BY player_id) as min
, CASE WHEN DATE(login_date) = MIN(login_date) OVER (PARTITION BY player_id) + INTERVAL '1 DAY' THEN 1 END AS retained
FROM players_logins
;
This gives us the following output.
We can now finish the problem by aggregating those players who were retained and the overall number of players.
WITH ret_tags AS (
SELECT
*
, MIN(login_date) OVER (PARTITION BY player_id) as min
, CASE WHEN DATE(login_date) = MIN(login_date) OVER (PARTITION BY player_id) + INTERVAL '1 DAY' THEN 1 END AS retained
FROM players_logins
)
SELECT
count(DISTINCT CASE WHEN retained = 1 THEN player_id END) * 1.0
/ count(DISTINCT player_id)
FROM ret_tags
;
Case Study
Let us put all these concepts together in the form of a case study. This is rated as a hard question by the users.
Amazon Data Engineer Interview Question #9: Product Market Share
The market share of a product is defined as the number of products sold as a percentage of the total number of products sold in the market. List the Market Share of each product brand for each territory. Output the Territory ID, name of the Product Brand, and the corresponding Market Share in percentages.
Link to the question: https://platform.stratascratch.com/coding/2112-product-market-share
The product uses three tables
fct_customer_sales that has the following fields.
cust_id: | varchar |
prod_sku_id: | varchar |
order_date: | datetime |
order_value: | int |
order_id: | varchar |
The data in fct_customer_sales is presented thus.
cust_id | prod_sku_id | order_date | order_value | order_id |
---|---|---|---|---|
C274 | P474 | 2021-06-28 | 1500 | O110 |
C285 | P472 | 2021-06-28 | 899 | O118 |
C282 | P487 | 2021-06-30 | 500 | O125 |
C282 | P476 | 2021-07-02 | 999 | O146 |
C284 | P487 | 2021-07-07 | 500 | O149 |
The map_customer_territory table contains the following fields
cust_id: | varchar |
territory_id: | varchar |
The data therein looks thus
cust_id | territory_id |
---|---|
C273 | T3 |
C274 | T3 |
C275 | T1 |
C276 | T1 |
C277 | T1 |
And the dim_product table with the following fields.
prod_sku_id: | varchar |
prod_sku_name: | varchar |
prod_brand: | varchar |
market_name: | varchar |
The data in the dim_product table is organized in the following manner.
prod_sku_id | prod_sku_name | prod_brand | market_name |
---|---|---|---|
P472 | iphone-13 | Apple | Apple IPhone 13 |
P473 | iphone-13-promax | Apple | Apply IPhone 13 Pro Max |
P474 | macbook-pro-13 | Apple | Apple Macbook Pro 13'' |
P475 | macbook-air-13 | Apple | Apple Makbook Air 13'' |
P476 | ipad | Apple | Apple IPad |
Solution
This Amazon data engineer interview question uses most of the concepts that we have covered in this article. Let us plot our path to the solution.
- We get the relevant fields from the three datasets by merging them.
- Then, we aggregate the sales for each brand in each territory for the time period.
- We finally find the market share of each brand in each territory.
Let us start off by selecting the relevant fields from each of the tables. From the fct_customer_sales table, we need the order_date field. Onto this, we overlay the territory_id from the map_customer_territory table. And from the dim_product table, we add the prod_brand fields. This can be done using a single query.
SELECT
f.order_date
, mc.territory_id
, dc.prod_brand
FROM fct_customer_sales AS f
LEFT JOIN map_customer_territory AS mc
ON f.cust_id = mc.cust_id
LEFT JOIN dim_product AS dc
ON f.prod_sku_id = dc.prod_sku_id
;
We get the following output.
Now we subset the relevant dates. Since we need only the transactions from the fourth quarter of 2021, we use the EXTRACT function to get the relevant year and quarter.
SELECT
f.order_date
, mc.territory_id
, dc.prod_brand
FROM fct_customer_sales AS f
LEFT JOIN map_customer_territory AS mc
ON f.cust_id = mc.cust_id
LEFT JOIN dim_product AS dc
ON f.prod_sku_id = dc.prod_sku_id
WHERE EXTRACT(QUARTER FROM f.order_date) = 4
AND EXTRACT(YEAR FROM f.order_date) = 2021
;
Alternatively, we can use the BETWEEN operator to find subset of the relevant dates like this.
SELECT
f.order_date
, mc.territory_id
, dc.prod_brand
FROM fct_customer_sales AS f
LEFT JOIN map_customer_territory AS mc
ON f.cust_id = mc.cust_id
LEFT JOIN dim_product AS dc
ON f.prod_sku_id = dc.prod_sku_id
WHERE f.order_date::DATE BETWEEN '2021-10-01' AND '2021-12-31'
;
In either case, we get the following output.
Now we aggregate the orders by territory. We can drop the date field since we do not need it anymore.
SELECT
mc.territory_id
, dc.prod_brand
, COUNT(*)
FROM fct_customer_sales AS f
LEFT JOIN map_customer_territory AS mc
ON f.cust_id = mc.cust_id
LEFT JOIN dim_product AS dc
ON f.prod_sku_id = dc.prod_sku_id
WHERE f.order_date::DATE BETWEEN '2021-10-01' AND '2021-12-31'
GROUP BY 1,2
;
We get the following output.
To calculate the market share, we use a window function to aggregate the orders by territory.
SELECT
mc.territory_id
, dc.prod_brand
, COUNT(*)
, SUM(COUNT(*)) OVER (PARTITION BY mc.territory_id)
FROM fct_customer_sales AS f
LEFT JOIN map_customer_territory AS mc
ON f.cust_id = mc.cust_id
LEFT JOIN dim_product AS dc
ON f.prod_sku_id = dc.prod_sku_id
WHERE f.order_date::DATE BETWEEN '2021-10-01' AND '2021-12-31'
GROUP BY 1,2
;
We get the following output.
We finish off the problem by calculating the orders of the brand as a percentage of all the orders in the territory. We multiply by a decimal to ensure that it does not result in a truncated division.
SELECT
mc.territory_id
, dc.prod_brand
, COUNT(*) / SUM(COUNT(*)) OVER (PARTITION BY mc.territory_id) * 100 AS market_share
FROM fct_customer_sales AS f
LEFT JOIN map_customer_territory AS mc
ON f.cust_id = mc.cust_id
LEFT JOIN dim_product AS dc
ON f.prod_sku_id = dc.prod_sku_id
WHERE f.order_date::DATE BETWEEN '2021-10-01' AND '2021-12-31'
GROUP BY 1,2
;
Conclusion
In this article, we looked at the various roles that a data engineer is required to perform. We looked at specific SQL problems from the past Amazon interviews that tested key Data Engineering skills like validation, aggregation, data transformation, text and datetime manipulation and joining tables. We finally brought it all together in a case study that required us to apply these skills. While some of these problems might appear hard at first, always remember that the first step is always the hardest. As with every other skill in life, all it takes is practice, patience, and persistence to master. Sign up today for a free account on StrataScratch and join over 20,000 other like-minded professionals aspiring to crack their next Data Engineer Interview at top companies like Amazon, Google, Apple, Meta, et al.