Data Engineer SQL Interview Questions From Top Employers
Categories
In this article, we will list important SQL concepts and walk you through 10 data engineer SQL interview questions asked by employers like Google and Amazon.
Companies collect and maintain large volumes of data to better understand their customers and improve various facets of their business. However, raw data can be a bit messy and disorganized. Companies need data engineers to refine initially collected data so that it’s easier to work with.
Data can be used for various tasks - analysis, visualization, or fed into machine learning models, just to name a few. Data engineers prepare data according to the specifics of the task so that data scientists can utilize the data instead of fixing its inconsistencies.
To do this, data engineers need to have a specific set of skills. In this article, we will discuss SQL - the main programming language for working with relational databases. We’ll go over the most important SQL concepts tested in SQL interview questions and walk you through data engineer SQL interview questions asked at top organizations today.
Concepts tested in data engineer SQL interview questions
Data engineers perform a wide variety of tasks. Few examples: validating and ordering data, handling edge cases, implementing consistent formatting, fixing bad data, and changing data types.
Data engineers will most likely use a combination of programming languages and tools. SQL and Python are two of the most important skills. StrataScratch blog has published an article about Python Data Engineer Interview Questions as well.
We compiled a list of SQL concepts most often used in data engineers’ day-to-day jobs. If you’re preparing for an interview, make sure to master these concepts. Data engineer SQL Interview questions often require a strong knowledge of these features.
We will start with easier concepts and cover difficult ones as well.
Filtering
As a data engineer, you must know how to use the WHERE clause to filter tables. It is usually paired with a condition and returns only rows that satisfy the condition.
It’s important to know how to set up a condition for the WHERE clause and how to couple it with SELECT, UPDATE and DELETE statements.
HAVING is another important filtering feature of SQL. Unlike WHERE, it is used to filter aggregated group data. It’s important to know the differences between WHERE and HAVING clauses and when to use them.
DDL and DML
Data Definition Language (DDL) and Data Manipulation Language (DML) allow data engineers to define and modify database structure - create new tables and objects, specify their name, define columns and specify types of values in them. This is a very short list of what data engineers can do using DDL and DML.
Data Type Conversion
Sometimes, data engineers have to convert values to another type. This may be necessary to perform certain SQL operations like comparison.
To be an effective data engineer, you need to know the difference between implicit and explicit type conversions. It’s important to know how to use CAST(), and CONVERT() functions for explicit conversion, and even better if you can explain the differences between these two functions. This will set you apart from other candidates interviewing for the same job.
You should also be prepared to explain implicit conversions and when they occur. This is important because one unexpected data conversion could lead to errors that are hard to identify and resolve.
ORDER BY
Arranging data using the ORDER BY statement is a basic but important SQL skill. Sometimes complex data engineering tasks can be solved using a simple ORDER BY statement.
The basic syntax of the ORDER BY statement is simple - you need to specify column values and the order (ascending or descending) to arrange rows. Advanced data engineer SQL interview questions might ask you to order rows based on values in multiple columns.
Subqueries
Any candidate interviewing for a data engineer job should know what subqueries are and how to write them.
Sometimes subqueries are described as a query in another query. They allow us to perform complex SQL operations.
Data engineers should also know how to use subqueries with various SQL features: WHERE, HAVING, FROM, SELECT.
Aggregate functions
An essential SQL feature to summarize large volumes of data. Aggregate functions are often used with GROUP BY statements to separate rows into groups and aggregate values for each group. Other times, you’ll need to aggregate data for the entire table.
There are five main aggregate functions in SQL:
- SUM() finds the total sum of all numeric values in a column.
- COUNT() finds the number of rows in the entire table or a group
- MIN and MAX() are used to find the highest and lowest values in a table or a group.
- AVG() calculates the average of all values in the column or a group
You can use these aggregate functions within a subquery to set conditions for WHERE and HAVING clauses.
You can find many examples of SQL Interview Questions involving aggregate functions on our blog.
In this article, we’ll solve interview questions in PostgreSQL, which has a wide variety of aggregate functions. Go to PostgreSQL documentation for a full list.
NULL values
Data engineers sometimes have to do what other data scientists don’t want to do - deal with NULL values.
Removing or replacing NULL values can improve the quality of data, which can make data more accessible and easy to work with.
Before dealing with them, you must understand what a NULL value is. It is not 0 or an empty text (‘ ‘) but an absence of any value. The absence of a value is typically a challenge to performing accurate data analysis.
You can’t use normal comparison operators like > < = to work with NULL values.
You can use IS NULL and IS NOT NULL operators to find values that are (and are not) NULL. Or use functions like ISNULL() to return a specific value instead of NULL.
Finally, it’s important to know how aggregate functions and NULL values work together. For example, how you can use the COUNT() aggregate function to find the number of NULL values.
COUNT(*) returns a number of values including NULL ones, whereas COUNT(column) returns the number of non-null values. The difference between them will be the number of NULL values in the table.
Text functions
In SQL, text values belong to data types such as VARCHAR, CHAR, or TEXT, and they are very common. It’s important for data engineers to be able to use built-in text functions in SQL.
CHAR_LENGTH() returns a number of symbols in the text. LOWER() and UPPER() convert text values to lowercase or uppercase. They help you ensure consistent capitalization of letters, which is absolutely necessary if the task involves case-sensitive functions.
Other important text functions are TRIM() and SUBSTRING(). The former allows you to remove unnecessary spaces before and after text values. The latter allows you to extract a specific portion of the string.
Finally, the REPLACE() function finds substring instances and replaces them with another text.
This was a short list of the most important text functions in PostgreSQL. Refer to the documentation for a full list. The more text functions you know, the better.
JOINs
Data engineers use JOINs to combine data from multiple tables into one. There are four main types of JOINs in SQL.
- INNER JOINs return rows with overlapping values in the shared dimension.
- LEFT JOINs return all rows from the first table but only overlapping values from the second table.
- RIGHT JOINs are the opposite and return all rows from the second table and only matching values from the first table.
- FULL JOINs return combined columns of both tables. If rows do not overlap, it fills absent columns with NULL.
- CROSS JOINs return all possible combinations of columns from both tables.
Understanding various types of JOINs will help you succeed as a data engineer. You should understand their syntax and be able to recognize which type of JOIN is needed for a given task. Knowing how to use the ON statement to define a shared dimension is also important.
Window functions
You can use window functions to create new values. For example, they can be used to assign each record a numerical rank. Or create new aggregate values of the entire table or certain group of rows.
Unlike the GROUP BY statement, aggregate values are stored in a separate column, not collapsed into one row.
Before going into an interview, practice important window functions like ROW_NUMBER(), RANK(), and DENSE(). Learn how to use PARTITION BY to create groups and apply window functions to each group, not the entire table.
In later parts of this article, we’ll see practical examples of window functions and how the aforementioned functions can solve real business challenges.
Array functions
As a data engineer, you’ll have to work with complex data structures like arrays. So it’s important to have a good grasp of array functions.
You can start with learning the ARRAY() command, which takes values and adds them to an array.
ARRAY_AGG() function aggregates all items of the array. ARRAY_APPEND() function allows you to add specific value to an array. Or join two arrays together using the ARRAY_CONCAT() function.
CASE expression
Setting up conditions for filtering and joining tables is an integral part of a data engineer’s job. CASE expression can be used to set up complex conditions for WHERE and HAVING clauses.
You should know how to use CASE expressions to chain multiple conditions. That includes WHEN, THEN, and ELSE clauses, which make up the condition.
It’s important to understand how SQL determines the outcome of a CASE expression. A chain of CASE statements will return the value of whichever condition is met first. If none of the conditions are met, it will return the value of the ELSE clause. If there is no ELSE clause, CASE will return NULL.
DISTINCT
Data engineers use the DISTINCT clause for many purposes. When used with a SELECT statement, it returns only unique values. When used with aggregate functions, it ensures that they are applied to unique values only.
DISTINCT provides a very simple yet effective way to remove (or ignore) duplicate values. For that reason, it is a useful tool at any data engineer’s disposal.
Basic Data Engineer SQL Interview Questions
Now, let’s see these SQL concepts used in practice.
Question #1: Cast Values to Integer
This is a typical data engineer SQL interview question, where you are asked to cast values, remove inconsistencies, and clean up the table.
Cast stars column values to integer and return with all other column values. Be aware that certain rows contain non integer values. You need to remove such rows. You are allowed to examine and explore the dataset before making a solution.
Link to the question: https://platform.stratascratch.com/coding/10056-cast-stars-column-values-to-integer-and-return-with-all-other-column-values
Understand the question
This data engineer SQL interview question is fairly straightforward as long as you take the time to understand what’s being asked.
There are two clear objectives: cast stars values to integer and clear the table of rows that don’t have numeric star values. Otherwise, leave the table as it is and return all other columns as well.
Analyze data
To answer this question, we have to work with a single yelp_reviews table. Let’s look at the types of values in all columns:
business_name | review_id | user_id | stars | review_date | review_text | funny | useful | cool |
---|---|---|---|---|---|---|---|---|
AutohausAZ | C4TSIEcazRay0qIRPeMAFg | jlbPUcCRiXlMtarzi9sW5w | 5 | 2011-06-27 | Autohaus is my main source for parts for an old Mercedes that has been in the family since new that I am restoring. The old beater is truly a labor of | 1 | 2 | 1 |
Citizen Public House | ZZ0paqUsSX-VJbfodTp1cQ | EeCWSGwMAPzwe_c1Aumd1w | 4 | 2013-03-18 | First time in PHX. Friend recommended. Friendly waitstaff. They were understaffed and extremely busy, but we never knew. The short ribs were tende | 0 | 0 | 0 |
Otto Pizza & Pastry | pF6W5JOPBK6kOXTB58cYrw | JG1Gd2mN2Qk7UpCqAUI-BQ | 5 | 2013-03-14 | LOVE THIS PIZZA! This is now one of my favorite pizza places in phoenix area. My husband i walked into this cute family owned business and were greet | 0 | 0 | 0 |
Giant Hamburgers | QBddRcflAcXwE2qhsLVv7w | T90ybanuLhAr0_s99GDeeg | 3 | 2009-03-27 | ok, so I tried this place out based on other reviews. I had the cheeseburger with everything, some fries, and a chocolate shake. The burger was okay, | 0 | 1 | 1 |
Tammie Coe Cakes | Y8UMm_Ng9oEpJbIygoGbZQ | MWt24-6bfv_OHLKhwMQ0Tw | 3 | 2008-08-25 | Overrated. The treats are tasty but certainly not the best I have ever had. I would have rated this a two star but the cakes and cookies are REALLY pr | 1 | 3 | 2 |
As you can see, the stars column stores text values of the ‘varchar’ type.
Business_name, review_id, user_id, review_date, and review_text similarly contain text values.
Funny, useful, and cool columns are integers.
All columns will stay the same except for the stars column. Values in this column will be converted to an integer. We must also remove rows where the stars column contains values other than numbers.
We can see that the stars column contains whole numbers to represent the yelp rating of each business. However, there are exceptions when the value is a question sign (‘?’).
It might look like values in the stars column are already numbers and don’t need to be converted. However, it’s important to understand that the ‘4’ you see in the column is actually text. Not a number 4, but a piece of text, like ‘4 apples’.
Plan your approach
To solve this data engineer SQL interview question, we need to perform two tasks - keep rows where the value of the stars column is a number (not the ‘?’), and convert values of the stars column to the integer type.
Finally, we need to return all columns. Only values in the stars column need to be converted to integers, others are returned as they were before.
Write the code
- Convert stars values
Let’s start by converting stars values from ‘varchar’ to ‘integer’ type.
Let’s select all other columns as well since our query needs to return them all.
In PostgreSQL, we can use the double-semicolon syntax to easily cast the value of the stars column.
SELECT business_name,
review_id,
user_id,
stars :: INTEGER,
review_date,
review_text,
funny,
useful,
cool
FROM yelp_reviews
In MySQL, you can use either CAST() or CONVERT() functions instead.
- Remove rows with non-numeric star values
As we’ve seen in the previous section, some records have the stars value of ‘?’.
We can simply set a WHERE clause to only keep rows where the value of the stars column is not ‘?’.
SELECT
business_name,
review_id,
user_id,
stars :: INTEGER,
review_date,
review_text,
funny,
useful,
cool
FROM yelp_reviews
WHERE
stars <> '?'
That’s it. This query will convert the stars column to integer, and filter out rows where the value of stars column is ‘?’.
Output
Output table should look exactly like before, but the value of the stars column will be of integer type. Also, records with the ‘?’ value in the stars column will be filtered out.
business_name | review_id | user_id | stars | review_date | review_text | funny | useful | cool |
---|---|---|---|---|---|---|---|---|
AutohausAZ | C4TSIEcazRay0qIRPeMAFg | jlbPUcCRiXlMtarzi9sW5w | 5 | 2011-06-27 | Autohaus is my main source for parts for an old Mercedes that has been in the family since new that I am restoring. The old beater is truly a labor of | 1 | 2 | 1 |
Citizen Public House | ZZ0paqUsSX-VJbfodTp1cQ | EeCWSGwMAPzwe_c1Aumd1w | 4 | 2013-03-18 | First time in PHX. Friend recommended. Friendly waitstaff. They were understaffed and extremely busy, but we never knew. The short ribs were tende | 0 | 0 | 0 |
Otto Pizza & Pastry | pF6W5JOPBK6kOXTB58cYrw | JG1Gd2mN2Qk7UpCqAUI-BQ | 5 | 2013-03-14 | LOVE THIS PIZZA! This is now one of my favorite pizza places in phoenix area. My husband i walked into this cute family owned business and were greet | 0 | 0 | 0 |
Giant Hamburgers | QBddRcflAcXwE2qhsLVv7w | T90ybanuLhAr0_s99GDeeg | 3 | 2009-03-27 | ok, so I tried this place out based on other reviews. I had the cheeseburger with everything, some fries, and a chocolate shake. The burger was okay, | 0 | 1 | 1 |
Tammie Coe Cakes | Y8UMm_Ng9oEpJbIygoGbZQ | MWt24-6bfv_OHLKhwMQ0Tw | 3 | 2008-08-25 | Overrated. The treats are tasty but certainly not the best I have ever had. I would have rated this a two star but the cakes and cookies are REALLY pr | 1 | 3 | 2 |
Question 2: Arrange worker records based on values in two columns
In this question, aspiring data engineers are asked to arrange names in an alphabetic order.
Sort workers in ascending order by the first name and then in descending order by department name.
Link to the question: https://platform.stratascratch.com/coding/9836-sort-workers-in-ascending-order-by-the-first-name-and-in-descending-order-by-department-name
Understand the question
The premise of this data engineer SQL interview question is fairly simple. We need to arrange records based on names (text values) in two columns.
The question only tells us to sort records in ascending and descending orders. However, since values are text, it means we need to order rows alphabetically.
There’s a twist to this question - for one column (first names), we must order in ascending order (from A to Z). For the second column (department names), we must order them in descending order (from Z to A).
Analyze data
All the necessary information is contained in a single workers table. Let’s start by looking at the types of values in each column.
worker_id | first_name | last_name | salary | joining_date | department |
---|---|---|---|---|---|
1 | Monika | Arora | 100000 | 2014-02-20 | HR |
2 | Niharika | Verma | 80000 | 2014-06-11 | Admin |
3 | Vishal | Singhal | 300000 | 2014-02-20 | HR |
4 | Amitah | Singh | 500000 | 2014-02-20 | Admin |
5 | Vivek | Bhati | 500000 | 2014-06-11 | Admin |
We have six columns.
- worker_id contains integer values to identify workers. Most likely, we won’t have to work with this column.
- The first_name column contains first name values. We will order records by looking at values in this column.
- We don’t need to work with values in the last_name, salary, or joining_date columns.
- Department names are stored in the department column. We will order records by looking at values in this column.
Looking at the table, all values look standard. It’s worth noting that first and last names are capitalized.
Plan your approach
This question asks us to order rows based on values in two columns - first_name and department. We don’t have to change rows or values in them.
We need to select all rows and use the ORDER BY statement to arrange them in a specific order.
It’s important that you know how to order rows according to values in two columns.
Write the code
Try to write the code yourself.
Output
The correct query should return a sorted table.
worker_id | first_name | last_name | salary | joining_date | department |
---|---|---|---|---|---|
9 | Agepi | Argon | 90000 | 2015-04-10 | Admin |
4 | Amitah | Singh | 500000 | 2014-02-20 | Admin |
8 | Geetika | Chauhan | 90000 | 2014-04-11 | Admin |
12 | Jai | Patel | 85000 | 2014-03-21 | HR |
10 | Moe | Acharya | 65000 | 2015-04-11 | HR |
Question 3: Find the last five records of the dataset
In this data engineer SQL interview question, data engineers are asked to select a specific part of a large dataset.
Link to the question: https://platform.stratascratch.com/coding/9864-find-the-last-five-records-of-the-dataset
Understand the question
Description of this question is fairly simple.
Before attempting to answer the question, it’s a good idea to look at available data and examine values, their types, and the ordering of records.
Analyze data
There is one available table named worker. We already worked with this table in the previous question.
In this case, we are not concerned with any of the six columns or values contained in them. We simply have to find and return the last five records.
However, it’s important to look at the actual table (with data) to see if we can notice anything that will help us find the last five records.
Without further ado, let’s look at the worker table with actual data:
worker_id | first_name | last_name | salary | joining_date | department |
---|---|---|---|---|---|
1 | Monika | Arora | 100000 | 2014-02-20 | HR |
2 | Niharika | Verma | 80000 | 2014-06-11 | Admin |
3 | Vishal | Singhal | 300000 | 2014-02-20 | HR |
4 | Amitah | Singh | 500000 | 2014-02-20 | Admin |
5 | Vivek | Bhati | 500000 | 2014-06-11 | Admin |
As you can see, every record has its own unique worker_id value.
It’s safe to assume that records are arranged in ascending order according to worker_id values. Double-check with the interviewer to confirm this pattern. Once you do, solving this data engineer SQL interview question becomes much easier.
Plan your approach
Before we can find and extract the last five records, we need to define the criteria for identifying them.
We’ve seen that each record has its own unique identifier of integer type. As long as these numbers are unique and increasing, counting the number of all rows will give us the id of the last record.
We can use that information to get the worker_id value of the last five records and set up the condition only to keep rows with those ids.
Or we can find the worker_id of the fifth row before the last and keep records where worker_id is higher than the threshold.
Write the code
Practice your SQL skills by answering the question yourself.
Output
Your final query should return the last five records.
worker_id | first_name | last_name | salary | joining_date | department |
---|---|---|---|---|---|
8 | Geetika | Chauhan | 90000 | 2014-04-11 | Admin |
9 | Agepi | Argon | 90000 | 2015-04-10 | Admin |
10 | Moe | Acharya | 65000 | 2015-04-11 | HR |
11 | Nayah | Laghari | 75000 | 2014-03-20 | Account |
12 | Jai | Patel | 85000 | 2014-03-21 | HR |
Question 4: Primary Key Violation
In this data engineer SQL interview question, you have to find duplicate cust_id values and return the number of times they are repeated. You can find other Amazon Data Engineer Interview Questions on the StrataScratch blog.
Interview Question Date: May 2022
Write a query to return all Customers (cust_id) who are violating primary key constraints in the Customer Dimension (dim_customer) i.e. those Customers who are present more than once in the Customer Dimension. For example if cust_id 'C123' is present thrice then the query should return two columns, value in first should be 'C123', while value in second should be 3
Link to the question: https://platform.stratascratch.com/coding/2107-primary-key-violation
Understand the question
The question has detailed instructions, but it can be confusing. Make sure to read it multiple times to get to the essence of the question.
To solve this challenge, you must know that primary keys are unique identifiers for each record. They can not be repeated. The question refers to this as the ‘primary key constraint’.
There are also instructions for how to format the output. We need to return rows with the duplicate cust_id values and the number of times they appear.
Analyze data
Looking at the available data could help you better understand the question.
Let’s start by looking at the types of values in every column of the only available table named dim_customer:
cust_id | cust_name | cust_city | cust_dob | cust_pin_code |
---|---|---|---|---|
C273 | Stephen V. Cooke | New York | 1996-11-28 | 8235 |
C274 | Peter P. Mankin | Mount Upton | 1984-06-25 | 6050 |
C274 | Juan C. Parker | Mertzon | 1989-07-07 | 6867 |
C274 | Eve E. McClure | Southfield | 1995-05-18 | 7791 |
C275 | Charles J. Stevens | Oakland | 1975-12-02 | 5930 |
To solve this data engineer SQL interview question, we’ll have to find duplicate cust_id values and the number of times they are repeated. This column contains text values, specifically of the ‘varchar’ type.
cust_name, cust_city, cust_dob and cust_pin_code values are not important.
Plan your approach
In simple words, we need to find duplicate cust_id values and the number of times they are repeated.
The simplest way is to create a group of rows for each unique cust_id value, find the number of rows in each group, and return those with more than one row in the group.
Then return the number of records in each group. This will return the number of rows with the same cust_id value.
Write the code
Now that you have a logical outline of what to do, try to match it by writing the SQL query yourself.
Output
Query should return two columns - the key and the number of times it was repeated.
cust_id | n_occurences |
---|---|
C276 | 2 |
C281 | 2 |
C274 | 3 |
Advanced data engineer SQL interview questions
Let’s also look at some of the more difficult SQL questions asked during data engineer interviews.
Question 5: Verify the first 4 numbers of all phone numbers
In this question, we have to validate data by checking phone number values and confirm that they begin with specific 4 numbers.
Verify that the first 4 digits are equal to 1415 for all phone numbers. Output the number of businesses with a phone number that does not start with 1415.
Link to the question: https://platform.stratascratch.com/coding/9737-verify-that-the-first-4-digits-are-equal-to-1415-for-all-phone-numbers
Understand the question
This data engineer SQL interview question asks us to verify that all phone numbers begin with 1415. It’s safe to assume that one of the columns contains phone number values. We will have to make sure that the first four symbols of values in this column are 1415.
It’s unclear what kind of answer we need to return. It could be binary - to show whether or not all phone numbers begin with 1415. Or we might have to return the number of phone numbers that begin with 1415 or do not begin with 1415.
In situations like this, you should ask the interviewer for specific instructions.
Analyze data
The sf_restaurant_health_violations table contains information about businesses, including their phone numbers.
Let’s take a closer look at the types of values contained in each column.
business_id | business_name | business_address | business_city | business_state | business_postal_code | business_latitude | business_longitude | business_location | business_phone_number | inspection_id | inspection_date | inspection_score | inspection_type | violation_id | violation_description | risk_category |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5800 | John Chin Elementary School | 350 Broadway St | San Francisco | CA | 94133 | 37.8 | -122.4 | {'longitude': '-122.403154', 'needs_recoding': False, 'latitude': '37.798358', 'human_address': '{"address":"","city":"","state":"","zip":""}'} | 5800_20171017 | 2017-10-17 | 98 | Routine - Unscheduled | 5800_20171017_103149 | Wiping cloths not clean or properly stored or inadequate sanitizer | Low Risk | |
64236 | Sutter Pub and Restaurant | 700 Sutter St | San Francisco | CA | 94102 | 37.79 | -122.41 | {'longitude': '-122.41188', 'needs_recoding': False, 'latitude': '37.78881', 'human_address': '{"address":"","city":"","state":"","zip":""}'} | 64236_20170725 | 2017-07-25 | 88 | Routine - Unscheduled | 64236_20170725_103133 | Foods not protected from contamination | Moderate Risk | |
1991 | SRI THAI CUISINE | 4621 LINCOLN Way | San Francisco | CA | 94122 | 37.76 | -122.51 | {'longitude': '-122.507779', 'needs_recoding': False, 'latitude': '37.764073', 'human_address': '{"address":"","city":"","state":"","zip":""}'} | 1991_20171129 | 2017-11-29 | 86 | Routine - Unscheduled | 1991_20171129_103139 | Improper food storage | Low Risk | |
3816 | Washington Bakery & Restaurant | 733 Washington St | San Francisco | CA | 94108 | 37.8 | -122.41 | {'longitude': '-122.405845', 'needs_recoding': False, 'latitude': '37.795174', 'human_address': '{"address":"","city":"","state":"","zip":""}'} | 3816_20160728 | 2016-07-28 | 67 | Routine - Unscheduled | 3816_20160728_103108 | Contaminated or adulterated food | High Risk | |
39119 | Brothers Restaurant | 4128 GEARY Blvd | San Francisco | CA | 94118 | 37.78 | -122.46 | {'longitude': '-122.463762', 'needs_recoding': False, 'latitude': '37.781148', 'human_address': '{"address":"","city":"","state":"","zip":""}'} | 39119_20160718 | 2016-07-18 | 79 | Routine - Unscheduled | 39119_20160718_103133 | Foods not protected from contamination | Moderate Risk |
The table has a lot of columns.
The only column that contains a phone number is business_phone_number, so it’s safe to assume that we need to check values in this column.
We can confirm our assumption by looking at the actual table above.
It's important to note that not all rows have values in this column. Some are empty.
Plan your approach
To solve this question, we have to filter rows by two criteria - the value in the business_phone_number column is not empty. If it has a number, it must start with 1415.
Let’s assume that the question asks us to return the number of rows where the phone number does not start with 1415.
Finally, we should return the number of rows that satisfy both criteria.
Write the code
1. Get the number of all rows
We can use the COUNT() aggregate function to get the number of all records in the table.
SELECT COUNT (*)
FROM sf_restaurant_health_violations
COUNT() will return the total number of records in the table. Output will be collapsed into one row.
2. Set up two conditions
To get the answer, we need to find the number of records that satisfy the two aforementioned criteria.
We can add a WHERE clause to remove records that do not meet the criteria.
SELECT COUNT (*)
FROM sf_restaurant_health_violations
WHERE business_phone_number IS NOT NULL
AND LEFT(business_phone_number :: TEXT, 4) <> '1415'
The COUNT(*) aggregate function will apply to filtered records.
We use the LEFT() function to take four symbols from the left side (beginning) of the value, and compare it with 1415.
Output
Our query will return the number of phone number values that exist but do not start with 1415.
count |
---|
1 |
Question 6: SMS Confirmations From Users
In this data engineer SQL interview question, data engineers must identify and get the number of invalid records. It is one of many Facebook Data Engineer Interview Questions.
Interview Question Date: November 2020
Meta/Facebook sends SMS texts when users attempt to 2FA (2-factor authenticate) into the platform to log in. In order to successfully 2FA they must confirm they received the SMS text message. Confirmation texts are only valid on the date they were sent.
Unfortunately, there was an ETL problem with the database where friend requests and invalid confirmation records were inserted into the logs, which are stored in the 'fb_sms_sends' table. These message types should not be in the table.
Fortunately, the 'fb_confirmers' table contains valid confirmation records so you can use this table to identify SMS text messages that were confirmed by the user.
Calculate the percentage of confirmed SMS texts for August 4, 2020. Be aware that there are multiple message types, the ones you're interested in are messages with type equal to 'message'.
Link to the question: https://platform.stratascratch.com/coding/10291-sms-confirmations-from-users
Understand the question
Read the description of this question multiple times because it’s a lot to take in at once.
In short, a table contains records that should not be there. Specifically, confirmation and friend request records. It’s our job to identify misplaced records.
Finally, we are asked to find what percentage of confirmation texts were valid on August 4.
Analyze data
You’ll have to work with two tables to find the answer.
One is the fb_sms_sends table. The question tells us that some of the records in this table are invalid.
The other fb_confirmers table contains correct confirmation records.
We need to compare these two tables to calculate the percentage of confirmed texts for a specific date.
Let’s look at columns in each of these tables:
ds | country | carrier | phone_number | type |
---|---|---|---|---|
2020-08-07 | ES | at&t | 9812768911 | confirmation |
2020-08-02 | AD | sprint | 9812768912 | confirmation |
2020-08-04 | SA | at&t | 9812768913 | message |
2020-08-02 | AU | sprint | 9812768914 | message |
2020-08-07 | GW | rogers | 9812768915 | message |
the ds column contains datetime values, most likely the date of sending a message. We will need to check this column to find sms confirmations for the 4th of August.
The question does not ask us anything about the location or network of the user, so we can ignore the country and carrier columns.
Values in the type column might help us identify whether the record represents a valid message, or false records.
The phone_number contains a phone number that receives the confirmation message.
Let’s move on to the next table, which contains correct records of confirmation messages.
date | phone_number |
---|---|
2020-08-06 | 9812768960 |
2020-08-03 | 9812768961 |
2020-08-05 | 9812768962 |
2020-08-02 | 9812768963 |
2020-08-06 | 9812768964 |
Here we have just two columns, values in the date column indicate when the message was sent. Phone numbers are stored in the column of the same name.
Some of the confirmation records in fb_sms_sends table represent an unsuccessful attempt. We need to check them against information in the fb_confirmers table, which contains records of successful confirmations.
Plan your approach
We need some way to identify ‘confirmation’ and ‘friend request’ records that should not be in the fb_sms_sends table. We can do that by merging two tables on shared dimensions of the phone number and the date. If records from the ‘unverified’ table are not in the valid fb_confirmers table, it means they are invalid.
We will also have to set up a condition to only keep rows for messages that were sent on the 4th of August 2020 and remove messages sent on all other days.
Finally, we need to calculate what percentage of all messages were confirmed. Once we merge the table, we can get the number of records from the fb_confirmers table that are also present in the fb_sms_sends table. Then find the number of all records of the ‘message’ type and calculate the percentage.
Write the code
1. Perform a LEFT JOIN to filter out invalid records
To compare valid and invalid confirmation records, we need to combine data from two tables. The safest bet is to perform a LEFT JOIN, which will keep all records from the fb_sms_sends (including invalid records) and only match values from the fb_confirmers table.
SELECT *
FROM fb_sms_sends a
LEFT JOIN fb_confirmers b ON a.ds = b.date
AND a.phone_number = b.phone_number
We will use the ON clause to define two shared dimensions - the date when the message was sent and the phone number.
In the combined table, records from the fb_sms_sends table that don’t share the date and phone_number values will have empty columns. This will help us identify messages that were not confirmed.
ds | country | carrier | phone_number | type | date | phone_number |
---|---|---|---|---|---|---|
2020-08-07 | ES | at&t | 9812768911 | confirmation | ||
2020-08-02 | AD | sprint | 9812768912 | confirmation | ||
2020-08-04 | SA | at&t | 9812768913 | message | ||
2020-08-02 | AU | sprint | 9812768914 | message | ||
2020-08-07 | GW | rogers | 9812768915 | message |
2. Filter records by send date and their type
The question specifically tells us that we need to work with messages sent on the 4th of August in 2020. Looking at the table, we can see that dates are written in the ‘MM-DD-YYYY’ format.
We need to chain another condition to specifically select records of the ‘message’ type.
SELECT *
FROM fb_sms_sends a
LEFT JOIN fb_confirmers b ON a.ds = b.date
AND a.phone_number = b.phone_number
WHERE a.ds = '08-04-2020'
AND a.type = 'message'
Running this query will only keep message records where the value of ds column is ‘08-04-2020’ and has a type of ‘message’.
ds | country | carrier | phone_number | type | date | phone_number |
---|---|---|---|---|---|---|
2020-08-04 | IT | at&t | 9812768991 | message | 2020-08-04 | 9812768991 |
2020-08-04 | SA | at&t | 9812768913 | message | ||
2020-08-04 | TJ | rogers | 9812768934 | message | ||
2020-08-04 | NR | rogers | 9812768997 | message | ||
2020-08-04 | HN | at&t | 9812768921 | message |
3. Calculate the percentage of valid confirmations
We need to find the percentage of valid confirmation messages vs. all messages from the fb_sms_sends table.
Looking at the output of the previous step, we know that valid confirmation records from the fb_confirmers table have date and phone_number values.
We can use the COUNT() aggregate function to find the number of valid confirmation records and the number of all messages.
Provide phone_number or date columns as an argument to the COUNT() function to find the number of records with values that are not NULL.
Repeat the same to find the number of all message records. Only this time, the argument to the COUNT() function should be one of the five columns where all records have values.
Then divide two aggregates, and you’ll have a decimal ratio. Finally, multiply the ratio by 100 to get the percentage value.
SELECT COUNT(b.phone_number)::float / COUNT(a.phone_number) * 100 AS perc
FROM fb_sms_sends a
LEFT JOIN fb_confirmers b ON a.ds = b.date
AND a.phone_number = b.phone_number
WHERE a.ds = '08-04-2020'
AND a.type = 'message'
The result of the COUNT() function is always a whole number. Don’t forget to convert it to a float so it can be divided by another whole number.
Finally, use the AS keyword to give the final column a descriptive name.
Output
Query should return one row with a single percentage value.
perc |
---|
20 |
Question 7: Find the first 50% of the dataset
In this question, candidates have to return the first half of all rows in the table.
Link to the question: https://platform.stratascratch.com/coding/9859-find-the-first-50-records-of-the-dataset
Understand the question
The premise is fairly simple - we need to return the first half of all rows in the dataset.
Analyze data
We are given one table named worker.
Let’s look at the types of values in every column:
worker_id | first_name | last_name | salary | joining_date | department |
---|---|---|---|---|---|
1 | Monika | Arora | 100000 | 2014-02-20 | HR |
2 | Niharika | Verma | 80000 | 2014-06-11 | Admin |
3 | Vishal | Singhal | 300000 | 2014-02-20 | HR |
4 | Amitah | Singh | 500000 | 2014-02-20 | Admin |
5 | Vivek | Bhati | 500000 | 2014-06-11 | Admin |
To answer this data engineer interview question, we don’t need to work with any of these columns. We just have to return the first half of all rows.
While looking at data, you might notice that every record seems to have an incremental worker_id value.
This is a limited example with only 8 rows. In reality, data engineers work with much larger datasets.
Plan your approach
To get the first 50% of all rows, first, we must find the total number of rows in the table.
Once we do, we can divide the total by 2. This will give us half of all rows, let’s call this number X. We can tell SQL to return the first X number of rows, but not more.
Rows already have worker_id values, which appear unique and incremental. We can set a condition to return records where the worker_id value is less than X (a number that represents half of all records). However, there are no guarantees that these values are unique and/or incremental.
Instead, we can use window functions to assign each row a unique incremental number and tell SQL to return records with numerical values of less than X (half of all records).
Write the code
If you like a challenge, try to write a query yourself.
Output
The query should output four records because the example data showed 8 records.
worker_id | first_name | last_name | salary | joining_date | department |
---|---|---|---|---|---|
1 | Monika | Arora | 100000 | 2014-02-20 | HR |
2 | Niharika | Verma | 80000 | 2014-06-11 | Admin |
3 | Vishal | Singhal | 300000 | 2014-02-20 | HR |
4 | Amitah | Singh | 500000 | 2014-02-20 | Admin |
5 | Vivek | Bhati | 500000 | 2014-06-11 | Admin |
Question 8: Formatting Names
This question involves cleaning up data and ensuring its consistency.
Print the first name after removing white spaces from the left side.
Link to the question: https://platform.stratascratch.com/coding/9831-formatting-names
Understand the question
The task is fairly simple - remove unnecessary white spaces from the left side of first name values.
The question asks us to print (return) trimmed values.
Analyze data
Let’s start by looking at the table and dataset available for this question:
To solve this data engineer SQL interview question, we are only concerned with values in the first_name column.
Plan your approach
We need to take first name values and trim them.
The question tells us to remove white spaces only from the left side of first name values. So the standard TRIM() function is not going to work.
Write the code
Solving this challenge would be a great practice before going into a data engineering interview.
Try to write the query yourself:
Output
The query should return only trimmed first name values and nothing else.
first_name_trim |
---|
Monika |
Niharika |
Vishal |
Amitah |
Vivek |
Question 9: Rows With Missing Values
In this data engineer SQL interview question, we have to find and return records with multiple empty columns.
Interview Question Date: April 2022
The data engineering team at YouTube want to clean the dataset user_flags. In particular, they want to examine rows that have missing values in more than one column. List these rows.
Link to the question: https://platform.stratascratch.com/coding/2106-rows-with-missing-values
Understand the question
The objective is clear - identify rows with more than one missing value and return them.
Analyze data
Let’s look at columns in the user_flags table:
user_firstname | user_lastname | video_id | flag_id |
---|---|---|---|
Richard | Hasson | y6120QOlsfU | 0cazx3 |
Mark | May | Ct6BUPvE2sM | 1cn76u |
Gina | Korman | dQw4w9WgXcQ | 1i43zk |
Mark | May | Ct6BUPvE2sM | 1n0vef |
Mark | May | jNQXAC9IVRw | 1sv6ib |
As you can see, all columns contain text values.
We’ll have to check all columns to find out whether they are empty or have values in them.
The last record has two empty columns. It is a prime example of the kind of records we’re looking for.
Plan your approach
The question asks us to return rows with missing columns, so we need to:
- Use the WHERE clause to remove rows that don’t have two or more missing values.
- Set up a complex condition (using CASE statement) to identify and keep count of empty columns for each row.
Write the code
Try to write the query yourself.
Output
The final query should return all the records with more than one missing value.
user_firstname | user_lastname | video_id | flag_id |
---|---|---|---|
Courtney | dQw4w9WgXcQ | ||
Gina | Korman | ||
Greg | 5qap5aO4i9A | ||
Ct6BUPvE2sM |
Question 10: Find the combinations
As a data engineer, sometimes you’ll have to prepare data for analysis. For example, extract data that meets certain criteria.
Find all combinations of 3 numbers that sum up to 8. Output 3 numbers in the combination but avoid summing up a number with itself.
Link to the question: https://platform.stratascratch.com/coding/10010-find-the-combinations
Understand the question
You might have to read this data engineer SQL interview question multiple times to understand it fully.
We are given a list of numbers and need to find and output various combinations of three numbers that add up to 8.
There is an extra condition - we can not add a number to itself. Every number in the equation must be unique.
Analyze data
The transportation_numbers table is essentially a list of numbers. It has two columns, index and number, and both of them contain integer values.
We need to go through the list and pick three that add up to 8.
index | number |
---|---|
1 | 5 |
2 | 3 |
3 | 7 |
4 | 1 |
5 | 0 |
It looks like all values in the number column are below 8.
Plan your approach
The initial table has one column with numbers.
We need to combine the table with itself three times, so that every record contains three number values.
If you use JOIN to combine tables, you can set the ON condition to ensure that all three numbers are different from each other.
Finally, you need to set up a filter that checks if three number columns add up to 8.
Write the code
Try to solve this interesting challenge yourself.
Output
num_1 | num_2 | num_3 |
---|---|---|
5 | 0 | 3 |
3 | 4 | 1 |
3 | 1 | 4 |
3 | 0 | 5 |
7 | 0 | 1 |
Summary
Hopefully, by now, you better understand the day-to-day responsibilities of a data engineer and what a data engineer does and the importance of their work.
To do the job successfully, data engineers need to have a very specific set of skills, one of them is SQL. In this article, we listed the most important SQL concepts and walked you through data engineer SQL interview questions asked by reputable employers.
Practice on these questions to maximize your chances of landing a data engineer job. Go to the Data Engineer Interview Questions post on the StrataScratch blog to find more questions to solve. The platform has hundreds of SQL interview questions of various levels of difficulty. You can write queries, check their correctness, get hints and guidance all within one platform. Soon enough, you’ll master SQL and be prepared to solve any interview questions with confidence.