Data Science Coding Interview Questions with 5 Technical Concepts
Categories
Understanding 5 coding concepts of data science coding interview questions that companies will test you on and preparing for them effectively
A data science interview involves multiple rounds and one of such rounds involves coding interview questions. The purpose of these data science coding interview questions is to check if a candidate can program and knows the required coding languages such as SQL and Python. We’ll cover the coding questions you may receive during your data science technical interview round.
In this article, we'll cover the 5 coding concepts that companies will test you on in your data science interviews. As you guys know, we’re always tracking and collecting data science coding interview questions from various companies to better understand what technical concepts you need to know to succeed in interviews and what data science skills you need to have to succeed on the job. So these are coding questions that I’ve collected and are as real-time as can be.
The companies that we’re sourcing these data science coding interview questions from today are Facebook, Postmates, Credit Karma, Ring Central, and Twitch. We’ll introduce the coding concept first to understand and then show you a real interview question we collected from the company.
Technical Concepts of Data Science Coding Interview Questions
This guide of technical concepts contains almost all types of data science coding interview questions you should expect when interviewing for a data scientist position.
From this list of data science coding interview questions, an interviewee should be able to prepare for:
- The complex and tough questions
- What answers will positively resonate with an employer
- The different concepts that companies will test you on
- Developing the confidence to ace the interview
We’ve broken these data science coding interview questions into five different concepts:
- Concept #1: Categorizations, aggregations, ratios
- Concept #2: LEFT JOINs and Subqueries and CTEs
- Concept #3: Subqueries in the WHERE clause
- Concept #4: Window functions
- Concept #5: Date manipulations
1. Categorizations, Aggregations, Ratios
Let’s start simple -- categorizations, aggregations, ratios. Aren’t these 3 different concepts? Yes, but they’re almost always tested together in one coding question. And it’s probably the most commonly tested concept on data science coding interviews, especially in the beginning rounds.
By categorizations, we mean CASE statements in SQL. We’ll stick with SQL technical concepts because most people use SQL for data science coding interviews.
Aggregations like taking the sum or average are always tested and usually are applied after the categorization takes place.
But they can also take the form of a ratio.
Here is an example coding interview question by Facebook
Interview Question Date: February 2021
Output share of US users that are active. Active users are the ones with an "open" status in the table.
Link to the question: https://platform.stratascratch.com/coding/2005-share-of-active-users
Here’s the solution to this coding interview question:
SELECT active_users /total_users::float AS active_users_share
FROM
(SELECT count(user_id) total_users,
count(CASE
WHEN status = 'open' THEN 1
ELSE NULL
END) AS active_users
FROM fb_active_users
WHERE country = 'USA') subq
If you read the above solution and it makes sense to you then congrats, you’re on the right track.
To solve this question, you need to understand how to properly write a case statement, properly aggregate the results of the case statement, and then take a ratio from the result of the aggregation itself. All in one query.
The tricky part isn’t even those 3 technical concepts. It’s organizing your code into logical steps -- as you can see there’s a subquery to split out the logic.
And there’s also a data type conversion from an integer to a float so that your output can be between 0 and 1.
These are the little nuances you need to know in addition to these 3 concepts. Lots of people forget these small things, especially if you’re out of practice or nervous.
2. LEFT JOINs and Subqueries and CTEs
The second concept of data science coding interview questions includes JOINs, subqueries, and CTEs. Again 3 different concepts? But they’re always tested together to solve one problem.
Let's have an example. This question is from Credit Karma Interview on Submissions.
Interview Question Date: February 2021
Write a query that joins this submissions table to the loans table and returns the total loan balance on each user’s most recent ‘Refinance’ submission. Return all users and the balance for each of them.
Link to the question: https://platform.stratascratch.com/coding/2003-recent-refinance-submissions
And here’s the SQL code solution for this coding interview question:
SELECT l.user_id, balance
FROM
(SELECT DISTINCT id, user_id, created_at,
max(created_at) OVER (PARTITION BY user_id,
TYPE) most_recent
FROM loans
WHERE TYPE = 'Refinance') l
INNER JOIN submissions s ON l.id = s.loan_id
WHERE most_recent = created_at
It might look pretty complicated. It even includes other advanced concepts we’re not even covering. But you can see there are a JOIN statement and a subquery (which could be re-written as a CTE, if you wanted to do that).
What’s being tested is if you can blend two sources of data together and aggregate the data from one table, “balance” with a variable from another table “user_id”. Remember, a successful interviewee will talk through the entire logic and why they’re doing what they’re doing.
3. Subqueries in the WHERE Clause
While we’re talking about subqueries, they can be found in multiple places – not just at the JOIN statement like in the last question. Subqueries in the WHERE clause are also common concepts tested and actually, it’s a question that most people get wrong.
Here is a question from Ring Central Interview on Subqueries
Interview Question Date: February 2021
How many paid users had any calls in Apr 2020?
Link to the question: https://platform.stratascratch.com/coding/2017-paid-users-in-april-2020
And here’s the solution:
SELECT COUNT(DISTINCT user_id)
FROM rc_calls
WHERE date BETWEEN '2020-04-01' AND '2020-04-30'
AND user_id IN
(SELECT user_id
FROM rc_users
WHERE status = 'paid');
Subqueries in the WHERE clause allow you to quickly grab information from one table and use it as a filtering condition in the main query like we are seeing here. So in this example, we’re grabbing the user_id of paid users from the user table and using that as a condition to find how many calls they placed in the calls table.
4. Window Functions
The 4th coding concept tested on data science coding interviews so far in 2021 is window functions. Window functions are heavily utilized in data science work. We’ve talked about them extensively in the past articles - SQL Window Functions. One example of this is a question from Twitch interview using Window functions.
Interview Question Date: January 2021
From users who had their first session as a viewer, how many streamer sessions have they had? Return the user id and number of sessions in descending order. In case there are users with the same number of sessions, order them by ascending user id.
Link to the question: https://platform.stratascratch.com/coding/2012-viewers-turned-streamers
Here’s the solution to this coding question:
SELECT user_id, count(*) n_sessions
FROM twitch_sessions
WHERE session_type = 'streamer'
AND user_id in
(SELECT user_id
FROM
(SELECT user_id,
session_type,
rank() OVER (PARTITION BY user_id
ORDER BY session_start) streams_order
FROM twitch_sessions) s1
WHERE streams_order =1
AND session_type = 'viewer')
GROUP BY user_id
ORDER BY n_sessions DESC,
user_id ASC
This one’s insane! And yes it did get asked in a data science interview. But don’t worry, in the upcoming weeks, StrataScratch will be stepping through the solution for each of the data science interview questions we just covered.
These questions take a lot of time to solve and they’re typically found in the later rounds of your data science interviews because there is so much code and it just takes time to talk through it.
5. Date Manipulations
The last technical concept of data science coding interview questions is date manipulations. This one is always asked on data science coding interviews because everyone in analytics from an analyst to a data scientist works with dates. It would be weird not to. Because you’re usually given a complete day, year-month-day-timestamp, but asked to aggregate by weeks, months, or years, you’ll need to know how to manipulate dates.
Here is one example from Postmates Interview with Date manipulations
Interview Question Date: January 2021
Which hour has the highest average order volume per day? Your output should have the hour which satisfies that condition, and average order volume.
Link to the question: https://platform.stratascratch.com/coding/2014-hour-with-the-highest-order-volume
And here’s the solution to this question:
WITH SUMMARY AS
(SELECT HOUR,
avg(n_orders) avg_orders
FROM
(SELECT date_part('hour', order_timestamp_utc) AS HOUR,
order_timestamp_utc::date,
count(id) n_orders
FROM postmates_orders
GROUP BY date_part('hour', order_timestamp_utc),
order_timestamp_utc::date) sq
GROUP BY HOUR)
SELECT *
FROM SUMMARY
WHERE avg_orders =
(SELECT max(avg_orders)
FROM SUMMARY)
ORDER BY avg_orders DESC
This one requires you to extract the hour from the DateTime in order to find the highest order volume. So it’s just a matter of knowing these DateTime functions, like date_part().
Conclusion
So these are the 5 technical concepts that data science companies will most likely to test you on in your interviews in 2021. Did you think these data science coding interview questions were difficult to answer? If so, we, at StrataScratch, will step through all the solutions in the upcoming weeks.
Remember, these are coding interview questions that were asked in the first few weeks/months of 2021, so they’re definitely requiring you to master these 5 technical concepts.