Product Analytics Data Scientist Interview Questions

Product Analytics Interview Questions


Product analytics interview questions are some of the most frequently asked by leading companies at technical interviews for data scientists and data analysts.

Large data science companies frequently employ data scientists and data analysts to work in teams responsible for specific products. That’s why it’s a common practice to ask product analytics questions during technical interviews. This article focuses on such data scientist interview questions that involve calculating relevant product metrics based on a dataset using programming languages such as Python or SQL.

The product metrics that candidates are asked to retrieve or calculate are different and range from simple volumes of events or interactions to more complex growth or retention rates. The product interview questions frequently require manipulating dates, writing conditions and aggregating data by users, user groupd or timeframes.

Product Analytics Interview Question #1: Users by Average Session Time


DataFrame: facebook_web_log
Expected Output Type: pandas.DataFrame

Link to the question: https://platform.stratascratch.com/coding/10352-users-by-avg-session-time

This is a really interesting question that is frequently asked at Meta data science interviews. The title of this product analytics interview question is "Users By Average Session Time" and we are being asked to calculate each user's average session time. It is said that a session is defined as the time difference between a page_load and page_exit. For simplicity, we can assume that a user has only 1 session per day and if there are multiple of the same events on that day, consider only the latest page_load and earliest page_exit.

This is a medium level question because while it’s rather straightforward to get durations of individual sessions from the events, it is harder to identify valid sessions and take an average from time differences.

Understanding the Data

The first step is always to understand the data. At a typical interview, you won’t see any actual data, only a schema with column names and datatypes. In this case, the situation is rather simple. We are only given one table, or DataFrame because we’re in Python, called facebook_web_log. It has three columns: user_id, timestamp and action.

Table: facebook_web_log
user_idtimestampaction
02019-04-25 13:30:15page_load
02019-04-25 13:30:18page_load
02019-04-25 13:30:40scroll_down
02019-04-25 13:30:45scroll_up
02019-04-25 13:31:10scroll_down

Based on the explanation given in the question, we can guess that two examples of the ‘action’ property are ‘page_load’ and ‘page_exit’. But we can’t rule out that this field doesn’t take any other values. Maybe we also track if a user scrolls up or down during their session. Then other values in the ‘action’ column could be ‘scroll_down’ or ‘scroll_up’ - the kind of values we don’t need to solve the question but we still need to anticipate that they could be there.

We can also see that the timestamp property is a datetime. This is important because it determines what functions we can use to manipulate these values. All in all, we can deduce that each row in this dataset represents one action that a certain user performed at a certain point in time.

Formulating the Approach

Next, we can formulate the approach or a few general steps that we’ll follow when writing the code:

1. The first step could be to extract only these rows from the dataset that we’ll actually need. In particular, we’re interested in these rows where the action is either ‘page_load’ or ‘page_exit’. We’re doing this because we’re anticipating that there may be other actions stored in the dataset.

2. In the next step, it would be wise to create a table with all the individual sessions. This table would have 3 columns: one for user_id, then one for the timestamp of a page_load action and one for the timestamp of a page_exit action. We can do it by merging all page_load timestamps of one user with all page_exit timestamps of the same user.

The problem with merging the timestamps is that we’ll end up with some sessions that are not real. Because, if a user has multiple sessions and we, when merging, consider all possible combinations of their page_load and page_exit actions, we’ll inevitably have some rows where page_load comes from one session and page_exit come from another session.

3. To detect and remove these cases, we can start by adding a condition that a page_exit should always happen after page_load.

This will cover one case but what about situations when page_exit happens after page_load but only a few days after? Or if a user has several page_load actions in a single day and only one page_exit? The question actually gives us a hint here that if there are multiple of the same events on that day, we should only consider the latest page_load and earliest page_exit. So aggregating the data while selecting the latest page_load and earliest page_exit timestamps should cover both these cases.

But what should we aggregate the data by? If we only aggregate by user_id, then we’ll end up with one session per user, not necessarily a valid one. But there is one more hint in the question that could help us here. It says that a user has only 1 session per day. This means that each valid session can be uniquely identified by a user_id-date tuple and so we can aggregate the results by user_id and date.

4. Therefore, as the fourth step, let’s add a column with the date of each session, we can extract it from the page_load timestamp and then aggregate data and select latest page_load and earliest page_exit.

5. At this point, we’ll be left with the valid sessions of each user, together with their respective page_load and page_exit timestamps. From here, it’s rather straightforward to get to the final solution. We can start by calculating the difference between the two timestamps - this is the duration of the session.

6. But let’s remember that we’re interested in the average session duration of each user and each user can still have multiple sessions that happen on different days. To get to the solution, we can aggregate the data again, this time only by user_id and get the average of durations belonging to each user.

Writing the Code

Now, we can follow these steps to write the actual solution to this product analytics interview question using Python. Let’s start by importing some libraries that will come useful. Pandas is the popular Python library for manipulating datasets and Numpy offers, among other things, mathematical functions and we’ll specifically need it to take the average of session durations in the last step.

import pandas as pd
import numpy as np

The first step is to extract page_load and page_exit actions from the dataset. We can create two copies of the original dataset, one per action. To get a list of timestamps of page_load actions, let’s use the Pandas loc property. We can start with the original dataset facebook_web_log and apply the loc property to it. Then, let’s specify the condition or which rows should be returned. We want these rows where the value in the column ‘action’ of the original dataset is equal to ‘page_load’. Once we define which rows to output, we can also decide which columns to return. We only need the user_id and the timestamp because ‘action’ will be the same for all the rows in this copy of the dataset.

import pandas as pd
import numpy as np

# Extract page_load and page_exit actions from the dataset
loads = facebook_web_log.loc[facebook_web_log['action'] == 'page_load', ['user_id', 'timestamp']]

When you run this code, you can see that there are only 2 columns left. These are actually all the rows where the action column was equal to page_load. Next, we can get a similar table for the page_exit actions by using the same loc property, only changing the value in the condition.

exits = facebook_web_log.loc[facebook_web_log['action'] == 'page_exit', ['user_id', 'timestamp']]

You can add this line of code to the code editor above and once you run it, you will see a set of rows from the original dataset where the action was equal to ‘page_exit’.

The next step is to identify possible sessions of each user and, as mentioned already, that we can achieve it by merging the two tables with only page_loads and only page_exits. We can call this new dataset session and use the Pandas merge() function. Inside this function, we first specify which two DataFrames should be merged together, in this case, it will be loads and exits. Next, we need to decide what kind of merge we want to perform. You may remember from SQL, there we have different kinds of JOINs: left join, right join, inner join etc. This is the same here, we can for example use a left merge. Finally, we also need to specify on which column or columns to join. This is the same as the ON keywords in SQL and the parameter in Pandas is also called on.

import pandas as pd
import numpy as np

# Extract page_load and page_exit actions from the dataset
loads = facebook_web_log.loc[facebook_web_log['action'] == 'page_load', ['user_id', 'timestamp']]
exits = facebook_web_log.loc[facebook_web_log['action'] == 'page_exit', ['user_id', 'timestamp']]

# Identify possible sessions of each user
sessions = pd.merge(loads, exits, how='left', on='user_id')

When running this code, you can see that we have a user_id, then a timestamp of the page_load action and a timestamp of a page_exit action. However, there are two issues here. First, we can see that user 2 has a page_load but no page_exit to match it with. This could simply be a mistake in the dataset, we don’t know, but we still need to deal with it. The interview question doesn’t specify what to do in such a case but since we have now page_exit, it’s impossible to calculate the duration of such a session so I propose we skip this session. To avoid situations like this one, we can switch from a left merge to an inner merge. The latter will only return rows where both timestamps exist. In the code above, change how=’left’ to how=’inner’ to see how the result will change.

And the second issue is that these names of columns, timestamp_x and timestamp_y are quite confusing. To make it clearer which corresponds to page_load and which to page_exit actions, we can add one more parameter to our merge function. It’s called suffixes and allows to replace these x and y with other names. It’s important that the order of these names needs to correspond to the order in which we’re merging the tables. In the code above, add a new parameter suffixes=['_load', '_exit'] and see how the column names will change.

Now it’s clear we’re dealing with user sessions. But as you can see not all of these sessions are real or possible because we merged all possible page_load timestamps with all possible page_exit timestamps. That’s why we have these next few steps about filtering valid sessions.

The first thing we can do here is to ensure that the page_exit happens after page_load. To add a condition in Python, we need to select a portion of the ‘sessions’ table where values in the ‘timestamp_load’ column of the table ‘sessions’ are smaller than, or before, the values in the ‘timestamp_exit’ column of the table ‘sessions’.

import pandas as pd
import numpy as np

# Extract page_load and page_exit actions from the dataset
loads = facebook_web_log.loc[facebook_web_log['action'] == 'page_load', ['user_id', 'timestamp']]
exits = facebook_web_log.loc[facebook_web_log['action'] == 'page_exit', ['user_id', 'timestamp']]

# Identify possible sessions of each user
sessions = pd.merge(loads, exits, how='inner', on='user_id', suffixes=['_load', '_exit'])

# Filter valid sessions:
## page_load before page_exit
sessions = sessions[sessions['timestamp_load'] < sessions['timestamp_exit']]

Now all the sessions left in the table are possible to have occurred but we can still see some sessions that span over a few days or several sessions of the same user in a single day - these are clearly incorrect. To remove them, let’s follow the steps we had defined and begin with adding a column with the date of a page_load timestamp. To do this, we can define a new column, let’s call it ‘date_load’ and it will be almost equal to the value in the 'timestamp_load' column, but we only want to extract the date and skip the time. We can use the dt.date function. Thanks to this ‘dt’, we can apply this function specifically to the datetime values inside the 'timestamp_load' column.

sessions['date_load'] = sessions['timestamp_load'].dt.date

You can add this line of code to the code editor above to see how the new column is added to the result. Having done this, we can now use each user_id-date_load pair to uniquely identify each session. So if there are several sessions of the same user and if the same date, we know that only one of these will be valid. And the question tells us that it will be the shortest one. To select these sessions, we can aggregate the data by using a Pandas groupby() function and, in particular, group by the unique identifier of each session, so by a user_id-date_load pair - this way we’ll only be left with 1 row per session. Then we should define the aggregation function and since we want to apply two functions at the same time, we’ll use the Pandas agg() function and say that we want, in each row, to have the latest, or maximum, timestamp_load and earliest, or minimum, timestamp_exit. We can also add the Pandas reset_index() function - it’s a good practice to add it whenever aggregating because without it Pandas would consider the user_id and date_load columns as indices and we wouldn’t have them displayed - you can try removing it from the code editor below to see the difference.

import pandas as pd
import numpy as np

# Extract page_load and page_exit actions from the dataset
loads = facebook_web_log.loc[facebook_web_log['action'] == 'page_load', ['user_id', 'timestamp']]
exits = facebook_web_log.loc[facebook_web_log['action'] == 'page_exit', ['user_id', 'timestamp']]

# Identify possible sessions of each user
sessions = pd.merge(loads, exits, how='inner', on='user_id', suffixes=['_load', '_exit'])

# Filter valid sessions:
## page_load before page_exit
sessions = sessions[sessions['timestamp_load'] < sessions['timestamp_exit']]

## Add a column with the date of a page_load timestamp
sessions['date_load'] = sessions['timestamp_load'].dt.date

## aggregate data and select latest page_load and earliest page_exit
sessions = sessions.groupby(['user_id', 'date_load']).agg({'timestamp_load': 'max', 'timestamp_exit': 'min'}).reset_index()

Now, we only have the valid sessions of each user left and we can move on to calculating the duration of each session. This is rather straightforward because it’s enough to subtract the timestamp_load from the timestamp_exit column. And let’s store the result in a new column called duration.

sessions['duration'] = sessions['timestamp_exit'] - sessions['timestamp_load']

You can add this line of code to the code editor above to see how the new column is added to the result. The final step is to calculate the average sessions duration for each user. So we simply want to take the two durations of each user and return the average of them. At first glance, we should be able to aggregate the data by the user_id column, because we want to have one row per user, and take the average, or mean, of the duration column.

result = sessions.groupby('user_id').mean()['duration']

But unfortunately, this won’t work. That’s because we can only use the mean() function with numbers. But even though the duration column may seem like a numeric one to us, for Python it’s of a datatype called time delta - it’s a special type for storing a difference between two timestamps. This datatype enables us to use some time-specific functions, such as converting the duration to hours or even days, but for us, it’s not that useful because it’s not that simple to take an average of it.

The way to solve this issue is to use the mean() function from the NumPy library - it’s more advanced and supports taking the average of time deltas. To be able to use a custom aggregate function in Pandas we can use the agg() function again.

result = sessions.groupby('user_id')['duration'].agg(np.mean())

However, this also won’t work just yet. That’s because, by definition, the mean() function from the NumPy library must take the values that it should take the average from as a parameter. So how can we pass all the durations of each user separately as a parameter? We can use the Pandas lambda keyword. Normally, lambda allows iterating through all the rows while applying some functions separately to each row.

In this case, it’s a bit more tricky because we’re already past the aggregate step but before applying the aggregation function. This means that the lambda function will first return all durations of one user at once, then all durations of the next user at once and so on. That’s why we can use it to pass correct parameters to the mean() function. And the mean() function will know to take the average from all durations of one user at once. Don’t forget the reset_index() function like last time and this should produce the expected result!

import pandas as pd
import numpy as np

# Extract page_load and page_exit actions from the dataset
loads = facebook_web_log.loc[facebook_web_log['action'] == 'page_load', ['user_id', 'timestamp']]
exits = facebook_web_log.loc[facebook_web_log['action'] == 'page_exit', ['user_id', 'timestamp']]

# Identify possible sessions of each user
sessions = pd.merge(loads, exits, how='inner', on='user_id', suffixes=['_load', '_exit'])

# Filter valid sessions:
## page_load before page_exit
sessions = sessions[sessions['timestamp_load'] < sessions['timestamp_exit']]

## Add a column with the date of a page_load timestamp
sessions['date_load'] = sessions['timestamp_load'].dt.date

## aggregate data and select latest page_load and earliest page_exit
sessions = sessions.groupby(['user_id', 'date_load']).agg({'timestamp_load': 'max', 'timestamp_exit': 'min'}).reset_index()

# Calculate the duration of the session
sessions['duration'] = sessions['timestamp_exit'] - sessions['timestamp_load']

# Aggregate to get average duration by user
result = sessions.groupby('user_id')['duration'].agg(lambda x: np.mean(x)).reset_index()

When you run this code, there are only 2 rows, one per user and the second column contains in fact the average session duration, taken from all the sessions of this user. This is the solution to this product analytics interview question.

Product Analytics Interview Question #2: Gender With Generous Reviews

Gender With Generous Reviews

This is an easy question that was asked in data science interviews at Airbnb. It’s similar to the first question because we’re also being asked to find an average value per person. This time, there are no dates involved so the process is much easier.


DataFrames: airbnb_reviews, airbnb_guests
Expected Output Type: pandas.Series

Link to the question: https://platform.stratascratch.com/coding/10149-gender-with-generous-reviews

We are being asked to write a query to find which gender gives a higher average review score when writing reviews as guests. The hint here is to use the `from_type` column to identify guest reviews. The question also instructs us to output the gender and their average review score.

Table: airbnb_reviews
from_userto_userfrom_typeto_typereview_score
42guesthost3
12hostguest5
88guesthost9
40guesthost6
71hostguest2
Table: airbnb_guests
guest_idnationalitygenderage
0MaliM21
1ChinaF23
2MaliF27
3AustraliaF24
4LuxembourgM19

This product analytics interview question can be solved by following these general steps:

  1. Perform inner join using pd.merge(dataframe1, dataframe2, on = common key);
  2. Filter specific column/s from dataframe using [column_name] then select rows with values equal to ‘==’ guest;
  3. Use .groupby(column_name) on gender to group the dataframe about the specifed column and use mean() to get the average per group; Convert the resulting object to a dataframe using to_frame('column_name');
  4. Select the gender with the highest average score using max() and ['column_name'] to return only the gender column.

Product Analytics Interview Question #3: Lowest Priced Orders

This next question comes from Amazon and also involves aggregating the dataset by users, this time using another aggregation function because we’re no longer interested in the average value but rather the lowest amount.


DataFrames: customers, orders
Expected Output Type: pandas.Series

Link to the question: https://platform.stratascratch.com/coding/9912-lowest-priced-orders

We are being asked to find the lowest order cost of each customer and output the customer id along with the first name and the lowest order price.

Table: customers
idfirst_namelast_namecityaddressphone_number
8JohnJosephSan Francisco928-386-8164
7JillMichaelAustin813-297-0692
4WilliamDanielDenver813-368-1200
5HenryJacksonMiami808-601-7513
13EmmaIsaacMiami808-690-5201
Table: orders
idcust_idorder_dateorder_detailstotal_order_cost
132019-03-04Coat100
232019-03-01Shoes80
332019-03-07Skirt30
472019-02-01Coat25
572019-03-10Shoes80

In this case, there only 2 main steps that need to be completed:

  1. Perform inner join on orders and customers using pd.merge(dataframe1, dataframe2, on = common_table_keys);
  2. Use .groupby(column_name) to group the dataframe about the specifed column then use min() to get the lowest value per group.

Product Analytics Interview Question #4: Most Active Users On Messenger

Most Active Users On Messenger

This is a more difficult question that was also asked by Meta at their interviews for data scientists and data analysts. In this case, aggregating the data and counting events for each user is not enough - there is an extra step of finding the top 10 results.


DataFrame: fb_messages
Expected Output Type: pandas.DataFrame

Link to the question: https://platform.stratascratch.com/coding/10295-most-active-users-on-messenger

We are being told that Messenger stores the number of messages between users in a table named 'fb_messages'. In this table 'user1' is the sender, 'user2' is the receiver, and 'msg_count' is the number of messages exchanged between them. We are then being asked to F=find the top 10 most active users on Facebook Messenger by counting their total number of messages sent and received. The solution should output usernames and the count of the total messages they sent or received.

Table: fb_messages
iddateuser1user2msg_count
12020-08-02kpenascottmartin2
22020-08-02misty19srogers2
32020-08-02jerome75craig233
42020-08-02taylorhowardjohnmccann8
52020-08-02wangdenisesgoodman2

The steps to follow in the case of this product analytics interview question are as follows:

  1. Concatenate two slices of the dataframe;
  2. Aggregated msg_counts by all the users using group by function;
  3. Sort users by their total number of messages in descending order.

Product Analytics Interview Question #5: User Growth Rate

This product analytics question comes from Salesforce and concerns the user growth rate. Solving this question is more difficult because the data need to be aggregated separately for each month and then merged together to calculate the rate.


DataFrame: sf_events
Expected Output Type: pandas.DataFrame

Link to the question: https://platform.stratascratch.com/coding/2052-user-growth-rate

We are being asked to find the growth rate of active users for Dec 2020 to Jan 2021 for each account. It is specified that the growth rate is defined as the number of users in January 2021 divided by the number of users in Dec 2020. We should output the account_id and growth rate.

Table: sf_events
dateaccount_iduser_id
2021-01-01A1U1
2021-01-01A1U2
2021-01-06A1U3
2021-01-02A1U1
2020-12-24A1U2

As mentioned, this question can be solved by following the two main steps, followed by adjusting the output to the format required by the question:

  1. Calculate total distinct number of users for both the Jan 2021 and Dec 2020. Group the data by account_id;
  2. Divide total number of users to get the growth rate;
  3. Output account_id and calculated growth rate.

Conclusion

In this article, we have explained in detail a solution to a product analytics interview question from Meta, followed by 4 examples of other similar questions with different difficulty levels. This should have given you an idea of the types of product analytics interview questions asked at data science interviews and the understanding of how to approach and solve such questions.

Feel free to browse other product related interview questions on the StrataScratch platform such as interview questions for the product analyst position. Use the ‘Approach Hints’ functionality to display the general steps for solving the problems and discuss your code with other users to get more insights and feedback. After practicing using the real questions, you should be properly equipped to face the product analytics problems at your interview!

Product Analytics Interview Questions


Become a data expert. Subscribe to our newsletter.