Product Analytics Data Scientist Interview Questions

Product Analytics Interview Question #1: Users by Average Session Time

Users By Average Session Time

Interview Question Date: July 2021

Understanding the Data

The first step is always to understand the data. At a typical interview, you won’t see any actual data, only a schema with column names and datatypes. In this case, the situation is rather simple. We are only given one table, or DataFrame because we’re in Python, called facebook_web_log. It has three columns: user_id, timestamp and action.

Table: facebook_web_log

user_id	timestamp	action
0	2019-04-25 13:30:15	page_load
0	2019-04-25 13:30:18	page_load
0	2019-04-25 13:30:40	scroll_down
0	2019-04-25 13:30:45	scroll_up
0	2019-04-25 13:31:10	scroll_down

Based on the explanation given in the question, we can guess that two examples of the ‘action’ property are ‘page_load’ and ‘page_exit’. But we can’t rule out that this field doesn’t take any other values. Maybe we also track if a user scrolls up or down during their session. Then other values in the ‘action’ column could be ‘scroll_down’ or ‘scroll_up’ - the kind of values we don’t need to solve the question but we still need to anticipate that they could be there.

We can also see that the timestamp property is a datetime. This is important because it determines what functions we can use to manipulate these values. All in all, we can deduce that each row in this dataset represents one action that a certain user performed at a certain point in time.

Formulating the Approach

Next, we can formulate the approach or a few general steps that we’ll follow when writing the code:

1. The first step could be to extract only these rows from the dataset that we’ll actually need. In particular, we’re interested in these rows where the action is either ‘page_load’ or ‘page_exit’. We’re doing this because we’re anticipating that there may be other actions stored in the dataset.

2. In the next step, it would be wise to create a table with all the individual sessions. This table would have 3 columns: one for user_id, then one for the timestamp of a page_load action and one for the timestamp of a page_exit action. We can do it by merging all page_load timestamps of one user with all page_exit timestamps of the same user.

The problem with merging the timestamps is that we’ll end up with some sessions that are not real. Because, if a user has multiple sessions and we, when merging, consider all possible combinations of their page_load and page_exit actions, we’ll inevitably have some rows where page_load comes from one session and page_exit come from another session.

3. To detect and remove these cases, we can start by adding a condition that a page_exit should always happen after page_load.

This will cover one case but what about situations when page_exit happens after page_load but only a few days after? Or if a user has several page_load actions in a single day and only one page_exit? The question actually gives us a hint here that if there are multiple of the same events on that day, we should only consider the latest page_load and earliest page_exit. So aggregating the data while selecting the latest page_load and earliest page_exit timestamps should cover both these cases.

But what should we aggregate the data by? If we only aggregate by user_id, then we’ll end up with one session per user, not necessarily a valid one. But there is one more hint in the question that could help us here. It says that a user has only 1 session per day. This means that each valid session can be uniquely identified by a user_id-date tuple and so we can aggregate the results by user_id and date.

4. Therefore, as the fourth step, let’s add a column with the date of each session, we can extract it from the page_load timestamp and then aggregate data and select latest page_load and earliest page_exit.

5. At this point, we’ll be left with the valid sessions of each user, together with their respective page_load and page_exit timestamps. From here, it’s rather straightforward to get to the final solution. We can start by calculating the difference between the two timestamps - this is the duration of the session.

6. But let’s remember that we’re interested in the average session duration of each user and each user can still have multiple sessions that happen on different days. To get to the solution, we can aggregate the data again, this time only by user_id and get the average of durations belonging to each user.

Writing the Code

Now, we can follow these steps to write the actual solution to this product analytics interview question using Python. Let’s start by importing some libraries that will come useful. Pandas is the popular Python library for manipulating datasets and Numpy offers, among other things, mathematical functions and we’ll specifically need it to take the average of session durations in the last step.

import pandas as pd
import numpy as np

The first step is to extract page_load and page_exit actions from the dataset. We can create two copies of the original dataset, one per action. To get a list of timestamps of page_load actions, let’s use the Pandas loc property. We can start with the original dataset facebook_web_log and apply the loc property to it. Then, let’s specify the condition or which rows should be returned. We want these rows where the value in the column ‘action’ of the original dataset is equal to ‘page_load’. Once we define which rows to output, we can also decide which columns to return. We only need the user_id and the timestamp because ‘action’ will be the same for all the rows in this copy of the dataset.

import pandas as pd
import numpy as np

# Extract page_load and page_exit actions from the dataset
loads = facebook_web_log.loc[facebook_web_log['action'] == 'page_load', ['user_id', 'timestamp']]

When you run this code, you can see that there are only 2 columns left. These are actually all the rows where the action column was equal to page_load. Next, we can get a similar table for the page_exit actions by using the same loc property, only changing the value in the condition.

exits = facebook_web_log.loc[facebook_web_log['action'] == 'page_exit', ['user_id', 'timestamp']]

You can add this line of code to the code editor above and once you run it, you will see a set of rows from the original dataset where the action was equal to ‘page_exit’.

The next step is to identify possible sessions of each user and, as mentioned already, that we can achieve it by merging the two tables with only page_loads and only page_exits. We can call this new dataset session and use the Pandas merge() function. Inside this function, we first specify which two DataFrames should be merged together, in this case, it will be loads and exits. Next, we need to decide what kind of merge we want to perform. You may remember from SQL, there we have different kinds of JOINs: left join, right join, inner join etc. This is the same here, we can for example use a left merge. Finally, we also need to specify on which column or columns to join. This is the same as the ON keywords in SQL and the parameter in Pandas is also called on.

import pandas as pd
import numpy as np

# Extract page_load and page_exit actions from the dataset
loads = facebook_web_log.loc[facebook_web_log['action'] == 'page_load', ['user_id', 'timestamp']]
exits = facebook_web_log.loc[facebook_web_log['action'] == 'page_exit', ['user_id', 'timestamp']]

# Identify possible sessions of each user
sessions = pd.merge(loads, exits, how='left', on='user_id')

When running this code, you can see that we have a user_id, then a timestamp of the page_load action and a timestamp of a page_exit action. However, there are two issues here. First, we can see that user 2 has a page_load but no page_exit to match it with. This could simply be a mistake in the dataset, we don’t know, but we still need to deal with it. The interview question doesn’t specify what to do in such a case but since we have now page_exit, it’s impossible to calculate the duration of such a session so I propose we skip this session. To avoid situations like this one, we can switch from a left merge to an inner merge. The latter will only return rows where both timestamps exist. In the code above, change how=’left’ to how=’inner’ to see how the result will change.

And the second issue is that these names of columns, timestamp_x and timestamp_y are quite confusing. To make it clearer which corresponds to page_load and which to page_exit actions, we can add one more parameter to our merge function. It’s called suffixes and allows to replace these x and y with other names. It’s important that the order of these names needs to correspond to the order in which we’re merging the tables. In the code above, add a new parameter suffixes=['_load', '_exit'] and see how the column names will change.

Now it’s clear we’re dealing with user sessions. But as you can see not all of these sessions are real or possible because we merged all possible page_load timestamps with all possible page_exit timestamps. That’s why we have these next few steps about filtering valid sessions.

The first thing we can do here is to ensure that the page_exit happens after page_load. To add a condition in Python, we need to select a portion of the ‘sessions’ table where values in the ‘timestamp_load’ column of the table ‘sessions’ are smaller than, or before, the values in the ‘timestamp_exit’ column of the table ‘sessions’.

import pandas as pd
import numpy as np

# Extract page_load and page_exit actions from the dataset
loads = facebook_web_log.loc[facebook_web_log['action'] == 'page_load', ['user_id', 'timestamp']]
exits = facebook_web_log.loc[facebook_web_log['action'] == 'page_exit', ['user_id', 'timestamp']]

# Identify possible sessions of each user
sessions = pd.merge(loads, exits, how='inner', on='user_id', suffixes=['_load', '_exit'])

# Filter valid sessions:
## page_load before page_exit
sessions = sessions[sessions['timestamp_load'] < sessions['timestamp_exit']]

Now all the sessions left in the table are possible to have occurred but we can still see some sessions that span over a few days or several sessions of the same user in a single day - these are clearly incorrect. To remove them, let’s follow the steps we had defined and begin with adding a column with the date of a page_load timestamp. To do this, we can define a new column, let’s call it ‘date_load’ and it will be almost equal to the value in the 'timestamp_load' column, but we only want to extract the date and skip the time. We can use the dt.date function. Thanks to this ‘dt’, we can apply this function specifically to the datetime values inside the 'timestamp_load' column.

sessions['date_load'] = sessions['timestamp_load'].dt.date

You can add this line of code to the code editor above to see how the new column is added to the result. Having done this, we can now use each user_id-date_load pair to uniquely identify each session. So if there are several sessions of the same user and if the same date, we know that only one of these will be valid. And the question tells us that it will be the shortest one. To select these sessions, we can aggregate the data by using a Pandas groupby() function and, in particular, group by the unique identifier of each session, so by a user_id-date_load pair - this way we’ll only be left with 1 row per session. Then we should define the aggregation function and since we want to apply two functions at the same time, we’ll use the Pandas agg() function and say that we want, in each row, to have the latest, or maximum, timestamp_load and earliest, or minimum, timestamp_exit. We can also add the Pandas reset_index() function - it’s a good practice to add it whenever aggregating because without it Pandas would consider the user_id and date_load columns as indices and we wouldn’t have them displayed - you can try removing it from the code editor below to see the difference.

import pandas as pd
import numpy as np

# Extract page_load and page_exit actions from the dataset
loads = facebook_web_log.loc[facebook_web_log['action'] == 'page_load', ['user_id', 'timestamp']]
exits = facebook_web_log.loc[facebook_web_log['action'] == 'page_exit', ['user_id', 'timestamp']]

# Identify possible sessions of each user
sessions = pd.merge(loads, exits, how='inner', on='user_id', suffixes=['_load', '_exit'])

# Filter valid sessions:
## page_load before page_exit
sessions = sessions[sessions['timestamp_load'] < sessions['timestamp_exit']]

## Add a column with the date of a page_load timestamp
sessions['date_load'] = sessions['timestamp_load'].dt.date

## aggregate data and select latest page_load and earliest page_exit
sessions = sessions.groupby(['user_id', 'date_load']).agg({'timestamp_load': 'max', 'timestamp_exit': 'min'}).reset_index()

Now, we only have the valid sessions of each user left and we can move on to calculating the duration of each session. This is rather straightforward because it’s enough to subtract the timestamp_load from the timestamp_exit column. And let’s store the result in a new column called duration.

sessions['duration'] = sessions['timestamp_exit'] - sessions['timestamp_load']

You can add this line of code to the code editor above to see how the new column is added to the result. The final step is to calculate the average sessions duration for each user. So we simply want to take the two durations of each user and return the average of them. At first glance, we should be able to aggregate the data by the user_id column, because we want to have one row per user, and take the average, or mean, of the duration column.

result = sessions.groupby('user_id').mean()['duration']

But unfortunately, this won’t work. That’s because we can only use the mean() function with numbers. But even though the duration column may seem like a numeric one to us, for Python it’s of a datatype called time delta - it’s a special type for storing a difference between two timestamps. This datatype enables us to use some time-specific functions, such as converting the duration to hours or even days, but for us, it’s not that useful because it’s not that simple to take an average of it.

The way to solve this issue is to use the mean() function from the NumPy library - it’s more advanced and supports taking the average of time deltas. To be able to use a custom aggregate function in Pandas we can use the agg() function again.

result = sessions.groupby('user_id')['duration'].agg(np.mean())

However, this also won’t work just yet. That’s because, by definition, the mean() function from the NumPy library must take the values that it should take the average from as a parameter. So how can we pass all the durations of each user separately as a parameter? We can use the Pandas lambda keyword. Normally, lambda allows iterating through all the rows while applying some functions separately to each row.

In this case, it’s a bit more tricky because we’re already past the aggregate step but before applying the aggregation function. This means that the lambda function will first return all durations of one user at once, then all durations of the next user at once and so on. That’s why we can use it to pass correct parameters to the mean() function. And the mean() function will know to take the average from all durations of one user at once. Don’t forget the reset_index() function like last time and this should produce the expected result!

import pandas as pd
import numpy as np

# Extract page_load and page_exit actions from the dataset
loads = facebook_web_log.loc[facebook_web_log['action'] == 'page_load', ['user_id', 'timestamp']]
exits = facebook_web_log.loc[facebook_web_log['action'] == 'page_exit', ['user_id', 'timestamp']]

# Identify possible sessions of each user
sessions = pd.merge(loads, exits, how='inner', on='user_id', suffixes=['_load', '_exit'])

# Filter valid sessions:
## page_load before page_exit
sessions = sessions[sessions['timestamp_load'] < sessions['timestamp_exit']]

## Add a column with the date of a page_load timestamp
sessions['date_load'] = sessions['timestamp_load'].dt.date

## aggregate data and select latest page_load and earliest page_exit
sessions = sessions.groupby(['user_id', 'date_load']).agg({'timestamp_load': 'max', 'timestamp_exit': 'min'}).reset_index()

# Calculate the duration of the session
sessions['duration'] = sessions['timestamp_exit'] - sessions['timestamp_load']

# Aggregate to get average duration by user
result = sessions.groupby('user_id')['duration'].agg(lambda x: np.mean(x)).reset_index()

When you run this code, there are only 2 rows, one per user and the second column contains in fact the average session duration, taken from all the sessions of this user. This is the solution to this product analytics interview question.

Product Analytics Interview Question #2: Gender With Generous Reviews

This is an easy question that was asked in data science interviews at Airbnb. It’s similar to the first question because we’re also being asked to find an average value per person. This time, there are no dates involved so the process is much easier.

Gender With Generous Reviews

Interview Question Date: June 2020

Airbnb

EasyID 10149

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Write a query to find which gender gives a higher average review score when writing reviews as guests. Use the from_type column to identify guest reviews. Output the gender and their average review score.

DataFrames: airbnb_reviews, airbnb_guests

Expected Output Type: pandas.Series

Link to the question: https://platform.stratascratch.com/coding/10149-gender-with-generous-reviews

We are being asked to write a query to find which gender gives a higher average review score when writing reviews as guests. The hint here is to use the `from_type` column to identify guest reviews. The question also instructs us to output the gender and their average review score.

Table: airbnb_reviews

from_user	to_user	from_type	to_type	review_score
4	2	guest	host	3
1	2	host	guest	5
8	8	guest	host	9
4	0	guest	host	6
7	1	host	guest	2

Table: airbnb_guests

guest_id	nationality	gender	age
0	Mali	M	21
1	China	F	23
2	Mali	F	27
3	Australia	F	24
4	Luxembourg	M	19

This product analytics interview question can be solved by following these general steps:

Perform inner join using pd.merge(dataframe1, dataframe2, on = common key);
Filter specific column/s from dataframe using [column_name] then select rows with values equal to ‘==’ guest;
Use .groupby(column_name) on gender to group the dataframe about the specifed column and use mean() to get the average per group; Convert the resulting object to a dataframe using to_frame('column_name');
Select the gender with the highest average score using max() and ['column_name'] to return only the gender column.

Product Analytics Interview Question #3: Lowest Priced Orders

This next question comes from Amazon and also involves aggregating the dataset by users, this time using another aggregation function because we’re no longer interested in the average value but rather the lowest amount.

Lowest Priced Orders

Interview Question Date: May 2019

Shopify

Amazon

MediumID 9912

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Find the lowest order cost of each customer. Output the customer id along with the first name and the lowest order price.

DataFrames: customers, orders

Expected Output Type: pandas.Series

Link to the question: https://platform.stratascratch.com/coding/9912-lowest-priced-orders

We are being asked to find the lowest order cost of each customer and output the customer id along with the first name and the lowest order price.

Table: customers

id	first_name	last_name	city	phone_number
8	John	Joseph	San Francisco	928-386-8164
7	Jill	Michael	Austin	813-297-0692
4	William	Daniel	Denver	813-368-1200
5	Henry	Jackson	Miami	808-601-7513
13	Emma	Isaac	Miami	808-690-5201

Table: orders

id	cust_id	order_date	order_details	total_order_cost
1	3	2019-03-04	Coat	100
2	3	2019-03-01	Shoes	80
3	3	2019-03-07	Skirt	30
4	7	2019-02-01	Coat	25
5	7	2019-03-10	Shoes	80

In this case, there only 2 main steps that need to be completed:

Perform inner join on orders and customers using pd.merge(dataframe1, dataframe2, on = common_table_keys);
Use .groupby(column_name) to group the dataframe about the specifed column then use min() to get the lowest value per group.

Product Analytics Interview Question #4: Most Active Users On Messenger

This is a more difficult question that was also asked by Meta at their interviews for data scientists and data analysts. In this case, aggregating the data and counting events for each user is not enough - there is an extra step of finding the top 10 results.

Most Active Users On Messenger

Interview Question Date: November 2020

id	date	user1	user2	msg_count
1	2020-08-02	kpena	scottmartin	2
2	2020-08-02	misty19	srogers	2
3	2020-08-02	jerome75	craig23	3
4	2020-08-02	taylorhoward	johnmccann	8
5	2020-08-02	wangdenise	sgoodman	2

Product Analytics Interview Question #5: User Growth Rate

This product analytics question comes from Salesforce and concerns the user growth rate. Solving this question is more difficult because the data need to be aggregated separately for each month and then merged together to calculate the rate.

User Growth Rate

Interview Question Date: July 2021

Asana

Salesforce

EasyID 2052

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Find the growth rate of active users for Dec 2020 to Jan 2021 for each account. The growth rate is defined as the number of users in January 2021 divided by the number of users in Dec 2020. Output the account_id and growth rate.

DataFrame: sf_events

Expected Output Type: pandas.DataFrame

Link to the question: https://platform.stratascratch.com/coding/2052-user-growth-rate

We are being asked to find the growth rate of active users for Dec 2020 to Jan 2021 for each account. It is specified that the growth rate is defined as the number of users in January 2021 divided by the number of users in Dec 2020. We should output the account_id and growth rate.

Table: sf_events

date	account_id	user_id
2021-01-01	A1	U1
2021-01-01	A1	U2
2021-01-06	A1	U3
2021-01-02	A1	U1
2020-12-24	A1	U2

As mentioned, this question can be solved by following the two main steps, followed by adjusting the output to the format required by the question:

Calculate total distinct number of users for both the Jan 2021 and Dec 2020. Group the data by account_id;
Divide total number of users to get the growth rate;
Output account_id and calculated growth rate.

Conclusion

In this article, we have explained in detail a solution to a product analytics interview question from Meta, followed by 4 examples of other similar questions with different difficulty levels. This should have given you an idea of the types of product analytics interview questions asked at data science interviews and the understanding of how to approach and solve such questions.

Feel free to browse other product related interview questions on the StrataScratch platform such as interview questions for the product analyst position. Use the ‘Approach Hints’ functionality to display the general steps for solving the problems and discuss your code with other users to get more insights and feedback. After practicing using the real questions, you should be properly equipped to face the product analytics problems at your interview!

Product Analytics Data Scientist Interview Questions

Categories

Product Analytics Interview Question #1: Users by Average Session Time

Users By Average Session Time

Understanding the Data

Formulating the Approach

Writing the Code

Product Analytics Interview Question #2: Gender With Generous Reviews

Gender With Generous Reviews

Product Analytics Interview Question #3: Lowest Priced Orders

Lowest Priced Orders

Product Analytics Interview Question #4: Most Active Users On Messenger

Most Active Users On Messenger

Product Analytics Interview Question #5: User Growth Rate

User Growth Rate

Conclusion

Latest Posts:

Data Structures and Algorithms in Python for Beginners

3D Plot using Plotly (with Examples)

5 Alternatives to Matplotlib That Make Data Visualization a Breeze

Share

Follow

Categories