Facebook Data Science Interview Questions
Categories
Recent Facebook data science interview questions solved using Python
Facebook controls some of the top social networks across the world. Besides the eponymous app, it also offers - Messenger, Instagram, WhatsApp, Oculus, and Giphy among others. Along with Google, Apple, Microsoft, and Amazon, it is considered one of the Big Five FAANG companies in U.S. information technology and has a market cap of over US$ 1 trillion.
Data Science Roles at Facebook
Data Scientists work on various large scale quantitative research projects at Facebook. They conduct researches to achieve deep insights into how people are interacting with each other and with the world. People at data scientist positions at Facebook work with a variety of methods including machine learning, field experiments, surveys, and information visualization to accomplish their goals. The roles at Facebook will therefore vary on the basis of business unit and the function that you are interviewing for.
Concepts Tested in Facebook Data Science Interview Questions
The main areas and concepts tested in the Facebook Data Science Interview Questions include.
- Pandas
- groupby
- Indexing and Slicing DataFrames
- Boolean Masking
- apply() method
You can practice these and more such Facebook data science interview questions on the StrataScratch platform and become interview ready.
Check out our previous article on Facebook Interview Process that can provide you an insight into the whole process.
Facebook Data Science Interview Questions
Interview Question Date: July 2021
Meta/Facebook is developing a search algorithm that will allow users to search through their post history. You have been assigned to evaluate the performance of this algorithm.
We have a table with the user's search term, search result positions, and whether or not the user clicked on the search result.
Write a query that assigns ratings to the searches in the following way: • If the search was not clicked for any term, assign the search with rating=1 • If the search was clicked but the top position of clicked terms was outside the top 3 positions, assign the search a rating=2 • If the search was clicked and the top position of a clicked term was in the top 3 positions, assign the search a rating=3
As a search ID can contain more than one search term, select the highest rating for that search ID. Output the search ID and its highest rating.
Example: The search_id 1 was clicked (clicked = 1) and its position is outside of the top 3 positions (search_results_position = 5), therefore its rating is 2.
Link to the question: https://platform.stratascratch.com/coding/10350-algorithm-performance
Dataset
search_id | search_term | clicked | search_results_position |
---|---|---|---|
1 | rabbit | 1 | 5 |
2 | airline | 1 | 4 |
2 | quality | 1 | 5 |
3 | hotel | 1 | 1 |
3 | scandal | 1 | 4 |
Assumptions
Since you typically will not have access to the underlying data in the data science interview, you will have to use a mixture of business logic and your understanding of data storage to impute the variable assumptions. You must ensure that your solution boundaries are defined reasonably well.
So let us try to figure out what the data might look like. Please ensure that you confirm the validity of the assumptions that you make with the interviewer so that you do not veer away from the solution path. This will also give you a chance to showcase your ability to visualize the tables and data structures. Most interviewers will be more than happy to help you at this stage.
Assumptions on the Data and the Table:
search_id: This appears to be the identifying field for the search. However, this may not be a unique key – since the problem also mentions
As a search ID can contain more than one search term, select the highest rating for that search ID.
search_term: This is the search text entered by the user. For this problem, we can safely ignore this field.
clicked: This field appears to be an indicator of whether the user has previously clicked on the search result. This field is required for the final analysis. Further, since the data type for this field is int64, we might need to check with the interviewer regarding the values it takes.
search_results_position: This, too, is required for the final analysis and appears to be a field that denotes the rank of the query in search results.
Before we proceed towards drafting a solution for this problem, it is highly recommended that you confirm your assumptions with the interviewer to ensure that we can refine our assumptions and ensure that any edge cases are handled in the solution.
Logic
The biggest challenge in this problem is to create the rating column. Once that is done, the query is relatively straightforward since the query parameters are already provided. Let us visualize this.
- We need to work with two columns: clicked and search_results_position.
- If clicked is not 1, then set the rating as 1
- Else if the position is 3 or lesser, then set the rating as 2
- Else set the rating as 2
- Once we have the rating, we can aggregate it on search_id, taking the highest rating.
Now that we have our logic let us begin coding this in Python.
Solution
1. We start by creating the rating column as described above by applying a conditional statement on two columns. There are many ways to accomplish this. We look at two of the most efficient methods
a) Boolean Mask: Boolean Masks can apply a conditional on the entire data frame and return the indexes with a Boolean output for the conditional. So we can create three Boolean masks, one each for the three ratings.
# Import your libraries
import pandas as pd
# Start writing code
# Mask 01
fb_search_events['rating1'] = fb_search_events['clicked'] != 1
# Mask 02
fb_search_events['rating2'] = (fb_search_events['clicked'] == 1) & (fb_search_events['search_results_position'] > 3)
# Mask 03
fb_search_events['rating3'] = (fb_search_events['clicked'] == 1) & (fb_search_events['search_results_position'] <=3)
Let’s see what the data looks like.
fb_search_events[['search_id', 'clicked', 'search_results_position','rating1', 'rating2', 'rating3']]
search_id | clicked | search_results_position | rating1 | rating2 | rating3 |
---|---|---|---|---|---|
1 | 1 | 5 | FALSE | TRUE | FALSE |
2 | 1 | 4 | FALSE | TRUE | FALSE |
2 | 1 | 5 | FALSE | TRUE | FALSE |
3 | 1 | 1 | FALSE | FALSE | TRUE |
3 | 1 | 4 | FALSE | TRUE | FALSE |
Let us verify that our masks are working fine. We will be checking if there are any overlaps in the ratings (there should not be any).
fb_search_events[['rating1', 'rating2', 'rating3', 'search_id']].groupby(by =
['rating1', 'rating2', 'rating3'], as_index = False).count()
rating1 | rating2 | rating3 | search_id |
---|---|---|---|
FALSE | FALSE | TRUE | 25 |
FALSE | TRUE | FALSE | 20 |
TRUE | FALSE | FALSE | 30 |
b) Now that the masks are working fine, we can create the rating field. For this, we use the loc method.
# Import your libraries
import pandas as pd
# Start writing code
# Mask 01
rating1 = fb_search_events['clicked'] != 1
# Mask 02
rating2 = (fb_search_events['clicked'] == 1) & (fb_search_events['search_results_position'] > 3)
# Mask 03
rating3 = (fb_search_events['clicked'] == 1) & (fb_search_events['search_results_position'] <=3)
# Calculate Ratings
fb_search_events.loc[rating1, 'rating'] = 1
fb_search_events.loc[rating2, 'rating'] = 2
fb_search_events.loc[rating3, 'rating'] = 3
# Verify
fb_search_events[['search_id', 'clicked', 'search_results_position','rating']]
search_id | clicked | search_results_position | rating |
---|---|---|---|
1 | 1 | 5 | 2 |
2 | 1 | 4 | 2 |
2 | 1 | 5 | 2 |
3 | 1 | 1 | 3 |
3 | 1 | 4 | 2 |
c) The last step is to return the highest rating for each search ID. We use the max() function to do that, and the solution to this interview question looks like this.
search_id | rating |
---|---|
1 | 2 |
2 | 2 |
3 | 3 |
5 | 3 |
6 | 3 |
2. Alternatively, we can use the apply() method in Pandas with a lambda function to do all this in one step.
a)The apply() method can be used to apply a function along an axis of a DataFrame. The lambda function is used to create a user-defined function on the fly.
# Import your libraries
import pandas as pd
# Start writing code
fb_search_events['rating'] = fb_search_events[['clicked',
'search_results_position']].apply(lambda x : 1 if x[0] != 1 else 3 if x[1] <=3
else 2 , axis = 1)
# Verify
fb_search_events[['search_id', 'clicked', 'search_results_position','rating']]
search_id | clicked | search_results_position | rating |
---|---|---|---|
1 | 1 | 5 | 2 |
2 | 1 | 4 | 2 |
2 | 1 | 5 | 2 |
3 | 1 | 1 | 3 |
3 | 1 | 4 | 2 |
b) Once we have the rating field, we can easily summarize the data frame using the groupby() and max() methods.
The final code is given below.
# Import your libraries
import pandas as pd
# Start writing code
fb_search_events['rating'] = fb_search_events[['clicked',
'search_results_position']].apply(lambda x : 1 if x[0] != 1 else 3 if x[1] <=3
else 2 , axis = 1)
result = fb_search_events.groupby('search_id')['rating'].max().reset_index()
search_id | ratings |
---|---|
1 | 2 |
2 | 2 |
3 | 3 |
5 | 3 |
6 | 3 |
Optimization
NumPy forms the basis of the Python Pandas library. These libraries are specifically designed to perform vectorized operations in super quick times. In simple terms, instead of iterating item by item using a for loop, NumPy and, by extension, Pandas can perform the same operation over an entire column in one go. Think of it as creating a formula in a spreadsheet and then copying it along the entire column.
For our solution:
- We used Boolean Masking in order to speed up filtering rows.
- Alternatively, we can use the apply() method with a lambda function to apply a conditional statement over the entire data frame.
# Import your libraries
import pandas as pd
# Start writing code
# Mask 01
rating1 = fb_search_events['clicked'] != 1
# Mask 02
rating2 = (fb_search_events['clicked'] == 1) & (fb_search_events['search_results_position'] > 3)
# Mask 03
rating3 = (fb_search_events['clicked'] == 1) & (fb_search_events['search_results_position'] <=3)
# Calculate Ratings
fb_search_events.loc[rating1, 'rating'] = 1
fb_search_events.loc[rating2, 'rating'] = 2
fb_search_events.loc[rating3, 'rating'] = 3
result = fb_search_events.groupby('search_id')['rating'].max().reset_index()
Additional Facebook Data Science Interview Questions
Facebook Data Science Interview Question #1: Find whether the number of seniors works at Facebook is higher than its number of USA based employees
Interview Question Date: April 2020
Find whether the number of senior workers (i.e., more experienced) at Meta/Facebook is higher than number of USA based employees at Facebook/Meta. If the number of seniors is higher then output as 'More seniors'. Otherwise, output as 'More USA-based'.
Link to the question: https://platform.stratascratch.com/coding/10065-find-whether-the-number-of-seniors-works-at-facebook-is-higher-than-its-number-of-usa-based-employees
Dataset
id | location | age | gender | is_senior |
---|---|---|---|---|
0 | USA | 24 | M | FALSE |
1 | USA | 31 | F | TRUE |
2 | USA | 29 | F | FALSE |
3 | USA | 33 | M | FALSE |
4 | USA | 36 | F | TRUE |
This is one of the easy level Facebook Data Science Interview questions. We can solve this in multiple ways using the len() method. We will then have to output the data in the form of a dataframe. For that, we need to create a new dataframe with the output. That can be done using the DataFrame() method in pandas.
Approach
- Find the count of the number of seniors
- Find the count of the number of employees based in USA
- Compare the counts and output the result into a DataFrame.
Facebook Data Science Interview Question #2: Clicked Vs Non-Clicked Search Results
Clicked Vs Non-Clicked Search Results
The 'position' column represents the position of the search results, and 'has_clicked' column represents whether the user has clicked on this result. Calculate the percentage of clicked search results, compared to those not clicked, that were in the top 3 positions (with respect to total number of records)
Link to the question: https://platform.stratascratch.com/coding/10288-clicked-vs-non-clicked-search-results
Dataset
This Facebook Data Science Interview question uses the same fb_search_events dataset we had seen earlier. We can solve this problem using the built-in len() method. We will then have to output the data in the form of a dataframe. For that we need to create a new dataframe with the output. That can be done using the DataFrame() method in pandas. One approach to solving this is presented below.
Approach
- Calculate the number of results that were clicked (filtering by the ‘has_clicked’ field) and in the top three results (filtering on the ‘position’ field) as a percentage of the total number of query results. This will be the clicked percentage.
- Calculate the not_clicked results in a similar manner by changing the filter on the has_clicked field.
- Output the clicked and not_clicked values in a Pandas data frame.
Facebook Data Science Interview Question #3: Popularity of Hack
Interview Question Date: March 2020
Meta/Facebook has developed a new programing language called Hack.To measure the popularity of Hack they ran a survey with their employees. The survey included data on previous programing familiarity as well as the number of years of experience, age, gender and most importantly satisfaction with Hack. Due to an error location data was not collected, but your supervisor demands a report showing average popularity of Hack by office location. Luckily the user IDs of employees completing the surveys were stored. Based on the above, find the average popularity of the Hack per office location. Output the location along with the average popularity.
Link to the question: https://platform.stratascratch.com/coding/10061-popularity-of-hack
Dataset
This problem uses the facebook_employees data set that we had used earlier along and additional facebook_hack_survey dataset.
id | location | age | gender | is_senior |
---|---|---|---|---|
0 | USA | 24 | M | FALSE |
1 | USA | 31 | F | TRUE |
2 | USA | 29 | F | FALSE |
3 | USA | 33 | M | FALSE |
4 | USA | 36 | F | TRUE |
employee_id | age | gender | popularity |
---|---|---|---|
0 | 24 | M | 6 |
1 | 31 | F | 4 |
2 | 29 | F | 0 |
3 | 33 | M | 7 |
4 | 36 | F | 6 |
This Facebook Data Science Interview question can be solved by merging the two data sets using the merge() and groupby() methods. One such approach is presented here.
Approach
- Merge the two datasets. The join keys are – id column in the facebook_employees dataset and employee_id column in facebook_hack_survey dataset.
- Calculate the average popularity, aggregating on the location column.
Check out our article Facebook Data Scientist Interview Questions to find more questions from Facebook interviews.
Conclusion
In this article, we have discussed an approach to solving one of the real-life Facebook Data Science interview questions in detail using Python. The question was not too tough. Besides getting the right answer, the final evaluation would also take into consideration your optimization skills and knowledge of specific Pandas features like Boolean Masking and the use of the apply() method. Expertise in Python in general and Pandas library for Data Science, in particular, can be accomplished only with the practice of solving a variety of problems. Join the StrataScratch platform and practice more such data science interview questions from Facebook and other top companies like Amazon, Apple, Microsoft, Netflix and more.