PySpark Interview Questions for Data Science Excellence
Categories
Navigating Through Essential PySpark Interview Questions from Basics to Advanced for Aspiring Data Scientists
What sets apart a good data scientist from a great one? It's not just about knowing different tools and techniques; it's about understanding how and when to use them.
PySpark, integrating Python and Apache Spark, stands as a crucial tool in modern data science. Its importance in processing large datasets and enabling distributed computing is undeniable.
In this article, I will introduce you to a range of PySpark interview questions, from basic to advanced levels, and you will see how and when to use them in real life, by testing yourselves with interview questions. Buckle up and let’s get started!
Basic PySpark Interview Questions
Starting with the basics is importantl in building a strong foundation. This section on Basic PySpark Interview Questions is designed for beginners or those new to PySpark.
PySpark Interview Questions #1: Find out search details for apartments designed for a sole-person stay
This question focuses on extracting details of searches made by people looking for apartments suitable for a single person.
Find the search details made by people who searched for apartments designed for a single-person stay.
Link to this question : https://platform.stratascratch.com/coding/9615-find-out-search-details-for-apartments-designed-for-a-sole-person-stay?code_type=6
First,we need to find out which searches were for single-person apartments. Let’s break down the solution.
- Start with the dataset 'airbnb_search_details', which contains information about what people are searching for.
- Filter this data to find searches that meet two criteria: the accommodation is for one person ('accommodates' equals 1) and the property type is 'Apartment'.
- After applying these filters, convert the filtered data into a pandas dataframe. This is done for easier analysis and handling of the data.
In simple terms, we're identifying and listing the search details for apartments that are meant for only one person. Let’s see the code.
import pyspark.sql.functions as F
result = airbnb_search_details.filter((F.col('accommodates') == 1) & (F.col('property_type') == 'Apartment')).toPandas()
Here is the output.
id | price | property_type | room_type | amenities | accommodates | bathrooms | bed_type | cancellation_policy | cleaning_fee | city | host_identity_verified | host_response_rate | host_since | neighbourhood | number_of_reviews | review_scores_rating | zipcode | bedrooms | beds |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5059214 | 431.75 | Apartment | Private room | {TV,"Wireless Internet","Air conditioning",Kitchen,"Free parking on premises",Breakfast,Heating,"Smoke detector","Carbon monoxide detector","First aid kit","Fire extinguisher",Essentials,Shampoo,"Lock on bedroom door",Hangers,"Laptop friendly workspace","Private living room"} | 1 | 3 | Real Bed | strict | FALSE | NYC | f | 2014-03-14 00:00:00 | Harlem | 0 | 10030 | 2 | 1 | ||
10923708 | 340.12 | Apartment | Private room | {TV,Internet,"Wireless Internet","Air conditioning",Kitchen,"Pets live on this property",Cat(s),"Buzzer/wireless intercom",Heating,"Family/kid friendly",Washer,"Smoke detector","Carbon monoxide detector","First aid kit","Fire extinguisher",Essentials} | 1 | 1 | Real Bed | strict | FALSE | NYC | t | 100% | 2014-06-30 00:00:00 | Harlem | 166 | 91 | 10031 | 1 | 1 |
1077375 | 400.73 | Apartment | Private room | {"Wireless Internet",Heating,"Family/kid friendly","Smoke detector","Carbon monoxide detector","Fire extinguisher",Essentials,Shampoo,Hangers,Iron,"Laptop friendly workspace","translation missing: en.hosting_amenity_50"} | 1 | 1 | Real Bed | moderate | TRUE | NYC | t | 2015-04-04 00:00:00 | Sunset Park | 1 | 100 | 11220 | 1 | 1 | |
13121821 | 501.06 | Apartment | Private room | {TV,"Cable TV",Internet,"Wireless Internet","Air conditioning",Kitchen,Heating,"Smoke detector","First aid kit",Essentials,Hangers,"Hair dryer",Iron,"Laptop friendly workspace"} | 1 | 1 | Real Bed | flexible | FALSE | NYC | f | 2014-09-20 00:00:00 | Park Slope | 0 | 11215 | 1 | 1 | ||
19245819 | 424.85 | Apartment | Private room | {Internet,"Wireless Internet",Kitchen,"Pets live on this property",Dog(s),Washer,Dryer,"Smoke detector","Fire extinguisher"} | 1 | 1 | Real Bed | moderate | FALSE | SF | t | 2010-03-16 00:00:00 | Mission District | 12 | 90 | 94110 | 1 | 1 |
PySpark Interview Questions #2: Users Activity Per Month Day
This question is about finding out how active users are on Facebook on different days of the month, measured by the number of posts made each day.
Interview Question Date: January 2021
Return a distribution of users activity per day of the month. By distribution we mean the number of posts per day of the month.
Link to this question : https://platform.stratascratch.com/coding/2006-users-activity-per-month-day?code_type=6
We are analyzing the number of Facebook posts made on each day of the month. Let’s break down the solution.
- Start with the 'facebook_posts' dataset, which records when posts are made.
- First, convert the 'post_date' to a standard date format.
- Then, group the posts by the day of the month they were made on.
- For each day, count the number of posts ('post_id'), which we label as 'user_activity'.
- Order the results by the day of the month.
- Finally, convert the PySpark dataframe into a pandas dataframe for easier analysis.
In short, we are figuring out how many posts are made on each day of the month on Facebook and presenting this data in a clear and organized way. Let’s see the code.
import pyspark.sql.functions as F
result = facebook_posts.withColumn('post_date', F.to_date(F.col('post_date'))) \
.groupBy(F.dayofmonth('post_date').alias('day')) \
.agg(F.count('post_id').alias('user_activity')) \
.orderBy('day') \
.toPandas()
Here is the output.
post_date | user_activity |
---|---|
1 | 3 |
2 | 3 |
PySpark Interview Questions #3:Customers Who Purchased the Same Product
This question asks us to identify customers who have bought the same furniture items and to provide details like the product ID, brand name, and the count of unique customers for each furniture item, arranged in order of popularity.
Interview Question Date: February 2023
In order to improve customer segmentation efforts for users interested in purchasing furniture, you have been asked to find customers who have purchased the same items of furniture.
Output the product_id
, brand_name
, unique customer ID's who purchased that product, and the count of unique customer ID's who purchased that product. Arrange the output in descending order with the highest count at the top.
Link to this question : https://platform.stratascratch.com/coding/2150-customers-who-purchased-the-same-product?code_type=6
We are looking to improve customer segmentation by finding out which furniture items are bought by the same customers. Let’s break down the solution.
- Start by merging two datasets: 'online_orders' (which contains customer orders) and 'online_products' (which contains product details). We link them using 'product_id'.
- Then, filter this merged data to include only furniture items.
- Create a list showing product ID, brand name, and customer ID, making sure to remove any duplicates.
- Next, group the data by product ID and count the distinct customer IDs for each product, labeling this count as 'unique_cust_no'.
- Then order this grouped data so that the products with the highest number of unique customers are at the top.
- Merge this sorted data back with our original furniture list.
- Finally, convert this PySpark dataframe into a pandas dataframe for easier handling and presentation.
In short, we are finding and listing furniture items based on the number of unique customers who bought them, showing the most popular items first. Let’s see the code.
import pyspark.sql.functions as F
merged = online_orders.join(online_products, on="product_id", how="inner")
merged = merged.filter(merged["product_class"] == "FURNITURE")
merged = merged.select("product_id", "brand_name", "customer_id").dropDuplicates()
unique_cust = (
merged.groupBy("product_id")
.agg(F.countDistinct("customer_id").alias("unique_cust_no"))
.orderBy(F.desc("unique_cust_no"))
)
result = merged.join(unique_cust, on="product_id", how="inner").orderBy(F.desc("unique_cust_no"))
result.toPandas()
Here is the output.
product_id | brand_name | customer_id | unique_cust_no |
---|---|---|---|
10 | American Home | 2 | 3 |
10 | American Home | 1 | 3 |
10 | American Home | 3 | 3 |
8 | Lucky Joe | 3 | 1 |
11 | American Home | 1 | 1 |
PySpark Interview Questions #4:Sorting Movies By Duration Time
This question requires organizing a list of movies based on their duration, with the longest movies shown first.
Interview Question Date: May 2023
You have been asked to sort movies according to their duration in descending order.
Your output should contain all columns sorted by the movie duration in the given dataset.
Link to this question : https://platform.stratascratch.com/coding/2163-sorting-movies-by-duration-time?code_type=6
We need to arrange movies by their length, starting with the longest. Let’s break down the solution.
- Begin with the 'movie_catalogue', which includes details about various movies.
- Extract the duration in minutes from the 'duration' column. This involves finding the number in the text and converting it to a float (a number that can have decimals).
- Next, sort the entire movie catalogue based on these duration numbers, putting the longest movies at the top.
- After sorting, remove the 'movie_minutes' column, as it's no longer needed.
- Finally, convert the sorted data into a pandas dataframe.
In simple terms, we are putting the movies in order from longest to shortest based on their duration. Let’s see the code.
import pyspark.sql.functions as F
movie_catalogue = movie_catalogue.withColumn(
"movie_minutes",
F.regexp_extract(F.col("duration"), r"(\d+)", 1).cast("float")
)
result = movie_catalogue.orderBy(F.desc("movie_minutes")).drop("movie_minutes")
result.toPandas()
Here is the output.
show_id | title | release_year | rating | duration |
---|---|---|---|---|
s8083 | Star Wars: Episode VIII: The Last Jedi | 2017 | PG-13 | 152 min |
s6201 | Avengers: Infinity War | 2018 | PG-13 | 150 min |
s6326 | Black Panther | 2018 | PG-13 | 135 min |
s8052 | Solo: A Star Wars Story | 2018 | PG-13 | 135 min |
s8053 | Solo: A Star Wars Story (Spanish Version) | 2018 | PG-13 | 135 min |
PySpark Interview Questions #5:Find the date with the highest opening stock price
This question is about identifying the date when Apple's stock had its highest opening price.
Find the date when Apple's opening stock price reached its maximum
Link to this question : https://platform.stratascratch.com/coding/9613-find-the-date-with-the-highest-opening-stock-price?code_type=6
We are tasked with finding out the day when Apple's stock opened at its maximum value. Let’s break down the solution.
- Start with the 'aapl_historical_stock_price' dataset, which has Apple's stock price history.
- Modify the 'date' column to ensure it's in a string format showing only the year, month, and day.
- Next, find the maximum value in the 'open' column, which represents the opening stock price.
- Then filter the dataset to find the date(s) when this maximum opening price occurred.
- Finally, select only the 'date' column and convert the data to a pandas dataframe for easy viewing.
In summary, we are pinpointing the date when Apple's stock had its highest opening price and presenting this information in a straightforward manner. Let’s see the code.
import pandas as pd
import numpy as np
import datetime
import time
df = aapl_historical_stock_price
df = df.withColumn('date', df['date'].cast('string').substr(0, 10))
result = df.filter(df['open'] == df.selectExpr('max(open)').collect()[0][0]).select('date').toPandas()
Here is the output.
date |
---|
2012-09-21 |
Intermediate PySpark Interview Questions
Once the fundamentals are mastered, the next step is to go into more complex scenarios. The Intermediate PySpark Interview Questions section is tailored for those who have a basic understanding of PySpark.
PySpark Interview Questions #5:Find the first and last times the maximum score was awarded
This question asks us to find out the earliest and latest dates on which the highest score was given in Los Angeles restaurant health inspections.
Interview Question Date: April 2018
Find the first and last times the maximum score was awarded
Link to this question : https://platform.stratascratch.com/coding/9712-find-the-first-and-last-times-the-maximum-score-was-awarded?code_type=6
We are looking for the first and last instances when the maximum health inspection score was awarded to restaurants in Los Angeles. Let’s break down the solution.
- Start by identifying the highest score given in the 'los_angeles_restaurant_health_inspections' dataset.
- Next, ensure the 'activity_date' is in a date format for accurate comparison.
- Then find the earliest date ('first_time') when this maximum score was awarded by filtering for this score and selecting the minimum date.
- Similarly, find the latest date ('last_time') by selecting the maximum date for the same score.
- Combine these two dates to get a result showing both the first and last times this score was given.
- Finally, convert this combined data into a pandas dataframe for easy viewing.
In summary, we are identifying the first and last occurrences of the highest health inspection score awarded to restaurants in Los Angeles. Let’s see the code.
import pyspark.sql.functions as F
max_score = los_angeles_restaurant_health_inspections.select(F.max("score")).first()[0]
los_angeles_restaurant_health_inspections = los_angeles_restaurant_health_inspections.withColumn("activity_date", F.to_date("activity_date"))
first_time = los_angeles_restaurant_health_inspections.filter(los_angeles_restaurant_health_inspections["score"] == max_score).select(F.min("activity_date").alias("first_time"))
last_time = los_angeles_restaurant_health_inspections.filter(los_angeles_restaurant_health_inspections["score"] == max_score).select(F.max("activity_date").alias("last_time"))
result = first_time.crossJoin(last_time)
result.toPandas()
Here is the output.
first_time | last_time |
---|---|
2015-09-11 | 2018-03-16 |
PySpark Interview Questions #6:Account Registrations
This question requires us to calculate the number of account signups per month, showing the year and month along with the corresponding number of registrations.
Interview Question Date: August 2022
Find the number of account registrations according to the signup date. Output the year months (YYYY-MM) and their corresponding number of registrations.
Link to this question : https://platform.stratascratch.com/coding/2126-account-registrations?code_type=6
We are tasked with finding out how many accounts were registered each month. Let’s break down the solution.
- Begin with the 'noom_signups' dataset, which has data on when accounts were registered.
- Create a new column 'started_at_month' that formats the 'started_at' date to show only the year and month (in 'YYYY-MM' format).
- Next, group the data by this new 'started_at_month' column.
- Count the number of registrations for each month and rename this count as 'n_registrations'.
- Then sort the data by the month and year.
- Finally, convert this sorted data into a pandas dataframe for easy reading.
In simple terms, we are tallying the number of account signups for each month and displaying them in an organized and chronological manner. Let’s see the code.
import pyspark.sql.functions as F
noom_signups = noom_signups.withColumn('started_at_month', F.date_format('started_at', 'yyyy-MM'))
result = noom_signups.groupby('started_at_month').count().withColumnRenamed('count', 'n_registrations').orderBy('started_at_month')
result.toPandas()
Here is the output.
started_at_month | n_registrations |
---|---|
2018-10 | 1 |
2018-11 | 4 |
2018-12 | 1 |
2019-01 | 3 |
2019-02 | 3 |
PySpark Interview Questions #7: Process a Refund
This question asks for the calculation of the minimum, average, and maximum number of days it takes to process a refund for accounts opened since January 1, 2019, and to group these calculations by billing cycle in months.
Interview Question Date: August 2022
Calculate and display the minimum, average and the maximum number of days it takes to process a refund for accounts opened from January 1, 2019. Group by billing cycle in months.
Note: The time frame for a refund to be fully processed is from settled_at until refunded_at.
Link to this question : https://platform.stratascratch.com/coding/2125-process-a-refund?code_type=6
We need to analyze the duration of refund processing for accounts opened from 2019 onwards, grouped by their billing cycle duration. Let’s break down the solution.
- Begin by joining three datasets: 'noom_transactions', 'noom_signups', and 'noom_plans', linking them via 'signup_id' and 'plan_id'.
- Filter these combined datasets to include only transactions from accounts started on or after January 1, 2019.
- Calculate 'time_to_settle', the number of days between 'settled_at' and 'refunded_at'.
- Next, group the data by 'billing_cycle_in_months'.
- For each billing cycle, calculate the minimum, average, and maximum refund processing time.
- Sort these groups by the billing cycle length.
- Finally, convert this grouped and calculated data into a pandas dataframe for easier interpretation.
In summary, we're measuring how long refunds take for different billing cycles, starting from 2019, and presenting this data in an organized manner. Let’s see the code.
import pyspark.sql.functions as F
transactions_signups = noom_transactions.join(noom_signups, on='signup_id')
transactions_signups_plans = transactions_signups.join(noom_plans, on='plan_id')
new_sigups_transactions = transactions_signups_plans.filter(transactions_signups_plans['started_at'] >= '2019-01-01')
new_sigups_transactions = new_sigups_transactions.withColumn('time_to_settle', (F.datediff(new_sigups_transactions['refunded_at'], new_sigups_transactions['settled_at'])))
result = new_sigups_transactions.groupby('billing_cycle_in_months').agg(F.min('time_to_settle').alias('_min'), F.mean('time_to_settle').alias('_mean'), F.max('time_to_settle').alias('_max')).sort('billing_cycle_in_months')
result.toPandas()
Here is the output.
billing_cycle_in_months | _time_to_settle_min | _time_to_settle_mean | _time_to_settle_max |
---|---|---|---|
1 | 4 | 10.44 | 21 |
6 | 6 | 9 | 14 |
12 | 6 | 9.5 | 13 |
PySpark Interview Questions #8:Highest Salary
This question requires us to identify the employee (or employees) who has the highest salary, and to display their first name and the amount of their salary.
You have been asked to find the employee with the highest salary. Output the worker or worker's first name, as well as the salary.
Link to this question : https://platform.stratascratch.com/coding/9870-highest-salary?code_type=6
We need to find the employee with the highest salary in the dataset. Let’s break down the solution.
- Start with the 'worker' dataset, which includes details about employees and their salaries.
- First determine the highest salary in the dataset using the 'max' function.
- Next, filter the dataset to find the employee(s) who have this highest salary.
- Then select the 'first_name' and 'salary' columns to display.
- Finally, convert this information into a pandas dataframe to make it more readable.
In summary, we are identifying the employee(s) with the top salary and presenting their first name along with the salary amount. Let’s see the code.
import pyspark.sql.functions as F
result = worker.filter(F.col('salary') == worker.select(F.max('salary')).collect()[0][0]) \
.select('first_name', 'salary') \
.toPandas()
Here is the output.
first_name | salary |
---|---|
Amitah | 500000 |
Vivek | 500000 |
PySpark Interview Questions #9: Find the average of inspections scores between 91 and 100
This question asks us to calculate the average (mean) of health inspection scores for Los Angeles restaurants that fall between 91 and 100, assuming that these scores are normally distributed.
Find the mean of inspections scores between 91 and 100. Assuming that the scores are normally distributed.
Link to this question : https://platform.stratascratch.com/coding/9707-find-the-average-of-inspections-scores-between-91-and-100?code_type=6
We are finding the average score of restaurant health inspections that are in the range of 91 to 100. Let’s break down the solution.
- Start with the 'los_angeles_restaurant_health_inspections' dataset.
- Filter this data to include only the scores that are between 91 and 100.
- Next, calculate the average (mean) of these scores.
- Finally, convert this calculated average into a pandas dataframe for easy viewing and interpretation.
In simple terms, we are determining the average score of health inspections for Los Angeles restaurants, focusing on scores between 91 and 100. Let’s see the code.
import pyspark.sql.functions as F
score_between = los_angeles_restaurant_health_inspections.filter(F.col('score').between(91, 100))
result = score_between.select(F.mean('score')).toPandas()
result
Here is the output.
Advanced PySpark Interview Questions
For those who have confidently navigated through the basics and intermediate levels, the Advanced PySpark Interview Questions section awaits. This is where complex, real-world data problems are addressed. It’s designed for individuals who are comfortable with PySpark and are looking to deepen their expertise.
PySpark Interview Questions #10: Find how the survivors are distributed by the gender and passenger classes
This question asks us to determine the distribution of Titanic survivors based on their gender and the class they traveled in. The classes are categorized into first, second, and third class based on the 'pclass' value.
Find how the survivors are distributed by the gender and passenger classes. Classes are categorized based on the pclass value as: pclass = 1: first_class pclass = 2: second_classs pclass = 3: third_class Output the sex along with the corresponding number of survivors for each class. HINT: each sex should be in the separate line with one column having the value of that sex and other 3 columns having number of survivors for each 3 classes.
Link to this question : https://platform.stratascratch.com/coding/9882-find-how-the-survivors-are-distributed-by-the-gender-and-passenger-classes/official-solution?code_type=6
We need to find out how many survivors there were in each passenger class, broken down by gender. Let’s break down the solution.
- We start with the 'titanic' dataset and filter it to include only the records of survivors ('survived' == 1).
- We then group these survivor records by 'sex' and 'pclass' and count the number of survivors in each group.
- Next, we reorganize (pivot) this data so that each row represents a gender, and each column represents a class, showing the count of survivors in each category.
- We rename the class columns to 'first_class', 'second_class', and 'third_class' for clarity.
- Finally, we convert this pivoted data into a pandas dataframe, which will display each gender with the corresponding number of survivors in each of the three classes.
In summary, we are showcasing the number of Titanic survivors based on their gender and the class in which they were traveling. Let’s see the code.
import pyspark.sql.functions as F
survived = titanic.filter(titanic['survived'] == 1)
count = survived.groupby(['sex','pclass']).agg(F.count('*').alias('count'))
pivot = count.groupBy('sex').pivot('pclass').agg(F.sum('count'))
pivot = pivot.withColumnRenamed('1', 'first_class').withColumnRenamed('2', 'second_class').withColumnRenamed('3', 'third_class')
result = pivot.toPandas()
result
Here is the output.
sex | first_class | second_classs | third_class |
---|---|---|---|
female | 7 | 9 | 15 |
male | 3 | 3 | 4 |
PySpark Interview Questions #11:Consecutive Days
This question is about identifying users who were active for three or more consecutive days.
Interview Question Date: July 2021
Find all the users who were active for 3 consecutive days or more.
Link to this question : https://platform.stratascratch.com/coding/2054-consecutive-days?code_type=6
We need to find users who have been active for at least three days in a row. Let’s break down the solution.
- Start with the 'sf_events' dataset and remove any duplicate records.
- Then ensure the 'date' column is in a standard date format (YYYY-MM-DD).
- Next, assign a rank to each user's activity date. This rank is calculated within each user's set of dates, ordered chronologically.
- Create a new column 'consecutive_days' by subtracting the rank from the date. This helps to identify consecutive days.
- Then group the data by 'user_id' and 'consecutive_days', counting the number of records in each group.
- Filter these groups to keep only those where the count is three or more, indicating three or more consecutive days of activity.
- Finally, select the 'user_id' of these active users and convert the data into a pandas dataframe.
In summary, we are pinpointing users who were active for three consecutive days or more and listing their IDs. Let’s see the code.
import pyspark.sql.functions as F
from pyspark.sql.window import Window
df = sf_events.dropDuplicates()
df = df.withColumn('date', F.to_date(df['date'], 'yyyy-MM-dd'))
df = df.withColumn('rank', F.row_number().over(Window.partitionBy('user_id').orderBy('date')))
df = df.withColumn('consecutive_days', F.date_sub(df['date'], df['rank'] - 1))
result = df.groupBy('user_id', 'consecutive_days').agg(F.count('*').alias('counter')).filter(F.col('counter') >= 3).select('user_id')
result.toPandas()
Here is the output.
user_id |
---|
U4 |
PySpark Interview Questions #12:Find all records with words that start with the letter 'g'
This question asks us to identify records from a dataset where either of two fields, 'words1' or 'words2', contains words starting with the letter 'g'.
Find all records with words that start with the letter 'g'.
Output words1 and words2 if any of them satisfies the condition.
Link to this question : https://platform.stratascratch.com/coding/9806-find-all-records-with-words-that-start-with-the-letter-g?code_type=6
We need to find records with words beginning with 'g' in either of two columns ('words1' or 'words2'). Let’s break down the solution.
- Begin with the 'google_word_lists' dataset.
- Apply filters to both 'words1' and 'words2' columns to check if any word starts with the letter 'g'. We use the regular expression ('rlike') function for this purpose. The expression checks for words starting with 'g' either at the beginning of the string or preceded by a comma, space, or other delimiters.
- After applying these filters, select the records that meet our criteria.
- Finally, convert these filtered records into a pandas dataframe for easy viewing and analysis.
In simple terms, we are finding and listing records where either the 'words1' or 'words2' field contains a word that starts with the letter 'g'. Let’s see the code.
import pyspark.sql.functions as F
movie_catalogue = movie_catalogue.withColumn(
"movie_minutes",
F.regexp_extract(F.col("duration"), r"(\d+)", 1).cast("float")
)
result = movie_catalogue.orderBy(F.desc("movie_minutes")).drop("movie_minutes")
result.toPandas()
Here is the output.
words1 | words2 |
---|---|
google,facebook,microsoft | flower,nature,sun |
sun,nature | google,apple |
beach,photo | facebook,green,orange |
PySpark Interview Questions #13:Top Teams In The Rio De Janeiro 2016 Olympics
This question asks us to identify the top three medal-winning teams for each event at the Rio De Janeiro 2016 Olympics, and display them as 'gold team', 'silver team', and 'bronze team' along with the number of medals they won. In case of a tie, the teams should be ordered alphabetically. If there is no team for a position, it should be labeled as 'No Team'.
Find the top 3 medal-winning teams by counting the total number of medals for each event in the Rio De Janeiro 2016 olympics. In case there is a tie, order the countries by name in ascending order. Output the event name along with the top 3 teams as the 'gold team', 'silver team', and 'bronze team', with the team name and the total medals under each column in format "{team} with {number of medals} medals". Replace NULLs with "No Team" string.
Link to this question : https://platform.stratascratch.com/coding/9960-top-teams-in-the-rio-de-janeiro-2016-olympics?code_type=6
We are analyzing the 2016 Rio De Janeiro Olympics data to determine the top three teams in terms of medal counts for each event. Let’s break down the solution.
- Start with the 'olympics_athletes_events' dataset and convert it to a pandas dataframe.
- Filter this data to include only the 2016 Olympics and events where medals were awarded.
- Then group the data by 'event' and 'team', counting the number of medals for each team in each event.
- For each event, rank the teams based on their medal counts and, in case of a tie, alphabetically by team name.
- Identify the top three teams for each event and label them as 'gold team', 'silver team', and 'bronze team', including the count of medals they won in a formatted string.
- Group this data by 'event' and aggregate the top teams in their respective positions.
- Replace any missing team positions with 'No Team'.
- The final output includes the event name, and the top three teams with their medal counts
In summary, we are showcasing the leading medal-winning teams in each event from the 2016 Olympics, labeled according to their rank and presented in an easy-to-read format. Let’s see the code.
import pandas as pd
import numpy as np
olympics_athletes_events = olympics_athletes_events.toPandas()
y_2016 = olympics_athletes_events[(olympics_athletes_events['year'] == 2016) &(olympics_athletes_events['medal'].notnull()) ]
n_medal = y_2016.groupby(['event','team']).size().to_frame('medals_count').reset_index()
n_medal['team_position'] = n_medal.sort_values(['medals_count', 'team'], ascending = [False, True]).groupby(['event']).cumcount() + 1
less_3_medals = n_medal[n_medal['team_position'] <= 3]
less_3_medals.loc[less_3_medals['team_position'] == 1, 'gold_team'] = less_3_medals['team'] + " with "+ less_3_medals['medals_count'].astype(str) + " medals"
less_3_medals.loc[less_3_medals['team_position'] == 2, 'silver_team'] = less_3_medals['team'] + " with "+ less_3_medals['medals_count'].astype(str) + " medals"
less_3_medals.loc[less_3_medals['team_position'] == 3, 'bronze_team'] = less_3_medals['team'] + " with "+ less_3_medals['medals_count'].astype(str) + " medals"
result = less_3_medals.groupby('event').agg({'gold_team':'first', 'silver_team':'first', 'bronze_team':'first'}).reset_index().fillna('No Team')
result
Here is the output.
event | gold_team | silver_team | bronze_team |
---|---|---|---|
Archery Men's Individual | France with 1 medals | No Team | No Team |
Athletics Men's Marathon | Kenya with 1 medals | No Team | No Team |
Athletics Women's 100 metres | United States with 1 medals | No Team | No Team |
Basketball Women's Basketball | Spain with 1 medals | United States with 1 medals | No Team |
Boxing Women's Lightweight | Russia with 1 medals | No Team | No Team |
PySpark Interview Questions #14:Exclusive Amazon Products
This question asks us to identify products that are only sold on Amazon and not available at Top Shop and Macy's. We need to list these exclusive products along with their name, brand, price, and rating.
Find products which are exclusive to only Amazon and therefore not sold at Top Shop and Macy's. Your output should include the product name, brand name, price, and rating.
Two products are considered equal if they have the same product name and same maximum retail price (mrp column).
Link to this question : https://platform.stratascratch.com/coding/9608-exclusive-amazon-products?code_type=6
We are tasked with finding products that are exclusive to Amazon, meaning they aren't sold at Top Shop and Macy's. Let’s break down the solution.
- Start with the 'innerwear_amazon_com' dataset, which includes details about products sold on Amazon.
- Check these Amazon products against the 'innerwear_macys_com' and 'innerwear_topshop_com' datasets to ensure they are not available at Macy's and Top Shop. We do this using 'leftanti' joins, which find records in the first dataset that do not have matching records in the second dataset. We use 'product_name' and 'mrp' (maximum retail price) as key fields to compare.
- After performing these joins, select the 'product_name', 'brand_name', 'price', and 'rating' columns for the final output.
- Finally, convert this filtered data into a pandas dataframe for easier analysis.
In short, we are identifying and listing products that are unique to Amazon by ensuring they are not sold at Top Shop and Macy's, based on their name and price. Let’s see the code.
# Import your libraries
import pyspark
# Start writing code
(innerwear_amazon_com
.join(innerwear_macys_com, ['product_name','mrp'], 'leftanti')
.join(innerwear_topshop_com, ['product_name','mrp'], 'leftanti')
.select('product_name','brand_name','price','rating').toPandas())
Here is the output.
product_name | brand_name | price | rating |
---|---|---|---|
Calvin Klein Women's Bottoms Up Hipster Panty | Calvin-Klein | $11.00 | 4.5 |
Wacoal Women's Retro Chic Underwire Bra | Wacoal | $60.00 | 4.4 |
Calvin Klein Women's Carousel 3 Pack Thong | Calvin-Klein | $19.99 | 4 |
b.tempt'd by Wacoal Women's Lace Kiss Bralette | b-temptd | $11.65 | 4 |
Wacoal Women's Front Close T-Back Bra | Wacoal | $46.00 | 4.2 |
Conclusion
In this article, we covered key aspects of PySpark, presenting questions from basic to advanced levels. These are important for understanding real-world data science challenges and using PySpark effectively in various scenarios.
Continuous practice and engagement with practical problems are crucial for budding data scientists. Regularly tackling data projects and data science interview questions sharpens your PySpark skills, leading to greater proficiency.
Visit StrataScratch to go deeper into data science and PySpark. This is your pathway to advancing in your data science career.