PySpark Interview Questions for Data Science Excellence

Published: January 4, 2024

Categories:

Written by:
Nathan Rosidi

Navigating Through Essential PySpark Interview Questions from Basics to Advanced for Aspiring Data Scientists.

What sets apart a good data scientist from a great one? It's not just about knowing different tools and techniques; it's about understanding how and when to use them.

PySpark, integrating Python and Apache Spark, stands as a crucial tool in modern data science. Its importance in processing large datasets and enabling distributed computing is undeniable.

In this article, I will introduce you to a range of PySpark interview questions, from basic to advanced levels, and you will see how and when to use them in real life, by testing yourselves with interview questions. Buckle up and let’s get started!

Basic PySpark Interview Questions

Starting with the basics is importantl in building a strong foundation. This section on Basic PySpark Interview Questions is designed for beginners or those new to PySpark.

PySpark Interview Questions #1: Find out search details for apartments designed for a sole-person stay

This question focuses on extracting details of searches made by people looking for apartments suitable for a single person.

Solo Apartment Search

Airbnb

EasyID 9615

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Find the search details for apartments where the property type is Apartment and the accommodation is suitable for one person.

DataFrame: airbnb_search_details

Expected Output Type: pandas.DataFrame

Link to this question : https://platform.stratascratch.com/coding/9615-find-out-search-details-for-apartments-designed-for-a-sole-person-stay?code_type=6

First,we need to find out which searches were for single-person apartments. Let’s break down the solution.

Start with the dataset 'airbnb_search_details', which contains information about what people are searching for.
Filter this data to find searches that meet two criteria: the accommodation is for one person ('accommodates' equals 1) and the property type is 'Apartment'.
After applying these filters, convert the filtered data into a pandas dataframe. This is done for easier analysis and handling of the data.

In simple terms, we're identifying and listing the search details for apartments that are meant for only one person. Let’s see the code.

import pyspark.sql.functions as F

result = airbnb_search_details.filter((F.col('accommodates') == 1) & (F.col('property_type') == 'Apartment')).toPandas()

Here is the output.

An error loading the results has occurred

PySpark Interview Questions #2: Users Activity Per Month Day

This question is about finding out how active users are on Facebook on different days of the month, measured by the number of posts made each day.

Users Activity Per Month Day

Last Updated: January 2021

PySpark Interview Questions #3:Customers Who Purchased the Same Product

This question asks us to identify customers who have bought the same furniture items and to provide details like the product ID, brand name, and the count of unique customers for each furniture item, arranged in order of popularity.

Customers Who Purchased the Same Product

Last Updated: February 2023

PySpark Interview Questions #4:Sorting Movies By Duration Time

This question requires organizing a list of movies based on their duration, with the longest movies shown first.

Sorting Movies By Duration Time

Last Updated: May 2023

Google

EasyID 2163

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Software Engineer

You have been asked to sort movies according to their duration in descending order.

Your output should contain all columns sorted by the movie duration in the given dataset.

DataFrame: movie_catalogue

Expected Output Type: pandas.DataFrame

Link to this question : https://platform.stratascratch.com/coding/2163-sorting-movies-by-duration-time?code_type=6

We need to arrange movies by their length, starting with the longest. Let’s break down the solution.

Begin with the 'movie_catalogue', which includes details about various movies.
Extract the duration in minutes from the 'duration' column. This involves finding the number in the text and converting it to a float (a number that can have decimals).
Next, sort the entire movie catalogue based on these duration numbers, putting the longest movies at the top.
After sorting, remove the 'movie_minutes' column, as it's no longer needed.
Finally, convert the sorted data into a pandas dataframe.

In simple terms, we are putting the movies in order from longest to shortest based on their duration. Let’s see the code.

import pyspark.sql.functions as F

movie_catalogue = movie_catalogue.withColumn(
    "movie_minutes",
    F.regexp_extract(F.col("duration"), r"(\d+)", 1).cast("float")
)

result = movie_catalogue.orderBy(F.desc("movie_minutes")).drop("movie_minutes")

result.toPandas()

Here is the output.

An error loading the results has occurred

PySpark Interview Questions #5:Find the date with the highest opening stock price

This question is about identifying the date when Apple's stock had its highest opening price.

Find the date with the highest opening stock price

Apple

Forbes

EasyID 9613

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Find the date when Apple's opening stock price reached its maximum

DataFrame: aapl_historical_stock_price

Expected Output Type: pandas.DataFrame

Link to this question : https://platform.stratascratch.com/coding/9613-find-the-date-with-the-highest-opening-stock-price?code_type=6

We are tasked with finding out the day when Apple's stock opened at its maximum value. Let’s break down the solution.

Start with the 'aapl_historical_stock_price' dataset, which has Apple's stock price history.
Modify the 'date' column to ensure it's in a string format showing only the year, month, and day.
Next, find the maximum value in the 'open' column, which represents the opening stock price.
Then filter the dataset to find the date(s) when this maximum opening price occurred.
Finally, select only the 'date' column and convert the data to a pandas dataframe for easy viewing.

In summary, we are pinpointing the date when Apple's stock had its highest opening price and presenting this information in a straightforward manner. Let’s see the code.

import pandas as pd
import numpy as np
import datetime
import time

df = aapl_historical_stock_price
df = df.withColumn('date', df['date'].cast('string').substr(0, 10))

result = df.filter(df['open'] == df.selectExpr('max(open)').collect()[0][0]).select('date').toPandas()

Here is the output.

An error loading the results has occurred

Intermediate PySpark Interview Questions

Once the fundamentals are mastered, the next step is to go into more complex scenarios. The Intermediate PySpark Interview Questions section is tailored for those who have a basic understanding of PySpark.

PySpark Interview Questions #5:Find the first and last times the maximum score was awarded

This question asks us to find out the earliest and latest dates on which the highest score was given in Los Angeles restaurant health inspections.

Find the first and last times the maximum score was awarded

Last Updated: April 2018

City of Los Angeles

MediumID 9712

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Find the first and last times the maximum score was awarded

DataFrame: los_angeles_restaurant_health_inspections

Expected Output Type: pandas.DataFrame

Link to this question : https://platform.stratascratch.com/coding/9712-find-the-first-and-last-times-the-maximum-score-was-awarded?code_type=6

We are looking for the first and last instances when the maximum health inspection score was awarded to restaurants in Los Angeles. Let’s break down the solution.

Start by identifying the highest score given in the 'los_angeles_restaurant_health_inspections' dataset.
Next, ensure the 'activity_date' is in a date format for accurate comparison.
Then find the earliest date ('first_time') when this maximum score was awarded by filtering for this score and selecting the minimum date.
Similarly, find the latest date ('last_time') by selecting the maximum date for the same score.
Combine these two dates to get a result showing both the first and last times this score was given.
Finally, convert this combined data into a pandas dataframe for easy viewing.

In summary, we are identifying the first and last occurrences of the highest health inspection score awarded to restaurants in Los Angeles. Let’s see the code.

import pyspark.sql.functions as F

max_score = los_angeles_restaurant_health_inspections.select(F.max("score")).first()[0]
los_angeles_restaurant_health_inspections = los_angeles_restaurant_health_inspections.withColumn("activity_date", F.to_date("activity_date"))
first_time = los_angeles_restaurant_health_inspections.filter(los_angeles_restaurant_health_inspections["score"] == max_score).select(F.min("activity_date").alias("first_time"))
last_time = los_angeles_restaurant_health_inspections.filter(los_angeles_restaurant_health_inspections["score"] == max_score).select(F.max("activity_date").alias("last_time"))
result = first_time.crossJoin(last_time)

result.toPandas()

Here is the output.

An error loading the results has occurred

PySpark Interview Questions #6:Account Registrations

This question requires us to calculate the number of account signups per month, showing the year and month along with the corresponding number of registrations.

Account Registrations

Last Updated: August 2022

Noom

EasyID 2126

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Software Engineer

Find the number of account registrations according to the signup date. Output the year months (YYYY-MM) and their corresponding number of registrations.

DataFrame: noom_signups

Expected Output Type: pandas.DataFrame

Link to this question : https://platform.stratascratch.com/coding/2126-account-registrations?code_type=6

We are tasked with finding out how many accounts were registered each month. Let’s break down the solution.

Begin with the 'noom_signups' dataset, which has data on when accounts were registered.
Create a new column 'started_at_month' that formats the 'started_at' date to show only the year and month (in 'YYYY-MM' format).
Next, group the data by this new 'started_at_month' column.
Count the number of registrations for each month and rename this count as 'n_registrations'.
Then sort the data by the month and year.
Finally, convert this sorted data into a pandas dataframe for easy reading.

In simple terms, we are tallying the number of account signups for each month and displaying them in an organized and chronological manner. Let’s see the code.

import pyspark.sql.functions as F

noom_signups = noom_signups.withColumn('started_at_month', F.date_format('started_at', 'yyyy-MM'))
result = noom_signups.groupby('started_at_month').count().withColumnRenamed('count', 'n_registrations').orderBy('started_at_month')
result.toPandas()

Here is the output.

An error loading the results has occurred

PySpark Interview Questions #7: Process a Refund

This question asks for the calculation of the minimum, average, and maximum number of days it takes to process a refund for accounts opened since January 1, 2019, and to group these calculations by billing cycle in months.

Process a Refund

Last Updated: August 2022

Noom

MediumID 2125

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Software Engineer

Calculate and display the minimum, average and the maximum number of days it takes to process a refund for accounts opened from January 1, 2019. Group by billing cycle in months.

Note: The time frame for a refund to be fully processed is from settled_at until refunded_at.

DataFrames: noom_signups, noom_transactions, noom_plans

Expected Output Type: pandas.DataFrame

Link to this question : https://platform.stratascratch.com/coding/2125-process-a-refund?code_type=6

We need to analyze the duration of refund processing for accounts opened from 2019 onwards, grouped by their billing cycle duration. Let’s break down the solution.

Begin by joining three datasets: 'noom_transactions', 'noom_signups', and 'noom_plans', linking them via 'signup_id' and 'plan_id'.
Filter these combined datasets to include only transactions from accounts started on or after January 1, 2019.
Calculate 'time_to_settle', the number of days between 'settled_at' and 'refunded_at'.
Next, group the data by 'billing_cycle_in_months'.
For each billing cycle, calculate the minimum, average, and maximum refund processing time.
Sort these groups by the billing cycle length.
Finally, convert this grouped and calculated data into a pandas dataframe for easier interpretation.

In summary, we're measuring how long refunds take for different billing cycles, starting from 2019, and presenting this data in an organized manner. Let’s see the code.

import pyspark.sql.functions as F

transactions_signups = noom_transactions.join(noom_signups, on='signup_id')
transactions_signups_plans = transactions_signups.join(noom_plans, on='plan_id')
new_sigups_transactions = transactions_signups_plans.filter(transactions_signups_plans['started_at'] >= '2019-01-01')
new_sigups_transactions = new_sigups_transactions.withColumn('time_to_settle', (F.datediff(new_sigups_transactions['refunded_at'], new_sigups_transactions['settled_at'])))
result = new_sigups_transactions.groupby('billing_cycle_in_months').agg(F.min('time_to_settle').alias('_min'), F.mean('time_to_settle').alias('_mean'), F.max('time_to_settle').alias('_max')).sort('billing_cycle_in_months')
result.toPandas()

Here is the output.

An error loading the results has occurred

PySpark Interview Questions #8:Highest Salary

This question requires us to identify the employee (or employees) who has the highest salary, and to display their first name and the amount of their salary.

Highest Salary

Siemens

Amazon

MediumID 9870

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

You have been asked to find the employee with the highest salary. Output the worker or worker's first name, as well as the salary.

DataFrame: worker

Expected Output Type: pandas.DataFrame

Link to this question : https://platform.stratascratch.com/coding/9870-highest-salary?code_type=6

We need to find the employee with the highest salary in the dataset. Let’s break down the solution.

Start with the 'worker' dataset, which includes details about employees and their salaries.
First determine the highest salary in the dataset using the 'max' function.
Next, filter the dataset to find the employee(s) who have this highest salary.
Then select the 'first_name' and 'salary' columns to display.
Finally, convert this information into a pandas dataframe to make it more readable.

In summary, we are identifying the employee(s) with the top salary and presenting their first name along with the salary amount. Let’s see the code.

import pyspark.sql.functions as F

result = worker.filter(F.col('salary') == worker.select(F.max('salary')).collect()[0][0]) \
    .select('first_name', 'salary') \
    .toPandas()

Here is the output.

An error loading the results has occurred

PySpark Interview Questions #9: Find the average of inspections scores between 91 and 100

This question asks us to calculate the average (mean) of health inspection scores for Los Angeles restaurants that fall between 91 and 100, assuming that these scores are normally distributed.

Find the average of inspections scores between 91 and 100

City of Los Angeles

EasyID 9707

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Find the mean of inspections scores between 91 and 100. Assuming that the scores are normally distributed.

DataFrame: los_angeles_restaurant_health_inspections

Link to this question : https://platform.stratascratch.com/coding/9707-find-the-average-of-inspections-scores-between-91-and-100?code_type=6

We are finding the average score of restaurant health inspections that are in the range of 91 to 100. Let’s break down the solution.

Start with the 'los_angeles_restaurant_health_inspections' dataset.
Filter this data to include only the scores that are between 91 and 100.
Next, calculate the average (mean) of these scores.
Finally, convert this calculated average into a pandas dataframe for easy viewing and interpretation.

In simple terms, we are determining the average score of health inspections for Los Angeles restaurants, focusing on scores between 91 and 100. Let’s see the code.

import pyspark.sql.functions as F

score_between = los_angeles_restaurant_health_inspections.filter(F.col('score').between(91, 100))
result = score_between.select(F.mean('score')).toPandas()
result

Here is the output.

An error loading the results has occurred

Advanced PySpark Interview Questions

For those who have confidently navigated through the basics and intermediate levels, the Advanced PySpark Interview Questions section awaits. This is where complex, real-world data problems are addressed. It’s designed for individuals who are comfortable with PySpark and are looking to deepen their expertise.

PySpark Interview Questions #10: Find how the survivors are distributed by the gender and passenger classes

This question asks us to determine the distribution of Titanic survivors based on their gender and the class they traveled in. The classes are categorized into first, second, and third class based on the 'pclass' value.

Find how the survivors are distributed by the gender and passenger classes

Google

MediumID 9882

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Find how the survivors are distributed by the gender and passenger classes. Classes are categorized based on the pclass value as: pclass = 1: first_class pclass = 2: second_classs pclass = 3: third_class Output the sex along with the corresponding number of survivors for each class. HINT: each sex should be in the separate line with one column having the value of that sex and other 3 columns having number of survivors for each 3 classes.

DataFrame: titanic

Expected Output Type: pandas.DataFrame

Link to this question : https://platform.stratascratch.com/coding/9882-find-how-the-survivors-are-distributed-by-the-gender-and-passenger-classes/official-solution?code_type=6

We need to find out how many survivors there were in each passenger class, broken down by gender. Let’s break down the solution.

We start with the 'titanic' dataset and filter it to include only the records of survivors ('survived' == 1).
We then group these survivor records by 'sex' and 'pclass' and count the number of survivors in each group.
Next, we reorganize (pivot) this data so that each row represents a gender, and each column represents a class, showing the count of survivors in each category.
We rename the class columns to 'first_class', 'second_class', and 'third_class' for clarity.
Finally, we convert this pivoted data into a pandas dataframe, which will display each gender with the corresponding number of survivors in each of the three classes.

In summary, we are showcasing the number of Titanic survivors based on their gender and the class in which they were traveling. Let’s see the code.

import pyspark.sql.functions as F

survived = titanic.filter(titanic['survived'] == 1)
count = survived.groupby(['sex','pclass']).agg(F.count('*').alias('count'))
pivot = count.groupBy('sex').pivot('pclass').agg(F.sum('count'))
pivot = pivot.withColumnRenamed('1', 'first_class').withColumnRenamed('2', 'second_class').withColumnRenamed('3', 'third_class')
result = pivot.toPandas()
result

Here is the output.

An error loading the results has occurred

PySpark Interview Questions #11:Consecutive Days

This question is about identifying users who were active for three or more consecutive days.

Consecutive Days

Last Updated: July 2021

Netflix

Salesforce

HardID 2054

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Find all the users who were active for 3 consecutive days or more.

DataFrame: sf_events

Expected Output Type: pandas.DataFrame

Link to this question : https://platform.stratascratch.com/coding/2054-consecutive-days?code_type=6

We need to find users who have been active for at least three days in a row. Let’s break down the solution.

Start with the 'sf_events' dataset and remove any duplicate records.
Then ensure the 'date' column is in a standard date format (YYYY-MM-DD).
Next, assign a rank to each user's activity date. This rank is calculated within each user's set of dates, ordered chronologically.
Create a new column 'consecutive_days' by subtracting the rank from the date. This helps to identify consecutive days.
Then group the data by 'user_id' and 'consecutive_days', counting the number of records in each group.
Filter these groups to keep only those where the count is three or more, indicating three or more consecutive days of activity.
Finally, select the 'user_id' of these active users and convert the data into a pandas dataframe.

In summary, we are pinpointing users who were active for three consecutive days or more and listing their IDs. Let’s see the code.

import pyspark.sql.functions as F
from pyspark.sql.window import Window

df = sf_events.dropDuplicates()
df = df.withColumn('date', F.to_date(df['date'], 'yyyy-MM-dd'))
df = df.withColumn('rank', F.row_number().over(Window.partitionBy('user_id').orderBy('date')))
df = df.withColumn('consecutive_days', F.date_sub(df['date'], df['rank'] - 1))
result = df.groupBy('user_id', 'consecutive_days').agg(F.count('*').alias('counter')).filter(F.col('counter') >= 3).select('user_id')
result.toPandas()

Here is the output.

An error loading the results has occurred

PySpark Interview Questions #12:Find all records with words that start with the letter 'g'

This question asks us to identify records from a dataset where either of two fields, 'words1' or 'words2', contains words starting with the letter 'g'.

Records With Words Starting With 'g'

Google

EasyID 9806

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Software Engineer

Find all records with words that start with the letter 'g'.
Output words1 and words2 if any of them satisfies the condition.

DataFrame: google_word_lists

Expected Output Type: pandas.DataFrame

Link to this question : https://platform.stratascratch.com/coding/9806-find-all-records-with-words-that-start-with-the-letter-g?code_type=6

We need to find records with words beginning with 'g' in either of two columns ('words1' or 'words2'). Let’s break down the solution.

Begin with the 'google_word_lists' dataset.
Apply filters to both 'words1' and 'words2' columns to check if any word starts with the letter 'g'. We use the regular expression ('rlike') function for this purpose. The expression checks for words starting with 'g' either at the beginning of the string or preceded by a comma, space, or other delimiters.
After applying these filters, select the records that meet our criteria.
Finally, convert these filtered records into a pandas dataframe for easy viewing and analysis.

In simple terms, we are finding and listing records where either the 'words1' or 'words2' field contains a word that starts with the letter 'g'. Let’s see the code.

import pyspark.sql.functions as F

movie_catalogue = movie_catalogue.withColumn(
    "movie_minutes",
    F.regexp_extract(F.col("duration"), r"(\d+)", 1).cast("float")
)

result = movie_catalogue.orderBy(F.desc("movie_minutes")).drop("movie_minutes")

result.toPandas()

Here is the output.

An error loading the results has occurred

PySpark Interview Questions #13:Top Teams In The Rio De Janeiro 2016 Olympics

This question asks us to identify the top three medal-winning teams for each event at the Rio De Janeiro 2016 Olympics, and display them as 'gold team', 'silver team', and 'bronze team' along with the number of medals they won. In case of a tie, the teams should be ordered alphabetically. If there is no team for a position, it should be labeled as 'No Team'.

Top Teams In The Rio De Janeiro 2016 Olympics

ESPN

HardID 9960

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Software Engineer

Find the top 3 medal-winning teams by counting the total number of medals for each event in the Rio De Janeiro 2016 olympics. In case there is a tie, order the countries by name in ascending order. Output the event name along with the top 3 teams as the 'gold team', 'silver team', and 'bronze team', with the team name and the total medals under each column in format "{team} with {number of medals} medals". Replace NULLs with "No Team" string.

DataFrame: olympics_athletes_events

Expected Output Type: pandas.DataFrame

Link to this question : https://platform.stratascratch.com/coding/9960-top-teams-in-the-rio-de-janeiro-2016-olympics?code_type=6

We are analyzing the 2016 Rio De Janeiro Olympics data to determine the top three teams in terms of medal counts for each event. Let’s break down the solution.

Start with the 'olympics_athletes_events' dataset and convert it to a pandas dataframe.
Filter this data to include only the 2016 Olympics and events where medals were awarded.
Then group the data by 'event' and 'team', counting the number of medals for each team in each event.
For each event, rank the teams based on their medal counts and, in case of a tie, alphabetically by team name.
Identify the top three teams for each event and label them as 'gold team', 'silver team', and 'bronze team', including the count of medals they won in a formatted string.
Group this data by 'event' and aggregate the top teams in their respective positions.
Replace any missing team positions with 'No Team'.
The final output includes the event name, and the top three teams with their medal counts

In summary, we are showcasing the leading medal-winning teams in each event from the 2016 Olympics, labeled according to their rank and presented in an easy-to-read format. Let’s see the code.

import pandas as pd
import numpy as np

olympics_athletes_events = olympics_athletes_events.toPandas()

y_2016 = olympics_athletes_events[(olympics_athletes_events['year'] == 2016) &(olympics_athletes_events['medal'].notnull()) ]
n_medal = y_2016.groupby(['event','team']).size().to_frame('medals_count').reset_index()
n_medal['team_position'] = n_medal.sort_values(['medals_count', 'team'], ascending = [False, True]).groupby(['event']).cumcount() + 1
less_3_medals = n_medal[n_medal['team_position'] <= 3]
less_3_medals.loc[less_3_medals['team_position'] == 1, 'gold_team'] = less_3_medals['team'] + " with "+ less_3_medals['medals_count'].astype(str) + " medals"
less_3_medals.loc[less_3_medals['team_position'] == 2, 'silver_team'] = less_3_medals['team'] + " with "+ less_3_medals['medals_count'].astype(str) + " medals"
less_3_medals.loc[less_3_medals['team_position'] == 3, 'bronze_team'] = less_3_medals['team'] + " with "+ less_3_medals['medals_count'].astype(str) + " medals"
result = less_3_medals.groupby('event').agg({'gold_team':'first', 'silver_team':'first', 'bronze_team':'first'}).reset_index().fillna('No Team')
result

Here is the output.

An error loading the results has occurred

PySpark Interview Questions #14:Exclusive Amazon Products

This question asks us to identify products that are only sold on Amazon and not available at Top Shop and Macy's. We need to list these exclusive products along with their name, brand, price, and rating.

Exclusive Amazon Products

Amazon

HardID 9608

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Find products which are exclusive to only Amazon and therefore not sold at Top Shop and Macy's. Your output should include the product name, brand name, price, and rating.

Two products are considered equal if they have the same product name and same maximum retail price (mrp column).

DataFrames: innerwear_macys_com, innerwear_topshop_com, innerwear_amazon_com

Expected Output Type: pandas.DataFrame

Link to this question : https://platform.stratascratch.com/coding/9608-exclusive-amazon-products?code_type=6

We are tasked with finding products that are exclusive to Amazon, meaning they aren't sold at Top Shop and Macy's. Let’s break down the solution.

Start with the 'innerwear_amazon_com' dataset, which includes details about products sold on Amazon.
Check these Amazon products against the 'innerwear_macys_com' and 'innerwear_topshop_com' datasets to ensure they are not available at Macy's and Top Shop. We do this using 'leftanti' joins, which find records in the first dataset that do not have matching records in the second dataset. We use 'product_name' and 'mrp' (maximum retail price) as key fields to compare.
After performing these joins, select the 'product_name', 'brand_name', 'price', and 'rating' columns for the final output.
Finally, convert this filtered data into a pandas dataframe for easier analysis.

In short, we are identifying and listing products that are unique to Amazon by ensuring they are not sold at Top Shop and Macy's, based on their name and price. Let’s see the code.

# Import your libraries
import pyspark

# Start writing code
(innerwear_amazon_com
    .join(innerwear_macys_com, ['product_name','mrp'], 'leftanti')
    .join(innerwear_topshop_com, ['product_name','mrp'], 'leftanti')
    .select('product_name','brand_name','price','rating').toPandas())

Here is the output.

An error loading the results has occurred

Conclusion

In this article, we covered key aspects of PySpark, presenting questions from basic to advanced levels. These are important for understanding real-world data science challenges and using PySpark effectively in various scenarios.

Continuous practice and engagement with practical problems are crucial for budding data scientists. Regularly tackling data projects and data science interview questions sharpens your PySpark skills, leading to greater proficiency.

Visit StrataScratch to go deeper into data science and PySpark. This is your pathway to advancing in your data science career.

PySpark Interview Questions for Data Science Excellence

Basic PySpark Interview Questions

PySpark Interview Questions #1: Find out search details for apartments designed for a sole-person stay

Solo Apartment Search

PySpark Interview Questions #2: Users Activity Per Month Day

Users Activity Per Month Day

PySpark Interview Questions #3:Customers Who Purchased the Same Product

Customers Who Purchased the Same Product

PySpark Interview Questions #4:Sorting Movies By Duration Time

Sorting Movies By Duration Time

PySpark Interview Questions #5:Find the date with the highest opening stock price

Find the date with the highest opening stock price

Intermediate PySpark Interview Questions

PySpark Interview Questions #5:Find the first and last times the maximum score was awarded

Find the first and last times the maximum score was awarded

PySpark Interview Questions #6:Account Registrations

Account Registrations

PySpark Interview Questions #7: Process a Refund

Process a Refund

PySpark Interview Questions #8:Highest Salary

Highest Salary

PySpark Interview Questions #9: Find the average of inspections scores between 91 and 100

Find the average of inspections scores between 91 and 100

Advanced PySpark Interview Questions

PySpark Interview Questions #10: Find how the survivors are distributed by the gender and passenger classes

Find how the survivors are distributed by the gender and passenger classes

PySpark Interview Questions #11:Consecutive Days

Consecutive Days

PySpark Interview Questions #12:Find all records with words that start with the letter 'g'

Records With Words Starting With 'g'

PySpark Interview Questions #13:Top Teams In The Rio De Janeiro 2016 Olympics

Top Teams In The Rio De Janeiro 2016 Olympics

PySpark Interview Questions #14:Exclusive Amazon Products

Exclusive Amazon Products

Conclusion

Latest Posts:

How to Create a Bubble Plot with Python and Matplotlib?

What Is a Python Abstract Class? When and How to Use It

How We Oversaturated the Data Science Job Market