SQL Cheat Sheet – Technical Concepts for the Job Interview

Published: October 12, 2023

Categories:

Written by:
Nathan Rosidi

This SQL cheat sheet gives you a concise and practical overview of the SQL concepts you’ll need for the job interview.

We intended for this SQL cheat sheet to help you whenever you need to skim through some SQL concepts and remember how they work. To have a cheat sheet to refer to is especially useful if you’re still new to some concepts or don’t use them very often.

In this SQL cheat sheet, we’ll keep the explanations short. Along with them, every concept will be explained by the graphical representation and the interview question from the StrataScratch platform.

And the concepts we will cover in this SQL cheat sheet are:

1. JOINs

LEFT JOIN
INNER JOIN
Multiple Joins
Self Joins

2. Time and Date Functions

DATE_PART()
EXTRACT()
INTERVAL

3. Aggregate Functions

COUNT()
SUM()
AVG()
MIN() & MAX()

4. Window Functions

PARTITION BY
ORDER BY
RANGE or ROW
Ranking Window Functions
1. RANK()
2. DENSE_RANK()
3. ROW_NUMBER()

5. Set Operators

UNION
UNION ALL

SQL Cheat Sheet: JOINs

The JOIN clause in SQL allows you to use data from two or more tables by joining them via the same column found in both tables.

The SQL JOINs most often appearing in the job interview are:

LEFT JOIN
INNER JOIN
Multiple Joins
Self Joins

LEFT JOIN

Explanation

The LEFT JOIN or a LEFT OUTER JOIN returns all data from the first or left table, and only the matching rows from the second or right table. If there are non-matching rows, it will return NULL values.

Graphical Representation

Interview Question

SMS Confirmations From Users

Last Updated: November 2020

INNER JOIN

Explanation

The JOIN or INNER JOIN outputs only the matching rows or the rows found in both tables.

Graphical Representation

Interview Question

Find the total number of available beds per hosts' nationality

Airbnb

MediumID 10187

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Find the total number of available beds per hosts' nationality. Output the nationality along with the corresponding total number of available beds. Sort records by the total available beds in descending order.

Tables: airbnb_apartments, airbnb_hosts

Link to the question: https://platform.stratascratch.com/coding/10187-find-the-total-number-of-available-beds-per-hosts-nationality

Solution:

SELECT nationality,
       SUM(n_beds) AS total_beds_available
FROM airbnb_hosts h
INNER JOIN airbnb_apartments a ON h.host_id = a.host_id
GROUP BY nationality
ORDER BY total_beds_available DESC;

The INNER JOIN will join table airbnb_hosts with the table airbnb_apartments. The column on which this is done is host_id. This join will return only hosts that appear in both tables.

Multiple Joins

Explanation

The multiple joins are not a special type of JOIN. It simply means joining more than two tables while using any type of SQL JOIN. It works the same way as joining two tables, but you add a three, four, five, or any number of tables into the chain of JOINs.

Graphical Representation

Interview Question

Premium vs Freemium

Last Updated: November 2020

Microsoft

MediumID 10300

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Find the total number of downloads for paying and non-paying users by date. Include only records where non-paying customers have more downloads than paying customers. The output should be sorted by earliest date first and contain 3 columns date, non-paying downloads, paying downloads. Hint: In Oracle you should use "date" when referring to date column (reserved keyword).

Tables: ms_user_dimension, ms_acc_dimension, ms_download_facts

Link to the question: https://platform.stratascratch.com/coding/10300-premium-vs-freemium

Solution:

SELECT date, 
 non_paying,
       paying
FROM
  (SELECT date, SUM(CASE
                        WHEN paying_customer = 'yes' THEN downloads
                    END) AS paying,
                SUM(CASE
                        WHEN paying_customer = 'no' THEN downloads
                    END) AS non_paying
   FROM ms_user_dimension a
   INNER JOIN ms_acc_dimension b ON a.acc_id = b.acc_id
   INNER JOIN ms_download_facts c ON a.user_id=c.user_id
   GROUP BY date
   ORDER BY date) t
WHERE (non_paying - paying) > 0
ORDER BY t.date ASC;

In this solution, three tables are joined in the subquery using the INNER JOIN. First, the tables ms_user_dimension and ms_acc_dimension are joined on the column acc_id. Then the third table, ms_download_facts, is added using the second INNER JOIN. This table is joined with the first table on the column user_id.

Self-JOIN

Explanation

Again, the self-join isn’t a specific JOIN clause in SQL. Any JOIN can be used to self-join. It means you’re joining the table with itself.

Graphical Representation

Interview Question

Find all number pairs whose first number is smaller than the second one and the product of two numbers is larger than 11

Delta Airlines

Uber

MediumID 10011

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Find all number pairs whose first number is smaller than the second one and the product of two numbers is larger than 11. Output both numbers in the combination.

Table: transportation_numbers

Link to the question: https://platform.stratascratch.com/coding/10011-find-all-number-pairs-whose-first-number-is-smaller-than-the-second-one-and-the-product-of-two-numbers-is-larger-than-11

Solution:

SELECT DISTINCT n1.number AS num1,
                n2.number AS num2
FROM transportation_numbers n1
INNER JOIN transportation_numbers n2 ON n1.index <> n2.index
WHERE n1.number < n2.number
  AND n1.number * n2.number > 11;

This solution uses the INNER JOIN to join the table transportation_numbers with itself. It’s done by having the table appear twice in the INNER JOIN: once as a left table and the second time as a right table. Both tables are given aliases so you can make a distinction between them. The table is self-joined on the column index.

SQL Cheat Sheet: Time and Date Functions

The time and date functions are the SQL functions used to calculate and manipulate the time and date data types. They can differ between the SQL dialects, but the three most common in PostgreSQL are:

DATE_PART()
EXTRACT()
INTERVAL

DATE_PART()

Explanation

This function is used to get a part of date or time from data. For example, it can return a year or month from the date or minutes from the time data.

Graphical Representation

Interview Question

Find the day of the week that most people check-in

Airbnb

MediumID 9762

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Software Engineer

Find the day of the week that most people want to check-in. Output the day of the week alongside the corresponding check-incount.

Table: airbnb_contacts

Link to the question: https://platform.stratascratch.com/coding/9762-find-the-day-of-the-week-that-most-people-check-in

Solution:

SELECT DATE_PART('dow', ds_checkin :: DATE) AS day_of_week,
       COUNT(*) AS checkin_count
FROM airbnb_contacts
GROUP BY day_of_week
ORDER BY checkin_count DESC
LIMIT 1;

The DATE_PART function is applied to the column ds_checkin from the table airbnb_contacts. To get the day of the week, ‘dow’ has to be written in the function.

EXTRACT()

Explanation

The EXTRACT() function does the same thing as the DATE_PART() function: it returns the part of the date or time. It is a function in compliance with the SQL standard, while the DATE_PART() is the PostgreSQL function.

Graphical Representation

EXTRACT Function Representation in SQL Cheat Sheet

Interview Question

Products Report Summary

Last Updated: June 2021

Whole Foods Market

EasyID 2039

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Find the number of unique transactions and total sales for each of the product categories in 2017. Output the product categories, number of transactions, and total sales in descending order. The sales column represents the total cost the customer paid for the product so no additional calculations need to be done on the column. Only include product categories that have products sold.

Tables: wfm_transactions, wfm_products

Link to the question: https://platform.stratascratch.com/coding/2039-products-report-summary

Solution:

SELECT product_category,
       COUNT(DISTINCT transaction_id) AS transactions,
       SUM(sales) AS sales
FROM wfm_transactions AS tr
INNER JOIN wfm_products AS pr ON pr.product_id = tr.product_id
WHERE EXTRACT(YEAR
              FROM transaction_date) = 2017
GROUP BY product_category
ORDER BY sum(sales) DESC;

The EXTRACT() function is used on the column transaction_date to extract year from it. Used in the WHERE clause like above means it will show only transactions from 2017.

INTERVAL

Explanation

The INTERVAL function is used to add or subtract a time interval to the already existing date or time.

Graphical Representation

INTERVAL Function Representation in SQL Cheat Sheet

Interview Question

Number of Comments Per User in 30 days before 2020-02-10

Last Updated: January 2021

SQL Cheat Sheet: Aggregate Functions

The SQL aggregate functions are the functions that perform a calculation on the set of rows and return a single value.

They can be represented in the following manner.

Aggregate Functions Representation in SQL Cheat Sheet

The most often used functions are:

COUNT()
SUM()
AVG()
MIN()
MAX()

The aggregate functions can also be used as window functions. What they are will be explained in the window functions section.

COUNT()

Explanation

The COUNT() function counts rows. There are three ways of using this function.COUNT(*) returns all the rows from the table, including duplicates and NULL values.

The second usage is COUNT(expression). This one will return all the rows from the table, including duplicates, but will ignore the NULL values.

The COUNT (DISTINCT expression) result is the number of distinct (unique) rows without the NULL values.

Graphical Representation

Interview Question

Businesses Open On Sunday

Yelp

MediumID 10178

Data Engineer

Data Scientist

BI Analyst

Data Analyst

Find the number of businesses that are open on Sundays. Output the slot of operating hours along with the corresponding number of businesses open during those time slots. Order records by total number of businesses opened during those hours in descending order.

Tables: yelp_business_hours, yelp_business

Link to the question: https://platform.stratascratch.com/coding/10178-businesses-open-on-sunday

Solution:

SELECT sunday,
       COUNT(*) AS total_business
FROM yelp_business_hours business_hours
LEFT JOIN yelp_business business ON business_hours.business_id = business.business_id
WHERE sunday IS NOT NULL
  AND is_open = 1
GROUP BY sunday
ORDER BY total_business DESC;

Once the tables are joined and data filtered, using the COUNT(*) function will count the number of business per slot of operating hours. In this case, using the COUNT(business.business_id) or COUNT(DISTINCT business.business_id) would get you the same result.

SUM()

Explanation

The SUM() function is an aggregate function used for getting the sum of the rows’ values.

Graphical Representation

Interview Question

Calculate Samantha's and Lisa's total sales revenue

Amazon

Groupon

Salesforce

EasyID 10127

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

What is the total sales revenue of Samantha and Lisa?

Table: sales_performance

Link to the question: https://platform.stratascratch.com/coding/10127-calculate-samanthas-and-lisas-total-sales-revenue

Solution:

SELECT SUM(sales_revenue) AS total_revenue
FROM sales_performance
WHERE salesperson = 'Samantha'
  OR salesperson = 'Lisa';

The solution uses the SUM() function on the sales_revenue column to calculate Samantha’s and Lisa’s sales.

AVG()

Explanation

The AVG() function calculates the average value of the rows. In other words, it sums the values of the rows and divides the result by the number of rows. If there are NULL values, the AVG() will ignore it, i.e. it divides the sum only by the number of rows with the non-NULL values.

Graphical Representation

Interview Question

Income By Title and Gender

City of San Francisco

MediumID 10077

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Find the average total compensation based on employee titles and gender. Total compensation is calculated by adding both the salary and bonus of each employee. However, not every employee receives a bonus so disregard employees without bonuses in your calculation. Employee can receive more than one bonus. Output the employee title, gender (i.e., sex), along with the average total compensation.

Tables: sf_employee, sf_bonus

Link to the question: https://platform.stratascratch.com/coding/10077-income-by-title-and-gender

Solution:

SELECT e.employee_title,
       e.sex,
       AVG(e.salary + b.ttl_bonus) AS avg_compensation
FROM sf_employee e
INNER JOIN
  (SELECT worker_ref_id,
          SUM(bonus) AS ttl_bonus
   FROM sf_bonus
   GROUP BY worker_ref_id) b ON e.id = b.worker_ref_id
GROUP BY employee_title,
         sex;

The AVG() function is applied on the sum of the columns salary and ttl_bonus, because these two columns constitute the total compensation. Therefore, the query will show the average salary by the employee title and gender.

MIN() & MAX()

Explanation

The MIN() and MAX() aggregate functions are two sides of the same coin. The MIN() returns the minimum or the lowest value, while the MAX() will show the maximum or the highest value.

Graphical Representation

Interview Question

Class Performance

Last Updated: December 2020

Box

MediumID 10310

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

You are given a table containing assignment scores of students in a class. Write a query that identifies the largest difference in total score of all assignments. Output just the difference in total score (sum of all 3 assignments) between a student with the highest score and a student with the lowest score.

Table: box_scores

Link to the question: https://platform.stratascratch.com/coding/10310-class-performance

Solution:

SELECT MAX(score)-MIN(score) AS difference_in_scores
FROM
  (SELECT student,
          SUM(assignment1+assignment2+assignment3) AS score
   FROM box_scores
   GROUP BY student) a;

To get the largest difference in total score, the maximum and minimum score has to be calculated. The MAX() and MIN() functions will do that when applied to the column score. Once these values are calculated, they have to be subtracted to get the difference between the highest and lowest score.

SQL Cheat Sheet: Window Functions

The SQL window functions are functions executing the calculation over the rows that are related to the current row.

They are similar to the aggregate functions used with the SQL GROUP BY clause. The main difference is the window functions do not collapse the individual rows while showing the aggregated values.

This is how they work:

They can broadly be divided into three categories:

Aggregate Window Functions
Ranking Window Functions
Value Window Functions

The interviewers are usually most interested in the Ranking Window Functions.

The window functions’ name comes from the window frame. It is the way of dividing the set of data into blocks. The window frame can be all the rows with the same data in the specified column(s). Or it can be a certain number of rows preceding and following the current row.

Partition or Window Frame for SQL Cheat Sheet

The window functions syntax is:

window_function ( expression ) OVER ([PARTITION BY list_of_columns] [ORDER BY list_of_columns] [ROW or RANGE clause])

PARTITION BY

The PARTITION BY clause is used to define the window frame. In other words, the column according to which the data will be partitioned is defined using the PARTITION BY. The window function will be executed on each partition separately when this is done.

ORDER BY

The ORDER BY is another sub-clause used in the OVER() clause. Unlike the ‘regular’ ORDER BY, it is not used to order data in the query output. In the window functions, it is used to specify the order in which the window function will be executed. This can be in ascending or descending order.

RANGE or ROW

The third important part of the OVER() clause is the RANGE or ROW clause. They are also used to define a window frame. They don’t do it according to the table column, like PARTITION BY. The RANGE and ROW clauses define the start and end rows of the window frame.

The RANGE clause does that logically by specifying the rows in relation to the current row’s value.

The ROW clause doesn’t look at the current row’s value. It simply defines a window frame by the number of rows before and/or after the current row.

The options for using these clauses are:

UNBOUNDED PRECEDING – all the rows before the current row and including the current row
UNBOUNDED FOLLOWING – all the rows after the current row and including the current row
N PRECEDING – defined number of rows before the current row and including the current row
M FOLLOWING – defined number of rows after the current row and including the current row
CURRENT ROW – only the current row

These options can be combined freely according to the needs.

RANK()

Explanation

The RANK() is the function that ranks data in a table and skips the ties. This means it will allocate the same rank where the values are the same, but the next non-tie value will be non-sequentially ranked. It will skip the number of ranks in line with the number of ties.

Graphical Representation

Interview Question

Update Call Duration

Last Updated: January 2021

Redfin

MediumID 2022

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Software Engineer

Redfin helps clients to find agents. Each client will have a unique request_id and each request_id has several calls. For each request_id, the first call is an “initial call” and all the following calls are “update calls”. What's the average call duration for all update calls?

Table: redfin_call_tracking

Link to the question: https://platform.stratascratch.com/coding/2022-update-call-duration

Solution:

SELECT AVG(call_duration)
FROM redfin_call_tracking
WHERE id in
    (SELECT id
     FROM
       (SELECT *,
               RANK() OVER (PARTITION BY request_id
                            ORDER BY created_on) AS rk
        FROM redfin_call_tracking) sq
     WHERE rk > 1);

The above solution uses the RANK() window function in the subquery. After the RANK() function, the OVER() clause is written, which signals the window function

The data is partitioned by the request ID, which means every request ID will be ranked separately. The ORDER BY means the order in which the ranking will be performed. In this case, the ranking will be executed according to the creation date in the ascending order.

In practice, this means the RANK() function will rank the request ID from the oldest to the newest date. Once all dates within one request ID are ranked, the window functions will go to the next request ID and start ranking from 1 again.

DENSE_RANK()

Explanation

The SQL DENSE_RANK() is also the Ranking Window Function. Unlike RANK(), it doesn’t skip ties. This means all the non-tied rows will be ranked sequentially, while the tied values will have the same rank.

Graphical Representation

Interview Question

Ranking Hosts By Beds

Last Updated: July 2020

Airbnb

MediumID 10161

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Rank each host based on the number of beds they have listed. The host with the most beds should be ranked 1 and the host with the least number of beds should be ranked last. Hosts that have the same number of beds should have the same rank but there should be no gaps between ranking values. A host can also own multiple properties. Output the host ID, number of beds, and rank from highest rank to lowest.

Table: airbnb_apartments

Link to the question: https://platform.stratascratch.com/coding/10161-ranking-hosts-by-beds

Solution:

SELECT host_id,
       SUM(n_beds) AS number_of_beds,
       DENSE_RANK() OVER(
                         ORDER BY SUM(n_beds) DESC) AS rank
FROM airbnb_apartments
GROUP BY host_id
ORDER BY number_of_beds DESC;

The DENSE_RANK() function is introduced here, also using the OVER() clause. There’s no PARTITION BY clause, which means the ranking will be executed across the whole table. The data will be ranked by the sum of the number of beds from the highest to the lowest.

ROW_NUMBER

Explanation

The third ranking function allocates the row number to the rows. The row number is allocated sequentially, which also includes the tied values. It’s also possible that the ROW_NUMBER doesn’t return the same row number each time. This happens when the values or the combination of values in the columns used in PARTITION BY and ORDER BY are not unique.

Graphical Representation

Interview Question

Activity Rank

Last Updated: July 2021

Google

MediumID 10351

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Find the email activity rank for each user. Email activity rank is defined by the total number of emails sent. The user with the highest number of emails sent will have a rank of 1, and so on. Output the user, total emails, and their activity rank.

• Order records first by the total emails in descending order. • Then, sort users with the same number of emails in alphabetical order by their username. • In your rankings, return a unique value (i.e., a unique rank) even if multiple users have the same number of emails.

Table: google_gmail_emails

Link to the question: https://platform.stratascratch.com/coding/10351-activity-rank

Solution:

SELECT from_user,
       COUNT(*) AS total_emails,
       ROW_NUMBER() OVER (
                          ORDER BY count(*) DESC, from_user ASC)
FROM google_gmail_emails
GROUP BY from_user
ORDER BY 2 DESC,
         1;

Due to the question’s request to return a unique ranking value even when the users have the same number of emails, the ROW_NUMBER() is used for ranking. The function name is followed by the OVER() clause as in the previous examples. The ranking is here performed by the number of emails in descending order and alphabetically if there are ties.

SQL Cheat Sheet: Set Operators

These operators allow for combining the outputs of multiple queries into a single data set.

The two most used operators are:

UNION
UNION ALL

UNION

Explanation

The UNION operator returns only the unique values from the outputs of two or more queries. There must be an equal number of columns in all queries for this operator to work. The columns have to be of the same data type between them, too.

Graphical Representation

Interview Question

Find all possible varieties which occur in either of the winemag datasets

Wine Magazine

MediumID 10025

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Find all possible varieties which occur in either of the winemag datasets. Output unique variety values only. Sort records based on the variety in ascending order.

Tables: winemag_p1, winemag_p2

Link to the question: https://platform.stratascratch.com/coding/10025-find-all-possible-varieties-which-occur-in-either-of-the-winemag-datasets

Solution:

SELECT DISTINCT variety
FROM winemag_p1
UNION
SELECT DISTINCT variety
FROM winemag_p2
WHERE variety IS NOT NULL
ORDER BY variety;

The distinct wine varieties have to be found separately using two SELECT statements. The first one will query the table winemag_p1, the second one will query the table winemag_p2. Since the question asks to output only the unique varieties, the UNION operator needs to be used: it is simply written between the two SELECT statements.

UNION ALL

Explanation

The UNION ALL operator does the same as the UNION operator, with one important difference: UNION ALL returns all data from all queries, including the duplicate values.

Graphical Representation

Interview Question

Sum Of Numbers

Tesla

Uber

MediumID 10008

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Find the sum of numbers whose index is less than 5 and the sum of numbers whose index is greater than 5. Output each result on a separate row.

Table: transportation_numbers

Link to the question: https://platform.stratascratch.com/coding/10008-sum-of-numbers

Solution:

SELECT SUM(number)
FROM transportation_numbers
WHERE index < 5
UNION ALL
SELECT SUM(number)
FROM transportation_numbers
WHERE index > 5;

The first query will return the sum of numbers where index is less than 5. The second query will do the same, but for the index above 5. To merge the outputs of both queries, the UNION ALL is written between the SELECT statements. Why UNION ALL instead of UNION? Because the results of both queries can be the same. In fact, they are the same! If UNION was used, the result would have been shown only in one row. This would mean not getting the correct result because the question asks to output results in separate rows.