How to Import Pandas as pd in Python
Categories
Importing pandas as pd: an essential Python library for data scientists. Once you import it, you can take your data analysis to a whole new level.
As a general purpose programming language, Python has all the features necessary to analyze and gain insights from data. For this reason, interviewers often expect prospective data scientists to be proficient in Python.
While it has its own advantages, like readability and flexibility, standard Python language is insufficient for doing serious data analysis. Thankfully it supports external libraries to extend basic functionality of Python. Pandas is one of the libraries that extends functionality of the language beyond basic features and provides a toolkit to perform common data science tasks.
Pandas in Data Science
Pandas is an essential tool for doing data science in Python. It allows you to perform many tasks that would be next to impossible with standard language: work with data in various formats (tabular or labeled), sort and format it, combine data from multiple sources, find and clean up messy data, and even visualize it.
Pandas API is easy to follow, especially if you already know SQL. Functions are well documented and named descriptively. Many of the features have the same name as their counterparts in SQL.
If the job you’re applying for requires some knowledge of Python, you will have to use functions from the pandas library a lot. Interviewers might ask you questions to measure your ability to make use of all the functions from the pandas library.
What Can Pandas Do?
Short answer is that you need Pandas to gain access to many functions for performing data analysis in Python. It is especially useful for doing data modeling, data analysis and data manipulation.
Pandas provides two data structures Series and DataFrame, which give you flexibility to work with data sets in different formats, like tabular, time-series, and matrix data. DataFrame provides the foundation to perform essential operations like data alignment, calculating statistics, slicing, grouping, merging, adding or splitting sets of data. It is useful for working with three dimensional data from tables.
For example, sort_values() and sort_index() functions are commonly used for sorting and organizing data. merge() allows you to combine and work with data from two tables, similar to JOINs in SQL. You can even specify the shared dimension and method for combining data. concat() function for joining strings and arrays.
The read_csv() and read_excel() function allows you to work with data from Excel spreadsheets. Calling the df.head() function will return the first five rows by default. You can provide an argument to specify the number of rows you want it to return.
Functions from the pandas library can be used to get information about available data. df.shape() returns information about the number of columns and rows. df.size() returns the number of rows multiplied by the number of columns. df.info() can give us information about value types for each column.
Functions like df.to_datetime(), df.to_period(), dt.year(), dt.month() and similar are very useful for working with datetime values.
The library also contains features for cleaning up messy data. You can set up conditions, limits, and formatting rules for detecting outlier cases. You can also define how to fix bad data. Doing this will help you remove unrealistic values, which can improve the accuracy of your data analysis.
How to Import Pandas as pd in Python
Installing Python alone is not enough to import the Pandas library, you need to install the library itself. It is only compatible with Python version 2.7 or higher.
Newer versions of Python (3.4 and higher) include the installation of pip, a tool for installing packages like Pandas library. If you’re running an older version of Python, you can install pip manually or download the latest version of Python on https://www.python.org.
Once you have pip set up, open the command prompt and type the following command:
pip install pandas
Once the installation is complete, command prompt will display some form of success message, such as: Successfully installed pandas-0.24.2
You can also download and install Pandas without using command prompt. Manually download and install the Anaconda software, which automatically installs Pandas as well.
Note: During the installation process, it’s essential to tick the checkbox for ‘Add Anaconda to my PATH environment variable’ option.
Once Pandas is installed, you can import pandas as pd and its parent library, numpy:
import pandas as pd
import numpy as np
We can simply write ‘import pandas’ to make this module available. However, it is a good practice to use the as keyword to give it a shorthand name ‘pd’. Still, this is not a requirement and you can simply write ‘import pandas’.
In this example, we also import numpy, another data science library, which is often used with pandas.
Example Problems Where You Need to Import Pandas as pd
Pandas library provides essential features for doing data analysis in Python. For this reason, solutions to Python interview questions often rely on features from the Pandas library.
Question 1 - Finding User Purchases
Solving this question comes down to working with purchase dates to find instances when a single user ordered two times within a 7-day period. Functions from the pandas library will be essential for finding an answer.
Interview Question Date: December 2020
Identify returning active users by finding users who made a second purchase within 7 days of any previous purchase. Output a list of these user_ids.
Link to the question: https://platform.stratascratch.com/coding/10322-finding-user-purchases
Available Data
First step towards finding an answer is to look at available data. In this case, we have only one amazon_transactions table with five columns.
Let’s preview actual data in the table:
id | user_id | item | created_at | revenue |
---|---|---|---|---|
1 | 109 | milk | 2020-03-03 | 123 |
2 | 139 | biscuit | 2020-03-18 | 421 |
3 | 120 | milk | 2020-03-18 | 176 |
4 | 108 | banana | 2020-03-18 | 862 |
5 | 130 | milk | 2020-03-28 | 333 |
Step-by-Step Code Solution
Step 1: Import Pandas as pd in Python
Pandas and Numpy libraries provide an essential toolkit for performing complex data science tasks.
Almost all functions we use to answer this question come from the pandas library.
import pandas as pd
import numpy as np
from datetime import datetime
Datetime library is useful for working with date values
Step 2: Convert date values
First, we need to convert values in the created_at column to datetime type, and format them.
For that, we use the to_datetime() and dt.strftime() functions from pandas library.
pd.to_datetime() one argument, a DataFrame data set that needs to be converted to a datetime object. dt.strftime() is used to specify the date format.
amazon_transactions["created_at"] = pd.to_datetime(amazon_transactions["created_at"]).dt.strftime('%m-%d-%Y')
Step 3: Arrange dates in an ascending order
In the next step, we need to put date values (from the created_at column) in an ascending order. When it comes to dates, ascending order means from earlier to later.
We use another function specific to the Pandas library sort_values(). It is necessary to arrange values in an ascending order, based on values in the user_id and created_at columns.
df = amazon_transactions.sort_values(by=['user_id', 'created_at'], ascending=[True, True])
Step 4: Find the date of previous purchase
In this step, we create a new column to store the closest previous date when the same user placed an order.
We generate values for the new column by moving values in the created_at column one row down. This is necessary to store the previous time the user placed an order. To accomplish this, we will use the shift() function from the Pandas library. It allows us to shift the index of values from a specific column.
We give the new column a descriptive name prev_value. For every row, the value in the prev_value column will be equal to created_at value from the previous row. For the second row, the prev_value column will have a value of created_at column from the first row.
df['prev_value'] = df.groupby('user_id')['created_at'].shift()
Step 5: Find repeat orders within 7 days
Finally, we find the difference between multiple orders placed by the same user by subtracting date values in created_at and prev_value columns. For that, we once again use the pd.to_datetime() function and the dt.days function to find the number of days that is the difference between two dates.
The dt.days() function takes a date value and returns the number of days.
df['days'] = (pd.to_datetime(df['created_at']) - pd.to_datetime(df['prev_value'])).dt.days
result = df[df['days'] <= 7]['user_id'].unique()
We find every instance where the same user placed orders within 7 days and store it in the result variable. We use the unique() function to return only one repeat order for each user. Applying this function to a hash table keeps only unique values.
Here is the final solution:
import pandas as pd
import numpy as np
from datetime import datetime
amazon_transactions["created_at"] = pd.to_datetime(amazon_transactions["created_at"]).dt.strftime('%m-%d-%Y')
df = amazon_transactions.sort_values(by=['user_id', 'created_at'], ascending=[True, True])
df['prev_value'] = df.groupby('user_id')['created_at'].shift()
df['days'] = (pd.to_datetime(df['created_at']) - pd.to_datetime(df['prev_value'])).dt.days
result = df[df['days'] <= 7]['user_id'].unique()
Output
Running this Python code will return unique id values of users who placed an order within 7 days.
0 |
---|
100 |
103 |
105 |
109 |
110 |
As you can see, in the process of finding an answer, we make extensive use of functions from the Pandas library. Python code looks a little different than SQL query, but both are following the same logical pattern.
Question 2 - Second Highest Salary
In order to find the second highest salary, we will need to rank values in the salary column.The pandas library provides all necessary functions to solve this question in Python.
Interview Question Date: April 2019
Find the second highest salary of employees.
Link to the question: https://platform.stratascratch.com/coding/9892-second-highest-salary
Available Data
It’s a good practice to study the available data before answering the question. For this question, we are dealing with one employee table with fourteen columns.
The salary column contains integer (numeric) values to describe the salary of each employee.
Now, let’s look at actual data in the table:
id | first_name | last_name | age | sex | employee_title | department | salary | target | bonus | city | address | manager_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | Max | George | 26 | M | Sales | Sales | 1300 | 200 | 150 | Max@company.com | California | 2638 Richards Avenue | 1 |
13 | Katty | Bond | 56 | F | Manager | Management | 150000 | 0 | 300 | Katty@company.com | Arizona | 1 | |
11 | Richerd | Gear | 57 | M | Manager | Management | 250000 | 0 | 300 | Richerd@company.com | Alabama | 1 | |
10 | Jennifer | Dion | 34 | F | Sales | Sales | 1000 | 200 | 150 | Jennifer@company.com | Alabama | 13 | |
19 | George | Joe | 50 | M | Manager | Management | 100000 | 0 | 300 | George@company.com | Florida | 1003 Wyatt Street | 1 |
Step-by-Step Code Solution
Step 1: Import Pandas as pd in Python
We need to use functions from the pandas library to work with tabular data. In the following steps, it will become clear why it’s so important to import pandas and numpy.
import pandas as pd
import numpy as np
Step 2: Get an unique list of salaries
To start off, we use the drop_duplicates() function from the Pandas library, which removes rows with duplicate values in a specific column. In this case, the salary column. This leaves us with the list of unique salary values.
distinct_salary = employee.drop_duplicates(subset = 'salary')
Step 3: Rank salaries
The question asks us to find the second highest salary. Therefore we will have to use the rank() function, which is a Pandas alternative to ranking window functions in SQL. It assigns a numeric value to each row, corresponding to its rank.
We create a ‘rnk’ column to rank values in the distinct_salary data set. When calling the rank() function, we provide two arguments, one to specify the type of ranking it should perform (dense), and another for order of values.
distinct_salary['rnk'] = distinct_salary['salary'].rank(method='dense', ascending=False)
In SQL, we have window ranking functions. The pandas library provides only the rank() functions, but we can specify the type of ranking via arguments (method=’dense’).
Dense ranking follows this principle for dealing with instances when multiple values are tied - if two salary values are the same, both will receive the same rank. It also doesn’t skip the next rank when the values are tied.
Step 4: Find the second highest salary
Once we rank all salaries, all there is left to do is return a salary that has a rank value of 2.
result = distinct_salary[distinct_salary.rnk == 2][['salary']]
The final solution will look like this:
import pandas as pd
import numpy as np
distinct_salary = employee.drop_duplicates(subset = 'salary')
distinct_salary['rnk'] = distinct_salary['salary'].rank(method='dense', ascending=False)
result = distinct_salary[distinct_salary.rnk == 2][['salary']]
Output
Running the code above will return just one column with second highest salary:
salary |
---|
200000 |
Question 3 - Revenue Over Time
This is a difficult question with a lot of moving parts. We will need to import pandas as pd to use functions for working with date values. We also need this library to use other essential utilities, like groupby() function, aggregate functions and rolling() function, which is very useful for this particular question.
Interview Question Date: December 2020
Find the 3-month rolling average of total revenue from purchases given a table with users, their purchase amount, and date purchased. Do not include returns which are represented by negative purchase values. Output the year-month (YYYY-MM) and 3-month rolling average of revenue, sorted from earliest month to latest month.
A 3-month rolling average is defined by calculating the average total revenue from all user purchases for the current month and previous two months. The first two months will not be a true 3-month rolling average since we are not given data from last year. Assume each month has at least one purchase.
Link to the question: https://platform.stratascratch.com/coding/10314-revenue-over-time
Available Data
Looking at data can help you better understand the question. In this case, we are dealing with data contained in the amazon_purchases table with three columns.
Previewing the table can help you better understand the data:
user_id | created_at | purchase_amt |
---|---|---|
10 | 2020-01-01 | 3742 |
11 | 2020-01-04 | 1290 |
12 | 2020-01-07 | 4249 |
13 | 2020-01-10 | 4899 |
14 | 2020-01-13 | -4656 |
Step-by-Step Code Solution
Step 1: Import Pandas as pd in Python
As a data scientist proficient in Python, you are expected to know what tools you’ll need to answer the interview question using the fewest lines of code.
pandas is one of the essential libraries that provides functions like to_datetime(), to_period(), group(), sum(), which you’ll need to find the answer.
import pandas as pd
import numpy as np
from datetime import datetime
Step 2: Set the format for float values
First, we modify pd.options to specify how numbers are displayed when outputting a DataFrame.
pd.options.display.float_format = "{:,.2f}".format
In this case, we tell pandas to output float values with 2 decimal places and use comma separator for more readability.
Step 3: Filter out negative purchase amount values
The question tells us to not account for negative purchase_amt values, which represent returns. We set the condition to store rows where the value of the purchase_amt column is more than 0. We store them in the df variable.
df=amazon_purchases[amazon_purchases['purchase_amt']>0]
Step 4: Convert datetime values to months
In the next step, we must take created_at values in the available data and use the pd.to_datetime and dt.to_period function to transform them to month values.
df['month_year'] = pd.to_datetime(df['created_at']).dt.to_period('M')
dt.to_period is another function from Pandas library, which takes datetime values to a period value, such as year, day, or name of the month of the datetime value.
Step 5: Get purchase totals for each month
We use the groupby() function to aggregate purchases by the month in which they happened, and use the sum() function to calculate total purchase_amt values for each group. Both of these functions are from the pandas library.
groupby() works similarly to the GROUP BY statement in SQL. It creates several groups based on multiple rows that have the same value in a specific column, in this case the ‘month_year’ column.
The sum() function works similarly to the aggregate function of the same name in SQL. It adds up values in the ‘purchase_amt’ column of each group.
df1 = df.groupby('month_year')['purchase_amt'].sum().reset_index(name='monthly_revenue').sort_values('month_year')
df1.set_index('month_year', inplace=True)
We use the set_index() method to set month_year as the index column for the df1 data frame. Values in this column become identifiers for each row. Setting the index is necessary for accuracy and efficiency of our data analysis.
Step 6: Calculate rolling 3-month average
Finally, we use the rolling() function from Pandas library to calculate a three month rolling revenue. This is a function used for doing rolling windows calculations. It takes two arguments: the number of periods for calculating the rolling window, and minimum period.
We create a new variable rolling_windows to store three month revenues. We need to use the mean() function to calculate three month rolling averages. Finally, we use the to_records() function to convert DataFrame structure to a NumPy record array.
rolling_windows = df1.rolling(3, min_periods=1)
rolling_mean = rolling_windows.mean()
result=rolling_mean.to_records()
Here is the final answer to this question:
import pandas as pd
import numpy as np
from datetime import datetime
pd.options.display.float_format = "{:,.2f}".format
df=amazon_purchases[amazon_purchases['purchase_amt']>0]
df['month_year'] = pd.to_datetime(df['created_at']).dt.to_period('M')
df1 = df.groupby('month_year')['purchase_amt'].sum().reset_index(name='monthly_revenue').sort_values('month_year')
df1.set_index('month_year', inplace=True)
rolling_windows = df1.rolling(3, min_periods=1)
rolling_mean = rolling_windows.mean()
result=rolling_mean.to_records()
Output
Running this code will output 3-month rolling average for each month in the year:
month_year | monthly_revenue |
---|---|
2020-01 | 26292 |
2020-02 | 23493.5 |
2020-03 | 25535.67 |
2020-04 | 24082.67 |
2020-05 | 25417.67 |
Question 4 - Top Percentile Fraud
In this question, we have to find the top 5 percentile claims for each state. For that, we will need to use the groupby() and rank() functions, similar to window ranking functions in SQL. The pandas library provides these and all other functions necessary to get the answer.
Interview Question Date: November 2020
ABC Corp is a mid-sized insurer in the US and in the recent past their fraudulent claims have increased significantly for their personal auto insurance portfolio. They have developed a ML based predictive model to identify propensity of fraudulent claims. Now, they assign highly experienced claim adjusters for top 5 percentile of claims identified by the model. Your objective is to identify the top 5 percentile of claims from each state. Your output should be policy number, state, claim cost, and fraud score.
Link to the question: https://platform.stratascratch.com/coding/10303-top-percentile-fraud
Solution
Another challenge which can be answered using the pandas library in Python. First, we need to use certain functions from the pandas library to generate percentile values for each state and store them in a separate column.
Once we have percentile values in a separate column, we will need to filter them. The question tells us to find records with percentile values of 0.95 or higher. Perhaps it would be a good idea to store the filtered records in a variable.
As a last step, we will need to output four columns specified in the question description.
Try to answer the question yourself below:
Question 5 - Total Cost Of Orders
The premise of the question is fairly simple. We have to aggregate the total cost of orders placed by customers. We will need functions from the pandas library to do the aggregation and other tasks to get the answer.
Interview Question Date: July 2020
Find the total cost of each customer's orders. Output customer's id, first name, and the total order cost. Order records by customer's first name alphabetically.
Link to the question: https://platform.stratascratch.com/coding/10183-total-cost-of-orders
Solution
For this question, we have two DataFrames (tables) - customers and orders. Neither of these tables alone contain enough information to find the answer.
The pandas library comes with a very useful merge() function, which has a functionality similar to JOINs in SQL. For arguments, we provide names of each table, and the columns that contain overlapping values.
The question asks us to calculate the total cost of orders for each user. Most probably, there are multiple orders placed by one user. In this case, you will need to use groupby() and sum() to aggregate total cost of orders for each user.
Finally, you will need a certain function from the pandas library to organize rows in an ascending alphabetical order of first names. And we must not forget to output the columns specified in the description.
Try to get the answer yourself:
Check out our post “Python Pandas Interview Questions” to practice more such questions.
Conclusion
In this article, we explained how to import pandas as pd in python and its importance for doing data science. Mastering its most essential features can bring you one step closer to landing a job and fast-track your career once you get it.
If you haven’t yet mastered pandas and Python, StrataScratch platform is a great place to start.
It has dozens of practical interview questions that require you to import and use functions from the pandas library. The platform allows you to look at available data, see approach hints if you are stuck, check the output of your code as you write it, and other features for interactive learning.
In short, StrataScratch has everything you need to hone your skills in pandas as well as Python. And as you might’ve heard, practice makes perfect, especially when it comes to technical skills like doing data science in Python.