How to Import Pandas as pd in Python

Published: September 12, 2023Updated: March 14, 2025

Categories:

Written by:
Irakli Tchigladze

Importing pandas as pd: an essential Python library for data scientists. Once you import it, you can take your data analysis to a whole new level.

As a general purpose programming language, Python has all the features necessary to analyze and gain insights from data. For this reason, interviewers often expect prospective data scientists to be proficient in Python.

While it has its own advantages, like readability and flexibility, standard Python language is insufficient for doing serious data analysis. Thankfully it supports external libraries to extend basic functionality of Python. Pandas is one of the libraries that extends functionality of the language beyond basic features and provides a toolkit to perform common data science tasks.

Pandas in Data Science

Pandas is an essential tool for doing data science in Python. It allows you to perform many tasks that would be next to impossible with standard language: work with data in various formats (tabular or labeled), sort and format it, combine data from multiple sources, find and clean up messy data, and even visualize it.

Pandas API is easy to follow, especially if you already know SQL. Functions are well documented and named descriptively. Many of the features have the same name as their counterparts in SQL.

If the job you’re applying for requires some knowledge of Python, you will have to use functions from the pandas library a lot. Interviewers might ask you questions to measure your ability to make use of all the functions from the pandas library.

What Can Pandas Do?

Short answer is that you need Pandas to gain access to many functions for performing data analysis in Python. It is especially useful for doing data modeling, data analysis and data manipulation.

Pandas provides two data structures Series and DataFrame, which give you flexibility to work with data sets in different formats, like tabular, time-series, and matrix data. DataFrame provides the foundation to perform essential operations like data alignment, calculating statistics, slicing, grouping, merging, adding or splitting sets of data. It is useful for working with three dimensional data from tables.

For example, sort_values() and sort_index() functions are commonly used for sorting and organizing data. merge() allows you to combine and work with data from two tables, similar to JOINs in SQL. You can even specify the shared dimension and method for combining data. concat() function for joining strings and arrays.

The read_csv() and read_excel() function allows you to work with data from Excel spreadsheets. Calling the df.head() function will return the first five rows by default. You can provide an argument to specify the number of rows you want it to return.

Functions from the pandas library can be used to get information about available data. df.shape() returns information about the number of columns and rows. df.size() returns the number of rows multiplied by the number of columns. df.info() can give us information about value types for each column.

Functions like df.to_datetime(), df.to_period(), dt.year(), dt.month() and similar are very useful for working with datetime values.

The library also contains features for cleaning up messy data. You can set up conditions, limits, and formatting rules for detecting outlier cases. You can also define how to fix bad data. Doing this will help you remove unrealistic values, which can improve the accuracy of your data analysis.

Why Import Pandas as pd?

Importing Pandas as pd is a convention among Python users. Here are the reasons why.

1. Saves Typing Time

In complex Python codes, you’ll have to reference Pandas numerous times. Doing so using a shorthand (pd) is much quicker than always writing its full name (pandas).

2. Improves Readability

Your code will be less cluttered, shorter, and more readable. Seeing pd.DataFrame() in the code is much easier on the eyes than pandas.DataFrame().

3. Ensures Consistency

Since importing Pandas as pd is a community standard, sticking with it ensures consistency and helps others understand your code better. Other Python libraries also use short aliases, so doing the same with Pandas ensures consistency within the code.

Example Problems Where You Need to Import Pandas as pd

Pandas library provides essential features for doing data analysis in Python. For this reason, solutions to Python interview questions often rely on features from the Pandas library.

Question 1 - Finding User Purchases

Solving this question comes down to working with purchase dates to find instances when a single user ordered two times within a 7-day period. Functions from the pandas library will be essential for finding an answer.

Finding User Purchases

Last Updated: December 2020

Amazon

MediumID 10322

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Identify returning active users by finding users who made a second purchase within 7 days of any previous purchase. Output a list of these user_ids.

DataFrame: amazon_transactions

Expected Output Type: pandas.DataFrame

Link to the question: https://platform.stratascratch.com/coding/10322-finding-user-purchases

Available Data

First step towards finding an answer is to look at available data. In this case, we have only one amazon_transactions table with five columns.

Let’s preview actual data in the table:

Table: amazon_transactions

Step-by-Step Code Solution

Step 1: Import Pandas as pd in Python

Pandas and Numpy libraries provide an essential toolkit for performing complex data science tasks.

Almost all functions we use to answer this question come from the pandas library.

import pandas as pd
import numpy as np
from datetime import datetime

Datetime library is useful for working with date values

Step 2: Convert date values

First, we need to convert values in the created_at column to datetime type, and format them.

For that, we use the to_datetime() and dt.strftime() functions from pandas library.

pd.to_datetime() one argument, a DataFrame data set that needs to be converted to a datetime object. dt.strftime() is used to specify the date format.

amazon_transactions["created_at"] = pd.to_datetime(amazon_transactions["created_at"]).dt.strftime('%m-%d-%Y')

Step 3: Arrange dates in an ascending order

In the next step, we need to put date values (from the created_at column) in an ascending order. When it comes to dates, ascending order means from earlier to later.

We use another function specific to the Pandas library sort_values(). It is necessary to arrange values in an ascending order, based on values in the user_id and created_at columns.

df = amazon_transactions.sort_values(by=['user_id', 'created_at'], ascending=[True, True])

Step 4: Find the date of previous purchase

In this step, we create a new column to store the closest previous date when the same user placed an order.

We generate values for the new column by moving values in the created_at column one row down. This is necessary to store the previous time the user placed an order. To accomplish this, we will use the shift() function from the Pandas library. It allows us to shift the index of values from a specific column.

We give the new column a descriptive name prev_value. For every row, the value in the prev_value column will be equal to created_at value from the previous row. For the second row, the prev_value column will have a value of created_at column from the first row.

df['prev_value'] = df.groupby('user_id')['created_at'].shift()

Step 5: Find repeat orders within 7 days

Finally, we find the difference between multiple orders placed by the same user by subtracting date values in created_at and prev_value columns. For that, we once again use the pd.to_datetime() function and the dt.days function to find the number of days that is the difference between two dates.

The dt.days() function takes a date value and returns the number of days.

df['days'] = (pd.to_datetime(df['created_at']) - pd.to_datetime(df['prev_value'])).dt.days
result = df[df['days'] <= 7]['user_id'].unique()

We find every instance where the same user placed orders within 7 days and store it in the result variable. We use the unique() function to return only one repeat order for each user. Applying this function to a hash table keeps only unique values.

Here is the final solution:

import pandas as pd
import numpy as np
from datetime import datetime

amazon_transactions["created_at"] = pd.to_datetime(amazon_transactions["created_at"]).dt.strftime('%m-%d-%Y')
df = amazon_transactions.sort_values(by=['user_id', 'created_at'], ascending=[True, True])
df['prev_value'] = df.groupby('user_id')['created_at'].shift()
df['days'] = (pd.to_datetime(df['created_at']) - pd.to_datetime(df['prev_value'])).dt.days
result = df[df['days'] <= 7]['user_id'].unique()

Output

Running this Python code will return unique id values of users who placed an order within 7 days.

As you can see, in the process of finding an answer, we make extensive use of functions from the Pandas library. Python code looks a little different than SQL query, but both are following the same logical pattern.

Question 2 - Second Highest Salary

In order to find the second highest salary, we will need to rank values in the salary column.The pandas library provides all necessary functions to solve this question in Python.

Second Highest Salary

Last Updated: April 2019

Dropbox

Amazon

MediumID 9892

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Find the second highest salary of employees.

DataFrame: employee

Link to the question: https://platform.stratascratch.com/coding/9892-second-highest-salary

Available Data

It’s a good practice to study the available data before answering the question. For this question, we are dealing with one employee table with fourteen columns.

The salary column contains integer (numeric) values to describe the salary of each employee.

Now, let’s look at actual data in the table:

Table: employee

Step-by-Step Code Solution

Step 1: Import Pandas as pd in Python

We need to use functions from the pandas library to work with tabular data. In the following steps, it will become clear why it’s so important to import pandas and numpy.

import pandas as pd
import numpy as np

Step 2: Get an unique list of salaries

To start off, we use the drop_duplicates() function from the Pandas library, which removes rows with duplicate values in a specific column. In this case, the salary column. This leaves us with the list of unique salary values.

distinct_salary = employee.drop_duplicates(subset = 'salary')

Step 3: Rank salaries

The question asks us to find the second highest salary. Therefore we will have to use the rank() function, which is a Pandas alternative to ranking window functions in SQL. It assigns a numeric value to each row, corresponding to its rank.

We create a ‘rnk’ column to rank values in the distinct_salary data set. When calling the rank() function, we provide two arguments, one to specify the type of ranking it should perform (dense), and another for order of values.

distinct_salary['rnk'] = distinct_salary['salary'].rank(method='dense', ascending=False)

In SQL, we have window ranking functions. The pandas library provides only the rank() functions, but we can specify the type of ranking via arguments (method=’dense’).

Dense ranking follows this principle for dealing with instances when multiple values are tied - if two salary values are the same, both will receive the same rank. It also doesn’t skip the next rank when the values are tied.

Step 4: Find the second highest salary

Once we rank all salaries, all there is left to do is return a salary that has a rank value of 2.

result = distinct_salary[distinct_salary.rnk == 2][['salary']]

The final solution will look like this:

import pandas as pd
import numpy as np

distinct_salary = employee.drop_duplicates(subset = 'salary')
distinct_salary['rnk'] = distinct_salary['salary'].rank(method='dense', ascending=False)
result = distinct_salary[distinct_salary.rnk == 2][['salary']]

Output

Running the code above will return just one column with second highest salary:

Question 3 - Revenue Over Time

This is a difficult question with a lot of moving parts. We will need to import pandas as pd to use functions for working with date values. We also need this library to use other essential utilities, like groupby() function, aggregate functions and rolling() function, which is very useful for this particular question.

Revenue Over Time

Last Updated: December 2020

Amazon

HardID 10314

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Software Engineer

Find the 3-month rolling average of total revenue from purchases given a table with users, their purchase amount, and date purchased. Do not include returns which are represented by negative purchase values. Output the year-month (YYYY-MM) and 3-month rolling average of revenue, sorted from earliest month to latest month.

A 3-month rolling average is defined by calculating the average total revenue from all user purchases for the current month and previous two months. The first two months will not be a true 3-month rolling average since we are not given data from last year. Assume each month has at least one purchase.

DataFrame: amazon_purchases

Expected Output Type: pandas.DataFrame

Link to the question: https://platform.stratascratch.com/coding/10314-revenue-over-time

Available Data

Looking at data can help you better understand the question. In this case, we are dealing with data contained in the amazon_purchases table with three columns.

Previewing the table can help you better understand the data:

Table: amazon_purchases

Step-by-Step Code Solution

Step 1: Import Pandas as pd in Python

As a data scientist proficient in Python, you are expected to know what tools you’ll need to answer the interview question using the fewest lines of code.

pandas is one of the essential libraries that provides functions like to_datetime(), to_period(), group(), sum(), which you’ll need to find the answer.

import pandas as pd
import numpy as np
from datetime import datetime

Step 2: Set the format for float values

First, we modify pd.options to specify how numbers are displayed when outputting a DataFrame.

pd.options.display.float_format = "{:,.2f}".format

In this case, we tell pandas to output float values with 2 decimal places and use comma separator for more readability.

Step 3: Filter out negative purchase amount values

The question tells us to not account for negative purchase_amt values, which represent returns. We set the condition to store rows where the value of the purchase_amt column is more than 0. We store them in the df variable.

df=amazon_purchases[amazon_purchases['purchase_amt']>0]

Step 4: Convert datetime values to months

In the next step, we must take created_at values in the available data and use the pd.to_datetime and dt.to_period function to transform them to month values.

df['month_year'] = pd.to_datetime(df['created_at']).dt.to_period('M')

dt.to_period is another function from Pandas library, which takes datetime values to a period value, such as year, day, or name of the month of the datetime value.

Step 5: Get purchase totals for each month

We use the groupby() function to aggregate purchases by the month in which they happened, and use the sum() function to calculate total purchase_amt values for each group. Both of these functions are from the pandas library.

groupby() works similarly to the GROUP BY statement in SQL. It creates several groups based on multiple rows that have the same value in a specific column, in this case the ‘month_year’ column.

The sum() function works similarly to the aggregate function of the same name in SQL. It adds up values in the ‘purchase_amt’ column of each group.

df1 = df.groupby('month_year')['purchase_amt'].sum().reset_index(name='monthly_revenue').sort_values('month_year')
df1.set_index('month_year', inplace=True)

We use the set_index() method to set month_year as the index column for the df1 data frame. Values in this column become identifiers for each row. Setting the index is necessary for accuracy and efficiency of our data analysis.

Step 6: Calculate rolling 3-month average

Finally, we use the rolling() function from Pandas library to calculate a three month rolling revenue. This is a function used for doing rolling windows calculations. It takes two arguments: the number of periods for calculating the rolling window, and minimum period.

We create a new variable rolling_windows to store three month revenues. We need to use the mean() function to calculate three month rolling averages. Finally, we use the to_records() function to convert DataFrame structure to a NumPy record array.

rolling_windows = df1.rolling(3, min_periods=1)
rolling_mean = rolling_windows.mean()
result=rolling_mean.to_records()

Here is the final answer to this question:

import pandas as pd
import numpy as np
from datetime import datetime

pd.options.display.float_format = "{:,.2f}".format

df=amazon_purchases[amazon_purchases['purchase_amt']>0]
df['month_year'] = pd.to_datetime(df['created_at']).dt.to_period('M')
df1 = df.groupby('month_year')['purchase_amt'].sum().reset_index(name='monthly_revenue').sort_values('month_year')
df1.set_index('month_year', inplace=True)
rolling_windows = df1.rolling(3, min_periods=1)
rolling_mean = rolling_windows.mean()
result=rolling_mean.to_records()

Output

Running this code will output 3-month rolling average for each month in the year:

Question 4 - Top Percentile Fraud

In this question, we have to find the top 5 percentile claims for each state. For that, we will need to use the groupby() and rank() functions, similar to window ranking functions in SQL. The pandas library provides these and all other functions necessary to get the answer.

Top Percentile Fraud

Last Updated: November 2020

Google

Netflix

HardID 10303

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

We want to identify the most suspicious claims in each state. We'll consider the top 5 percentile of claims with the highest fraud scores in each state as potentially fraudulent.

Your output should include the policy number, state, claim cost, and fraud score.

DataFrame: fraud_score

Link to the question: https://platform.stratascratch.com/coding/10303-top-percentile-fraud

Solution

Another challenge which can be answered using the pandas library in Python. First, we need to use certain functions from the pandas library to generate percentile values for each state and store them in a separate column.

Once we have percentile values in a separate column, we will need to filter them. The question tells us to find records with percentile values of 0.95 or higher. Perhaps it would be a good idea to store the filtered records in a variable.

As a last step, we will need to output four columns specified in the question description.

Try to answer the question yourself below:

Question 5 - Total Cost Of Orders

The premise of the question is fairly simple. We have to aggregate the total cost of orders placed by customers. We will need functions from the pandas library to do the aggregation and other tasks to get the answer.

Total Cost Of Orders

Last Updated: July 2020

Etsy

Amazon

EasyID 10183

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Find the total cost of each customer's orders. Output customer's id, first name, and the total order cost. Order records by customer's first name alphabetically.

DataFrames: customers, orders

Expected Output Type: pandas.DataFrame

Link to the question: https://platform.stratascratch.com/coding/10183-total-cost-of-orders

Solution

For this question, we have two DataFrames (tables) - customers and orders. Neither of these tables alone contain enough information to find the answer.

The pandas library comes with a very useful merge() function, which has a functionality similar to JOINs in SQL. For arguments, we provide names of each table, and the columns that contain overlapping values.

The question asks us to calculate the total cost of orders for each user. Most probably, there are multiple orders placed by one user. In this case, you will need to use groupby() and sum() to aggregate total cost of orders for each user.

Finally, you will need a certain function from the pandas library to organize rows in an ascending alphabetical order of first names. And we must not forget to output the columns specified in the description.

Try to get the answer yourself:

Check out our post “Python Pandas Interview Questions” to practice more such questions.

Alternative Ways to Import Pandas Without Using pd

While importing Pandas as pd is a recommended practice, there are also some alternative ways of importing Pandas.

1. Importing Pandas Without an Alias

You can import it without an alias. Everything will work, only your code will be longer and harder to read.

import pandas  
df = pandas.DataFrame(data)

2. Importing Pandas Using a Different Alias

Instead of pd, you could use a different alias. For example, p.

import pandas as p  
df = p.DataFrame(data)

This could be confusing, as pd is commonly used.

3. Import Only Specific Functions

You can import only the functions or classes you need, instead of importing the whole library.

from pandas import DataFrame, read_csv  
df = DataFrame(data)

This is a good approach to make your code even cleaner. However, if too many functions are used and imported this way, it becomes a bad approach.

4. Import Everything With *

Finally, you can also import all Pandas functions directly into the namespace.

from pandas import *  
df = DataFrame(data)

This is generally a bad idea, as it’s unclear where functions come from. This approach can also cause conflicts with other libraries, as many have functions with the same names.

Conclusion

In this article, we explained how to import pandas as pd in python and its importance for doing data science. Mastering its most essential features can bring you one step closer to landing a job and fast-track your career once you get it.

If you haven’t yet mastered pandas and Python, StrataScratch platform is a great place to start.

It has dozens of practical interview questions that require you to import and use functions from the pandas library. The platform allows you to look at available data, see approach hints if you are stuck, check the output of your code as you write it, and other features for interactive learning.

In short, StrataScratch has everything you need to hone your skills in pandas as well as Python. And as you might’ve heard, practice makes perfect, especially when it comes to technical skills like doing data science in Python.

How to Import Pandas as pd in Python

Pandas in Data Science

What Can Pandas Do?

Why Import Pandas as pd?

1. Saves Typing Time

2. Improves Readability

3. Ensures Consistency

Example Problems Where You Need to Import Pandas as pd

Question 1 - Finding User Purchases

Finding User Purchases

Step-by-Step Code Solution

Question 2 - Second Highest Salary

Second Highest Salary

Step-by-Step Code Solution

Question 3 - Revenue Over Time

Revenue Over Time

Step-by-Step Code Solution

Question 4 - Top Percentile Fraud

Top Percentile Fraud

Solution

Question 5 - Total Cost Of Orders

Total Cost Of Orders

Solution

Alternative Ways to Import Pandas Without Using pd

1. Importing Pandas Without an Alias

2. Importing Pandas Using a Different Alias

3. Import Only Specific Functions

4. Import Everything With *

Conclusion

Latest Posts:

Looping Through Lists in Python: A Comprehensive Tutorial

LLM Deep Dive: Practitioner vs. Researcher Path

Python List Manipulation: Convert Lists to Strings