How to Create a Matplotlib Histogram?

 How to Create a Matplotlib Histogram

A Matplotlib histogram is a common data visualization in data science. Here, you will learn all the fundamentals of creating and customizing histograms.

Histograms commonly feature in different contexts of data science. You will use them in exploratory data analysis (EDA), feature engineering in machine learning, probability and statistics, anomaly detection, image processing, and computer vision.

The list doesn’t end here, and you’ll have to strive hard to avoid them in your work. The same is true for Matplotlib. Show me a data scientist who never used Matplotlib, and you’ll get a soup.

Nothing? No soup for you! Come back – one year! Or stay and learn how you can use these two – Matplotlib and histogram – to your benefit as a data scientist.

What is a Histogram in Matplotlib?

A histogram is a type of bar chart that displays the numerical data distribution. It doesn’t display individual data points but groups data into bins (intervals) and shows the frequency of values within each bin.

A bin’s height represents the number of data points, making it easy to analyze data distributions.

In Matplotlib, you can easily create a histogram using the hist() function.

Why Choose Matplotlib for Creating Histograms?

Matplotlib is one of the most widely used Python libraries for data visualization. Somewhat of a synonym for creating plots in Python.

It’s popular for a reason, and its benefits also become obvious when creating histograms.

Why Choose Matplotlib for Creating Histograms

How to Set Up Matplotlib for Histograms

Before creating histograms, you need to set up Matplotlib in your Python environment.

1. Install Matplotlib

If you haven’t installed Matplotlib yet, the easiest way to do so is by using the pip installer and running this in the command prompt:

python -m pip install -U pip
python -m pip install -U matplotlib


It ensures that pip runs with the correct Python interpreters, avoiding conflicts in environments where multiple Python versions are installed. It also ensures that both pip and matplotlib are updated to the latest versions.

Or you can ignore Matplotlib recommendations and run this in the command prompt.

pip install matplotlib

There are other installation methods; all details are in the Matplotlib documentation.

2. Import Matplotlib

After installing it, you must import Matplotlib in your Python script.

import matplotlib.pyplot as plt


This imports pyplot, the Matplotlib module containing all the necessary plotting functions. You are now ready to create histograms in Matplotlib.

Creating Your First Matplotlib Histogram

In this showcase example, we create data manually.

Then, we pass the data variable as a parameter in the hist() function to create a histogram.

Finally, the show() function shows the plot.

import matplotlib.pyplot as plt

data = (
    [-2.5, -2.3, -2.1] * 5 + 
    [-1.9, -1.8, -1.6, -1.5, -1.4] * 10 +
    [-1.3, -1.2, -1.1, -1.0, -0.9] * 15 + 
    [-0.8, -0.7, -0.6, -0.5, -0.4] * 20 +
    [-0.3, -0.2, -0.1, 0.0, 0.1, 0.2] * 25 + 
    [0.3, 0.4, 0.5, 0.6, 0.7] * 20 +
    [0.8, 0.9, 1.0, 1.1, 1.2] * 15 + 
    [1.3, 1.4, 1.5, 1.6, 1.8] * 10 +
    [1.9, 2.0, 2.1, 2.3, 2.5] * 5
)

plt.hist(data)

plt.show()


Here’s the output.

Creating Your First Matplotlib Histogram

Practical Example

We will use the Age Distribution Analysis interview question to apply what we learned above.



Link to the question: https://platform.stratascratch.com/visualizations/10485-age-distribution-analysis

We will not do everything – at least for now – what the question requires. Instead, we’ll create the most basic histogram that shows the distribution of the ‘Under 25’ age group without any customization.

We’ll use this dataset.

Table: Numerical Data
Loading...Loading...Loading...Loading...Loading...Loading...


Here’s the code. We import Matplotlib’s pyplot module.

In the next step, we utilize the hist() function to filter the dataset by the age_group column and show the age distribution only for the ‘Under 25’ data subset.

Finally, we show the histogram using the show() function.


Customizing a Matplotlib Histogram

The basic histogram we created earlier is relatively sound, but customization can improve its readability and aesthetics.

In this section, you will learn the following histogram customizations in Matplotlib.

Customizing a Matplotlib Histogram


1. Customizing Figure Size

The figure is explicitly created using the figure() function. The plot’s height and width are defined using the figsize argument in the function.

For example, this code snapshot defines a figure that is 8 inches wide and 4 inches tall.

import matplotlib.pyplot as plt

data = (
    [-2.5, -2.3, -2.1] * 5 + 
    [-1.9, -1.8, -1.6, -1.5, -1.4] * 10 +
    [-1.3, -1.2, -1.1, -1.0, -0.9] * 15 + 
    [-0.8, -0.7, -0.6, -0.5, -0.4] * 20 +
    [-0.3, -0.2, -0.1, 0.0, 0.1, 0.2] * 25 + 
    [0.3, 0.4, 0.5, 0.6, 0.7] * 20 +
    [0.8, 0.9, 1.0, 1.1, 1.2] * 15 + 
    [1.3, 1.4, 1.5, 1.6, 1.8] * 10 +
    [1.9, 2.0, 2.1, 2.3, 2.5] * 5
)

plt.figure(figsize=(8, 4))
plt.hist(data)

plt.show()


Here’s the output.

Create Matplotlib Histogram by Customizing Figure Size

2. Changing Bin Number

Depending on the bin number, your histogram can look less or more detailed: a lower bin number will show less detail (variations are smoothed out), while more bins will result in more detail.

You can change the number of bins in the bins parameter within the hist() function.

As an example, the below code creates a histogram with 15 bins.

import matplotlib.pyplot as plt

data = (
    [-2.5, -2.3, -2.1] * 5 + 
    [-1.9, -1.8, -1.6, -1.5, -1.4] * 10 +
    [-1.3, -1.2, -1.1, -1.0, -0.9] * 15 + 
    [-0.8, -0.7, -0.6, -0.5, -0.4] * 20 +
    [-0.3, -0.2, -0.1, 0.0, 0.1, 0.2] * 25 + 
    [0.3, 0.4, 0.5, 0.6, 0.7] * 20 +
    [0.8, 0.9, 1.0, 1.1, 1.2] * 15 + 
    [1.3, 1.4, 1.5, 1.6, 1.8] * 10 +
    [1.9, 2.0, 2.1, 2.3, 2.5] * 5
)

plt.figure(figsize=(8, 4))
plt.hist(data, bins=15)

plt.show()


Here’s the output.

Create Matplotlib Histogram by Changing Bin Number

3. Changing Bar Color

If you don’t specify a bar color, Matplotlib assigns the default color (blue), as you saw in the previous sections.

The bar color is customized using the color parameter in the hist() function. You can find a list of all the named Matplolib colors here. And here’s a detailed guide on color customization in Matplotlib.

With this code, you create a histogram with bins of the 'lightsalmon' color.

import matplotlib.pyplot as plt

data = (
    [-2.5, -2.3, -2.1] * 5 + 
    [-1.9, -1.8, -1.6, -1.5, -1.4] * 10 +
    [-1.3, -1.2, -1.1, -1.0, -0.9] * 15 + 
    [-0.8, -0.7, -0.6, -0.5, -0.4] * 20 +
    [-0.3, -0.2, -0.1, 0.0, 0.1, 0.2] * 25 + 
    [0.3, 0.4, 0.5, 0.6, 0.7] * 20 +
    [0.8, 0.9, 1.0, 1.1, 1.2] * 15 + 
    [1.3, 1.4, 1.5, 1.6, 1.8] * 10 +
    [1.9, 2.0, 2.1, 2.3, 2.5] * 5
)

plt.figure(figsize=(8, 4))
plt.hist(data, bins=15, color='lightsalmon')

plt.show()


Here’s the output.

Creating Matplotlib Histogram by Changing Bar Color

4. Changing the Bar Edge Color

To make a histogram even clearer, you can define the bar edge color in the edgecolor parameter of hist().

In the instance below, the edge color is black.

import matplotlib.pyplot as plt

data = (
    [-2.5, -2.3, -2.1] * 5 + 
    [-1.9, -1.8, -1.6, -1.5, -1.4] * 10 +
    [-1.3, -1.2, -1.1, -1.0, -0.9] * 15 + 
    [-0.8, -0.7, -0.6, -0.5, -0.4] * 20 +
    [-0.3, -0.2, -0.1, 0.0, 0.1, 0.2] * 25 + 
    [0.3, 0.4, 0.5, 0.6, 0.7] * 20 +
    [0.8, 0.9, 1.0, 1.1, 1.2] * 15 + 
    [1.3, 1.4, 1.5, 1.6, 1.8] * 10 +
    [1.9, 2.0, 2.1, 2.3, 2.5] * 5
)

plt.figure(figsize=(8, 4))
plt.hist(data, bins=15, edgecolor='black', color='lightsalmon')

plt.show()


Here’s the output.

Create Matplotlib Histogram by Changing Bar Edge Color

5. Adding Grid Lines

If you add grid lines, it will be even easier to compare bar heights in your histogram.

To do so, set the visible argument in the grid() function to True.

import matplotlib.pyplot as plt

data = (
    [-2.5, -2.3, -2.1] * 5 + 
    [-1.9, -1.8, -1.6, -1.5, -1.4] * 10 +
    [-1.3, -1.2, -1.1, -1.0, -0.9] * 15 + 
    [-0.8, -0.7, -0.6, -0.5, -0.4] * 20 +
    [-0.3, -0.2, -0.1, 0.0, 0.1, 0.2] * 25 + 
    [0.3, 0.4, 0.5, 0.6, 0.7] * 20 +
    [0.8, 0.9, 1.0, 1.1, 1.2] * 15 + 
    [1.3, 1.4, 1.5, 1.6, 1.8] * 10 +
    [1.9, 2.0, 2.1, 2.3, 2.5] * 5
)

plt.figure(figsize=(8, 4))
plt.hist(data, bins=15, edgecolor='black', color='lightsalmon')

plt.grid(True)

plt.show()


Here’s the output.

Create Matplotlib Histogram by Adding Grid Lines


6. Adding Transparency

Adjusting transparency can be useful when overlaying multiple histograms, which we’ll talk about in more detail later. For now, you should remember that transparency is changed by passing values between 0 (the highest transparency) and 1 (the lowest transparency) as an argument in the alpha parameter of hist().

In this code snapshot, the transparency is set to 0.4.

import matplotlib.pyplot as plt

data = (
    [-2.5, -2.3, -2.1] * 5 + 
    [-1.9, -1.8, -1.6, -1.5, -1.4] * 10 +
    [-1.3, -1.2, -1.1, -1.0, -0.9] * 15 + 
    [-0.8, -0.7, -0.6, -0.5, -0.4] * 20 +
    [-0.3, -0.2, -0.1, 0.0, 0.1, 0.2] * 25 + 
    [0.3, 0.4, 0.5, 0.6, 0.7] * 20 +
    [0.8, 0.9, 1.0, 1.1, 1.2] * 15 + 
    [1.3, 1.4, 1.5, 1.6, 1.8] * 10 +
    [1.9, 2.0, 2.1, 2.3, 2.5] * 5
)

plt.figure(figsize=(8, 4))
plt.hist(data, bins=15, alpha=0.4, edgecolor='black', color='lightsalmon')

plt.grid(True)

plt.show()


Here’s the output.

Create Matplotlib Histogram by Adding Transparency


7. Customizing Titles and Labels

A histogram’s readability can be even better if you add axes labels and a histogram title.

You can do that using the xlabel(), ylabel(), and title() functions.

The following snapshot demonstrates how to do that.

import matplotlib.pyplot as plt

data = (
    [-2.5, -2.3, -2.1] * 5 + 
    [-1.9, -1.8, -1.6, -1.5, -1.4] * 10 +
    [-1.3, -1.2, -1.1, -1.0, -0.9] * 15 + 
    [-0.8, -0.7, -0.6, -0.5, -0.4] * 20 +
    [-0.3, -0.2, -0.1, 0.0, 0.1, 0.2] * 25 + 
    [0.3, 0.4, 0.5, 0.6, 0.7] * 20 +
    [0.8, 0.9, 1.0, 1.1, 1.2] * 15 + 
    [1.3, 1.4, 1.5, 1.6, 1.8] * 10 +
    [1.9, 2.0, 2.1, 2.3, 2.5] * 5
)

plt.figure(figsize=(8, 4))
plt.hist(data, bins=15, alpha=0.4, edgecolor='black', color='lightsalmon')

plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Matplotlib Histogram')
plt.grid(True)

plt.show()


Here’s the output.

Create Matplotlib Histogram by Customizing Titles and Labels

8. Adding Legend

A legend is such a simple yet often forgotten when creating histograms. It explains at a glance what the histogram shows; useful when showing only one variable and absolutely necessary when there are multiple variables shown.

A legend is created by first specifying the label argument inside hist(), which gives the data in the histogram a label. Then, use the legend() function to create the legend box and give it a name.

The code below shows how that works. It assigns the ‘Data 1’ label, and the legend is named, well, ‘Legend’.

import matplotlib.pyplot as plt

data = (
    [-2.5, -2.3, -2.1] * 5 + 
    [-1.9, -1.8, -1.6, -1.5, -1.4] * 10 +
    [-1.3, -1.2, -1.1, -1.0, -0.9] * 15 + 
    [-0.8, -0.7, -0.6, -0.5, -0.4] * 20 +
    [-0.3, -0.2, -0.1, 0.0, 0.1, 0.2] * 25 + 
    [0.3, 0.4, 0.5, 0.6, 0.7] * 20 +
    [0.8, 0.9, 1.0, 1.1, 1.2] * 15 + 
    [1.3, 1.4, 1.5, 1.6, 1.8] * 10 +
    [1.9, 2.0, 2.1, 2.3, 2.5] * 5
)

plt.figure(figsize=(8, 4))
plt.hist(data, bins=15, alpha=0.4, label='Data 1', edgecolor='black', color='lightsalmon')

plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Matplotlib Histogram')
plt.legend(title='Legend')
plt.grid(True)

plt.show()


Here’s the output.

Create Matplotlib Histogram by Adding Legend


Practical Example

Let’s now incorporate all these customizations in the practical example. We’ll go back to the histogram we already created for the Age Distribution Analysis interview question and customize it.

We will make it a histogram with 15 bins whose transparency is 0.7 and the color is 'skyblue'. We’ll label this data as ‘Under 25’, as this is what it shows, and show it in a legend named ‘Age Group’.

We’ll also name the histogram ‘Age Distribution of Population’, label the axes, and show the grid.


Using NumPy or Pandas to Build a Matplotlib Histogram

NumPy and pandas can be very helpful when creating histograms in Matplotlib.

Using NumPy

NumPy has a helpful function – randn() – which we can use to generate random numerical data instead of creating it manually as we did earlier.

For example, the code below imports NumPy and creates 1,000 data points using the randn() function. The rest of the code has all the same customizations as the showcase examples.

import numpy as np
import matplotlib.pyplot as plt

data = np.random.randn(1000)

plt.figure(figsize=(8, 4))
plt.hist(data, bins=15, alpha=0.4, label='Data 1', edgecolor='black', color='lightsalmon')

plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Matplotlib Histogram')
plt.legend(title='Legend')
plt.grid(True)

plt.show()


Here’s the output.

Using NumPy or Pandas to Build a Matplotlib Histogram


Using Pandas

With pandas – a powerful library for data analysis – you can create histograms directly from a pandas DataFrame.

Using df.hist()

One of the ways to create a histogram with pandas involves using the DataFrame.hist() function.

For example, the code above can be rewritten like this.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame({'values': np.random.randn(1000)})

df['values'].hist(bins=15, alpha=0.4, edgecolor='black', color='lightsalmon', figsize=(8, 4))

plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Matplotlib Histogram')
plt.grid(True)

plt.show()


It creates a DataFrame by utilizing the randn() function.

Then we reference the DataFrame and its column 'values', calling the hist() function. The rest of the code is the same as in the previous example.

Here’s the output.

Using Pandas to Build a Matplotlib Histogram


You’re probably asking yourself how is using pandas’ hist() different from using Matplotlib’s hist(); they look almost the same.

However, there are some differences which I summed up in the table two sections below.

Using df.plot()

Another way to create histograms with pandas is with its DataFrame.plot() function, a wrapper around Matplotlib.

Creating a histogram involves passing 'hist' as an argument in the function’s kind parameter.

When the above is applied, the previous code can be rewritten like this.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame({'values': np.random.randn(1000)})

df['values'].plot(kind='hist', bins=15, alpha=0.4, edgecolor='black', color='lightsalmon', figsize=(8, 4))

plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Histogram Using Pandas .plot()')
plt.grid(True)

plt.show()


Here’s the output.

Using df Plot To Build a Matplotlib Histogram


Comparison of Methods for Creating Histograms

So far, we’ve covered three distinct methods of creating histograms. One is Matplotlib’s function plt.hist(), and the other two are pandas functions df.hist() and df.plot().

Here’s an overview of their similarities and differences.

Comparison of Methods for Creating Matplotlib Histograms


All this can be summarized in these general principles for choosing the correct method when creating histograms.

  • For quick DataFrame visualizations -> use df.hist()
  • For highly customizable histograms -> use plt.hist()
  • For an in-between solution with multiple histograms in the same plot -> use df.plot()

How to Overlay Multiple Histograms in Matplotlib

Overlaying multiple histograms in one Matplotlib figure is very useful when you want to compare different datasets or different subsets of data.

There are two main ways to overlay multiple histograms in Matplotlib:

  • Multiple calls to plt.hist() -> explicitly adds each dataset
  • Using a for loop -> iterates over datasets and plots them dynamically

Multiple Calls to plt.hist()

This is the most straightforward approach when overlaying histograms shown in the code below.

Everything is the same as when you plot only one dataset. The only difference is that now you write plt.hist() multiple times, once for each dataset.

import numpy as np
import matplotlib.pyplot as plt

data1 = np.random.randn(1000)
data2 = np.random.randn(1000) + 2 

plt.hist(data1, bins=30, color='blue', edgecolor='black', alpha=0.5, label='Dataset 1')
plt.hist(data2, bins=30, color='red', edgecolor='black', alpha=0.5, label='Dataset 2')

plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Overlaying Histograms')
plt.legend()

plt.show()


Here’s the output.

How to Overlay Multiple Histograms in Matplotlib

Using a for Loop

Overlaying multiple histograms by utilizing a python for loop is a little more complicated, but sophisticated method.

This method avoids repeating plt.hist() multiple times and scales easily, which is perfect for working with multiple datasets or pandas DataFrames, where column names can be dynamically retrieved.

Practical Example

To show an example, we’ll go back to our interview question.



Link to the question: https://platform.stratascratch.com/visualizations/10485-age-distribution-analysis

This time, we won’t write a partial code but solve the complete question. In other words, we will overlay three histograms (for categories ‘Under 25’, ‘25-50’, and ‘Over 50’) and use the required colors.

As a reminder, here’s the dataset we’ll use again.

Table: Numerical Data
Loading...Loading...Loading...Loading...Loading...Loading...


The solution first creates the colors dictionary that defines colors for each age group.

Then, we create a figure and customize its width and height.

A for loop iterates through the colors dictionary and extracts the age group and corresponding color.

The plt.hist() function then plots a histogram for each age group. The function filters the dataset for only the current group, creates bins from 0 to 80 with an interval of 5, adds transparency so overlapping bars remain visible, adds a label, and assigns the color from the color dictionary.

The looping is continued the same way for all age groups.

Then, we add the plot title, x- and y-axis labels, legend title, and grids, and, finally, we show the plot.

import matplotlib.pyplot as plt

colors = {'Under 25': 'skyblue', '25-50': 'mediumslateblue', 'Over 50': 'slategray'}

# Plotting
plt.figure(figsize=(12, 6))
for group, color in colors.items():
    plt.hist(df[df['age_group'] == group]['age'], bins=range(0, 81, 5), alpha=0.7, label=group, color=color)

plt.title('Age Distribution of Population')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.legend(title='Age Group')
plt.grid(True)
plt.show()

Here’s the output.

Expected Visual Output
Expected Output


Advanced Features of Matplotlib Histogram

There are numerous advanced features you can use when plotting a Matplotlib histogram. In this section, we’ll talk about these three.

Advanced Features of Matplotlib Histogram

Normalized Histograms

A normalized histogram (or a probability density histogram) doesn’t display frequency counts but scales the y-axis so that the total area under the histogram sums to 1.

Normalizing a histogram is useful in several scenarios:

  1. Comparing datasets of different sizes: Normalization allows a fair comparison if one dataset has a different number of observations than another.
  2. Estimating a probability density function (PDF).
  3. When you want to visualize relative proportions, not absolute counts.

Normalized histograms are created easily in Matplotlib: set the density parameter to True in hist().

Here’s an example. I took the first two datasets from this StackOverflow question and created a .csv file with the same data. I’ll use that data to demonstrate the difference between non-normalized and normalized histograms.

So, in the first code, I load the data_histograms.csv file and create the non-normalized histogram as we’ve done before.

import pandas as pd
import matplotlib.pyplot as plt

file_path = "F:/Your_file_path/data_histograms.csv"

df = pd.read_csv(file_path)

plt.figure(figsize=(12, 6))
plt.hist(df["Data1"], bins=15, color='blue', edgecolor='black', alpha=0.6, label='Data 1')
plt.hist(df["Data2"], bins=15, color='green', edgecolor='black', alpha=0.6, label='Data 2')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Non-Normalized Histograms')
plt.legend()
plt.grid(True)

plt.show()


In another code, we skip the data loading, set density in hist() to True, and that will create a normalized histogram.

import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6))
plt.hist(data1, bins=15, density=True, color='red', edgecolor='black', alpha=0.6, label='Data 1 (Normalized)')
plt.hist(data2, bins=15, density=True, color='purple', edgecolor='black', alpha=0.6, label='Data 2 (Normalized)')
plt.xlabel('Value')
plt.ylabel('Probability Density')
plt.title('Normalized Histograms')
plt.legend()
plt.grid(True)

plt.show()


Here are the histograms shown one after the other so you can easily see the difference (no, it’s not only colors!).

The first histogram shows absolute occurrences of values. Dark green is where the data from the two datasets overlaps. This histogram doesn’t account for dataset size, so if one dataset is larger, the bars will also be, naturally, higher.

Normalized Histograms Created in Matplotlib


In the normalized histogram, the y-axis represents the probability density. The area under each histogram sums to 1, allowing for the distribution comparison regardless of dataset size. That way, the height of the bars reflects relative proportions within each dataset.

Normalized Histograms Created in Matplotlib


Be aware that the peak in Data 2 being, say, 0.065 doesn’t mean that 6.5% of the values are in this bin. It means that the probability of a value falling within that bin’s range is 6.5%, considering bin width.

To find the actual probability of a value falling in that bin, you must multiply the probability density by the bin width or:

P=fwP = f*w


And how do you calculate a bin width? Here’s the formula.

Bin Width=Range of DataNumber of BinsBin\ Width = \frac{Range\ of\ Data}{ Number\ of\ Bins}


For Data 2, we calculate the bin width like this.

Bin Width=max(Data 2)  min(Data 2)Number of Bins=41.0714285714285  015=2.73809523809523Bin\ Width = \frac{max(Data\ 2)\ -\ min(Data\ 2)}{ Number\ of\ Bins} = \frac{41.0714285714285\ -\ 0}{15} = 2.73809523809523


We can now go back to our probability formula.

P=fw=0.0652.73809523809523=0.1780=17.80%P = f*w = 0.065*2.73809523809523 = 0.1780 = 17.80\%


This means that 17.80% of data will fall within the highest Data 2 bin.

Logarithmic Scales

A logarithmic scale on histograms is very useful in situations when your data has few extremely large values, while most are small.

In a linear scale, tightly packed together small values could be hard to differentiate or completely invisible because they have to be shown on the same linear scale as the highest value.

Using the log scale helps with that, as it increases exponentially.

To convert the y-axis to a logarithmic scale, simply add this to your code:

plt.yscale('log')

Here’s an example. This code plots our standard histogram with a linear scale.

import matplotlib.pyplot as plt

data = (
    [1] * 5000 +   # Large peak at small values
    [10] * 2000 +
    [50] * 1000 +
    [100] * 500 +
    [500] * 200 +
    [1000] * 100 +
    [5000] * 50 +
    [10000] * 20 +
    [50000] * 10 +
    [100000] * 5 +
    [500000] * 2 +
    [1000000] * 1   # Single extreme value
)

plt.figure(figsize=(8, 4))
plt.hist(data, bins=20, color='blue', edgecolor='black')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Histogram with Linear Scale')
plt.grid(True)

plt.show()


Its output shows only one bar because all other values have too low a frequency to be visible in that scale. (Matplotlib automatically switches to scientific notation where there are very large values in data. But you can read the x-axis values easily like this: 0.0 = 0; 0.2 = 200,000; 0.4 = 400,00, and so on.)

Logarithmic Scales Created in Matplotlib Histogram


Now, the log scale histogram-creating code. It’s the same as the above, except for the addition of plt.yscale('log').

plt.figure(figsize=(8, 4))
plt.hist(data, bins=20, color='blue', edgecolor='black')
plt.yscale('log')
plt.xlabel('Value')
plt.ylabel('Frequency (Log Scale)')
plt.title('Histogram with Logarithmic Scale')
plt.grid(True)
           
plt.show()


However, the histogram is significantly different. In this one, you can see other values, too.

Matplotlib Histogram with Logarithmic Scale


Stacked Histograms

Stacked histograms are relatively similar to overlaying histograms in the sense that they both show multiple datasets on the same figure to allow comparison. However, instead of overlapping, stacked histograms are represented on top of each other.

This allows you to compare the total distributions while maintaining the relative contributions of each dataset.

In Matplotlib, stacked histograms are created by setting the stacked parameter in hist() to True.

Practical Example

Here’s an example. We’ll use the Car Price Relationship question to draw stacked histograms that visualize the distribution of selling prices by dealership.



Link to the question: https://platform.stratascratch.com/visualizations/10438-car-price-relationship

Here’s the dataset we’ll work with.

Table: Numerical Data
Loading...Loading...Loading...Loading...Loading...Loading...


Our solution first defines two lists: one for unique colors and the other for dealership names.

Next, we define the data variable, which loops through each dealership and selects only data from the selling_price column.

Then, we create a figure, plot histograms, and stack them with stacked=True.

Finally, we customize the plot as usual and display it.

How do you read it? For example, Dealership A contributes to the first bin with two values, Dealership 2 with two, and Dealership C with one value.

Saving and Sharing Your Matplotlib Histogram

The histograms you create are not only visible within a coding environment. You can also save them as images for reports and presentations.

Saving Visualization as an Image

In Matplotlib, you can easily export plots using the savefig() function before plt.show().

The default export format is PNG, so you don’t have to specify it explicitly. You can, of course, along with other supported formats:

  • PNG (default) – plt.savefig('histogram.png')
  • JPEG – plt.savefig('histogram.jpg')
  • SVG (for scalable graphics) – plt.savefig('histogram.svg')
  • PDF (for documents) – plt.savefig('histogram.pdf')

Adjusting Image Quality

The default DPI (dots per inch) in Matplotlib is 100. You are not wedded to it, as you can customize it. This is done by setting the dpi parameter in hist() to a desired DPI value.

Here are the general guidelines for choosing a suitable DPI value.

Choosing a Suitable DPI Value in Matplotlib Histogram


Additional customization is done by applying bbox_inches='tight' in savefig(), which removes extra whitespaces around the plot.

In the code below, we can see how we export the figure with DPI set to 300.

import matplotlib.pyplot as plt
import numpy as np

data = np.random.randn(1000)

plt.figure(figsize=(8, 4))
plt.hist(data, bins=30, color='blue', edgecolor='black', alpha=0.7)

plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Histogram with Extra Whitespace (Default)')

plt.grid(True)

plt.savefig('histogram_default.png', dpi=300) 


Here’s what the exported visualization looks like.

Choosing a Suitable DPI Value in Matplotlib Histogram

To trim the white space around the plot, write this code. The main difference to the previous code is that we added the bbox_inches parameter to savefig() and set it to 'tight'.

import matplotlib.pyplot as plt
import numpy as np

data = np.random.randn(1000)

plt.figure(figsize=(8, 4))
plt.hist(data, bins=30, color='blue', edgecolor='black', alpha=0.7)

plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Histogram without Extra Whitespace')

plt.grid(True)

plt.savefig('histogram_clean.png', dpi=300, bbox_inches='tight')


Here’s the exported histogram. You see how much less white space there is around it?

Choosing a Suitable DPI Value in Matplotlib Histogram


Common Errors When Creating a Matplotlib Histogram

Creating histograms in Matplotlib is pretty straightforward. However, sometimes you might encounter some errors, especially while you’re still new to Matplotlib.

Here’s a list of some common errors.

Common Errors When Creating a Matplotlib Histogram


1. Naming a Variable plt.hist

You might have some variable named plt.hist, which will override the plt.hist() function for creating a histogram.

For example, you might have done something like this.

plt.hist = plt.hist(data, bins=30, color='blue', edgecolor='black', alpha=0.7)


It will result in the TypeError: 'list' object is not callable' error. Make sure you never assign plt.hist to some variable.

2. Issues When Defining Bins

Does your dataset, perchance, have the same minimum and maximum values (i.e., all the values are the same)? If yes, it might confuse Matplotlib. It needs a range of values to create bins for the histogram.

So, if all the data values are identical, the range becomes zero. This causes the following error: ValueError: max must be larger than min in range parameter.

To avoid this, manually define the bin range like this:

plt.hist(data, bins=5, range=(4, 6))


3. A String or a Float in the Bin Parameter

The bins parameter accepts only integer values, a sequence of bin edges, or 'auto'; it does not accept strings or floats.

All these are accepted.

plt.hist(data, bins=10)
plt.hist(data, bins='auto')
plt.hist(data, bins=[-3, -2, -1, 0, 1, 2, 3])


However, this one is a float argument, so it will result in a TypeError: Invalid type for bins argument error.

plt.hist(data, bins=10.5)

4. Histogram Not Displaying

If your histogram is not showing, check if you forgot to use plt.show() after plotting. If you did, add it – problem solved!

5. Empty Histogram

If your dataset contains only NaN values or an empty list, the histogram will be empty.

Check for missing values and remove them with dropna() before plotting.

Conclusion

If you’ve been careful and followed the article from start to finish, you've learned quite a lot about histograms: what they are, how to create them and customize them in Matplotlib, call NumPy and pandas for help, overlay multiple histograms, use a logarithmic scale, normalize, stack, and save them.

It’s understandable if you still don’t feel comfortable yet with all that knowledge. What will help you digest all this is practicing on your own, and StrataScratch visualization questions are ideal for that.

FAQs

1. What is a histogram in Matplotlib?

A histogram is a plot that represents the distribution of numerical values. In a histogram, values are grouped into bins (range of values), and the height of each bin represents the frequency of data points in that range.

Histograms in Matplotlib are created using the plt.hist() function.

2. How can I change the bin size in a histogram?

You can do it by specifying the bins parameter in plt.hist().

If you set it to an integer, you specify the number of bins. Matplotlib will determine their width automatically by dividing the data range by the number of bins.

This code will create a histogram with 10 bins.

plt.hist(data, bins=10)


You can also provide a sequence of bin edges, specifying each bin size. For example, this will create a histogram with seven bins. The first bin covers data from -3 to -2, the second covers data from -2 to -1, and so on.

plt.hist(data, bins=[-3, -2, -1, 0, 1, 2, 3])


You can also use np.arrange() to specify the number of bins and the range they cover. For example, this code creates a histogram with bins of width 5 that cover the data range from 0 to 100.

plt.hist(data, bins=np.arange(0, 100, 5))


3. How do I save my histogram as an image file?

Use plt.savefig() before plt.show() to save the plot.

Share

Follow

 How to Create a Matplotlib Histogram


Become a data expert. Subscribe to our newsletter.