How to Reset a pandas DataFrame Index – Real Python
In this tutorial, you’ll learn how to reset a pandas DataFrame index, the reasons why you might want to do this, and the problems that could occur if you don’t.
Before you start your learning journey, you should familiarize yourself with how to create a pandas DataFrame. Knowing the difference between a DataFrame and a pandas Series will also prove useful to you.
In addition, you may want to use the data analysis tool Jupyter Notebook as you work through the examples in this tutorial. Alternatively, JupyterLab will give you an enhanced notebook experience, but feel free to use any Python environment you wish.
As a starting point, you’ll need some data. To begin with, you’ll use the band_members.csv
file included in the downloadable materials that you can access by clicking the link below:
The table below describes the data from band_members.csv
that you’ll begin with:
Column Name | PyArrow Data Type | Description |
---|---|---|
first_name |
string |
First name of member |
last_name |
string |
Last name of member |
instrument |
string |
Main instrument played |
date_of_birth |
string |
Member’s date of birth |
As you’ll see, the data has details of the members of the rock band The Beach Boys. Each row contains information about its various members both past and present.
Note: In case you’ve never heard of The Beach Boys, they’re an American rock band formed in the early 1960s.
Throughout this tutorial, you’ll be using the pandas library to allow you to work with DataFrames, as well as the newer PyArrow library. The PyArrow library provides pandas with its own optimized data types, which are faster and less memory-intensive than the traditional NumPy types that pandas uses by default.
If you’re working at the command line, you can install both pandas
and pyarrow
using the single command python -m pip install pandas pyarrow
. If you’re working in a Jupyter Notebook, you should use !python -m pip install pandas pyarrow
. Regardless, you should do this within a virtual environment to avoid clashes with the libraries you use in your global environment.
Once you have the libraries in place, it’s time to read your data into a DataFrame:
First, you used import pandas
to make the library available within your code. To construct the DataFrame and read it into the beach_boys
variable, you used pandas’ read_csv()
function, passing band_members.csv
as the file to read. Finally, by passing dtype_backend="pyarrow"
to .convert_dtypes()
you convert all columns to pyarrow
types.
If you want to verify that pyarrow
data types are indeed being used, then beach_boys.dtypes
will satisfy your curiosity:
As you can see, each data type contains [pyarrow]
in its name.
If you wanted to analyze the date information thoroughly, then you would parse the date_of_birth
column to make sure dates are read as a suitable pyarrow
date type. This would allow you to analyze by specific days, months or years, and so on, as commonly found in pivot tables.
The date_of_birth
column is not analyzed in this tutorial, so the string
data type it’s being read as will do. Later on, you’ll get the chance to hone your skills with some exercises. The solutions include the date parsing code if you want to see how it’s done.
Now that the file has been loaded into a DataFrame, you’ll probably want to take a look at it:
DataFrames are two-dimensional data structures similar to spreadsheets or database tables. A pandas DataFrame can be considered a set of columns, with each column being a pandas Series. Each column also has a heading, which is the name
property of the Series, and each row has a label, which is referred to as an element of its associated index object.
The DataFrame’s index is shown to the left of the DataFrame. It’s not part of the original band_members.csv
source file, but is added as part of the DataFrame creation process. It’s this index object you’re learning to reset.
The index of a DataFrame is an additional column of labels that helps you identify rows. When used in combination with column headings, it allows you to access specific data within your DataFrame. The default index labels are a sequence of integers, but you can use strings to make them more meaningful. You can actually use any hashable type for your index, but integers, strings, and timestamps are the most common.
Note: Although indexes are certainly useful in pandas, an alternative to pandas is the new high-performance Polars library, which eliminates them in favor of row numbers. This may come as a surprise, but aside from being used for selecting rows or columns, indexes aren’t often used when analyzing DataFrames. Also, row numbers always remain sequential when rows are added or removed in a Polars DataFrame. This isn’t the case with indexes in pandas.
With these preliminaries out of the way, it’s time to take your Little Deuce Coupe to the beach, do a pop-up, and surf through some learning. You’ll now investigate the main ways of re-indexing a DataFrame. You could apply an Index
object directly to the DataFrame’s .index
property, or use the DataFrame’s .set_axis()
method. To begin with, you’ll use the DataFrame’s .reset_index()
method.
Take the Quiz: Test your knowledge with our interactive “How to Reset a pandas DataFrame Index” quiz. You’ll receive a score upon completion to help you track your learning progress:
Interactive Quiz
How to Reset a pandas DataFrame Index
This quiz will challenge your knowledge of resetting indexes in pandas DataFrames. You won’t find all the answers in the tutorial, so you’ll need to do some extra investigating. By finding all the answers, you’re sure to learn some interesting things along the way.
How to Reset a pandas DataFrame Index With .reset_index()
The DataFrame.reset_index()
method, as its name suggests, is used to reset the index of a pandas DataFrame back to its default. You can also use this method to reset the MultiIndex
on your DataFrame if it has one. You’ll learn more about how to reset multi-indexes later.
Suppose you’ve been performing some data cleansing on your DataFrame and have decided to sort it alphabetically by first_name
. You can do this using .sort_values()
:
While you’re happy that the sort has worked perfectly, you’re not happy with the current state of its index. The index no longer bears any resemblance to the updated order of the rows.
To fix this, your first thought might be to use .reset_index()
with its default parameters, but the result might not be what you want:
You now have a new default index, which is a sequential series of numbers starting at zero that clearly defines the row order. However, the original index has remained and has been moved to a new column that’s confusingly named index
. This may be acceptable if you need to return the DataFrame to its original sort order in the future, but more often than not, its inclusion just wastes memory. You’ll usually want to get rid of it.
Fortunately, this is easy to do. By default, .reset_index()
makes a copy of the original DataFrame, applies the default index to this second copy, and returns that second copy back to you. This means that the index on the original DataFrame remains unchanged, so no harm has been done to it.
You can instruct .reset_index()
to drop the original index completely and replace it with a fresh default one by setting its drop
parameter to True
. You can also re-assign the original reference to the new DataFrame. The updated version will then be the only one that exists from this point forward:
You’ve now tidied up your DataFrame the way you want it. You’ve also re-assigned the original reference to the new DataFrame, meaning that irritating index
column has been sunk without a trace.
Note: The original DataFrame remains unchanged because .reset_index()
, by default, returns a copy of the original DataFrame. You can set the inplace
parameter to True
to change the original DataFrame.
Over time, the developers of pandas have designed most methods to return DataFrame copies by default so that any incorrect analysis could easily be redone correctly. If each method updated the DataFrame in place, any errors would need a rerun of all earlier processing.
One common misconception is that by setting inplace=True
you’ll save processing time. This generally isn’t accurate. Under the Python hood, a DataFrame copy is often created and worked on, but the re-indexed copy is assigned to the original DataFrame’s variable for you.
It’s recommended not to use this parameter as it’s scheduled for deprecation in a future pandas release. The reason for the deprecation is that setting inplace
to True
changes the underlying data in your original DataFrame. Any code that references your original DataFrame from elsewhere in your program will reference the changed data, which may introduce difficult-to-find bugs.
Setting inplace=True
also makes method chaining impossible. Method chaining is where you apply successive methods that change an original DataFrame, instead of creating lots of intermediate mutating DataFrames. But chaining only works if the methods used in the chain return a DataFrame.
If you apply a DataFrame method to the result of reset_index()
when inplace
has been set to True
, then there would be no DataFrame returned to apply it to, and your code would break.
Now it’s time for you to have some fun fun fun and try the following challenge:
You’ve already seen how it’s possible to reset an index while retaining the old one. When you do this, pandas adds it as a new column with the title index
. Take a look at the documentation for .reset_index()
and see if you can figure out how to customize this new column’s name.
Now re-read the data from band_members.csv
into a fresh DataFrame, and use beach_boys.index = range(1, 10)
to update its index. Finally, use .reset_index()
to see if you can customize your DataFrame and make it look like this:
You’re aiming to make sure the existing index isn’t only retained, but renamed as well.
You’ll find a solution in the Solutions.ipynb
notebook provided in the downloadable materials.
Although the .reset_index()
method you’ve just learned about is the most customizable way to reset a pandas DataFrame index, allowing you to handle multi-indexes, it’s certainly not the only way. It’s possible to do it the gremmie way by using first principles.
Reset an Index Directly With .index
When working with DataFrames, you can identify rows using .loc[]
or .iloc[]
. Each has its use case, although either can be used on any DataFrame.
A DataFrame’s index is referenced using its .index
property. To replace the current index with one of your own, you assign .index
an iterable containing your new index labels. This allows you lots of scope to customize your DataFrame indexes beyond the default incrementing numbers.
Before you explore .loc[]
and .iloc[]
, you’ll add a custom index to your DataFrame that contains the initials of the Beach Boys band members. First, you’ll read the band members’ data from a CSV file and convert it to pyarrow
data types:
You’ll take a look at an alternative way of resetting an index in a short while, but first, you’ll dive into row selection.
Select Rows Using .loc[]
and .iloc[]
The .loc[]
attribute allows you to index rows by their index label. This is particularly useful when the underlying index labels have intrinsic meaning within the DataFrame, such as usernames or timestamps.
If you slice with .loc[]
, it uses a closed interval, meaning that both the start and end index values you specify will appear in the output. So by selecting a slice of rows by username, your output will show the first one, the last one, and everything in between.
The .iloc[]
attribute allows you to find rows based on their index position regardless of the index label’s value. When you use .iloc[]
, you use zero-based indexing. In other words, the first row is at location 0
, not 1
as you might expect. .iloc[]
uses an open interval, which means that while the start index value you specify will appear in the output, the end one won’t.
While these concepts will be very familiar to you if you’ve ever sliced a string or list, it’s important to note that the final item .iloc[]
appears to select won’t actually be selected.
Of course, it doesn’t matter whether you use .iloc[]
or .loc[]
if you’re dealing with default indexes because these also use zero-based sequences, so their labels and positions match. However, if your index contains strings or a non-default numeric sequence, the parameters passed to .iloc[]
will not resemble the actual index value of the row you’re accessing, making your code more difficult to read.
Note: While you can use both .iloc[]
and .loc[]
to select rows, it’s generally preferable to use .loc[]
if your index has explicit meaning. Otherwise, you might be more comfortable using .iloc[]
because of its similarity to other areas of Python. If you’re using default indexes, it won’t matter which you choose.
A good way to remember the difference between them is to mentally associate the initial letter l of .loc[]
with the word label. This will remind you that .loc[]
uses index label values.
Now that your DataFrame has been updated with a fresh index, you can use it to select some rows using both .loc[]
and .iloc[]
:
In line 8, you accessed the first row whose index value is BW
by passing this as a string into .loc[]
. By passing it as a list to .loc[["BW"]]
, your output becomes a DataFrame. You could have passed it in directly, but this would have produced a pandas Series which behaves differently. The main point to note is that by passing an index value into .loc[]
, you returned the row whose index label is that value.
Compare this to the output after line 12. The code accesses row number 1
, this time with .iloc[[1]]
. Notice how the output is different. Using .iloc[[1]]
means that you’ve selected the second row of the DataFrame. Remember, .iloc[]
treats the DataFrame’s first row as 0
, regardless of the actual index label.
In line 16, you selected the first four rows by passing the labels of the first and fourth rows into .loc[]
.
Note: This slicing returns the BW
and BJ
records, plus everything in between, regardless of alphabetical order. It’s sometimes wise to sort an index using the DataFrame’s .sort_index()
method before slicing if an alphabetical selection of rows is required.
Finally, when you passed the slice 1:4
into .iloc[]
in line 23, you selected the rows starting at index position one—in other words, the second row but ending at index position three. Again, this is due to the zero-based numbering effect of rows and because the last slice parameter of position 4
is excluded. Remember .iloc[]
assumes an open interval.
Now, it’s time for you to try another challenge. If you get it wrong, you can always Do It Again:
Once again, re-read the data from band_members.csv
into a fresh DataFrame. Now, see if you can create the DataFrame shown below by assigning an appropriate list to beach_boys.index
, then select the bottom two rows using both .loc[]
and .iloc[]
:
This time, your index contains even numbers starting at 2
.
You’ll find a solution in the Solutions.ipynb
notebook provided in the downloadable materials.
So far, you’ve used .reset_index()
and .index
to reset a pandas DataFrame index. Now it’s time to take a carve and look at yet another alternative.
Reset an Index Directly With .set_axis()
A third way to reset an index is to use the DataFrame’s .set_axis()
method. This method allows you to assign a new RangeIndex
object to your DataFrame, and also allows you to change the column labels.
To alter the existing index of a DataFrame, you can pass .set_axis()
a range
object using the built-in range()
constructor. This will assign an interval of ascending integers to the index that start at zero:
Here, you used .set_axis()
to reset the index back to its default value. To do this, you passed .set_axis()
a range
whose length was equal to that of the original beach_boys
DataFrame. Using len()
ensures there are the correct amount of numbers for each row in the DataFrame. The numbers generated will be zero-based and be just sufficient to cover each row. As you can see, the index has now been reset to its default value.
Once again it’s time to check your understanding. Feel free to ask a friend for help, but only if they’re called Rhonda:
Re-read the data from band_members.csv
Now, use .set_axis()
to see if you can make it look like this:
This time you want each value of the default index to be squared.
As usual, you’ll find a solution in the Solutions.ipynb
notebook provided in the downloadable materials.
Note: At this point, you’ve seen three different ways of resetting a DataFrame index. You may be confused as to which one of them you should use. If all you want to do is reset your index to its default, then you can use the timeit module to time small pieces of code in either IPython or a Jupyter Notebook:
As you can see from these three timings, if all you need is to reset an index to its default, assigning a RangeIndex
object directly to your DataFrame’s .index
property is the fastest of the three. It’s also the only way to reset the index on the original DataFrame. The other two ways create a copy that you can assign back to the beach_boys
variable. However, .reset_index()
, while slow, does offer more configuration options.
Note that your exact timing figures will be different from those shown here, but they should show the same trend.
Now that you know how to reset a DataFrame’s index, you’ll move on and learn why you’d want to do this. It’s not quite time to hang loose just yet.
Restore a Sequential Index
When you cleanse data prior to analysis, you often need to remove certain rows from a DataFrame. For example, you may need to remove duplicate or other unwanted rows. When you remove these, the index values of the rows are removed as well. This could lead to gaps appearing in the index. By resetting the index, you can fix this.
Later, in the section about aligning indexes of several DataFrames, you’ll learn how missing values in a DataFrame’s index can wreak havoc when you need to analyze its data. For now, you’ll mess up the index of your lovely beach_boys
DataFrame:
As you can see, the DataFrame contains the original nine rows of band members with a default index. To remove some rows, you used the DataFrame’s .drop()
method. The rows to be dropped were defined in the Python list provided to its labels
parameter. In this case, the rows whose index label is 3
or 5
were both removed. Look carefully at the index in the output produced below line 17 and you’ll see that the index is no longer sequential.
To fix this, you can use any of the techniques you’ve learned so far. Here, you’ll use .reset_index()
which, because it doesn’t change the underlying DataFrame, means you can reuse the original cleanly-indexed DataFrame in later examples:
As you saw earlier, .reset_index()
has added a new default index to your DataFrame. However, you’re not happy just yet because your original index still remains, albeit in a new column. Once again, you need to adjust the drop
parameter of .reset_index()
to complete the job:
As you can see, by setting drop=True
, the original index is now nowhere to be seen, but your shiny new one looks, well, new and shiny.
Once again, it’s time to check your understanding. Oh, and if you can’t figure it out, Don’t Worry Baby:
To start, re-read the data from band_members.csv
, then use beach_boys.drop(labels=[3, 5])
to remove some of the rows. You’ve just learned how to reset the index to its default value using .reset_index(drop=True)
. See if you can do this again using either of the other two techniques you’ve learned. Your answer should look like this:
As you can see, the default index has been restored, and the original is nowhere to be found.
As with the other exercises, you’ll find a solution in the Solutions.ipynb
notebook provided in the downloadable materials.
All your efforts so far may mean you’re now seeing a twin fin. Next, you’ll get back on your board and get rid of those duplicates.
Remove Duplicate Index Values
You may be surprised to learn that indexes can sometimes have duplicate values. They don’t necessarily need to be unique identifiers. However, duplicates are usually something you want to avoid because they can cause issues. Fortunately, .reset_index()
can handle this for you.
Duplicate index values often arise when two DataFrames are merged. The duplication can cause problems with incorrect row selection, slicing, and filtering. Before you can see these problems, you first need to apply some duplicate values to your index:
Here, you’ve split the original DataFrame into two new ones. The guitar_players
DataFrame, shown below line 9, contains the records of those group members who play guitar. The others
DataFrame, shown below line 20, contains the rest of the members.
To select the guitar players, you passed the "instrument == 'Guitar'"
query string into .query()
in lines 5 through 7, which extracts all rows where the instrument
column values match "Guitar"
.
Lines 16 through 18 use similar code that creates a second DataFrame containing the other rows. In this case, musicians who aren’t marked as guitar players.
In both cases, .reset_index()
was used to make sure the index in both new DataFrames was sequential. This ensured that some identical index values appear across both DataFrames. When you merge both of these together, you might think you’ll get back to your original DataFrame, but you won’t:
You’ve created a single new DataFrame using concat()
. By passing it both guitar_players
and others
, your new all_beach_boys
DataFrame shows the original nine band members once more, but the index contains duplicates. Now that you’ve a DataFrame with duplicate indexes, you’ll investigate the problems this can cause.
Suppose you want to select the fourth row—the row whose index position is 3
. You can’t use .loc[]
to do this because the duplicate index causes issues. Run the code below and you’ll see the problem:
Take a look at the highlighted lines. As you can see, because .loc[]
selected rows whose index label is 3
, two records were returned. To fix this, you’d need to use .iloc[]
to select the required single row.
The duplicate index values also cause havoc when you try to select contiguous rows using slicing. Suppose you wanted to see the rows at index positions three and four. Your first attempt might be to try .loc[]
:
As you can see, when you try to pass the required index positions into .loc[]
, it throws a KeyError
exception because you have a non-unique label. To fix this, you’d need to resort to using .iloc[]
instead.
Note: It’s possible to use slicing with .loc[]
when duplicate indexes are involved. Before doing this, make sure you sort the index first using .sort_index()
. Otherwise, you’ll raise a KeyError
exception due to the duplicate values:
As you can see, the rows with index labels 3
and 4
have been returned.
Suppose you now want to see those elements with index labels 1
and 3
. This time, you use the DataFrame’s .filter()
method:
You’ve tried to filter the DataFrame with index labels 1
and 3
by passing these in as a list to the items
parameter of .filter()
. You’ve also set the axis
parameter to "index"
to apply the filter to the index. Despite your code being technically correct, the result is a ValueError
exception because of the duplicate labels.
At the moment, you’re probably feeling a bit rag dolled with all of these setbacks. It’s time to get back on your board again and solve the problems:
See if you can fix this code so that both .loc[]
and .iloc[]
produce the same results, and so .filter()
works as expected when applied to all_beach_boys
. Also, make sure the problematic old index has been removed:
Well done. Everything works!
Now suppose you want to promote an existing column to an index. Is this possible? It’s time to drop in, catch a wave, and see.
Use an Existing Column as an Index
While the default sequential numeric index provides a unique accessor to the rows within your DataFrame, it’s unlikely that it’ll have any inherent meaning. For example, the numbers assigned to each row of the DataFrames you’ve used up to this point have no meaning relative to the data they’re indexing.
In the DataFrame read in from band_members.csv
, band member Brian Wilson
has an index value of 0
simply because he appears first in the file. This has no inherent meaning, even though it may offend you if you’re a fan and think he should be number 1
.
While the concept of using an unrelated sequential index will be familiar to you if you’ve worked with keys in a relational database, you may want something more meaningful in your DataFrames.
If you want more user-friendly index labels, you could use an existing column and promote it to the index. To do this, you use the .set_index()
method. Although you can promote any existing column to become the index, keep in mind that unless your intended column contains unique values, you’ll still run into the same problems with duplicates that you saw earlier.
Suppose you want to reset your index to contain the first_name
labels. You could do this as shown here:
When you call .set_index()
on beach_boys
and pass in "first_name"
, this column gets promoted to the index. You can then use .loc[]
to select one or more rows using the first names of the musicians you’re interested in.
In some cases, you may wish to reset your index so that its existing values become more meaningful to the data they’re indexing. While you can’t use .reset_index()
for this, you could apply something more suitable to the DataFrame’s .index
attribute.
A common example is using employee identifiers instead of plain sequential integers. The code below updates the existing sequential index with more contextual values:
In this code, you used a list comprehension to create a list of strings containing ["Employee_1", "Employee_2", ...]
. This then gets assigned to the .index
attribute of beach_boys
.
Now that you have a more meaningful index, you can use it in the same way you used the default numeric index. For example, you can select rows by their new Employee_
values:
In the code above, you used .loc[]
to select the record of Employee_4
.
Now for a bit of a mind-bender. Wouldn’t it be nice if you could do this:
Using the original beach_boys
DataFrame read in from band_members.csv
, create an index consisting of the usernames of staff in the format <last_name><initial>
. Your final result should look like this:
As you can see, the index now contains a common username format, which will make it straightforward to select users by their usernames.
Next, you’ll gain some experience tandem surfing. It’s time to use not one, but two DataFrames together.
Align Indexes of Several DataFrames
One of the great features of working with pandas DataFrames is that you can use the basic arithmetic operators to add their data together. Unfortunately, you’ll only be allowed this convenience if their indexes align. Otherwise, you’ll run into problems.
Suppose you’re analyzing the weekly sales of records from a record store. Two weeks of sales data are stored inside two CSV files named week1_record_sales.csv
and week2_record_sales.csv
. For demonstration purposes, both files contain identical sales data but their indexes are different:
Each file is read into a DataFrame and contains daily sales information. Every row is identified by its index
column, which has been set as the DataFrame’s index with .set_index()
.
Suppose you now want to find the total sales of both weeks. This should be achievable with little more than a simple arithmetic operation:
As you can see, something’s gone wrong. Because both DataFrames contain the same data, you’d expect the answers to be twice their original values. Instead, the first and last answers are NaN
, meaning an arithmetic calculation couldn’t be performed due to missing values. Also, the remaining results are incorrect.
Both of these issues were caused by mismatched indexes. The NaN
values have appeared because neither index 0
nor index 5
appear in both of your DataFrames. The calculations are wrong because, for example, the Wednesday sales figure of 200
is indexed as 2
in your first DataFrame, while index 2
refers to the Tuesday sales of 150
in your second DataFrame. When you add these, the result is meaningless.
You can also merge two DataFrames in the same way that relational database tables can be merged. This allows you to see matching data from both DataFrames in the same place. Again, if you join on index, each index value must refer to the related data in both DataFrames.
For example, say you wanted to see all data for both sales weeks. To do this, you can use the DataFrame’s .merge()
method:
Right away you can see some problems. Neither the records with index 0
nor index 5
are anywhere to be seen. The daily figures don’t align either.
Here, you’ve performed an inner join of both DataFrames, meaning only records whose index values appear in both DataFrames will be merged. Because index values 0
and 5
don’t appear in both, they’re not included in the merge. The days are mismatched because the same index refers to different days in each DataFrame.
To fix both of these problems, you need to make sure both DataFrames are using the same index. One way would be to reset the index on week2_sales
to its default. This will then match that used by week1_sales
, but only because the daily data for both DataFrames are already in the same order:
As before, to reset the index back to its default, you use .reset_index()
and pass True
to its drop
parameter to remove the problematic original index. Now when you run both of the previous pieces of code, the results are far more palatable:
As you can see, everything now matches and nothing is missing. Aligning the indexes has solved both issues in one fell swoop. However, this example only worked because the rows within the DataFrames were in the correct order to begin with. This won’t always be the case.
This time you’re going to think laterally. So give your brain a Wipeout and see if you can solve this exercise:
Suppose you’re happy with the index of week2_sales
, but not with that of the week1_sales
DataFrame. See if you can use one of the techniques you learned about earlier to apply the index from week2_sales
to that of week1_sales
. Don’t forget to make sure both the addition and .merge()
still produce the correct output.
So far, you’ve merged the DataFrames on numerical indexes. Can you think of a better alternative that would still work even if both original DataFrames had their rows in a different order? Again, make sure that both the addition and .merge()
produce the correct result.
Remember, you can always take a look at the solution in your downloaded materials.
To round off your learning experience, you’ll finish up by learning how to reset multi-level indexes on DataFrames. Not something surfers normally do, unless they’re Pythonistas like you.
Reset Multi-Indexes
Each of the DataFrames you’ve been working with so far has consisted of single-column Index
objects. DataFrames also support MultiIndex
objects, which provide hierarchical, or multi-level indexes for your DataFrames.
In this section, you start to wake up from your California Dreaming of the Beach Boys and decide to have breakfast. You’ll use the cereals.csv
file to help you decide which cereal to eat. This file contains data about various popular breakfast cereals from a range of manufacturers. The original data comes from Kaggle and is freely available under the Creative Commons License. Here, you’re using a cut-down version of it.
The first thing that you’ll need to do is read the cereals data into a DataFrame:
As you can see, the file contains details of different breakfast cereals. When you call the DataFrame’s .head()
method, you see the first five records that reveal the cereal’s name
and manufacturer
. You can also see its type
, which tells you whether the cereal is to be eaten hot or cold, as well as its fiber
content.
Not surprisingly, this DataFrame has a simple index. One quick way to create a MultiIndex
is to create a pivot table from it:
This pivot table, which is actually another DataFrame, analyzes the raw data by working out the average fiber content for each type of cereal for each manufacturer. The index of this DataFrame is a bit different from what you’re used to seeing:
The MultiIndex
in this pivot_table consists of both manufacturer
and type
columns. Instead of the simple single columns you’ve seen so far, you now have a more complex multi-level structure.
In this example, you’ve created a MultiIndex
consisting of two levels. These are defined by passing "manufacturer"
and "type"
to the index
parameter of .pivot_table()
.
Within the MultiIndex
object, manufacturer
is known as level 0, while type
is level 1. These level numbers are important if you need to reset the index because they allow you to do so on a specific level, or even reset it completely.
Suppose you wanted to reset only level 1, corresponding to type
, which relegates it to a separate column before removing it completely:
By passing both level=1
and drop=True
to .reset_index()
, you drop the type
details, retaining only the manufacturer
information as a simple Index
.
Take care when doing this, because you’ve now created duplicate index values with all of the problems you saw earlier. In addition, your data has now lost some of its meaning. For example, there’s confusion over what both the Nabisco
and Quaker Oats
labels are showing you. You no longer know which one refers to the hot cereals and which to the cold.
Be aware that resetting part of a MultiIndex
can have undesirable side effects that aren’t always obvious.
Time to Catch a Wave and go on a Surfin’ Safari through your penultimate challenge:
Using the previous code for guidance, see if you can produce this exact output to provide clarification to the Nabisco
and Quaker Oats
confusion:
The type
information forms the index while the manufacturer
information is back in a column.
Sometimes, it’s better to reset all levels of a MultiIndex
but retain all data. When you apply .reset_index()
to a MultiIndex
using its default parameters, you’ll replace the complete index with a simple default version and create additional columns from the original index inside your DataFrame:
This time your DataFrame has a default index applied to it, but it also has new manufacturer
and type
columns, as well as the aggregated fiber
column’s data. Most importantly, the data hasn’t lost any of its meaning.
This final challenge will test those Good Vibrations you now have about your ability to work with DataFrame indexes.
See if you can solve the undesirable side effects of your original pivot table by flattening the MultiIndex
. Perhaps it’ll look something like this:
The index now consists of one level instead of two, but each level is a tuple.
That’s it! You should be stoked that you now have a comprehensive understanding of how to reset the index on your DataFrames using several techniques. You’ve also had a bonus lesson on how to incorporate Beach Boys song titles and corny surfing references into a Python tutorial. A truly rad original from your friends at Real Python.
Conclusion
In this tutorial, you learned that while the .reset_index()
method is the most customizable way to reset an index, it isn’t the fastest option. However, it is useful when you’re working with MultiIndex
resets.
You also learned that directly applying the index to the DataFrame’s .index
property is the quickest way to reset an index, and that it changes the original DataFrame in the process. Additionally, you discovered how the .set_axis()
method allows you to reset and re-label your index should you want to.
Congratulations on completing this tutorial, and enjoy applying these newfound skills to better prepare your DataFrames for analysis. Happy surfing!
Take the Quiz: Test your knowledge with our interactive “How to Reset a pandas DataFrame Index” quiz. You’ll receive a score upon completion to help you track your learning progress:
Interactive Quiz
How to Reset a pandas DataFrame Index
This quiz will challenge your knowledge of resetting indexes in pandas DataFrames. You won’t find all the answers in the tutorial, so you’ll need to do some extra investigating. By finding all the answers, you’re sure to learn some interesting things along the way.