How to Manage Categorical Data Effectively with Pandas

Thank you for reading this post, don't forget to subscribe!

Let’s try to learn about categorical data in Pandas.

Preparation

Our Top 3 Course Recommendations

1. Google Cybersecurity Certificate – Get on the fast track to a career in cybersecurity.

2. Google Data Analytics Professional Certificate – Up your data analytics game

3. Google IT Support Professional Certificate – Support your organization in IT

Before we start, we need the Pandas and Numpy packages installed. You can install them using the following code:

With the packages installed, let’s jump into the main part of the article.

Manage Categorical Data in Pandas

Categorical data is a Pandas data type representing particular (fixed) numbers of class or distinct values. It’s different from the string or object data type in Pandas, especially in the way Pandas store the data.

Categorical data is more memory-efficient as the values in categorical data are only stored once. In contrast, object data types store each value as a separate string, which requires much more memory.

Let’s try out the categorical data with an example. Below is how we can initiate the categorical data with Pandas.

<code>import pandas as pd

df = pd.DataFrame({
    'fruits': pd.Categorical(['apple', 'kiwi', 'watermelon', 'kiwi', 'apple', 'kiwi']),
    'size': pd.Categorical(['small', 'large', 'large', 'small', 'large', 'small'])
})
df.info()</code>

Output:

<code>RangeIndex: 6 entries, 0 to 5
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   fruits  6 non-null      category
 1   size    6 non-null      category
dtypes: category(2)
memory usage: 396.0 bytes</code>

You can see the data type for column fruits, and the size is a category instead of an object, as we usually get.

We can try to compare the memory usage for the categorical and object data types with the following code:

<code>import numpy as np

n = 100000

df_object = pd.DataFrame({
    'fruit': np.random.choice(['apple', 'banana', 'orange'], size=n)
})

print('Memory usage with object type:')
print(df_object['fruit'].memory_usage(deep=True))


df_category = pd.DataFrame({
    'fruit': pd.Categorical(np.random.choice(['apple', 'banana', 'orange'], size=n))
})

print('Memory usage with categorical type:')
print(df_category['fruit'].memory_usage(deep=True))</code>

Output:

<code>Memory usage with object type:
6267209
Memory usage with categorical type:
100424</code>

You can see that the object type consumes way more memory than the categorical data type, especially with more samples.

Next, we will examine the unique method that categorical data types can use. For example, you can get the categories:

<code>df['fruits'].cat.categories</code>

Output:

<code>Index(['apple', 'kiwi', 'watermelon'], dtype="object")</code>

Also, we can rename the categories:

<code>df['fruits'] = df['fruits'].cat.rename_categories(['fruit_apple', 'fruit_banana', 'fruit_orange'])
print(df['fruits'].cat.categories)</code>

Output:

<code>Index(['fruit_apple', 'fruit_banana', 'fruit_orange'], dtype="object")</code>

The categorical data type can also introduce ordinal values, and we can compare categories.

<code>df['size'] = pd.Categorical(df['size'], categories=['small', 'medium', 'large'], ordered=True)
df['size'] < 'large' </code>

Output:

<code>0     True
1    False
2    False
3     True
4    False
5     True
Name: size, dtype: bool</code>

Mastering the categorical data type would give you an edge in the data analysis.

Additional Resources

Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.

biskit 2 weeks agoLast Updated: September 3, 2024

0 2 minutes read

Preparation

Our Top 3 Course Recommendations

Manage Categorical Data in Pandas

Additional Resources

biskit

What this futuristic Olympics video says about the state of generative AI

Quick Hit #14 | CSS-Tricks

Related Articles

The Intersection of Data Privacy and Regulatory Compliance Software: What Businesses Need to Know

The State of Data Resilience in the Enterprise: Many Corporate Leaders Are Not Taking Data Protection Seriously, Say IT Teams

Implementing Multimodal Models with Hugging Face Transformers

Aligned and Refined: The Crucial Role of Data in the AI Revolution for Financial Services