reCAPTCHA WAF Session Token
Data Science and ML

How to Manage Categorical Data Effectively with Pandas

Thank you for reading this post, don't forget to subscribe!


 

Let’s try to learn about categorical data in Pandas.
 

Preparation

Our Top 3 Course Recommendations

1. Google Cybersecurity Certificate – Get on the fast track to a career in cybersecurity.

2. Google Data Analytics Professional Certificate – Up your data analytics game

3. Google IT Support Professional Certificate – Support your organization in IT

 
Before we start, we need the Pandas and Numpy packages installed. You can install them using the following code:

 

With the packages installed, let’s jump into the main part of the article.

 

Manage Categorical Data in Pandas

 

Categorical data is a Pandas data type representing particular (fixed) numbers of class or distinct values. It’s different from the string or object data type in Pandas, especially in the way Pandas store the data.

Categorical data is more memory-efficient as the values in categorical data are only stored once. In contrast, object data types store each value as a separate string, which requires much more memory.

Let’s try out the categorical data with an example. Below is how we can initiate the categorical data with Pandas.

<code>import pandas as pd

df = pd.DataFrame({
    'fruits': pd.Categorical(['apple', 'kiwi', 'watermelon', 'kiwi', 'apple', 'kiwi']),
    'size': pd.Categorical(['small', 'large', 'large', 'small', 'large', 'small'])
})
df.info()</code>

 

Output:

<code>RangeIndex: 6 entries, 0 to 5
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   fruits  6 non-null      category
 1   size    6 non-null      category
dtypes: category(2)
memory usage: 396.0 bytes</code>

 

You can see the data type for column fruits, and the size is a category instead of an object, as we usually get.

We can try to compare the memory usage for the categorical and object data types with the following code:

<code>import numpy as np

n = 100000

df_object = pd.DataFrame({
    'fruit': np.random.choice(['apple', 'banana', 'orange'], size=n)
})

print('Memory usage with object type:')
print(df_object['fruit'].memory_usage(deep=True))


df_category = pd.DataFrame({
    'fruit': pd.Categorical(np.random.choice(['apple', 'banana', 'orange'], size=n))
})

print('Memory usage with categorical type:')
print(df_category['fruit'].memory_usage(deep=True))</code>

 

Output:

<code>Memory usage with object type:
6267209
Memory usage with categorical type:
100424</code>

 

You can see that the object type consumes way more memory than the categorical data type, especially with more samples.

Next, we will examine the unique method that categorical data types can use. For example, you can get the categories:

<code>df['fruits'].cat.categories</code>

 

Output:

<code>Index(['apple', 'kiwi', 'watermelon'], dtype="object")</code>

 

Also, we can rename the categories:

<code>df['fruits'] = df['fruits'].cat.rename_categories(['fruit_apple', 'fruit_banana', 'fruit_orange'])
print(df['fruits'].cat.categories)</code>

 

Output:

<code>Index(['fruit_apple', 'fruit_banana', 'fruit_orange'], dtype="object")</code>

 

The categorical data type can also introduce ordinal values, and we can compare categories.

<code>df['size'] = pd.Categorical(df['size'], categories=['small', 'medium', 'large'], ordered=True)
df['size'] < 'large' </code>

 

Output:

<code>0     True
1    False
2    False
3     True
4    False
5     True
Name: size, dtype: bool</code>

 

Mastering the categorical data type would give you an edge in the data analysis.

 

Additional Resources

 

 
 

Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.

Back to top button
Consent Preferences
WP Twitter Auto Publish Powered By : XYZScripts.com
SiteLock