NumPy with Pandas for More Efficient Data Analysis

Thank you for reading this post, don't forget to subscribe!

Image by jcomp on Freepik

As a data person, Pandas is a go-to package for any data manipulation activity because it’s intuitive and easy to use. That’s why many data science education include Pandas in their learning curriculum.

Our Top 5 Free Course Recommendations

1. Google Cybersecurity Certificate – Get on the fast track to a career in cybersecurity.

2. Natural Language Processing in TensorFlow – Build NLP systems

3. Python for Everybody – Develop programs to gather, clean, analyze, and visualize data

4. Google IT Support Professional Certificate

5. AWS Cloud Solutions Architect – Professional Certificate

Pandas are built on the NumPy package, especially the NumPy array. Many NumPy functions and methodologies still work well with them, so we can use NumPy to effectively improve our data analysis with Pandas.

This article will explore several examples of how NumPy can help our Pandas data analysis experience.

Let’s get into it.

Pandas Data Analysis Improvement with NumPy

Before proceeding with the tutorial, we should have all the required packages installed. If you haven’t done so, you can install Pandas and NumPy using the following code.

We can start by explaining how Pandas and NumPy are connected. As mentioned above, Pandas is built on the NumPy package. Let’s see how they could complement each other to improve our data analysis.

First, let’s try to create a NumPy array and Pandas DataFrame with the respective packages.

<code>import numpy as np
import pandas as pd

np_array= np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
pandas_df = pd.DataFrame(np_array, columns=['A', 'B', 'C'])

print(np_array)
print(pandas_df)</code>

<code>Output>>
[[1 2 3]
 [4 5 6]
 [7 8 9]]
   A  B  C
0  1  2  3
1  4  5  6
2  7  8  9</code>

As you can see in the code above, we can create Pandas DataFrame with a NumPy array with the same dimension structure.

Next, we can use NumPy in the Pandas data processing and cleaning steps. For example, we can use the NumPy NaN object as the missing data placeholder.

<code>df = pd.DataFrame({
    'A': [1, 2, np.nan, 4, 5],
    'B': [5, np.nan, np.nan, 3, 2],
    'C': [1, 2, 3, np.nan, 5]
})
print(df)</code>

<code>Output>>
    A    B    C
0  1.0  5.0  1.0
1  2.0  NaN  2.0
2  NaN  NaN  3.0
3  4.0  3.0  NaN
4  5.0  2.0  5.0</code>

As you can see in the result above, the NumPy NaN object becomes a synonym with any missing data in Pandas.

This code can examine the number of NaN objects in each Pandas DataFrame column.

<code>Output>>
A    1
B    2
C    1
dtype: int64</code>

The data collector may represent the missing data values in the DataFrame column as strings. If that happens, we can try to replace that string value with a NumPy NaN object.

<code>df['A'] = df['A'].replace('missing data'', np.nan)</code>

NumPy can also used for outlier detection. Let’s see how we can do that.

<code>df = pd.DataFrame({
    'A': np.random.normal(0, 1, 1000),
    'B': np.random.normal(0, 1, 1000)
})

df.loc[10, 'A'] = 100
df.loc[25, 'B'] = -100

def detect_outliers(data, threshold=3):
    z_scores = np.abs((data - data.mean()) / data.std())
    return z_scores > threshold

outliers = detect_outliers(df)
print(df[outliers.any(axis =1)])</code>

<code>Output>>
            A           B
10  100.000000    0.355967
25    0.239933 -100.000000</code>

In the code above, we generate random numbers with NumPy and then create a function that detects outliers using the Z-score and sigma rules. The result is the DataFrame containing the outlier.

We can perform statistical analysis with Pandas. NumPy could help facilitate more efficient analysis during the aggregation process. For example, here is statistical aggregation with Pandas and NumPy.

<code>df = pd.DataFrame({
    'Category': [np.random.choice(['A', 'B']) for i in range(100)],
    'Values': np.random.rand(100)
})

print(df.groupby('Category')['Values'].agg([np.mean, np.std, np.min, np.max]))</code>

<code>Output>>
             mean       std      amin      amax
Category                                        
A         0.524568  0.288471  0.025635  0.999284
B         0.525937  0.300526  0.019443  0.999090</code>

Using NumPy, we can use the statistical analysis function to the Pandas DataFrame and acquire aggregate statistics similar to the above output.

Lastly, we will talk about vectorized operations using Pandas and NumPy. Vectorized operations are a method of performing operations on the data simultaneously rather than looping them individually. The result would be faster and memory-optimized.
For example, we can perform element-wise addition operations between DataFrame columns using NumPy.

<code>data = {'A': [15,20,25,30,35], 'B': [10, 20, 30, 40, 50]}

df = pd.DataFrame(data)
df['C'] = np.add(df['A'], df['B'])  

print(df)</code>

<code>Output>>
   A   B   C
0  15  10  25
1  20  20  40
2  25  30  55
3  30  40  70
4  35  50  85</code>

We can also transform the DataFrame column via the NumPy mathematical function.

<code>df['B_exp'] = np.exp(df['B'])
print(df)</code>

<code>Output>>
   A   B   C         B_exp
0  15  10  25  2.202647e+04
1  20  20  40  4.851652e+08
2  25  30  55  1.068647e+13
3  30  40  70  2.353853e+17
4  35  50  85  5.184706e+21</code>

There is also the possibility of conditional replacement with NumPy for Pandas DataFrame.

<code>df['A_replaced'] = np.where(df['A'] > 20, df['B'] * 2, df['B'] / 2)
print(df)</code>

<code>Output>>
   A   B   C         B_exp  A_replaced
0  15  10  25  2.202647e+04         5.0
1  20  20  40  4.851652e+08        10.0
2  25  30  55  1.068647e+13        60.0
3  30  40  70  2.353853e+17        80.0
4  35  50  85  5.184706e+21       100.0</code>

Those are all the examples we have explored. These functions from NumPy would undoubtedly help to improve your Data Analysis process.

Conclusion

This article discusses how NumPy can help improve efficient data analysis using Pandas. We have tried to perform data preprocessing, data cleaning, statistical analysis, and vectorized operations with Pandas and NumPy.

I hope it helps!

Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.

biskit 3 weeks agoLast Updated: August 20, 2024

33 4 minutes read

Our Top 5 Free Course Recommendations

Pandas Data Analysis Improvement with NumPy

Conclusion

biskit

Get a Microsoft Office license for Windows or Mac for $25 right now

Software Engineering in the Digital Age: Trends and Transformations

Related Articles

Bootstrapping Your Data Science Career: A Guide to Self-Learning Pathways

Free Courses That Are Actually Free: Data Analytics Edition

Building Trust in AI: Qlik’s Latest AutoML Enhancements Offer Transparent Explainability and Improved Business Outcomes

7 Free Cloud IDE for Data Science That You Are Missing Out