reCAPTCHA WAF Session Token
Programming Languages

Exploring the Keys to Data Preparation — SitePoint

In this article, we’ll explore what data preprocessing is, why it’s important, and how to clean, transform, integrate and reduce our data.

Table of Contents
  1. Why Is Data Preprocessing Needed?
  2. Data Cleaning
  3. Data Transformation
  4. Data Integration
  5. Data Reduction
  6. Conclusion

Why Is Data Preprocessing Needed?

Data preprocessing is a fundamental step in data analysis and machine learning. It’s an intricate process that sets the stage for the success of any data-driven endeavor.

At its core, data preprocessing encompasses an array of techniques to transform raw, unrefined data into a structured and coherent format ripe for insightful analysis and modeling.

This vital preparatory phase is the backbone for extracting valuable knowledge and wisdom from data, empowering decision-making and predictive modeling across diverse domains.

The need for data preprocessing arises from real-world data’s inherent imperfections and complexities. Often acquired from different sources, raw data tends to be riddled with missing values, outliers, inconsistencies, and noise. These flaws can obstruct the analytical process, endangering the reliability and accuracy of the conclusions drawn. Moreover, data collected from various channels may vary in scales, units, and formats, making direct comparisons arduous and potentially misleading.

Data preprocessing typically involves several steps, including data cleaning, data transformation, data integration, and data reduction. We’ll explore each of these in turn below.

Data Cleaning

Data cleaning involves identifying and correcting errors, inconsistencies, and inaccuracies in the data. Some standard techniques used in data cleaning include:

  • handling missing values
  • handling duplicates
  • handling outliers

Let’s discuss each of these data-cleaning techniques in turn.

Handling missing values

Handling missing values is an essential part of data preprocessing. Observations with missing data are dealt with under this technique. We’ll discuss three standard methods for handling missing values: removing observations (rows) with missing values, imputing missing values with the statistics tools, and imputing missing values with machine learning algorithms.

We will demonstrate each technique with a custom dataset and explain the output of each method, discussing all of these techniques of handling missing values individually.

Dropping observations with missing values

The simplest way to deal with missing values is to drop rows with missing ones. This method usually isn’t recommended, as it can affect our dataset by removing rows containing essential data.

Let’s understand this method with the help of an example. We create a custom dataset with age, income, and education data. We introduce missing values by setting some values to NaN (not a number). NaN is a special floating-point value that indicates an invalid or undefined result. The observations with NaN will be dropped with the help of the dropna() function from the Pandas library:

import pandas as pd
import numpy as np

data = pd.DataFrame({'age': [20, 25, np.nan, 35, 40, np.nan],
  'income': [50000, np.nan, 70000, np.nan, 90000, 100000],
  'education': ['Bachelor', np.nan, 'PhD', 'Bachelor', 'Master', np.nan]})

data_cleaned = data.dropna(axis=0)

print("Original dataset:")

print("\nCleaned dataset:")

The output of the above code is given below. Note that the output won’t be produced in a bordered table format. We’re providing it in this format to make the output more interpretable, as shown below.

Original dataset

age income education
20 50000 Bachelor
25 NaN NaN
NaN 70000 PhD
35 NaN Bachelor
40 90000 Master
NaN 100000 NaN

Cleaned dataset

age income education
20 50000 Bachelor
40 90000 Master

The observations with missing values are removed in the cleaned dataset, so only the observations without missing values are kept. You’ll find that only row 0 and 4 are in the cleaned dataset.

Dropping rows or columns with missing values can significantly reduce the number of observations in our dataset. This may affect the accuracy and generalization of our machine-learning model. Therefore, we should use this approach cautiously and only when we have a large enough dataset or when the missing values aren’t essential for analysis.

Imputing missing values with statistics tools

This is a more sophisticated way to deal with missing data compared with the previous one. It replaces the missing values with some statistics, such as the mean, median, mode, or constant value.

This time, we create a custom dataset with age, income, gender, and marital_status data with some missing (NaN) values. We then impute the missing values with the median using the fillna() function from the Pandas library:

import pandas as pd
import numpy as np

data = pd.DataFrame({'age': [20, 25, 30, 35, np.nan, 45],
  'income': [50000, np.nan, 70000, np.nan, 90000, 100000],
  'gender': ['M', 'F', 'F', 'M', 'M', np.nan],
  'marital_status': ['Single', 'Married', np.nan, 'Married', 'Single', 'Single']})

data_imputed = data.fillna(data.median())

print("Original dataset:")

print("\nImputed dataset:")

The output of the above code in table form is shown below.

Original dataset

age income gender marital_status
20 50000 M Single
25 NaN F Married
30 70000 F NaN
35 NaN M Married
NaN 90000 M Single
45 100000 NaN Single

Imputed dataset

age income gender marital_status
20 50000 M Single
30 90000 F Married
30 70000 F Single
35 90000 M Married
30 90000 M Single
45 100000 M Single

In the imputed dataset, the missing values in the age, income, gender, and marital_status columns are replaced with their respective column medians.

Imputing missing values with machine learning algorithms

Machine-learning algorithms provide a sophisticated way to deal with missing values based on features of our data. For example, the KNNImputer class from the Scikit-learn library is a powerful way to impute missing values. Let’s understand this with the help of a code example:

import pandas as pd
import numpy as np

df = pd.DataFrame({'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
  'age': [25, 30, np.nan, 40, 45],
  'gender': ['F', 'M', 'M', np.nan, 'F'],
  'salary': [5000, 6000, 7000, 8000, np.nan]})

print('Original Dataset')

from sklearn.impute import KNNImputer

imputer = KNNImputer()

df['gender'] = df['gender'].map({'F': 0, 'M': 1})

df_imputed = imputer.fit_transform(df[['age', 'gender', 'salary']])

df_imputed = pd.DataFrame(df_imputed, columns=['age', 'gender', 'salary'])

df_imputed['name'] = df['name']

print('Dataset after imputing with KNNImputer')

The output of this code is shown below.

Original Dataset

name age gender salary
Alice 25.0 F 5000.0
Bob 30.0 M 6000.0
Charlie NaN M 7000.0
David 40.0 NaN 8000.0
Eve 45.0 F NaN

Dataset after imputing with KNNImputer

age gender salary name
25.0 0.0 5000.000000 Alice
30.0 1.0 6000.000000 Bob
37.5 1.0 7000.000000 Charlie
40.0 1.0 8000.000000 David
45.0 0.0 6666.666667 Eve

The above example demonstrates that imputing missing values with machine learning can produce more realistic and accurate values than imputing with statistics, as it considers the relationship between the features and the missing values. However, this approach can also be more computationally expensive and complex than imputing with statistics, as it requires choosing and tuning a suitable machine learning algorithm and its parameters. Therefore, we should use this approach when we have sufficient data, and the missing values are not random or trivial for your analysis.

It’s important to note that many machine-learning algorithms can handle missing values internally. XGBoost, LightGBM, and CatBoost are brilliant examples of machine-learning algorithms supporting missing values. These algorithms take missing values internally by ignoring missing ones, splitting missing values, and so on. But this approach doesn’t work well on all types of data. It can result in bias and noise in our model.

Handling duplicates

There are many times we have to deal with data with duplicate rows — such as rows with the same data in all columns. This process involves the identification and removal of duplicated rows in the dataset.

Here, the duplicated() and drop_duplicates() functions can us. The duplicated() function is used to find the duplicated rows in the data, while the drop_duplicates() function removes these rows. This technique can also lead to the removal of important data. So it’s important to analyze the data before applying this method:

import pandas as pd

data = pd.DataFrame({'name': ['John', 'Emily', 'Peter', 'John', 'Emily'],
  'age': [20, 25, 30, 20, 25],
  'income': [50000, 60000, 70000, 50000, 60000]})

duplicates = data[data.duplicated()]

data_deduplicated = data.drop_duplicates()

print("Original dataset:")

print("\nDuplicate rows:")

print("\nDeduplicated dataset:")

The output of the above code is shown below.

Original dataset

name age income
John 20 50000
Emily 25 60000
Peter 30 70000
John 20 50000
Emily 25 60000

Duplicate rows

name age income
John 20 50000
Emily 25 60000

Deduplicated dataset

name age income
John 20 50000
Emily 25 60000
Peter 30 70000

The duplicate rows are removed from the original dataset based on the deduplicated dataset’s name, age, and income columns.

Handing outliers

In real-world data analysis, we often come across data with outliers. Outliers are very small or huge values that deviate significantly from other observations in a dataset. Such outliers are first identified, then removed, and the dataset is transformed at a specific scale. Let’s understand with the following detail.

Identifying outliers

As we’ve already seen, the first step is to identify the outliers in our dataset. Various statistical techniques can be used for this, such as the interquartile range (IQR), z-score, or Tukey methods.

We’ll mainly look at z-score. It’s a common technique for the identification of outliers in the dataset.

The z-score measures how many standard deviations an observation is from the mean of the dataset. The formula for calculating the z-score of an observation is this:

z = (observation - mean) / standard deviation

The threshold for the z-score method is typically chosen based on the level of significance or the desired level of confidence in identifying outliers. A commonly used threshold is a z-score of 3, meaning any observation with a z-score more significant than 3 or less than -3 is considered an outlier.

Removing outliers

Once the outliers are identified, they can be removed from the dataset using various techniques such as trimming, or removing the observations with extreme values. However, it’s important to carefully analyze the dataset and determine the appropriate technique for handling outliers.

Transforming the data

Alternatively, the data can be transformed using mathematical functions such as logarithmic, square root, or inverse functions to reduce the impact of outliers on the analysis:

import pandas as pd
import numpy as np

data = pd.DataFrame({'age': [20, 25, 30, 35, 40, 200],
  'income': [50000, 60000, 70000, 80000, 90000, 100000]})

mean = data.mean()
std_dev = data.std()

threshold = 3
z_scores = ((data - mean) / std_dev).abs()
outliers = data[z_scores > threshold]

data_without_outliers = data[z_scores <= threshold]

print("Original dataset:")


print("\nDataset without outliers:")

In this example, we’ve created a custom dataset with outliers in the age column. We then apply the outlier handling technique to identify and remove outliers from the dataset. We first calculate the mean and standard deviation of the data, and then identify the outliers using the z-score method. The z-score is calculated for each observation in the dataset, and any observation that has a z-score greater than the threshold value (in this case, 3) is considered an outlier. Finally, we remove the outliers from the dataset.

The output of the above code in table form is shown below.

Original dataset

age income
20 50000
25 60000
30 70000
35 80000
40 90000
200 100000


Dataset without outliers

age income
20 50000
25 60000
30 70000
35 80000
40 90000

The outlier (200) in the age column in the dataset without outliers is removed from the original dataset.

Data Transformation

Data transformation is another method in data processing to improve data quality by modifying it. This transformation process involves converting the raw data into a more suitable format for analysis by adjusting the data’s scale, distribution, or format.

  • Log transformation is used to reduce outliers’ impact and transform skewed (a situation where the distribution of the target variable or class labels is highly imbalanced) data into a normal distribution. It’s a widely used transformation technique that involves taking the natural logarithm of the data.
  • Square root transformation is another technique to transform skewed data into a normal distribution. It involves taking the square root of the data, which can help reduce the impact of outliers and improve the data distribution.

Let’s look at an example:

import pandas as pd
import numpy as np

data = pd.DataFrame({'age': [20, 25, 30, 35, 40, 45],
  'income': [50000, 60000, 70000, 80000, 90000, 100000],
  'spending': [1, 4, 9, 16, 25, 36]})

data['sqrt_spending'] = np.sqrt(data['spending'])

print("Original dataset:")

print("\nTransformed dataset:")
print(data[['age', 'income', 'sqrt_spending']])

In this example, our custom dataset has a variable called spending. A significant outlier in this variable is causing skewness in the data. We’re controlling this skewness in the spending variable. The square root transformation has transformed the skewed spending variable into a more normal distribution. Transformed values are stored in a new variable called sqrt_spending. The normal distribution of sqrt_spending is between 1.00000 to 6.00000, making it more suitable for data analysis.

The output of the above code in table form is shown below.

Original dataset

age income spending
20 50000 1
25 60000 4
30 70000 9
35 80000 16
40 90000 25
45 100000 36

Transformed dataset

age income sqrt_spending
20 50000 1.00000
25 60000 2.00000
30 70000 3.00000
35 80000 4.00000
40 90000 5.00000
45 100000 6.00000

Data Integration

The data integration technique combines data from various sources into a single, unified view. This helps to increase the completeness and diversity of the data, as well as resolve any inconsistencies or conflicts that may exist between the different sources. Data integration is helpful for data mining, enabling data analysis spread across multiple systems or platforms.

Let’s suppose we have two datasets. One contains customer IDs and their purchases, while the other dataset contains information on customer IDs and demographics, as given below. We intend to combine these two datasets for a more comprehensive customer behavior analysis.

Customer Purchase Dataset

Customer ID Purchase Amount
1 $50
2 $100
3 $75
4 $200

Customer Demographics Dataset

Customer ID Age Gender
1 25 Male
2 35 Female
3 30 Male
4 40 Female

To integrate these datasets, we need to map the common variable, the customer ID, and combine the data. We can use the Pandas library in Python to accomplish this:

import pandas as pd

purchase_data = pd.DataFrame({'Customer ID': [1, 2, 3, 4],
  'Purchase Amount': [50, 100, 75, 200]})

demographics_data = pd.DataFrame({'Customer ID': [1, 2, 3, 4],
  'Age': [25, 35, 30, 40],
  'Gender': ['Male', 'Female', 'Male', 'Female']})

merged_data = pd.merge(purchase_data, demographics_data, on='Customer ID')


The output of the above code in table form is shown below.

Customer ID Purchase Amount Age Gender
1 $50 25 Male
2 $100 35 Female
3 $75 30 Male
4 $200 40 Female

We’ve used the merge() function from the Pandas library. It merges the two datasets based on the common customer ID variable. It results in a unified dataset containing purchase information and customer demographics. This integrated dataset can now be used for more comprehensive analysis, such as analyzing purchasing patterns by age or gender.

Data Reduction

Data reduction is one of the commonly used techniques in the data processing. It’s used when we have a lot of data with plenty of irrelevant information. This method reduces data without losing the most critical information.

There are different methods of data reduction, such as those listed below.

  • Data cube aggregation involves summarizing or aggregating the data along multiple dimensions, such as time, location, product, and so on. This can help reduce the complexity and size of the data, as well as reveal higher-level patterns and trends.
  • Dimensionality reduction involves reducing the number of attributes or features in the data by selecting a subset of relevant features or transforming the original features into a lower-dimensional space. This can help remove noise and redundancy and improve the efficiency and accuracy of data mining algorithms.
  • Data compression involves encoding the data in a more minor form, by using techniques such as sampling, clustering, histogram analysis, wavelet analysis, and so on. This can help reduce the data’s storage space and transmission cost and speed up data processing.
  • Numerosity reduction replaces the original data with a more miniature representation, such as a parametric model (for example, regression, log-linear models, and so on) or a non-parametric model (such as histograms, clusters, and so on). This can help simplify the data structure and analysis and reduce the amount of data to be mined.

Data preprocessing is essential, because the quality of the data directly affects the accuracy and reliability of the analysis or model. By properly preprocessing the data, we can improve the performance of the machine learning models and obtain more accurate insights from the data.


Preparing data for machine learning is like getting ready for a big party. Like cleaning and tidying up a room, data preprocessing involves fixing inconsistencies, filling in missing information, and ensuring that all data points are compatible. Using techniques such as data cleaning, data transformation, data integration, and data reduction, we create a well-prepared dataset that allows computers to identify patterns and learn effectively.

It’s recommended that we explore data in depth, understand data patterns and find the reasons for missingness in data before choosing an approach. Validation and test set are also important ways to evaluate the performance of different techniques.

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button
WP Twitter Auto Publish Powered By :
SiteLock Consent Preferences

Adblock Detected

Please consider supporting us by disabling your ad blocker