Automating Data Cleaning Processes with Pandas
Few data science projects are exempt from the necessity of cleaning data. Data cleaning encompasses the initial steps of preparing data. Its specific purpose is that only the relevant and useful information underlying the data is retained, be it for its posterior analysis, to use as inputs to an AI or machine learning model, and so on. Unifying or converting data types, dealing with missing values, eliminating noisy values stemming from erroneous measurements, and removing duplicates are some examples of typical processes within the data cleaning stage.
As you might think, the more complex the data, the more intricate, tedious, and time-consuming the data cleaning can become, especially when implementing it manually.
This article delves into the functionalities offered by the Pandas library to automate the process of cleaning data. Off we go!
Cleaning Data with Pandas: Common Functions
Automating data cleaning processes with pandas boils down to systematizing the combined, sequential application of several data cleaning functions to encapsulate the sequence of actions into a single data cleaning pipeline. Before doing this, let’s introduce some typically used pandas functions for diverse data cleaning steps. In the sequel, we assume an example python variable df
that contains a dataset encapsulated in a pandas DataFrame
object.
- Filling missing values: pandas provides methods for automatically dealing with missing values in a dataset, be it by replacing missing values with a “default” value using the
df.fillna()
method, or by removing any rows or columns containing missing values through thedf.dropna()
method. - Removing duplicated instances: automatically removing duplicate entries (rows) in a dataset could not be easier thanks to the
df.drop_duplicates()
method, which allows the removal of extra instances when either a specific attribute value or the entire instance values are duplicated to another entry. - Manipulating strings: some pandas functions are useful to make the format of string attributes uniform. For instance, if there is a mix of lowercase, sentencecase, and uppercase values for an
'column'
attribute and we want them all to be lowercase, thedf['column'].str.lower()
method does the job. For removing accidentally introduced leading and trailing whitespaces, try thedf['column'].str.strip()
method. - Manipulating date and time: the
pd.to_datetime(df['column'])
converts string columns containing date-time information, e.g. in the dd/mm/yyyy format, into Python datetime objects, thereby easing their further manipulation. - Column renaming: automating the process of renaming columns can be particularly useful when there are multiple datasets seggregated by city, region, project, etc., and we want to add prefixes or suffixes to all or some of their columns for easing their identification. The
df.rename(columns={old_name: new_name})
method makes this possible.
Putting it all Together: Automated Data Cleaning Pipeline
Time to put the above example methods together into a reusable pipeline that helps further automate the data-cleaning process over time. Consider a small dataset of personal transactions with three columns: name of the person (name), date of purchase (date), and amount spent (value):
This dataset has been stored in a pandas DataFrame, df
.
To create a simple yet encapsulated data-cleaning pipeline, we create a custom class called DataCleaner
, with a series of custom methods for each of the above-outlined data cleaning steps, as follows:
class DataCleaner: def __init__(self): pass |
def fill_missing_values(self, df): return df.fillna(method=‘ffill’).fillna(method=‘bfill’) |
Note: the ffill
and bfill
argument values in the ‘fillna’ method are two examples of strategies for dealing with missing values. In particular, ffill
applies a “forward fill” that imputes missing values from the previous row’s value. A “backward fill” is then applied with bfill
to fill any remaining missing values utilizing the next instance’s value, thereby ensuring no missing values will be left.
def drop_missing_values(self, df): return df.dropna() |
def remove_duplicates(self, df): return df.drop_duplicates() |
def clean_strings(self, df, column): df[column] = df[column].str.strip().str.lower() return df |
def convert_to_datetime(self, df, column): df[column] = pd.to_datetime(df[column]) return df |
def rename_columns(self, df, columns_dict): return df.rename(columns=columns_dict) |
Then there comes the “central” method of this class, which bridges together all the cleaning steps into a single pipeline. Remember that, just like in any data manipulation process, the order matters: it is up to you to determine the most logical order to apply the different steps to achieve what you are looking for in your data, depending on the specific problem addressed.
def clean_data(self, df): df = self.fill_missing_values(df) df = self.drop_missing_values(df) df = self.remove_duplicates(df) df = self.clean_strings(df, ‘name’) df = self.convert_to_datetime(df, ‘date’) df = self.rename_columns(df, {‘name’: ‘full_name’}) return df |
Finally, we use the newly created class to apply the entire cleaning process in one shot and display the result.
cleaner = DataCleaner() cleaned_df = cleaner.clean_data(df) print(“\nCleaned DataFrame:”) print(cleaned_df) |
And that’s it! We have a much nicer and more uniform version of our original data after applying some touches to it.
This encapsulated pipeline is designed to facilitate and greatly simplify the overall data cleaning process on any new batches of data you get from now on.