10 Python One-Liners That Will Boost Your Data Science Workflow

0 0 6 minutes read

10 Python One-Liners That Will Boost Your Data Science Workflow
Image by Author | Ideogram

Python is the most popular data science programming language, as it’s versatile and has a lot of support from the community. With so much usage, there are many ways to improve our data science workflow that you might not know.

In this article, we will explore ten different Python one-liners that would boost your data science work.

What are they? Let’s have a look.

1. Efficient Missing Data Handling

Missing data is a constant occurrence in datasets. It could happen because of a variety of reasons, from data mismanagement to natural conditions and beyond. Nevertheless, we need to decide how to handle the missing data.

Some would make it into the missing data category or drop them all. However, there are times we opt to fill in the missing data.

If we want to fill in the missing data, we can use the Pandas fillnamethod. It’s easy to use as we only need to pass the value we want to fill as the replacement for the missing value, but we can make it more efficient.

Let’s see the code below.

df.fillna(col: df[col].median() for col in df.select_dtypes(include=”number”).columns | col: df[col].mode()[0] for col in df.select_dtypes(include=”object”).columns, inplace=True)

df.fillna(col: df[col].median() for col in df.select_dtypes(include=‘number’).columns |

col: df[col].mode()[0] for col in df.select_dtypes(include=‘object’).columns, inplace=True)

By combining the fillna with the condition, we can fill the numerical missing data with the median and the categorical missing data with the mode.

With one line, you can quickly fill in all the different missing data in other columns.

2. Highly Correlated Features Removal

Multicollinearity occurs when our dataset contains many independent variables that are highly correlated with each other instead of with the target. This negatively impacts the model performance, so we want to keep less correlated features.

We can combine the Pandas correlation feature with the conditional selection to quickly select the less correlated features. For example, here is how we can choose the features that have the maximum Pearson correlation with the others below 0.95.

df = df.loc[:, df.corr().abs().max() < 0.95]

df = df.loc[:, df.corr().abs().max() < 0.95]

Trying out the correlation features and the threshold to see if the prediction model is good or not.

3. Conditional Column Apply

Creating a new column with multiple conditions can sometimes be complicated, and the line to perform them can be long. However, we can use the apply method from the Pandas to use specific conditions when developing the new feature while still using multiple column values.

For example, here are examples of creating a new column where the values are based on the condition of the other column values.

df[‘new_col’] = df.apply(lambda x: x[‘A’] * x[‘B’] if x[‘C’] > 0 else x[‘A’] + x[‘B’], axis=1)

df[‘new_col’] = df.apply(lambda x: x[‘A’] * x[‘B’] if x[‘C’] > 0 else x[‘A’] + x[‘B’], axis=1)

You can try out another condition that follows your requirements.

4. Finding Common and Different Element

Python provides many built-in data types, including Set. The Set data type is unique data that represents an unordered list of data but only with unique elements. It’s often used for many data operations, which include finding the common elements.

For example, we have the following set:

set1 = “apple”, “banana”, “cherry”, “date”, “fig” set2 = “cherry”, “date”, “elderberry”, “fig”, “grape”

set1 = “apple”, “banana”, “cherry”, “date”, “fig”

set2 = “cherry”, “date”, “elderberry”, “fig”, “grape”

Then, we want to find the common element between both sets. We can use the following method.

Output:

‘cherry’, ‘date’, ‘fig’

‘cherry’, ‘date’, ‘fig’

It’s a simple but useful way to find the common element. In reverse, we can also find the elements that are different within both sets.

Output:

Try using them in your data workflow when you are required to find the common and different elements.

5. Boolean Masks for Filtering

When working with the NumPy array and its derivate object, we sometimes want to filter the data according to our requirements. In this case, we can create a boolean mask to filter the data based on the boolean condition we set.

Let’s say we have the following list of data.

import numpy as np data = np.array([10, 15, 20, 25, 30, 35, 40, 45, 50])

import numpy as np

data = np.array([10, 15, 20, 25, 30, 35, 40, 45, 50])

Then, we can use the boolean mask to filter the data we want. For example, we want only even numbers.

Output:

array([10, 20, 30, 40, 50])

array([10, 20, 30, 40, 50])

This is also the basis of the Pandas filtering; however, a Boolean mask can be more versatile as it works in the NumPy array as well.

6. List Count Occurrence

When working with a list or any other data with multiple values, there are times when we want to know the frequency for each value. In this case, we can use the counter function to count them automatically.

For example, consider having the following list.

data = [10, 10, 20, 20, 30, 35, 40, 40, 40, 50]

data = [10, 10, 20, 20, 30, 35, 40, 40, 40, 50]

Then, we can use the counter function to calculate the frequency.

from collections import Counter Counter(data)

from collections import Counter

Counter(data)

Output:

Counter(10: 2, 20: 2, 30: 1, 35: 1, 40: 3, 50: 1)

Counter(10: 2, 20: 2, 30: 1, 35: 1, 40: 3, 50: 1)

The result is a dictionary for the count occurrence. Use them when you need quick frequency calculation.

7. Numerical Extraction from Text

Regular expressions (Regex) are defined character lists that match a pattern in text. They are usually used when we want to perform specific text manipulation, and that’s precisely what we can do with this one-liner.

In the example below, we can use a combination of Regex and map to extract numbers from the text.

import re list(map(int, re.findall(r’\d+’, “Sample123Text456”)))

import re

list(map(int, re.findall(r‘\d+’, “Sample123Text456”)))

Output:

The example above only works for integer data, but learning more about regular expressions can give you the power and flexibility to adapt this one-liner for multiple use cases.

8. Flatten Nested List

When we prepare our data for analysis, we can encounter list data that contains a list within the list, which we can call nested. If we find something like that, we might want to flatten it for further data analysis or visualization.

For example, let’s say we have the following nested list.

nested_list = [ [1, 2, 3], [4, 5], [6, 7, 8, 9] ]

nested_list = [

[1, 2, 3],

[4, 5],

[6, 7, 8, 9]

]

We can then flatten the list with the following code.

Output:

[1, 2, 3, 4, 5, 6, 7, 8, 9]

[1, 2, 3, 4, 5, 6, 7, 8, 9]

With this one-dimensional data list, you can analyze further and in a more straightforward manner if needed.

9. List to Dictionary

Have you ever got into a situation where you have several lists and want to combine the information in the dictionary form? For example, the use case may be related to mapping purposes or feature encoding.

In this case, we can convert the list we have into a dictionary using the zip function.

For example, we have the following list.

fruit = [‘apple’, ‘banana’, ‘cherry’] values = [100, 200, 300]

fruit = [‘apple’, ‘banana’, ‘cherry’]

values = [100, 200, 300]

With the combination of zip and dict, we can combine both of the lists above into one.

Output:

‘apple’: 100, ‘banana’: 200, ‘cherry’: 300

‘apple’: 100, ‘banana’: 200, ‘cherry’: 300

This is a quick way to combine both pieces of data into one structure, which can then be used for further data preprocessing.

10. Dictionary Merging

When we have a dictionary that contains the information we require for data preprocessing, we should combine them. For example, we have performed the list to dictionary action like above and ended up with the following dictionaries:

fruit_mapping = ‘apple’: 100, ‘banana’: 200, ‘cherry’: 300 furniture_mapping = ‘table’: 100, ‘chair’: 200, ‘sofa’: 300

fruit_mapping = ‘apple’: 100, ‘banana’: 200, ‘cherry’: 300