Column Transformers Using Scikit Learn

Column Transformers Using Scikit Learn

Column Transformers helps you to transform different data types using a single code snippet.

Recently we saw how to create a Pipeline using sklearn. We learned how simple and important that is for your code readability, as well as for code maintenance and scalability.

But there's a catch.

When we have different column types, your Pipeline will break. 😱

You will either need to create two different Pipelines to treat numerical features and categorical features separately, or you can use the Column Transformers.

Yes, the sklearn team thought it through! Using Column Transformers, we can create a set of steps that will transform the data accordingly and make it ready for a modeling pipeline, or for whatever we need it.

Let's see how to use that method.

First, we import some modules and create some fake data.

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

# Sample dataset
data = pd.DataFrame({
    'age': np.random.randint(18, 65, size=200),
    'income': np.random.randint(20000, 120000, size=200),
    'gender': np.random.choice(['Male', 'Female'], size=200),
    'target': np.random.choice([0, 1], size=200)
})

# Split in X and Y
X = data[['age', 'income', 'gender']]
y = data['target']        


Article content
Head(5) of the dataset. Image by the author.

Next, let's split the data in train and test data.

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)         

And then we can use the ColumnTransformer method to (1) standardize the numerical features and (2) One Hot Encode the categorical feature.

The usage is simple: ColumnTransformer will receive a list of tuples. In each tuple, there will be ('name of the process', InstanceProcess( ), [list of, variables, to, apply]).

# Define preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['age', 'income']),
        ('cat', OneHotEncoder(), ['gender'])
    ]
)        

If we run this transformer, we will have this array as a result. The numerical columns are scaled, and the gender is encoded in 2 binary columns. Worked as expected.

# Fit and apply transformation to X_train
preprocessor.fit_transform(X_train)        
Article content
Transformed X_train dataset. Image by the author.

But in fact, we don't need to apply this transformation. We can plug the transformer directly into a Pipeline, making it easier for repetition when testing the model. And that is the beauty of sklearn.

# Build pipeline and plug the ColumnTransformer
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Train and evaluate
pipeline.fit(X_train, y_train)
accuracy = pipeline.score(X_test, y_test)
print(f"Model Accuracy: {accuracy:.2f}")        

Why it’s useful

ColumnTransformer makes it easy to handle mixed data types and ensures consistent preprocessing for all features.

Here are 5 benefits of using it:

1. Handle Mixed Data Types: Simplifies preprocessing for datasets with both numerical and categorical features by allowing separate processing pipelines for each type.

2. Maintain Clean Code: Combines all preprocessing steps into one object, improving code readability and making workflows easier to manage.

3. Avoid Repetition: Automates feature-specific transformations, ensuring consistent preprocessing across the training and test datasets.

4. Enable Advanced Workflows: Easily integrates with scikit-learn pipelines, enabling complex workflows like scaling numerical features while encoding categorical ones.

5. Streamline Deployment: Encapsulates preprocessing logic, making it seamless to apply the same transformations to new, unseen data in production.

Read More

Here is an article with an introduction to Pipelines in sklearn.


Visit My Website


To view or add a comment, sign in

More articles by Gustavo R Santos

Insights from the community

Others also viewed

Explore topics