Categorical Data

Learn to handle non-numeric data in machine learning

🏷️ Understanding Categorical Data

Categorical data represents categories or groups. Since ML algorithms work with numbers, we need to convert text categories into numeric format.


# Example categorical data
colors = ['red', 'blue', 'green', 'red']
sizes = ['small', 'medium', 'large']
                                    
2
Main Types
3
Encoding Methods
Essential
For ML

Types of Categorical Data

Understanding different types helps choose the right encoding method:

🏷️

Nominal

No natural order (colors, brands)

Colors Brands Countries
📊

Ordinal

Has natural order (ratings, sizes)

Ratings Sizes Education
🔢

Label Encoding

Convert to numbers (0, 1, 2...)

Simple Memory Efficient
🎯

One-Hot Encoding

Create binary columns

No Order Bias Most Common

🔹 Step 1: Identify Categorical Data

First, let's identify what categorical data looks like

import pandas as pd

# Sample data with categories
data = {
    'color': ['red', 'blue', 'green', 'red'],
    'size': ['small', 'large', 'medium', 'small'],
    'price': [10, 20, 15, 12]
}
df = pd.DataFrame(data)
print(df)
print("\nData types:")
print(df.dtypes)

🔹 Step 2: Label Encoding

Convert categories to numbers (good for ordinal data)

from sklearn.preprocessing import LabelEncoder

# Create encoder
encoder = LabelEncoder()

# Encode size (has order: small < medium < large)
df['size_encoded'] = encoder.fit_transform(df['size'])
print("Original sizes:", df['size'].tolist())
print("Encoded sizes:", df['size_encoded'].tolist())

# See the mapping
print("Mapping:", dict(zip(encoder.classes_, encoder.transform(encoder.classes_))))

🔹 Step 3: One-Hot Encoding

Create separate columns for each category (best for nominal data)

from sklearn.preprocessing import OneHotEncoder
import pandas as pd

# One-hot encode colors
encoder = OneHotEncoder(sparse_output=False)
color_encoded = encoder.fit_transform(df[['color']])

# Create new dataframe with encoded columns
color_df = pd.DataFrame(color_encoded, columns=encoder.get_feature_names_out(['color']))
print("One-hot encoded colors:")
print(color_df)

🔹 Step 4: Using Pandas get_dummies

Pandas provides an easy way to one-hot encode

# Easy one-hot encoding with pandas
encoded_df = pd.get_dummies(df, columns=['color', 'size'])
print("Full encoded dataset:")
print(encoded_df)

# Drop first column to avoid multicollinearity
encoded_df = pd.get_dummies(df, columns=['color'], drop_first=True)
print("\nWith drop_first=True:")
print(encoded_df)

🔹 Step 5: Handling New Categories

What to do when new categories appear in test data

# Train data
train_colors = ['red', 'blue', 'green']
encoder = LabelEncoder()
encoder.fit(train_colors)

# Test data with new category
test_colors = ['red', 'yellow', 'blue']  # 'yellow' is new!

# Safe encoding function
def safe_encode(encoder, data):
    result = []
    for item in data:
        if item in encoder.classes_:
            result.append(encoder.transform([item])[0])
        else:
            result.append(-1)  # Unknown category
    return result

encoded_test = safe_encode(encoder, test_colors)
print("Test colors:", test_colors)
print("Encoded (with -1 for unknown):", encoded_test)

🔹 Step 6: Complete Example

Full workflow for handling categorical data in ML

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Sample dataset
data = {
    'color': ['red', 'blue', 'green', 'red', 'blue'],
    'size': ['small', 'large', 'medium', 'small', 'large'],
    'material': ['wood', 'metal', 'wood', 'plastic', 'metal'],
    'sold': [1, 0, 1, 1, 0]  # Target variable
}
df = pd.DataFrame(data)

# Encode categorical features
X = pd.get_dummies(df[['color', 'size', 'material']], drop_first=True)
y = df['sold']

# Split and train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Predict and evaluate
predictions = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, predictions):.2f}")
print("Feature columns:", X.columns.tolist())

🧠 Test Your Knowledge

Which encoding is best for colors (red, blue, green)?

What does drop_first=True prevent?