Categorical Data
Learn to handle non-numeric data in machine learning
🏷️ Understanding Categorical Data
Categorical data represents categories or groups. Since ML algorithms work with numbers, we need to convert text categories into numeric format.
# Example categorical data
colors = ['red', 'blue', 'green', 'red']
sizes = ['small', 'medium', 'large']
Types of Categorical Data
Understanding different types helps choose the right encoding method:
Nominal
No natural order (colors, brands)
Ordinal
Has natural order (ratings, sizes)
Label Encoding
Convert to numbers (0, 1, 2...)
One-Hot Encoding
Create binary columns
🔹 Step 1: Identify Categorical Data
First, let's identify what categorical data looks like
import pandas as pd
# Sample data with categories
data = {
'color': ['red', 'blue', 'green', 'red'],
'size': ['small', 'large', 'medium', 'small'],
'price': [10, 20, 15, 12]
}
df = pd.DataFrame(data)
print(df)
print("\nData types:")
print(df.dtypes)
🔹 Step 2: Label Encoding
Convert categories to numbers (good for ordinal data)
from sklearn.preprocessing import LabelEncoder
# Create encoder
encoder = LabelEncoder()
# Encode size (has order: small < medium < large)
df['size_encoded'] = encoder.fit_transform(df['size'])
print("Original sizes:", df['size'].tolist())
print("Encoded sizes:", df['size_encoded'].tolist())
# See the mapping
print("Mapping:", dict(zip(encoder.classes_, encoder.transform(encoder.classes_))))
🔹 Step 3: One-Hot Encoding
Create separate columns for each category (best for nominal data)
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
# One-hot encode colors
encoder = OneHotEncoder(sparse_output=False)
color_encoded = encoder.fit_transform(df[['color']])
# Create new dataframe with encoded columns
color_df = pd.DataFrame(color_encoded, columns=encoder.get_feature_names_out(['color']))
print("One-hot encoded colors:")
print(color_df)
🔹 Step 4: Using Pandas get_dummies
Pandas provides an easy way to one-hot encode
# Easy one-hot encoding with pandas
encoded_df = pd.get_dummies(df, columns=['color', 'size'])
print("Full encoded dataset:")
print(encoded_df)
# Drop first column to avoid multicollinearity
encoded_df = pd.get_dummies(df, columns=['color'], drop_first=True)
print("\nWith drop_first=True:")
print(encoded_df)
🔹 Step 5: Handling New Categories
What to do when new categories appear in test data
# Train data
train_colors = ['red', 'blue', 'green']
encoder = LabelEncoder()
encoder.fit(train_colors)
# Test data with new category
test_colors = ['red', 'yellow', 'blue'] # 'yellow' is new!
# Safe encoding function
def safe_encode(encoder, data):
result = []
for item in data:
if item in encoder.classes_:
result.append(encoder.transform([item])[0])
else:
result.append(-1) # Unknown category
return result
encoded_test = safe_encode(encoder, test_colors)
print("Test colors:", test_colors)
print("Encoded (with -1 for unknown):", encoded_test)
🔹 Step 6: Complete Example
Full workflow for handling categorical data in ML
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Sample dataset
data = {
'color': ['red', 'blue', 'green', 'red', 'blue'],
'size': ['small', 'large', 'medium', 'small', 'large'],
'material': ['wood', 'metal', 'wood', 'plastic', 'metal'],
'sold': [1, 0, 1, 1, 0] # Target variable
}
df = pd.DataFrame(data)
# Encode categorical features
X = pd.get_dummies(df[['color', 'size', 'material']], drop_first=True)
y = df['sold']
# Split and train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
# Predict and evaluate
predictions = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, predictions):.2f}")
print("Feature columns:", X.columns.tolist())