Scale (Normalization)
Make all your data play nicely together by scaling values
βοΈ What is Scaling?
Scaling (or normalization) means adjusting your data so all features have similar ranges. Imagine comparing apples (weight in grams) to distances (in kilometers) - scaling makes them comparable! It's essential for machine learning algorithms to work properly.
from sklearn.preprocessing import StandardScaler
# Before scaling: age=25, salary=50000
# After scaling: age=0.5, salary=0.3
data = [[25, 50000], [30, 60000]]
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
print(scaled_data)
Scaling Concepts
Min-Max Scaling
Scale to 0-1 range
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled = scaler.fit_transform(data)
Standard Scaling
Mean=0, Standard Deviation=1
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled = scaler.fit_transform(data)
Why Scale?
Fair feature comparison
# Age: 20-80, Salary: 20000-100000
# Without scaling: salary dominates
# With scaling: both features matter equally
Better ML
Algorithms work better
# Many ML algorithms need scaled data:
# - Neural Networks
# - SVM
# - K-Means Clustering
π€ Why Do We Need Scaling?
Let's see what happens when features have very different ranges:
πΉ The Problem: Different Scales
import pandas as pd
# Example: Person data with different scales
data = {
'Age': [25, 30, 35, 40],
'Salary': [30000, 50000, 70000, 90000],
'Years_Experience': [2, 5, 8, 12]
}
df = pd.DataFrame(data)
print(df)
print("\nRanges:")
print(f"Age: {df['Age'].min()} to {df['Age'].max()}")
print(f"Salary: {df['Salary'].min()} to {df['Salary'].max()}")
print(f"Experience: {df['Years_Experience'].min()} to {df['Years_Experience'].max()}")
# Problem: Salary numbers are MUCH bigger than age!
πΉ What Happens Without Scaling
# Distance calculation example
import numpy as np
person1 = [25, 30000, 2] # Age, Salary, Experience
person2 = [30, 50000, 5]
# Calculate distance (how similar they are)
distance = np.sqrt(sum((a - b)**2 for a, b in zip(person1, person2)))
print(f"Distance: {distance:.1f}")
# The salary difference (20000) dominates everything!
# Age difference (5) and experience difference (3) barely matter
πΉ After Scaling - Fair Comparison
from sklearn.preprocessing import StandardScaler
# Scale the data
data_array = [[25, 30000, 2], [30, 50000, 5], [35, 70000, 8], [40, 90000, 12]]
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data_array)
print("Scaled data:")
print(scaled_data)
# Now all features have similar importance!
π Min-Max Scaling (0 to 1)
The simplest scaling method - squeeze everything between 0 and 1:
π Min-Max Formula:
scaled_value = (value - min) / (max - min)
- Smallest value becomes 0
- Largest value becomes 1
- Everything else is between 0 and 1
πΉ Manual Min-Max Scaling
# Do it yourself to understand
ages = [20, 25, 30, 35, 40]
# Find min and max
min_age = min(ages) # 20
max_age = max(ages) # 40
# Scale each value
scaled_ages = []
for age in ages:
scaled = (age - min_age) / (max_age - min_age)
scaled_ages.append(scaled)
print(f"Original: {ages}")
print(f"Scaled: {scaled_ages}")
# Output: [0.0, 0.25, 0.5, 0.75, 1.0]
πΉ Using Sklearn MinMaxScaler
from sklearn.preprocessing import MinMaxScaler
import numpy as np
# Your data
ages = np.array([20, 25, 30, 35, 40]).reshape(-1, 1) # Reshape for sklearn
# Create and fit scaler
scaler = MinMaxScaler()
scaled_ages = scaler.fit_transform(ages)
print("Original ages:", ages.flatten())
print("Scaled ages:", scaled_ages.flatten())
print(f"Range: {scaled_ages.min():.1f} to {scaled_ages.max():.1f}")
πΉ Multiple Features
# Scale multiple columns at once
data = np.array([
[25, 50000], # Age, Salary
[30, 60000],
[35, 70000],
[40, 80000]
])
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(data)
print("Original data:")
print(data)
print("\nScaled data (0-1 range):")
print(scaled_data)
π Standard Scaling (Z-Score)
Make data have mean=0 and standard deviation=1:
π Standard Scaling Formula:
scaled_value = (value - mean) / standard_deviation
- Mean becomes 0
- Standard deviation becomes 1
- Values can be negative
πΉ Manual Standard Scaling
import numpy as np
# Test scores
scores = [60, 70, 80, 90, 100]
# Calculate mean and standard deviation
mean_score = np.mean(scores) # 80
std_score = np.std(scores) # ~14.14
# Scale each value
scaled_scores = []
for score in scores:
scaled = (score - mean_score) / std_score
scaled_scores.append(scaled)
print(f"Original: {scores}")
print(f"Scaled: {[round(x, 2) for x in scaled_scores]}")
print(f"New mean: {np.mean(scaled_scores):.1f}") # Should be ~0
print(f"New std: {np.std(scaled_scores):.1f}") # Should be ~1
πΉ Using Sklearn StandardScaler
from sklearn.preprocessing import StandardScaler
# Your data
scores = np.array([60, 70, 80, 90, 100]).reshape(-1, 1)
# Create and fit scaler
scaler = StandardScaler()
scaled_scores = scaler.fit_transform(scores)
print("Original scores:", scores.flatten())
print("Scaled scores:", scaled_scores.flatten().round(2))
print(f"Mean: {scaled_scores.mean():.1f}") # ~0
print(f"Std: {scaled_scores.std():.1f}") # ~1
πΉ When to Use Standard Scaling
# Use Standard Scaling when:
# 1. Data follows normal distribution
# 2. Using algorithms like SVM, Neural Networks
# 3. Features have different units (age, salary, etc.)
# Example: Mixed data types
data = np.array([
[25, 50000, 5.8], # Age, Salary, Height(ft)
[30, 60000, 6.0],
[35, 70000, 5.9],
[40, 80000, 6.1]
])
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
print("All features now have meanβ0, stdβ1")
print(scaled_data)
π οΈ Practical Scaling Examples
Real-world scenarios where scaling makes a difference:
πΉ House Price Prediction
# House features with different scales
house_data = np.array([
[1200, 3, 2010], # Size(sqft), Bedrooms, Year
[1500, 4, 2015],
[1800, 3, 2020],
[2000, 5, 2018]
])
print("Before scaling:")
print("Size ranges from", house_data[:, 0].min(), "to", house_data[:, 0].max())
print("Bedrooms range from", house_data[:, 1].min(), "to", house_data[:, 1].max())
print("Year ranges from", house_data[:, 2].min(), "to", house_data[:, 2].max())
# Scale the data
scaler = StandardScaler()
scaled_houses = scaler.fit_transform(house_data)
print("\nAfter scaling - all features have similar importance:")
print(scaled_houses)
πΉ Customer Segmentation
# Customer data for clustering
customer_data = np.array([
[25, 30000, 5], # Age, Income, Years_Customer
[45, 80000, 10],
[35, 50000, 3],
[55, 90000, 15]
])
# Without scaling: income dominates
print("Without scaling, income values are much larger:")
print(customer_data)
# With scaling: fair comparison
scaler = MinMaxScaler()
scaled_customers = scaler.fit_transform(customer_data)
print("\nWith scaling, all features are 0-1:")
print(scaled_customers.round(2))
πΉ Image Processing
# Image pixels are usually 0-255, scale to 0-1
import numpy as np
# Simulate image pixel values
pixels = np.array([0, 50, 100, 150, 200, 255])
# Scale to 0-1 for neural networks
scaled_pixels = pixels / 255.0
print(f"Original pixels: {pixels}")
print(f"Scaled pixels: {scaled_pixels}")
print("Now ready for machine learning!")
π― When to Use Which Scaling?
Choose the right scaling method for your situation:
π Quick Decision Guide:
- Min-Max (0-1): When you want bounded values, neural networks, image data
- Standard (Z-score): When data is normally distributed, SVM, clustering
- No scaling: Tree-based algorithms (Random Forest, Decision Trees)
πΉ Testing Both Methods
# Compare both scaling methods
data = np.array([10, 20, 30, 40, 50]).reshape(-1, 1)
# Min-Max scaling
minmax_scaler = MinMaxScaler()
minmax_scaled = minmax_scaler.fit_transform(data)
# Standard scaling
standard_scaler = StandardScaler()
standard_scaled = standard_scaler.fit_transform(data)
print("Original:", data.flatten())
print("Min-Max (0-1):", minmax_scaled.flatten())
print("Standard (mean=0):", standard_scaled.flatten().round(2))
# Choose based on your algorithm and data!
πΉ Scaling in ML Pipeline
# Proper way to scale in machine learning
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# Your data
X = np.array([[25, 50000], [30, 60000], [35, 70000], [40, 80000]])
y = np.array([200000, 250000, 300000, 350000]) # House prices
# Split data first
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Fit scaler on training data only
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # Use same scaler
# Now train your model
model = LinearRegression()
model.fit(X_train_scaled, y_train)
print("Model trained on properly scaled data!")