Scale (Normalization)

Make all your data play nicely together by scaling values

βš–οΈ What is Scaling?

Scaling (or normalization) means adjusting your data so all features have similar ranges. Imagine comparing apples (weight in grams) to distances (in kilometers) - scaling makes them comparable! It's essential for machine learning algorithms to work properly.


from sklearn.preprocessing import StandardScaler

# Before scaling: age=25, salary=50000
# After scaling: age=0.5, salary=0.3
data = [[25, 50000], [30, 60000]]
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
print(scaled_data)
                                    
0-1
Range
Fair
Comparison
Better
ML Results

Scaling Concepts

πŸ“

Min-Max Scaling

Scale to 0-1 range

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled = scaler.fit_transform(data)
πŸ“Š

Standard Scaling

Mean=0, Standard Deviation=1

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled = scaler.fit_transform(data)
🎯

Why Scale?

Fair feature comparison

# Age: 20-80, Salary: 20000-100000
# Without scaling: salary dominates
# With scaling: both features matter equally
πŸš€

Better ML

Algorithms work better

# Many ML algorithms need scaled data:
# - Neural Networks
# - SVM
# - K-Means Clustering

πŸ€” Why Do We Need Scaling?

Let's see what happens when features have very different ranges:

πŸ”Ή The Problem: Different Scales

import pandas as pd

# Example: Person data with different scales
data = {
    'Age': [25, 30, 35, 40],
    'Salary': [30000, 50000, 70000, 90000],
    'Years_Experience': [2, 5, 8, 12]
}

df = pd.DataFrame(data)
print(df)
print("\nRanges:")
print(f"Age: {df['Age'].min()} to {df['Age'].max()}")
print(f"Salary: {df['Salary'].min()} to {df['Salary'].max()}")
print(f"Experience: {df['Years_Experience'].min()} to {df['Years_Experience'].max()}")

# Problem: Salary numbers are MUCH bigger than age!

πŸ”Ή What Happens Without Scaling

# Distance calculation example
import numpy as np

person1 = [25, 30000, 2]    # Age, Salary, Experience
person2 = [30, 50000, 5]

# Calculate distance (how similar they are)
distance = np.sqrt(sum((a - b)**2 for a, b in zip(person1, person2)))
print(f"Distance: {distance:.1f}")

# The salary difference (20000) dominates everything!
# Age difference (5) and experience difference (3) barely matter

πŸ”Ή After Scaling - Fair Comparison

from sklearn.preprocessing import StandardScaler

# Scale the data
data_array = [[25, 30000, 2], [30, 50000, 5], [35, 70000, 8], [40, 90000, 12]]
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data_array)

print("Scaled data:")
print(scaled_data)

# Now all features have similar importance!

πŸ“ Min-Max Scaling (0 to 1)

The simplest scaling method - squeeze everything between 0 and 1:

πŸ“ Min-Max Formula:

scaled_value = (value - min) / (max - min)

  • Smallest value becomes 0
  • Largest value becomes 1
  • Everything else is between 0 and 1

πŸ”Ή Manual Min-Max Scaling

# Do it yourself to understand
ages = [20, 25, 30, 35, 40]

# Find min and max
min_age = min(ages)  # 20
max_age = max(ages)  # 40

# Scale each value
scaled_ages = []
for age in ages:
    scaled = (age - min_age) / (max_age - min_age)
    scaled_ages.append(scaled)

print(f"Original: {ages}")
print(f"Scaled: {scaled_ages}")
# Output: [0.0, 0.25, 0.5, 0.75, 1.0]

πŸ”Ή Using Sklearn MinMaxScaler

from sklearn.preprocessing import MinMaxScaler
import numpy as np

# Your data
ages = np.array([20, 25, 30, 35, 40]).reshape(-1, 1)  # Reshape for sklearn

# Create and fit scaler
scaler = MinMaxScaler()
scaled_ages = scaler.fit_transform(ages)

print("Original ages:", ages.flatten())
print("Scaled ages:", scaled_ages.flatten())
print(f"Range: {scaled_ages.min():.1f} to {scaled_ages.max():.1f}")

πŸ”Ή Multiple Features

# Scale multiple columns at once
data = np.array([
    [25, 50000],  # Age, Salary
    [30, 60000],
    [35, 70000],
    [40, 80000]
])

scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(data)

print("Original data:")
print(data)
print("\nScaled data (0-1 range):")
print(scaled_data)

πŸ“Š Standard Scaling (Z-Score)

Make data have mean=0 and standard deviation=1:

πŸ“ Standard Scaling Formula:

scaled_value = (value - mean) / standard_deviation

  • Mean becomes 0
  • Standard deviation becomes 1
  • Values can be negative

πŸ”Ή Manual Standard Scaling

import numpy as np

# Test scores
scores = [60, 70, 80, 90, 100]

# Calculate mean and standard deviation
mean_score = np.mean(scores)  # 80
std_score = np.std(scores)    # ~14.14

# Scale each value
scaled_scores = []
for score in scores:
    scaled = (score - mean_score) / std_score
    scaled_scores.append(scaled)

print(f"Original: {scores}")
print(f"Scaled: {[round(x, 2) for x in scaled_scores]}")
print(f"New mean: {np.mean(scaled_scores):.1f}")  # Should be ~0
print(f"New std: {np.std(scaled_scores):.1f}")    # Should be ~1

πŸ”Ή Using Sklearn StandardScaler

from sklearn.preprocessing import StandardScaler

# Your data
scores = np.array([60, 70, 80, 90, 100]).reshape(-1, 1)

# Create and fit scaler
scaler = StandardScaler()
scaled_scores = scaler.fit_transform(scores)

print("Original scores:", scores.flatten())
print("Scaled scores:", scaled_scores.flatten().round(2))
print(f"Mean: {scaled_scores.mean():.1f}")  # ~0
print(f"Std: {scaled_scores.std():.1f}")    # ~1

πŸ”Ή When to Use Standard Scaling

# Use Standard Scaling when:
# 1. Data follows normal distribution
# 2. Using algorithms like SVM, Neural Networks
# 3. Features have different units (age, salary, etc.)

# Example: Mixed data types
data = np.array([
    [25, 50000, 5.8],  # Age, Salary, Height(ft)
    [30, 60000, 6.0],
    [35, 70000, 5.9],
    [40, 80000, 6.1]
])

scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
print("All features now have meanβ‰ˆ0, stdβ‰ˆ1")
print(scaled_data)

πŸ› οΈ Practical Scaling Examples

Real-world scenarios where scaling makes a difference:

πŸ”Ή House Price Prediction

# House features with different scales
house_data = np.array([
    [1200, 3, 2010],  # Size(sqft), Bedrooms, Year
    [1500, 4, 2015],
    [1800, 3, 2020],
    [2000, 5, 2018]
])

print("Before scaling:")
print("Size ranges from", house_data[:, 0].min(), "to", house_data[:, 0].max())
print("Bedrooms range from", house_data[:, 1].min(), "to", house_data[:, 1].max())
print("Year ranges from", house_data[:, 2].min(), "to", house_data[:, 2].max())

# Scale the data
scaler = StandardScaler()
scaled_houses = scaler.fit_transform(house_data)

print("\nAfter scaling - all features have similar importance:")
print(scaled_houses)

πŸ”Ή Customer Segmentation

# Customer data for clustering
customer_data = np.array([
    [25, 30000, 5],    # Age, Income, Years_Customer
    [45, 80000, 10],
    [35, 50000, 3],
    [55, 90000, 15]
])

# Without scaling: income dominates
print("Without scaling, income values are much larger:")
print(customer_data)

# With scaling: fair comparison
scaler = MinMaxScaler()
scaled_customers = scaler.fit_transform(customer_data)
print("\nWith scaling, all features are 0-1:")
print(scaled_customers.round(2))

πŸ”Ή Image Processing

# Image pixels are usually 0-255, scale to 0-1
import numpy as np

# Simulate image pixel values
pixels = np.array([0, 50, 100, 150, 200, 255])

# Scale to 0-1 for neural networks
scaled_pixels = pixels / 255.0

print(f"Original pixels: {pixels}")
print(f"Scaled pixels: {scaled_pixels}")
print("Now ready for machine learning!")

🎯 When to Use Which Scaling?

Choose the right scaling method for your situation:

πŸ” Quick Decision Guide:

  • Min-Max (0-1): When you want bounded values, neural networks, image data
  • Standard (Z-score): When data is normally distributed, SVM, clustering
  • No scaling: Tree-based algorithms (Random Forest, Decision Trees)

πŸ”Ή Testing Both Methods

# Compare both scaling methods
data = np.array([10, 20, 30, 40, 50]).reshape(-1, 1)

# Min-Max scaling
minmax_scaler = MinMaxScaler()
minmax_scaled = minmax_scaler.fit_transform(data)

# Standard scaling
standard_scaler = StandardScaler()
standard_scaled = standard_scaler.fit_transform(data)

print("Original:", data.flatten())
print("Min-Max (0-1):", minmax_scaled.flatten())
print("Standard (mean=0):", standard_scaled.flatten().round(2))

# Choose based on your algorithm and data!

πŸ”Ή Scaling in ML Pipeline

# Proper way to scale in machine learning
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Your data
X = np.array([[25, 50000], [30, 60000], [35, 70000], [40, 80000]])
y = np.array([200000, 250000, 300000, 350000])  # House prices

# Split data first
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Fit scaler on training data only
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Use same scaler

# Now train your model
model = LinearRegression()
model.fit(X_train_scaled, y_train)

print("Model trained on properly scaled data!")

🧠 Test Your Knowledge

What range does Min-Max scaling produce?

After Standard Scaling, what is the mean of the data?

Why do we need to scale data?