K-Means Clustering

Group similar data points together automatically

🎯 Understanding K-Means

K-Means groups data into K clusters by finding centers and assigning points to the nearest center. Perfect for customer segmentation and data exploration.


from sklearn.cluster import KMeans
# Group data into 3 clusters
kmeans = KMeans(n_clusters=3)
clusters = kmeans.fit_predict(data)
                                    
Unsupervised
Learning Type
Fast
Algorithm
Popular
Method

K-Means Applications

K-Means is widely used across different domains:

👥

Customer Segmentation

Group customers by behavior

Marketing Targeting
🖼️

Image Processing

Color quantization and compression

Colors Compression
📊

Market Research

Find patterns in survey data

Surveys Patterns
🔍

Data Exploration

Discover hidden groups

EDA Insights

🔹 Step 1: Create Sample Data

Let's start with simple 2D data to visualize clusters

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# Create sample data (customer ages and incomes)
np.random.seed(42)
ages = np.random.normal(35, 10, 100)
incomes = np.random.normal(50000, 15000, 100)
data = np.column_stack([ages, incomes])

print(f"Data shape: {data.shape}")
print("First 5 customers:")
print(data[:5])

🔹 Step 2: Apply K-Means

Group customers into 3 segments

# Create K-Means model
kmeans = KMeans(n_clusters=3, random_state=42)

# Fit and predict clusters
clusters = kmeans.fit_predict(data)

# Get cluster centers
centers = kmeans.cluster_centers_

print("Cluster labels:", clusters[:10])
print("Cluster centers:")
print(centers)

🔹 Step 3: Visualize Results

Plot the clusters and their centers

# Create scatter plot
plt.figure(figsize=(10, 6))
colors = ['red', 'blue', 'green']

# Plot each cluster
for i in range(3):
    cluster_points = data[clusters == i]
    plt.scatter(cluster_points[:, 0], cluster_points[:, 1], 
                c=colors[i], label=f'Cluster {i}', alpha=0.6)

# Plot centers
plt.scatter(centers[:, 0], centers[:, 1], 
            c='black', marker='x', s=200, label='Centers')

plt.xlabel('Age')
plt.ylabel('Income')
plt.title('Customer Segmentation')
plt.legend()
plt.show()

🔹 Step 4: Find Optimal K

Use the elbow method to find the best number of clusters

# Test different numbers of clusters
k_range = range(1, 11)
inertias = []

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(data)
    inertias.append(kmeans.inertia_)

# Plot elbow curve
plt.figure(figsize=(8, 5))
plt.plot(k_range, inertias, 'bo-')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()

print("Inertias:", inertias[:5])

🔹 Step 5: Real-World Example

Customer segmentation with multiple features

import pandas as pd

# Sample customer data
customers = {
    'age': [25, 35, 45, 30, 50, 28, 40],
    'income': [30000, 50000, 70000, 45000, 80000, 35000, 60000],
    'spending': [20000, 30000, 50000, 25000, 60000, 22000, 40000]
}
df = pd.DataFrame(customers)

# Apply K-Means
X = df[['age', 'income', 'spending']]
kmeans = KMeans(n_clusters=2, random_state=42)
df['cluster'] = kmeans.fit_predict(X)

print("Customer segments:")
print(df.groupby('cluster').mean())

🔹 Step 6: Interpret Results

Understanding what each cluster represents

# Analyze cluster characteristics
for cluster in range(2):
    cluster_data = df[df['cluster'] == cluster]
    print(f"\nCluster {cluster} ({len(cluster_data)} customers):")
    print(f"Average age: {cluster_data['age'].mean():.1f}")
    print(f"Average income: ${cluster_data['income'].mean():,.0f}")
    print(f"Average spending: ${cluster_data['spending'].mean():,.0f}")
    
    # Business interpretation
    if cluster_data['income'].mean() > 50000:
        print("→ High-value customers")
    else:
        print("→ Budget-conscious customers")

🧠 Test Your Knowledge

What does K in K-Means represent?

What is the elbow method used for?