K-Means Clustering
Group similar data points together automatically
🎯 Understanding K-Means
K-Means groups data into K clusters by finding centers and assigning points to the nearest center. Perfect for customer segmentation and data exploration.
from sklearn.cluster import KMeans
# Group data into 3 clusters
kmeans = KMeans(n_clusters=3)
clusters = kmeans.fit_predict(data)
K-Means Applications
K-Means is widely used across different domains:
Customer Segmentation
Group customers by behavior
Image Processing
Color quantization and compression
Market Research
Find patterns in survey data
Data Exploration
Discover hidden groups
🔹 Step 1: Create Sample Data
Let's start with simple 2D data to visualize clusters
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
# Create sample data (customer ages and incomes)
np.random.seed(42)
ages = np.random.normal(35, 10, 100)
incomes = np.random.normal(50000, 15000, 100)
data = np.column_stack([ages, incomes])
print(f"Data shape: {data.shape}")
print("First 5 customers:")
print(data[:5])
🔹 Step 2: Apply K-Means
Group customers into 3 segments
# Create K-Means model
kmeans = KMeans(n_clusters=3, random_state=42)
# Fit and predict clusters
clusters = kmeans.fit_predict(data)
# Get cluster centers
centers = kmeans.cluster_centers_
print("Cluster labels:", clusters[:10])
print("Cluster centers:")
print(centers)
🔹 Step 3: Visualize Results
Plot the clusters and their centers
# Create scatter plot
plt.figure(figsize=(10, 6))
colors = ['red', 'blue', 'green']
# Plot each cluster
for i in range(3):
cluster_points = data[clusters == i]
plt.scatter(cluster_points[:, 0], cluster_points[:, 1],
c=colors[i], label=f'Cluster {i}', alpha=0.6)
# Plot centers
plt.scatter(centers[:, 0], centers[:, 1],
c='black', marker='x', s=200, label='Centers')
plt.xlabel('Age')
plt.ylabel('Income')
plt.title('Customer Segmentation')
plt.legend()
plt.show()
🔹 Step 4: Find Optimal K
Use the elbow method to find the best number of clusters
# Test different numbers of clusters
k_range = range(1, 11)
inertias = []
for k in k_range:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(data)
inertias.append(kmeans.inertia_)
# Plot elbow curve
plt.figure(figsize=(8, 5))
plt.plot(k_range, inertias, 'bo-')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()
print("Inertias:", inertias[:5])
🔹 Step 5: Real-World Example
Customer segmentation with multiple features
import pandas as pd
# Sample customer data
customers = {
'age': [25, 35, 45, 30, 50, 28, 40],
'income': [30000, 50000, 70000, 45000, 80000, 35000, 60000],
'spending': [20000, 30000, 50000, 25000, 60000, 22000, 40000]
}
df = pd.DataFrame(customers)
# Apply K-Means
X = df[['age', 'income', 'spending']]
kmeans = KMeans(n_clusters=2, random_state=42)
df['cluster'] = kmeans.fit_predict(X)
print("Customer segments:")
print(df.groupby('cluster').mean())
🔹 Step 6: Interpret Results
Understanding what each cluster represents
# Analyze cluster characteristics
for cluster in range(2):
cluster_data = df[df['cluster'] == cluster]
print(f"\nCluster {cluster} ({len(cluster_data)} customers):")
print(f"Average age: {cluster_data['age'].mean():.1f}")
print(f"Average income: ${cluster_data['income'].mean():,.0f}")
print(f"Average spending: ${cluster_data['spending'].mean():,.0f}")
# Business interpretation
if cluster_data['income'].mean() > 50000:
print("→ High-value customers")
else:
print("→ Budget-conscious customers")