Hierarchical Clustering
Build tree-like clusters to group similar data points
🌳 Understanding Hierarchical Clustering
Hierarchical clustering creates a tree of clusters, showing relationships between data points at different levels of similarity.
# Simple hierarchical clustering
from sklearn.cluster import AgglomerativeClustering
import numpy as np
# Sample data
X = [[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]]
# Create clusters
clustering = AgglomerativeClustering(n_clusters=2)
labels = clustering.fit_predict(X)
print("Cluster labels:", labels)
Tree
Structure
No K
Required
Visual
Dendrogram
Types of Hierarchical Clustering
Agglomerative
Bottom-up: Start with individual points, merge similar ones
Most Common
Merge Up
Divisive
Top-down: Start with all points, split into smaller groups
Less Common
Split Down
🔹 Step 1: Basic Clustering
Start with simple agglomerative clustering
from sklearn.cluster import AgglomerativeClustering
import numpy as np
# Create sample data
data = np.array([[1, 1], [2, 1], [1, 2], [8, 8], [9, 8], [8, 9]])
# Apply clustering
cluster = AgglomerativeClustering(n_clusters=2)
labels = cluster.fit_predict(data)
print("Data points:", data)
print("Cluster labels:", labels)
# Output: [0 0 0 1 1 1] - Two clear groups
🔹 Step 2: Linkage Methods
Different ways to measure distance between clusters
# Different linkage methods
linkages = ['ward', 'complete', 'average', 'single']
for linkage in linkages:
cluster = AgglomerativeClustering(
n_clusters=2,
linkage=linkage
)
labels = cluster.fit_predict(data)
print(f"{linkage} linkage:", labels)
# Ward: Minimizes variance within clusters
# Complete: Maximum distance between clusters
# Average: Average distance between clusters
# Single: Minimum distance between clusters
🔹 Step 3: Creating Dendrograms
Visualize the clustering hierarchy
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt
# Create linkage matrix
linkage_matrix = linkage(data, method='ward')
# Plot dendrogram
plt.figure(figsize=(8, 5))
dendrogram(linkage_matrix)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Data Points')
plt.ylabel('Distance')
plt.show()
# The dendrogram shows how clusters merge at different distances
🔹 Step 4: Choosing Number of Clusters
Use dendrogram to decide optimal cluster count
# Method 1: Visual inspection of dendrogram
# Look for large jumps in distance
# Method 2: Set distance threshold
from scipy.cluster.hierarchy import fcluster
# Cut dendrogram at specific distance
clusters = fcluster(linkage_matrix, t=3, criterion='distance')
print("Clusters at distance 3:", clusters)
# Method 3: Try different numbers
for n in range(2, 5):
cluster = AgglomerativeClustering(n_clusters=n)
labels = cluster.fit_predict(data)
print(f"{n} clusters:", labels)
🔹 Step 5: Real-World Example
Customer segmentation using hierarchical clustering
# Customer data: [age, income]
customers = np.array([
[25, 30000], [30, 35000], [28, 32000], # Young, lower income
[45, 80000], [50, 85000], [48, 82000], # Middle-aged, high income
[65, 40000], [70, 45000], [68, 42000] # Older, moderate income
])
# Cluster customers
cluster = AgglomerativeClustering(n_clusters=3)
segments = cluster.fit_predict(customers)
# Analyze segments
for i in range(3):
segment_customers = customers[segments == i]
avg_age = segment_customers[:, 0].mean()
avg_income = segment_customers[:, 1].mean()
print(f"Segment {i}: Age={avg_age:.1f}, Income=${avg_income:.0f}")
# This helps identify customer groups for targeted marketing