Normal Distribution
Understanding the bell curve that appears everywhere in data
📊 What is Normal Distribution?
Normal distribution is a bell-shaped curve that shows how data is spread around the average. Most values cluster around the center (mean), and fewer values appear at the extremes. It's everywhere in nature - heights, test scores, measurement errors!
import numpy as np
import matplotlib.pyplot as plt
# Generate normal distribution data
data = np.random.normal(100, 15, 1000) # mean=100, std=15
plt.hist(data, bins=30)
plt.title("Normal Distribution - Bell Curve")
plt.show()
Normal Distribution Concepts
Bell Shape
Symmetric curve around the mean
# Create bell curve
x = np.linspace(-3, 3, 100)
y = (1/np.sqrt(2*np.pi)) * np.exp(-0.5*x**2)
plt.plot(x, y)
68-95-99.7 Rule
Standard deviation boundaries
# 68% within 1 standard deviation
mean = 100
std = 15
print(f"68% between {mean-std} and {mean+std}")
Mean = Median
Perfect symmetry
data = np.random.normal(50, 10, 1000)
print(f"Mean: {np.mean(data):.1f}")
print(f"Median: {np.median(data):.1f}")
Everywhere!
Heights, scores, errors
# Heights example
heights = np.random.normal(170, 10, 100)
print(f"Average height: {np.mean(heights):.1f}cm")
🔔 Understanding the Bell Curve
The normal distribution creates a perfect bell shape. Let's see why:
🔹 Creating Your First Bell Curve
import numpy as np
import matplotlib.pyplot as plt
# Generate 1000 random numbers from normal distribution
data = np.random.normal(0, 1, 1000) # mean=0, std=1
# Plot histogram
plt.hist(data, bins=30, alpha=0.7)
plt.title("My First Bell Curve")
plt.xlabel("Values")
plt.ylabel("Frequency")
plt.show()
🔹 Different Means, Same Shape
# Three bell curves with different centers
curve1 = np.random.normal(0, 1, 1000) # centered at 0
curve2 = np.random.normal(5, 1, 1000) # centered at 5
curve3 = np.random.normal(-3, 1, 1000) # centered at -3
plt.hist(curve1, alpha=0.5, label="Mean=0")
plt.hist(curve2, alpha=0.5, label="Mean=5")
plt.hist(curve3, alpha=0.5, label="Mean=-3")
plt.legend()
plt.show()
📏 The 68-95-99.7 Rule
This rule tells us where most of our data lives:
📊 The Magic Numbers:
- 68% of data within 1 standard deviation
- 95% of data within 2 standard deviations
- 99.7% of data within 3 standard deviations
🔹 Test Scores Example
# Test scores: mean=75, std=10
mean_score = 75
std_score = 10
print("Test Score Ranges:")
print(f"68% of students score between {mean_score-std_score} and {mean_score+std_score}")
print(f"95% of students score between {mean_score-2*std_score} and {mean_score+2*std_score}")
print(f"99.7% of students score between {mean_score-3*std_score} and {mean_score+3*std_score}")
# Output:
# 68% of students score between 65 and 85
# 95% of students score between 55 and 95
# 99.7% of students score between 45 and 105
🔹 Checking Real Data
# Generate test scores
scores = np.random.normal(75, 10, 1000)
# Count how many fall within 1 standard deviation
within_1_std = np.sum((scores >= 65) & (scores <= 85))
percentage = (within_1_std / len(scores)) * 100
print(f"Actually {percentage:.1f}% are within 1 std deviation")
# Should be close to 68%
🌍 Real World Examples
Normal distribution appears everywhere in real life:
🔹 Human Heights
# Adult male heights (cm)
heights = np.random.normal(175, 7, 1000)
print(f"Average height: {np.mean(heights):.1f}cm")
print(f"68% of men are between {175-7}cm and {175+7}cm")
plt.hist(heights, bins=30)
plt.title("Distribution of Male Heights")
plt.xlabel("Height (cm)")
plt.show()
🔹 IQ Scores
# IQ scores: mean=100, std=15
iq_scores = np.random.normal(100, 15, 1000)
print(f"Average IQ: {np.mean(iq_scores):.1f}")
print("68% of people have IQ between 85 and 115")
print("95% of people have IQ between 70 and 130")
plt.hist(iq_scores, bins=30)
plt.title("IQ Score Distribution")
plt.show()
🔹 Measurement Errors
# Measuring a 10cm object with small errors
true_length = 10.0
measurements = np.random.normal(true_length, 0.1, 100)
print(f"True length: {true_length}cm")
print(f"Average measurement: {np.mean(measurements):.2f}cm")
print(f"Standard deviation: {np.std(measurements):.3f}cm")
plt.hist(measurements, bins=20)
plt.title("Measurement Errors")
plt.show()
🛠️ Working with Normal Distribution
Practical tools for normal distribution analysis:
🔹 Generating Normal Data
# Different ways to create normal data
data1 = np.random.normal(50, 10, 100) # mean=50, std=10, 100 points
data2 = np.random.randn(100) * 10 + 50 # Same result, different way
print(f"Method 1 mean: {np.mean(data1):.1f}")
print(f"Method 2 mean: {np.mean(data2):.1f}")
🔹 Testing if Data is Normal
from scipy import stats
# Create normal and non-normal data
normal_data = np.random.normal(0, 1, 1000)
uniform_data = np.random.uniform(0, 1, 1000)
# Simple visual test
plt.subplot(1, 2, 1)
plt.hist(normal_data, bins=30)
plt.title("Normal Data")
plt.subplot(1, 2, 2)
plt.hist(uniform_data, bins=30)
plt.title("Not Normal Data")
plt.show()
🔹 Z-Scores (Standardizing)
# Convert any normal distribution to standard normal (mean=0, std=1)
test_scores = np.random.normal(75, 10, 100)
# Calculate z-scores
z_scores = (test_scores - np.mean(test_scores)) / np.std(test_scores)
print(f"Original mean: {np.mean(test_scores):.1f}")
print(f"Z-score mean: {np.mean(z_scores):.1f}") # Should be ~0
print(f"Z-score std: {np.std(z_scores):.1f}") # Should be ~1