Probability¶
Course: MA2221 · Mahindra University
Reference: Mathematics for Machine Learning, Deisenroth, Faisal & Ong — Ch 6
Two questions drive this lab:
How does probability give us the language to describe uncertainty in data and models?
Why is a single train/test split unreliable, and how does K-fold fix that?
Every machine learning algorithm assumes data is drawn from some probability distribution. Understanding that distribution — and honestly measuring how well we generalise to it — is the foundation of principled machine learning.
Structure¶
| Section | Topic |
|---|---|
| 1 | Sample spaces and events — the building blocks |
| 2 | Probability axioms and rules (sum, product, Bayes) |
| 3 | Random variables — discrete and continuous |
| 4 | Key distributions: Bernoulli, Binomial, Uniform, Gaussian |
| 5 | Expectation, variance, and the Law of Large Numbers |
| 6 | The Normal distribution and the Central Limit Theorem |
| 7 | Why one train/test split gives a noisy estimate |
| 8 | K-fold cross-validation — theory and implementation |
| 9 | Stratified K-fold for imbalanced data |
| 10 | Full CV pipeline — choosing K, reporting results |
Legend¶
- 🧱 Worked — run and read
- ✏️ Your Turn — fill in
___ - 🔬 Experiment — change numbers and observe
- 💬 Discuss — no single right answer
0 · Setup¶
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
from sklearn import datasets
from sklearn.model_selection import (
train_test_split, KFold, StratifiedKFold, cross_val_score
)
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
np.random.seed(42)
plt.rcParams.update({
'figure.dpi' : 120,
'axes.spines.top' : False,
'axes.spines.right': False,
'font.size' : 12,
})
print('All imports OK ✓')
Section 1 · Sample Spaces & Events¶
What is a probability space?¶
Every probabilistic experiment is described by three objects — the probability triple:
- Sample space Ω — the set of all possible outcomes.
- Event space 𝒜 — the collection of subsets of Ω we assign probabilities to.
- Probability measure P — a function that maps each event to a number in [0, 1].
$$( \Omega,\, \mathcal{A},\, P )$$
Example. Rolling a fair die: $\Omega = \{1, 2, 3, 4, 5, 6\}$. The event "rolling an even number" is $A = \{2, 4, 6\} \subseteq \Omega$, and $P(A) = 3/6 = 0.5$.
1.1 · Worked — Simulating a Sample Space¶
🧱 Run and read.
# Simulate rolling a fair six-sided die 100 000 times
n_rolls = 100_000
outcomes = np.random.randint(1, 7, size=n_rolls) # Ω = {1, 2, 3, 4, 5, 6}
# Empirical probability of each face
faces, counts = np.unique(outcomes, return_counts=True)
empirical_p = counts / n_rolls
fig, ax = plt.subplots(figsize=(7, 3.5))
ax.bar(faces, empirical_p, color='steelblue', edgecolor='white')
ax.axhline(1/6, color='crimson', lw=1.5, linestyle='--', label='True P = 1/6')
ax.set_xlabel('Face'); ax.set_ylabel('Empirical probability')
ax.set_title(f'Die roll simulation ({n_rolls:,} rolls)')
ax.legend(); plt.tight_layout(); plt.show()
# Event A = {2, 4, 6}: rolling an even number
A = outcomes[outcomes % 2 == 0]
print(f'P(even) ≈ {len(A)/n_rolls:.4f} (true = 0.5000)')
✏️ 1.2 · Your Turn — Complement and Union¶
Use the same outcomes array to estimate:
- $P(A^c)$ — the probability of rolling an odd number (complement of A).
- $P(B)$ where $B = \{5, 6\}$ — rolling a 5 or 6.
- $P(A \cup B)$ — rolling an even number or a 5 or 6.
- Verify: does $P(A \cup B) = P(A) + P(B) - P(A \cap B)$?
# 1. Complement: odd rolls
p_A = (outcomes % 2 == 0).mean()
p_Ac = ___ # fill in: 1 - p_A or (outcomes % 2 != 0).mean()
print(f'P(A) = {p_A:.4f}')
print(f'P(Ac) = {p_Ac:.4f} (should ≈ 0.5000)')
# 2. Event B = {5, 6}
p_B = np.isin(outcomes, ___).mean() # fill in: [5, 6]
print(f'P(B) = {p_B:.4f} (should ≈ {2/6:.4f})')
# 3. A ∪ B (A = even, B = {5,6})
mask_A = outcomes % 2 == 0
mask_B = np.isin(outcomes, [5, 6])
p_A_or_B = (mask_A ___ mask_B).mean() # fill in: | (bitwise OR)
print(f'P(A∪B) = {p_A_or_B:.4f}')
# 4. Inclusion–exclusion check
p_A_and_B = (mask_A ___ mask_B).mean() # fill in: &
p_union_formula = p_A + p_B - p_A_and_B
print(f'Formula P(A)+P(B)-P(A∩B) = {p_union_formula:.4f} — matches? {np.isclose(p_A_or_B, p_union_formula, atol=0.005)}')
Section 2 · Probability Axioms & Rules¶
The three axioms (Kolmogorov, 1933)¶
All of probability theory follows from three axioms:
- Nonnegativity: $P(A) \ge 0$ for every event $A$.
- Normalisation: $P(\Omega) = 1$.
- Countable additivity: For disjoint events $A_1, A_2, \dots$, $P\!\left(\bigcup_i A_i\right) = \sum_i P(A_i)$.
From these three axioms, everything else derives:
- Complement rule: $P(A^c) = 1 - P(A)$
- Inclusion–exclusion: $P(A \cup B) = P(A) + P(B) - P(A \cap B)$
- Product rule: $P(A \cap B) = P(A \mid B)\,P(B)$
- Bayes' theorem: $\displaystyle P(A \mid B) = \frac{P(B \mid A)\,P(A)}{P(B)}$
2.1 · Worked — Conditional Probability & Bayes¶
🧱 A medical test for a disease has sensitivity $P(\text{pos} \mid \text{disease}) = 0.99$ and specificity $P(\text{neg} \mid \text{no disease}) = 0.95$. The disease prevalence is $P(\text{disease}) = 0.001$. What is $P(\text{disease} \mid \text{positive test})$?
# Bayes' theorem — medical test example
P_disease = 0.001 # prior: prevalence
P_no_disease = 1 - P_disease
P_pos_given_disease = 0.99 # sensitivity
P_pos_given_no_disease = 0.05 # 1 - specificity
# Law of total probability: P(positive test)
P_pos = (P_pos_given_disease * P_disease
+ P_pos_given_no_disease * P_no_disease)
# Bayes' theorem: posterior
P_disease_given_pos = (P_pos_given_disease * P_disease) / P_pos
print(f'P(positive test) = {P_pos:.5f}')
print(f'P(disease | positive test) = {P_disease_given_pos:.4f} ({P_disease_given_pos*100:.2f}%)')
print()
print('Interpretation: even with a 99% sensitive test, a positive result')
print('only means ~2% chance of having the disease (because prevalence is very low).')
print('This is the base-rate fallacy — always account for the prior!')
✏️ 2.2 · Your Turn — Simulate Conditional Probability¶
A dataset has two features: age group (young / old) and label (positive / negative). Simulate 10 000 records and estimate $P(\text{positive} \mid \text{old})$ empirically.
- 60% young, 40% old.
- Positive rate: 20% if young, 50% if old.
n = 10_000
rng = np.random.default_rng(7)
age = rng.choice(['young', 'old'], size=n, p=[0.6, 0.4])
p_pos = np.where(age == 'old', 0.5, 0.2)
label = rng.binomial(1, p_pos) # 1 = positive
# Estimate P(positive | old)
mask_old = age == 'old'
p_pos_given_old = ___ # fill in: label[mask_old].mean()
# Estimate P(positive | young)
mask_young = ___ # fill in: age == 'young'
p_pos_given_young = ___ # fill in: label[mask_young].mean()
print(f'P(positive | old) ≈ {p_pos_given_old:.3f} (true = 0.50)')
print(f'P(positive | young) ≈ {p_pos_given_young:.3f} (true = 0.20)')
# Marginal: P(positive)
p_pos_marginal = ___ # fill in: label.mean()
print(f'P(positive) ≈ {p_pos_marginal:.3f} (true = 0.6*0.2 + 0.4*0.5 = {0.6*0.2+0.4*0.5:.2f})')
Section 3 · Random Variables¶
Discrete vs continuous¶
A random variable $X$ is a function $X : \Omega \to \mathbb{R}$ that maps each outcome to a number. We distinguish two types:
- Discrete RV — takes countable values. Described by a probability mass function (PMF): $P(X = k)$.
- Continuous RV — takes values in an interval. Described by a probability density function (PDF): $f(x)$, where $P(a \le X \le b) = \int_a^b f(x)\,dx$.
3.1 · Worked — Empirical PMF from Data¶
🧱 Discrete RV: number of heads in 10 fair coin flips.
n_experiments = 50_000
X = np.random.binomial(n=10, p=0.5, size=n_experiments)
k_vals = np.arange(0, 11)
empirical_pmf = np.array([(X == k).mean() for k in k_vals])
true_pmf = stats.binom.pmf(k_vals, n=10, p=0.5)
fig, ax = plt.subplots(figsize=(8, 4))
ax.bar(k_vals - 0.2, empirical_pmf, width=0.35, color='steelblue', label='Empirical')
ax.bar(k_vals + 0.2, true_pmf, width=0.35, color='crimson', alpha=0.7, label='True Binomial(10, 0.5)')
ax.set_xlabel('k (number of heads)')
ax.set_ylabel('P(X = k)')
ax.set_title(f'Empirical vs true PMF ({n_experiments:,} experiments)')
ax.legend(); plt.tight_layout(); plt.show()
✏️ 3.2 · Your Turn — Empirical CDF¶
The cumulative distribution function (CDF) is $F(x) = P(X \le x)$.
- Compute the empirical CDF of
Xat each value $k = 0, \dots, 10$. - Compare with the true CDF from
stats.binom.cdf. - What does $F(5)$ tell you?
empirical_cdf = np.array([(X <= k).mean() for k in k_vals])
true_cdf = stats.binom.cdf(___, n=10, p=0.5) # fill in: k_vals
fig, ax = plt.subplots(figsize=(8, 4))
ax.step(k_vals, empirical_cdf, where='post', color='steelblue', lw=2, label='Empirical CDF')
ax.step(k_vals, true_cdf, where='post', color='crimson', lw=2, linestyle='--', label='True CDF')
ax.set_xlabel('k'); ax.set_ylabel('F(k) = P(X ≤ k)')
ax.set_title('CDF of Binomial(10, 0.5)'); ax.legend()
plt.tight_layout(); plt.show()
print(f'P(X ≤ 5) empirical = {___:.4f}') # fill in: empirical_cdf[5]
print(f'P(X ≤ 5) true = {true_cdf[5]:.4f}')
# 💬 What does F(5) = 0.623 mean in plain language?
Section 4 · Key Distributions¶
| Distribution | Type | Parameters | Where it appears |
|---|---|---|---|
| Bernoulli($p$) | Discrete | $p \in [0,1]$ | Binary labels, logistic regression outputs |
| Binomial($n, p$) | Discrete | $n \in \mathbb{N},\; p \in [0,1]$ | Count of successes; bootstrap |
| Uniform($a, b$) | Continuous | $a < b$ | Random initialisation, random splits |
| Normal($\mu, \sigma^2$) | Continuous | $\mu \in \mathbb{R},\; \sigma > 0$ | Weight init, noise models, CLT |
4.1 · Worked — Visualising the Four Distributions¶
🧱
fig, axes = plt.subplots(1, 4, figsize=(16, 4))
# 1. Bernoulli(p=0.3)
axes[0].bar([0, 1], [0.7, 0.3], color=['steelblue', 'crimson'])
axes[0].set_xticks([0, 1]); axes[0].set_xticklabels(['0 (fail)', '1 (success)'])
axes[0].set_title('Bernoulli(p=0.3)')
# 2. Binomial(20, 0.4)
k = np.arange(0, 21)
axes[1].bar(k, stats.binom.pmf(k, 20, 0.4), color='steelblue')
axes[1].set_title('Binomial(20, 0.4)')
# 3. Uniform(0, 1)
x = np.linspace(-0.2, 1.2, 400)
axes[2].plot(x, stats.uniform.pdf(x), color='steelblue', lw=2)
axes[2].fill_between(x, stats.uniform.pdf(x), alpha=0.2, color='steelblue')
axes[2].set_title('Uniform(0, 1)')
# 4. Normal(0, 1)
x = np.linspace(-4, 4, 400)
axes[3].plot(x, stats.norm.pdf(x), color='steelblue', lw=2)
axes[3].fill_between(x, stats.norm.pdf(x), alpha=0.2, color='steelblue')
axes[3].set_title('Normal(μ=0, σ²=1)')
for ax in axes:
ax.set_ylabel('P(X=k) or f(x)', fontsize=9)
plt.suptitle('Four fundamental distributions', fontsize=13)
plt.tight_layout(); plt.show()
✏️ 4.2 · Your Turn — Changing Distribution Parameters¶
How does the shape of Binomial($n, p$) change as $p$ varies?
- Plot Binomial($20, p$) for $p \in \{0.1, 0.3, 0.5, 0.7, 0.9\}$ on the same axes.
- For which value of $p$ is the distribution most symmetric?
- What happens to the mean as $p$ increases? (Recall: $\mathbb{E}[X] = np$.)
p_values = [0.1, 0.3, 0.5, 0.7, 0.9]
n_trials = 20
k = np.arange(0, n_trials + 1)
fig, ax = plt.subplots(figsize=(9, 4))
colors = plt.cm.coolwarm(np.linspace(0, 1, len(p_values)))
for p, c in zip(p_values, colors):
pmf = stats.binom.pmf(___, n=n_trials, p=___) # fill in: k, p
ax.plot(k, pmf, 'o-', color=c, lw=1.5, ms=4, label=f'p={p}')
ax.axvline(n_trials * p, color=c, lw=0.7, linestyle=':') # mark the mean
ax.set_xlabel('k'); ax.set_ylabel('P(X = k)')
ax.set_title(f'Binomial({n_trials}, p) for varying p')
ax.legend(fontsize=10)
plt.tight_layout(); plt.show()
# 💬 Which p gives a symmetric distribution? Why?
Section 5 · Expectation & Variance¶
$$ \mathbb{E}[X] = \sum_k k\,P(X=k) \quad \text{(discrete)} \qquad \mathbb{E}[X] = \int_{-\infty}^{\infty} x\,f(x)\,dx \quad \text{(continuous)} $$
$$ \operatorname{Var}(X) = \mathbb{E}\!\left[(X - \mathbb{E}[X])^2\right] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 $$
5.1 · Worked — Law of Large Numbers¶
🧱 As $n \to \infty$, the sample mean $\bar{X}_n$ converges to $\mathbb{E}[X]$. Let's watch it happen.
# Roll a fair die. E[X] = (1+2+3+4+5+6)/6 = 3.5
true_mean = 3.5
n_max = 10_000
rolls = np.random.randint(1, 7, size=n_max)
cumul_mean = np.cumsum(rolls) / np.arange(1, n_max + 1)
fig, ax = plt.subplots(figsize=(9, 4))
ax.plot(cumul_mean, color='steelblue', lw=1.2, label='Running mean')
ax.axhline(true_mean, color='crimson', lw=1.5, linestyle='--', label=f'E[X] = {true_mean}')
ax.set_xscale('log')
ax.set_xlabel('Number of rolls (log scale)'); ax.set_ylabel('Sample mean')
ax.set_title('Law of Large Numbers — die roll')
ax.legend(); plt.tight_layout(); plt.show()
✏️ 5.2 · Your Turn — Estimating Variance¶
For Binomial($n=10,\; p=0.3$): true mean $= np = 3$, true variance $= np(1-p) = 2.1$.
- Draw 50 000 samples.
- Compute the sample mean $\bar{x}$ and sample variance $s^2$ using
np.meanandnp.var(ddof=1). - Compare to the true values.
- Why do we use
ddof=1(Bessel's correction)?
samples = np.random.binomial(n=10, p=0.3, size=50_000)
sample_mean = samples.___() # fill in: mean
sample_var = samples.___(ddof=1) # fill in: var (Bessel-corrected)
print(f'Sample mean = {sample_mean:.4f} (true = {10*0.3:.4f})')
print(f'Sample variance = {sample_var:.4f} (true = {10*0.3*0.7:.4f})')
# 💬 What changes if you use ddof=0? How large is the correction for n=50000?
Section 6 · The Normal Distribution & Central Limit Theorem¶
The Central Limit Theorem (CLT) says: the sum (or mean) of many independent random variables converges to a Normal distribution — regardless of the original distribution.
$$ \sqrt{n}\,\left(\bar{X}_n - \mu\right) \;\xrightarrow{d}\; \mathcal{N}(0,\, \sigma^2) \qquad \text{as } n \to \infty $$
6.1 · Worked — Watching the CLT¶
🧱 We sample from a highly skewed exponential distribution and watch the sample means become Normal as sample size grows.
# Exponential distribution with λ=1 — very skewed (mean=1, std=1)
n_experiments = 20_000
sample_sizes = [1, 4, 16, 64]
fig, axes = plt.subplots(1, 4, figsize=(16, 4))
for ax, n in zip(axes, sample_sizes):
X = np.random.exponential(scale=1, size=(n_experiments, n))
means = X.mean(axis=1) # shape: (n_experiments,)
ax.hist(means, bins=60, density=True, color='steelblue', alpha=0.8)
# Overlay the CLT prediction: N(μ=1, σ²=1/n)
xr = np.linspace(means.min(), means.max(), 300)
ax.plot(xr, stats.norm.pdf(xr, loc=1, scale=1/np.sqrt(n)),
color='crimson', lw=2, label=f'N(1, 1/{n})')
ax.set_title(f'n = {n}'); ax.legend(fontsize=9)
plt.suptitle('Central Limit Theorem — means of Exponential(1) samples', fontsize=13)
plt.tight_layout(); plt.show()
💡 Connection to ML: The CLT explains why validation metrics (MSE, accuracy) computed on a test set are approximately normally distributed. This is the theoretical backbone of confidence intervals for model performance — and why averaging over K folds gives a tighter, better-calibrated estimate.
🔬 6.2 · Experiment — How Many Samples Until Normal?¶
Use a Uniform distribution (less skewed than Exponential). Try n ∈ {1, 2, 5, 20}. At what n does the Normal approximation look convincing?
sample_sizes_exp = [1, 2, 5, 20] # ← try changing these
fig, axes = plt.subplots(1, 4, figsize=(16, 4))
for ax, n in zip(axes, sample_sizes_exp):
X = np.random.uniform(0, 1, size=(20_000, n))
means = X.mean(axis=1) # E[Uniform(0,1)] = 0.5, Var = 1/12
ax.hist(means, bins=60, density=True, color='steelblue', alpha=0.8)
xr = np.linspace(means.min(), means.max(), 300)
ax.plot(xr, stats.norm.pdf(xr, loc=0.5, scale=np.sqrt(1/(12*n))),
color='crimson', lw=2)
ax.set_title(f'n = {n}')
plt.suptitle('CLT — means of Uniform(0,1) samples', fontsize=13)
plt.tight_layout(); plt.show()
# 💬 At what n does the Normal approximation become convincing here vs Exponential?
Section 7 · Why One Split Is Not Enough¶
The single-split problem¶
A single random split has a critical weakness: the test score is a random variable — it depends on which samples happened to land in the test set. With a small dataset, this variance can be enormous.
$$ \widehat{\text{Err}}_{\text{test}} = \frac{1}{|S_{\text{test}}|} \sum_{i \in S_{\text{test}}} \ell(f(\mathbf{x}_i),\, y_i) $$
This is an unbiased estimator of generalisation error — but it can have high variance.
7.1 · Worked — Variance of a Single Split¶
🧱
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score
data = load_breast_cancer()
X, y = data.data, data.target
print(f'Dataset: {X.shape[0]} samples, {X.shape[1]} features, {np.unique(y).size} classes')
# Repeat a single 80/20 split 200 times with different random seeds
n_repeats = 200
test_scores = []
for seed in range(n_repeats):
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=seed)
scaler = StandardScaler().fit(X_tr)
clf = LogisticRegression(max_iter=1000, random_state=0)
clf.fit(scaler.transform(X_tr), y_tr)
test_scores.append(accuracy_score(y_te, clf.predict(scaler.transform(X_te))))
test_scores = np.array(test_scores)
print(f'\nSingle-split test accuracy over {n_repeats} random seeds:')
print(f' Mean : {test_scores.mean():.4f}')
print(f' Std : {test_scores.std():.4f}')
print(f' Range : [{test_scores.min():.4f}, {test_scores.max():.4f}]')
fig, ax = plt.subplots(figsize=(8, 3.5))
ax.hist(test_scores, bins=30, color='steelblue', edgecolor='white')
ax.axvline(test_scores.mean(), color='crimson', lw=2, linestyle='--',
label=f'Mean={test_scores.mean():.3f}')
ax.set_xlabel('Test accuracy'); ax.set_ylabel('Count')
ax.set_title('Distribution of single-split test accuracy across 200 seeds')
ax.legend(); plt.tight_layout(); plt.show()
⚠️ The problem with one split: A single test score can vary by several percentage points depending on the random seed. Reporting one number without a confidence interval is misleading. We need to average over multiple splits.
Section 8 · K-Fold Cross-Validation¶
The idea¶
Instead of one split, partition the data into $K$ equal folds. Train on $K-1$ folds and evaluate on the remaining one — repeat $K$ times, rotating which fold is the validation set.
Fold 1: [VAL] [---] [---] [---] [---]
Fold 2: [---] [VAL] [---] [---] [---]
Fold 3: [---] [---] [VAL] [---] [---]
Fold 4: [---] [---] [---] [VAL] [---]
Fold 5: [---] [---] [---] [---] [VAL]
$$ \widehat{\text{Err}}_{\text{CV}} = \frac{1}{K} \sum_{k=1}^{K} \text{Err}_k \qquad \widehat{\sigma}_{\text{CV}} = \sqrt{\frac{1}{K-1}\sum_{k=1}^{K}\left(\text{Err}_k - \widehat{\text{Err}}_{\text{CV}}\right)^2} $$
8.1 · Worked — K-Fold from Scratch¶
🧱 Manual 5-fold CV to see exactly what is happening inside.
K = 5
n = len(X)
idx = np.arange(n)
np.random.shuffle(idx)
fold_size = n // K
fold_accs = []
for fold in range(K):
# Validation indices
val_idx = idx[fold * fold_size : (fold + 1) * fold_size]
# Training indices = everything else
train_idx = np.concatenate([idx[:fold * fold_size], idx[(fold + 1) * fold_size:]])
X_tr, y_tr = X[train_idx], y[train_idx]
X_val, y_val = X[val_idx], y[val_idx]
scaler = StandardScaler().fit(X_tr)
clf = LogisticRegression(max_iter=1000, random_state=0)
clf.fit(scaler.transform(X_tr), y_tr)
acc = accuracy_score(y_val, clf.predict(scaler.transform(X_val)))
fold_accs.append(acc)
print(f' Fold {fold+1}: val_size={len(val_idx)}, accuracy={acc:.4f}')
print(f'\nCV mean = {np.mean(fold_accs):.4f}')
print(f'CV std = {np.std(fold_accs, ddof=1):.4f}')
8.2 · Worked — Using sklearn's KFold¶
🧱 sklearn handles the fold indices for us.
kf = KFold(n_splits=5, shuffle=True, random_state=42)
fold_accs_sk = []
for fold, (train_idx, val_idx) in enumerate(kf.split(X)):
X_tr, y_tr = X[train_idx], y[train_idx]
X_val, y_val = X[val_idx], y[val_idx]
scaler = StandardScaler().fit(X_tr)
clf = LogisticRegression(max_iter=1000, random_state=0)
clf.fit(scaler.transform(X_tr), y_tr)
acc = accuracy_score(y_val, clf.predict(scaler.transform(X_val)))
fold_accs_sk.append(acc)
scores = np.array(fold_accs_sk)
print(f'5-fold CV accuracy: {scores.mean():.4f} ± {scores.std(ddof=1):.4f}')
fig, ax = plt.subplots(figsize=(6, 3.5))
ax.bar(np.arange(1, K+1), scores, color='steelblue', edgecolor='white')
ax.axhline(scores.mean(), color='crimson', lw=2, linestyle='--', label=f'Mean = {scores.mean():.3f}')
ax.set_xlabel('Fold'); ax.set_ylabel('Validation accuracy')
ax.set_title('5-fold CV — per-fold accuracy')
ax.set_ylim(0.9, 1.0); ax.legend(); plt.tight_layout(); plt.show()
✏️ 8.3 · Your Turn — cross_val_score (the one-liner)¶
sklearn.model_selection.cross_val_score wraps the loop above into a single call.
- What does the
scoringparameter accept? Try'accuracy'and'f1'. - What does the
cvparameter accept — an integer or a KFold object? - Why is passing a
Pipeline(scaler + model) safer than scaling outside the loop?
# Build a Pipeline — scaler + classifier in one object
pipe = Pipeline([
('scaler', StandardScaler()),
('clf', LogisticRegression(max_iter=1000, random_state=0)),
])
# cross_val_score with 5-fold CV
cv_scores = cross_val_score(
___, # fill in: pipe
X, y,
cv = ___, # fill in: 5 (or a KFold object)
scoring = ___, # fill in: 'accuracy'
)
print(f'CV scores (per fold): {cv_scores.round(4)}')
print(f'Mean ± Std : {cv_scores.mean():.4f} ± {cv_scores.std(ddof=1):.4f}')
# Now try scoring='f1' — what changes?
cv_f1 = cross_val_score(pipe, X, y, cv=5, scoring=___)
print(f'\nCV F1 : {cv_f1.mean():.4f} ± {cv_f1.std(ddof=1):.4f}')
🔬 8.4 · Experiment — How Does K Affect the Estimate?¶
Vary $K \in \{2, 3, 5, 10, \text{LOO}\}$ and observe how the mean and variance of the CV estimate change.
K_values = [2, 3, 5, 10, len(X)] # last entry = LOO
results = {}
for K_val in K_values:
kf = KFold(n_splits=K_val, shuffle=True, random_state=42)
sc = cross_val_score(pipe, X, y, cv=kf, scoring='accuracy')
results[K_val] = sc
label = 'LOO' if K_val == len(X) else f'K={K_val}'
print(f'{label:6s}: mean={sc.mean():.4f} std={sc.std(ddof=1):.4f}')
# Plot mean ± std
fig, ax = plt.subplots(figsize=(8, 4))
labels = ['LOO' if k == len(X) else f'K={k}' for k in K_values]
means = [results[k].mean() for k in K_values]
stds = [results[k].std(ddof=1) for k in K_values]
ax.errorbar(labels, means, yerr=stds, fmt='o-', capsize=5, color='steelblue', lw=2)
ax.set_xlabel('Number of folds K'); ax.set_ylabel('CV accuracy')
ax.set_title('CV estimate mean ± std vs K')
plt.tight_layout(); plt.show()
# 💬 Why does variance decrease then (for LOO) potentially increase again?
# 💬 Why is K=5 or K=10 the standard choice in practice?
Section 9 · Stratified K-Fold¶
The class imbalance problem¶
Standard K-fold splits data randomly. If classes are imbalanced (e.g., 95% negative, 5% positive), some folds may have very few — or zero — positive examples. Stratified K-fold preserves the class ratio in each fold.
9.1 · Worked — Stratified vs Standard on Imbalanced Data¶
🧱
from sklearn.datasets import make_classification
X_imb, y_imb = make_classification(
n_samples=500, n_features=10,
weights=[0.95, 0.05],
random_state=42
)
print(f'Class counts: {np.bincount(y_imb)} (ratio 1:{np.bincount(y_imb)[1]/np.bincount(y_imb)[0]:.3f})')
kf_plain = KFold(n_splits=5, shuffle=True, random_state=0)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
print('\n--- Standard KFold ---')
for i, (_, val_idx) in enumerate(kf_plain.split(X_imb)):
ratio = y_imb[val_idx].mean()
print(f' Fold {i+1}: positive rate = {ratio:.3f}')
print('\n--- Stratified KFold ---')
for i, (_, val_idx) in enumerate(skf.split(X_imb, y_imb)):
ratio = y_imb[val_idx].mean()
print(f' Fold {i+1}: positive rate = {ratio:.3f}')
✏️ 9.2 · Your Turn — Compare CV Scores¶
Use the imbalanced dataset. Compare KFold vs StratifiedKFold accuracy scores. Then switch to scoring='f1' — which gives a more honest picture of performance on the minority class?
pipe_imb = Pipeline([
('scaler', StandardScaler()),
('clf', LogisticRegression(max_iter=500, random_state=0)),
])
# Standard KFold
acc_kf = cross_val_score(pipe_imb, X_imb, y_imb, cv=___, scoring='accuracy') # fill in: kf_plain
f1_kf = cross_val_score(pipe_imb, X_imb, y_imb, cv=kf_plain, scoring=___) # fill in: 'f1'
# Stratified KFold
acc_skf = cross_val_score(pipe_imb, X_imb, y_imb, cv=___, scoring='accuracy') # fill in: skf
f1_skf = cross_val_score(pipe_imb, X_imb, y_imb, cv=skf, scoring='f1')
print(f'Standard — Accuracy: {acc_kf.mean():.4f} ± {acc_kf.std():.4f} | F1: {f1_kf.mean():.4f} ± {f1_kf.std():.4f}')
print(f'Stratified — Accuracy: {acc_skf.mean():.4f} ± {acc_skf.std():.4f} | F1: {f1_skf.mean():.4f} ± {f1_skf.std():.4f}')
# 💬 Which metric better captures performance on the rare class?
Section 10 · Full CV Pipeline¶
Best practices for reporting cross-validation results¶
- Always shuffle before splitting (unless data is a time series).
- Put all preprocessing inside the pipeline — scaler, imputer, encoder — so they are refitted on each fold's training data only.
- Report mean ± std across folds, not just the mean.
- Use
StratifiedKFoldfor classification;KFoldfor regression. - Keep a final held-out test set separate — CV is for model selection, not final reporting.
🔑 The Golden Rule: Cross-validation is a tool for model selection (choosing K, comparing algorithms). Your held-out test set should still be used once, at the very end, to report final performance. Using the CV score as your final number is fine; using the test set to tune is leakage.
10.1 · Worked — The Complete Recommended Workflow¶
🧱
from sklearn.metrics import classification_report
# ── 1. Load ────────────────────────────────────────────────────────────────
data = load_breast_cancer()
X, y = data.data, data.target
# ── 2. Hold out a final test set — touch this ONLY at the very end ─────────
X_dev, X_test, y_dev, y_test = train_test_split(
X, y, test_size=0.15, random_state=42, stratify=y
)
print(f'Dev set : {X_dev.shape[0]} samples')
print(f'Test set: {X_test.shape[0]} samples (locked away)\n')
# ── 3. Build a pipeline (scaler + model) ──────────────────────────────────
pipe = Pipeline([
('scaler', StandardScaler()),
('clf', LogisticRegression(max_iter=1000, C=1.0, random_state=0)),
])
# ── 4. Cross-validate on the dev set only ─────────────────────────────────
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
cv_acc = cross_val_score(pipe, X_dev, y_dev, cv=skf, scoring='accuracy')
cv_f1 = cross_val_score(pipe, X_dev, y_dev, cv=skf, scoring='f1')
print('10-fold CV on dev set:')
print(f' Accuracy : {cv_acc.mean():.4f} ± {cv_acc.std(ddof=1):.4f}')
print(f' F1 score : {cv_f1.mean():.4f} ± {cv_f1.std(ddof=1):.4f}')
# ── 5. Refit on ALL dev data and report test performance ──────────────────
scaler_final = StandardScaler().fit(X_dev)
pipe.fit(scaler_final.transform(X_dev), y_dev)
y_pred = pipe.predict(scaler_final.transform(X_test))
print('\n── Final test set performance ──')
print(classification_report(y_test, y_pred, target_names=data.target_names))
✏️ 10.2 · Your Turn — Compare Two Models with CV¶
Use the breast cancer dev set. Compare LogisticRegression(C=1.0) vs LogisticRegression(C=0.01) using 10-fold stratified CV. Which regularisation level generalises better?
models = {
'LogReg C=1.0' : Pipeline([('sc', StandardScaler()), ('clf', LogisticRegression(C=1.0, max_iter=1000))]),
'LogReg C=0.01': Pipeline([('sc', StandardScaler()), ('clf', LogisticRegression(C=___, max_iter=1000))]), # fill in
}
skf10 = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
print(f'{"Model":<20} {"Mean Acc":>10} {"Std":>8} {"Mean F1":>10} {"Std":>8}')
print('-' * 60)
for name, model in models.items():
acc = cross_val_score(___, X_dev, y_dev, cv=skf10, scoring='accuracy') # fill in: model
f1 = cross_val_score(model, X_dev, y_dev, cv=___, scoring='f1') # fill in: skf10
print(f'{name:<20} {acc.mean():>10.4f} {acc.std():>8.4f} {f1.mean():>10.4f} {f1.std():>8.4f}')
# 💬 Which model would you choose? Would you look at accuracy or F1?
🏁 Summary¶
| Section | What you learned | Key rule |
|---|---|---|
| 1–2 | Sample spaces, events, Kolmogorov axioms, Bayes' theorem | All probability derives from three axioms |
| 3 | Discrete (PMF/CDF) and continuous (PDF) random variables | Row = sample drawn from a distribution |
| 4 | Bernoulli, Binomial, Uniform, Normal distributions | Know the parameters and where each appears in ML |
| 5 | Expectation, variance, Law of Large Numbers | More data → sample mean converges to true mean |
| 6 | Normal distribution, Central Limit Theorem | Average many iid RVs → Normal, regardless of origin |
| 7 | Single-split variance problem | One test score has high variance — never trust just one |
| 8 | K-fold cross-validation from scratch and with sklearn | Report mean ± std across K folds |
| 9 | Stratified K-fold for class imbalance | Always stratify for classification tasks |
| 10 | Full recommended CV pipeline | CV on dev set; test set touched once, at the very end |
The central lesson:
Probability gives us the language to describe where data comes from. Cross-validation gives us an honest way to measure how well a model has learned from it. Together, they are the mathematical and empirical backbone of every trustworthy machine learning experiment.
MA2221 — Foundational Mathematics for Machine Learning · Mahindra University