Probability¶

Course: MA2221 · Mahindra University
Reference: Mathematics for Machine Learning, Deisenroth, Faisal & Ong — Ch 6

Two questions drive this lab:

How does probability give us the language to describe uncertainty in data and models?
Why is a single train/test split unreliable, and how does K-fold fix that?

Every machine learning algorithm assumes data is drawn from some probability distribution. Understanding that distribution — and honestly measuring how well we generalise to it — is the foundation of principled machine learning.

Structure¶

Section	Topic
1	Sample spaces and events — the building blocks
2	Probability axioms and rules (sum, product, Bayes)
3	Random variables — discrete and continuous
4	Key distributions: Bernoulli, Binomial, Uniform, Gaussian
5	Expectation, variance, and the Law of Large Numbers
6	The Normal distribution and the Central Limit Theorem
7	Why one train/test split gives a noisy estimate
8	K-fold cross-validation — theory and implementation
9	Stratified K-fold for imbalanced data
10	Full CV pipeline — choosing K, reporting results

Legend¶

🧱 Worked — run and read
✏️ Your Turn — fill in ___
🔬 Experiment — change numbers and observe
💬 Discuss — no single right answer

0 · Setup¶

In [ ]:

Copied!





import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
from sklearn import datasets
from sklearn.model_selection import (
    train_test_split, KFold, StratifiedKFold, cross_val_score
)
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

np.random.seed(42)

plt.rcParams.update({
    'figure.dpi'       : 120,
    'axes.spines.top'  : False,
    'axes.spines.right': False,
    'font.size'        : 12,
})
print('All imports OK ✓')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
from sklearn import datasets
from sklearn.model_selection import (
    train_test_split, KFold, StratifiedKFold, cross_val_score
)
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

np.random.seed(42)

plt.rcParams.update({
    'figure.dpi'       : 120,
    'axes.spines.top'  : False,
    'axes.spines.right': False,
    'font.size'        : 12,
})
print('All imports OK ✓')

Section 1 · Sample Spaces & Events¶

What is a probability space?¶

Every probabilistic experiment is described by three objects — the probability triple:

Sample space Ω — the set of all possible outcomes.
Event space 𝒜 — the collection of subsets of Ω we assign probabilities to.
Probability measure P — a function that maps each event to a number in [0, 1].

$$( \Omega,\, \mathcal{A},\, P )$$

Example. Rolling a fair die: $\Omega = \{1, 2, 3, 4, 5, 6\}$. The event "rolling an even number" is $A = \{2, 4, 6\} \subseteq \Omega$, and $P(A) = 3/6 = 0.5$.

1.1 · Worked — Simulating a Sample Space¶

🧱 Run and read.

In [ ]:

Copied!





# Simulate rolling a fair six-sided die 100 000 times
n_rolls  = 100_000
outcomes = np.random.randint(1, 7, size=n_rolls)   # Ω = {1, 2, 3, 4, 5, 6}

# Empirical probability of each face
faces, counts = np.unique(outcomes, return_counts=True)
empirical_p   = counts / n_rolls

fig, ax = plt.subplots(figsize=(7, 3.5))
ax.bar(faces, empirical_p, color='steelblue', edgecolor='white')
ax.axhline(1/6, color='crimson', lw=1.5, linestyle='--', label='True P = 1/6')
ax.set_xlabel('Face'); ax.set_ylabel('Empirical probability')
ax.set_title(f'Die roll simulation ({n_rolls:,} rolls)')
ax.legend(); plt.tight_layout(); plt.show()

# Event A = {2, 4, 6}: rolling an even number
A = outcomes[outcomes % 2 == 0]
print(f'P(even) ≈ {len(A)/n_rolls:.4f}   (true = 0.5000)')
# Simulate rolling a fair six-sided die 100 000 times
n_rolls  = 100_000
outcomes = np.random.randint(1, 7, size=n_rolls)   # Ω = {1, 2, 3, 4, 5, 6}

# Empirical probability of each face
faces, counts = np.unique(outcomes, return_counts=True)
empirical_p   = counts / n_rolls

fig, ax = plt.subplots(figsize=(7, 3.5))
ax.bar(faces, empirical_p, color='steelblue', edgecolor='white')
ax.axhline(1/6, color='crimson', lw=1.5, linestyle='--', label='True P = 1/6')
ax.set_xlabel('Face'); ax.set_ylabel('Empirical probability')
ax.set_title(f'Die roll simulation ({n_rolls:,} rolls)')
ax.legend(); plt.tight_layout(); plt.show()

# Event A = {2, 4, 6}: rolling an even number
A = outcomes[outcomes % 2 == 0]
print(f'P(even) ≈ {len(A)/n_rolls:.4f}   (true = 0.5000)')

✏️ 1.2 · Your Turn — Complement and Union¶

Use the same outcomes array to estimate:

$P(A^c)$ — the probability of rolling an odd number (complement of A).
$P(B)$ where $B = \{5, 6\}$ — rolling a 5 or 6.
$P(A \cup B)$ — rolling an even number or a 5 or 6.
Verify: does $P(A \cup B) = P(A) + P(B) - P(A \cap B)$?

In [ ]:

Copied!





# 1. Complement: odd rolls
p_A  = (outcomes % 2 == 0).mean()
p_Ac = ___                    # fill in: 1 - p_A  or  (outcomes % 2 != 0).mean()
print(f'P(A)  = {p_A:.4f}')
print(f'P(Ac) = {p_Ac:.4f}   (should ≈ 0.5000)')

# 2. Event B = {5, 6}
p_B = np.isin(outcomes, ___).mean()   # fill in: [5, 6]
print(f'P(B)  = {p_B:.4f}   (should ≈ {2/6:.4f})')

# 3. A ∪ B  (A = even, B = {5,6})
mask_A   = outcomes % 2 == 0
mask_B   = np.isin(outcomes, [5, 6])
p_A_or_B = (mask_A ___ mask_B).mean()    # fill in: |  (bitwise OR)
print(f'P(A∪B) = {p_A_or_B:.4f}')

# 4. Inclusion–exclusion check
p_A_and_B     = (mask_A ___ mask_B).mean()   # fill in: &
p_union_formula = p_A + p_B - p_A_and_B
print(f'Formula P(A)+P(B)-P(A∩B) = {p_union_formula:.4f}  — matches? {np.isclose(p_A_or_B, p_union_formula, atol=0.005)}')
# 1. Complement: odd rolls
p_A  = (outcomes % 2 == 0).mean()
p_Ac = ___                    # fill in: 1 - p_A  or  (outcomes % 2 != 0).mean()
print(f'P(A)  = {p_A:.4f}')
print(f'P(Ac) = {p_Ac:.4f}   (should ≈ 0.5000)')

# 2. Event B = {5, 6}
p_B = np.isin(outcomes, ___).mean()   # fill in: [5, 6]
print(f'P(B)  = {p_B:.4f}   (should ≈ {2/6:.4f})')

# 3. A ∪ B  (A = even, B = {5,6})
mask_A   = outcomes % 2 == 0
mask_B   = np.isin(outcomes, [5, 6])
p_A_or_B = (mask_A ___ mask_B).mean()    # fill in: |  (bitwise OR)
print(f'P(A∪B) = {p_A_or_B:.4f}')

# 4. Inclusion–exclusion check
p_A_and_B     = (mask_A ___ mask_B).mean()   # fill in: &
p_union_formula = p_A + p_B - p_A_and_B
print(f'Formula P(A)+P(B)-P(A∩B) = {p_union_formula:.4f}  — matches? {np.isclose(p_A_or_B, p_union_formula, atol=0.005)}')

Section 2 · Probability Axioms & Rules¶

The three axioms (Kolmogorov, 1933)¶

All of probability theory follows from three axioms:

Nonnegativity: $P(A) \ge 0$ for every event $A$.
Normalisation: $P(\Omega) = 1$.
Countable additivity: For disjoint events $A_1, A_2, \dots$, $P\!\left(\bigcup_i A_i\right) = \sum_i P(A_i)$.

From these three axioms, everything else derives:

Complement rule: $P(A^c) = 1 - P(A)$
Inclusion–exclusion: $P(A \cup B) = P(A) + P(B) - P(A \cap B)$
Product rule: $P(A \cap B) = P(A \mid B)\,P(B)$
Bayes' theorem: $\displaystyle P(A \mid B) = \frac{P(B \mid A)\,P(A)}{P(B)}$

2.1 · Worked — Conditional Probability & Bayes¶

🧱 A medical test for a disease has sensitivity $P(\text{pos} \mid \text{disease}) = 0.99$ and specificity $P(\text{neg} \mid \text{no disease}) = 0.95$. The disease prevalence is $P(\text{disease}) = 0.001$. What is $P(\text{disease} \mid \text{positive test})$?

In [ ]:

Copied!





# Bayes' theorem — medical test example
P_disease    = 0.001          # prior: prevalence
P_no_disease = 1 - P_disease

P_pos_given_disease    = 0.99  # sensitivity
P_pos_given_no_disease = 0.05  # 1 - specificity

# Law of total probability: P(positive test)
P_pos = (P_pos_given_disease    * P_disease
       + P_pos_given_no_disease * P_no_disease)

# Bayes' theorem: posterior
P_disease_given_pos = (P_pos_given_disease * P_disease) / P_pos

print(f'P(positive test)             = {P_pos:.5f}')
print(f'P(disease | positive test)   = {P_disease_given_pos:.4f}  ({P_disease_given_pos*100:.2f}%)')
print()
print('Interpretation: even with a 99% sensitive test, a positive result')
print('only means ~2% chance of having the disease (because prevalence is very low).')
print('This is the base-rate fallacy — always account for the prior!')
# Bayes' theorem — medical test example
P_disease    = 0.001          # prior: prevalence
P_no_disease = 1 - P_disease

P_pos_given_disease    = 0.99  # sensitivity
P_pos_given_no_disease = 0.05  # 1 - specificity

# Law of total probability: P(positive test)
P_pos = (P_pos_given_disease    * P_disease
       + P_pos_given_no_disease * P_no_disease)

# Bayes' theorem: posterior
P_disease_given_pos = (P_pos_given_disease * P_disease) / P_pos

print(f'P(positive test)             = {P_pos:.5f}')
print(f'P(disease | positive test)   = {P_disease_given_pos:.4f}  ({P_disease_given_pos*100:.2f}%)')
print()
print('Interpretation: even with a 99% sensitive test, a positive result')
print('only means ~2% chance of having the disease (because prevalence is very low).')
print('This is the base-rate fallacy — always account for the prior!')

✏️ 2.2 · Your Turn — Simulate Conditional Probability¶

A dataset has two features: age group (young / old) and label (positive / negative). Simulate 10 000 records and estimate $P(\text{positive} \mid \text{old})$ empirically.

60% young, 40% old.
Positive rate: 20% if young, 50% if old.

In [ ]:

Copied!





n   = 10_000
rng = np.random.default_rng(7)

age   = rng.choice(['young', 'old'], size=n, p=[0.6, 0.4])
p_pos = np.where(age == 'old', 0.5, 0.2)
label = rng.binomial(1, p_pos)   # 1 = positive

# Estimate P(positive | old)
mask_old        = age == 'old'
p_pos_given_old = ___    # fill in: label[mask_old].mean()

# Estimate P(positive | young)
mask_young        = ___  # fill in: age == 'young'
p_pos_given_young = ___  # fill in: label[mask_young].mean()

print(f'P(positive | old)   ≈ {p_pos_given_old:.3f}  (true = 0.50)')
print(f'P(positive | young) ≈ {p_pos_given_young:.3f}  (true = 0.20)')

# Marginal: P(positive)
p_pos_marginal = ___  # fill in: label.mean()
print(f'P(positive)         ≈ {p_pos_marginal:.3f}  (true = 0.6*0.2 + 0.4*0.5 = {0.6*0.2+0.4*0.5:.2f})')
n   = 10_000
rng = np.random.default_rng(7)

age   = rng.choice(['young', 'old'], size=n, p=[0.6, 0.4])
p_pos = np.where(age == 'old', 0.5, 0.2)
label = rng.binomial(1, p_pos)   # 1 = positive

# Estimate P(positive | old)
mask_old        = age == 'old'
p_pos_given_old = ___    # fill in: label[mask_old].mean()

# Estimate P(positive | young)
mask_young        = ___  # fill in: age == 'young'
p_pos_given_young = ___  # fill in: label[mask_young].mean()

print(f'P(positive | old)   ≈ {p_pos_given_old:.3f}  (true = 0.50)')
print(f'P(positive | young) ≈ {p_pos_given_young:.3f}  (true = 0.20)')

# Marginal: P(positive)
p_pos_marginal = ___  # fill in: label.mean()
print(f'P(positive)         ≈ {p_pos_marginal:.3f}  (true = 0.6*0.2 + 0.4*0.5 = {0.6*0.2+0.4*0.5:.2f})')

Section 3 · Random Variables¶

Discrete vs continuous¶

A random variable $X$ is a function $X : \Omega \to \mathbb{R}$ that maps each outcome to a number. We distinguish two types:

Discrete RV — takes countable values. Described by a probability mass function (PMF): $P(X = k)$.
Continuous RV — takes values in an interval. Described by a probability density function (PDF): $f(x)$, where $P(a \le X \le b) = \int_a^b f(x)\,dx$.

3.1 · Worked — Empirical PMF from Data¶

🧱 Discrete RV: number of heads in 10 fair coin flips.

In [ ]:

Copied!





n_experiments = 50_000
X = np.random.binomial(n=10, p=0.5, size=n_experiments)

k_vals        = np.arange(0, 11)
empirical_pmf = np.array([(X == k).mean() for k in k_vals])
true_pmf      = stats.binom.pmf(k_vals, n=10, p=0.5)

fig, ax = plt.subplots(figsize=(8, 4))
ax.bar(k_vals - 0.2, empirical_pmf, width=0.35, color='steelblue', label='Empirical')
ax.bar(k_vals + 0.2, true_pmf,      width=0.35, color='crimson',   alpha=0.7, label='True Binomial(10, 0.5)')
ax.set_xlabel('k (number of heads)')
ax.set_ylabel('P(X = k)')
ax.set_title(f'Empirical vs true PMF ({n_experiments:,} experiments)')
ax.legend(); plt.tight_layout(); plt.show()
n_experiments = 50_000
X = np.random.binomial(n=10, p=0.5, size=n_experiments)

k_vals        = np.arange(0, 11)
empirical_pmf = np.array([(X == k).mean() for k in k_vals])
true_pmf      = stats.binom.pmf(k_vals, n=10, p=0.5)

fig, ax = plt.subplots(figsize=(8, 4))
ax.bar(k_vals - 0.2, empirical_pmf, width=0.35, color='steelblue', label='Empirical')
ax.bar(k_vals + 0.2, true_pmf,      width=0.35, color='crimson',   alpha=0.7, label='True Binomial(10, 0.5)')
ax.set_xlabel('k (number of heads)')
ax.set_ylabel('P(X = k)')
ax.set_title(f'Empirical vs true PMF ({n_experiments:,} experiments)')
ax.legend(); plt.tight_layout(); plt.show()

✏️ 3.2 · Your Turn — Empirical CDF¶

The cumulative distribution function (CDF) is $F(x) = P(X \le x)$.

Compute the empirical CDF of X at each value $k = 0, \dots, 10$.
Compare with the true CDF from stats.binom.cdf.
What does $F(5)$ tell you?

In [ ]:

Copied!





empirical_cdf = np.array([(X <= k).mean() for k in k_vals])
true_cdf      = stats.binom.cdf(___, n=10, p=0.5)  # fill in: k_vals

fig, ax = plt.subplots(figsize=(8, 4))
ax.step(k_vals, empirical_cdf, where='post', color='steelblue', lw=2, label='Empirical CDF')
ax.step(k_vals, true_cdf,      where='post', color='crimson',   lw=2, linestyle='--', label='True CDF')
ax.set_xlabel('k'); ax.set_ylabel('F(k) = P(X ≤ k)')
ax.set_title('CDF of Binomial(10, 0.5)'); ax.legend()
plt.tight_layout(); plt.show()

print(f'P(X ≤ 5) empirical = {___:.4f}')   # fill in: empirical_cdf[5]
print(f'P(X ≤ 5) true      = {true_cdf[5]:.4f}')
# 💬 What does F(5) = 0.623 mean in plain language?
empirical_cdf = np.array([(X <= k).mean() for k in k_vals])
true_cdf      = stats.binom.cdf(___, n=10, p=0.5)  # fill in: k_vals

fig, ax = plt.subplots(figsize=(8, 4))
ax.step(k_vals, empirical_cdf, where='post', color='steelblue', lw=2, label='Empirical CDF')
ax.step(k_vals, true_cdf,      where='post', color='crimson',   lw=2, linestyle='--', label='True CDF')
ax.set_xlabel('k'); ax.set_ylabel('F(k) = P(X ≤ k)')
ax.set_title('CDF of Binomial(10, 0.5)'); ax.legend()
plt.tight_layout(); plt.show()

print(f'P(X ≤ 5) empirical = {___:.4f}')   # fill in: empirical_cdf[5]
print(f'P(X ≤ 5) true      = {true_cdf[5]:.4f}')
# 💬 What does F(5) = 0.623 mean in plain language?

Section 4 · Key Distributions¶

Distribution	Type	Parameters	Where it appears
Bernoulli($p$)	Discrete	$p \in [0,1]$	Binary labels, logistic regression outputs
Binomial($n, p$)	Discrete	$n \in \mathbb{N},\; p \in [0,1]$	Count of successes; bootstrap
Uniform($a, b$)	Continuous	$a < b$	Random initialisation, random splits
Normal($\mu, \sigma^2$)	Continuous	$\mu \in \mathbb{R},\; \sigma > 0$	Weight init, noise models, CLT

4.1 · Worked — Visualising the Four Distributions¶

🧱

In [ ]:

Copied!





fig, axes = plt.subplots(1, 4, figsize=(16, 4))

# 1. Bernoulli(p=0.3)
axes[0].bar([0, 1], [0.7, 0.3], color=['steelblue', 'crimson'])
axes[0].set_xticks([0, 1]); axes[0].set_xticklabels(['0 (fail)', '1 (success)'])
axes[0].set_title('Bernoulli(p=0.3)')

# 2. Binomial(20, 0.4)
k = np.arange(0, 21)
axes[1].bar(k, stats.binom.pmf(k, 20, 0.4), color='steelblue')
axes[1].set_title('Binomial(20, 0.4)')

# 3. Uniform(0, 1)
x = np.linspace(-0.2, 1.2, 400)
axes[2].plot(x, stats.uniform.pdf(x), color='steelblue', lw=2)
axes[2].fill_between(x, stats.uniform.pdf(x), alpha=0.2, color='steelblue')
axes[2].set_title('Uniform(0, 1)')

# 4. Normal(0, 1)
x = np.linspace(-4, 4, 400)
axes[3].plot(x, stats.norm.pdf(x), color='steelblue', lw=2)
axes[3].fill_between(x, stats.norm.pdf(x), alpha=0.2, color='steelblue')
axes[3].set_title('Normal(μ=0, σ²=1)')

for ax in axes:
    ax.set_ylabel('P(X=k)  or  f(x)', fontsize=9)

plt.suptitle('Four fundamental distributions', fontsize=13)
plt.tight_layout(); plt.show()
fig, axes = plt.subplots(1, 4, figsize=(16, 4))

# 1. Bernoulli(p=0.3)
axes[0].bar([0, 1], [0.7, 0.3], color=['steelblue', 'crimson'])
axes[0].set_xticks([0, 1]); axes[0].set_xticklabels(['0 (fail)', '1 (success)'])
axes[0].set_title('Bernoulli(p=0.3)')

# 2. Binomial(20, 0.4)
k = np.arange(0, 21)
axes[1].bar(k, stats.binom.pmf(k, 20, 0.4), color='steelblue')
axes[1].set_title('Binomial(20, 0.4)')

# 3. Uniform(0, 1)
x = np.linspace(-0.2, 1.2, 400)
axes[2].plot(x, stats.uniform.pdf(x), color='steelblue', lw=2)
axes[2].fill_between(x, stats.uniform.pdf(x), alpha=0.2, color='steelblue')
axes[2].set_title('Uniform(0, 1)')

# 4. Normal(0, 1)
x = np.linspace(-4, 4, 400)
axes[3].plot(x, stats.norm.pdf(x), color='steelblue', lw=2)
axes[3].fill_between(x, stats.norm.pdf(x), alpha=0.2, color='steelblue')
axes[3].set_title('Normal(μ=0, σ²=1)')

for ax in axes:
    ax.set_ylabel('P(X=k)  or  f(x)', fontsize=9)

plt.suptitle('Four fundamental distributions', fontsize=13)
plt.tight_layout(); plt.show()

✏️ 4.2 · Your Turn — Changing Distribution Parameters¶

How does the shape of Binomial($n, p$) change as $p$ varies?

Plot Binomial($20, p$) for $p \in \{0.1, 0.3, 0.5, 0.7, 0.9\}$ on the same axes.
For which value of $p$ is the distribution most symmetric?
What happens to the mean as $p$ increases? (Recall: $\mathbb{E}[X] = np$.)

In [ ]:

Copied!





p_values = [0.1, 0.3, 0.5, 0.7, 0.9]
n_trials = 20
k        = np.arange(0, n_trials + 1)

fig, ax = plt.subplots(figsize=(9, 4))
colors  = plt.cm.coolwarm(np.linspace(0, 1, len(p_values)))

for p, c in zip(p_values, colors):
    pmf = stats.binom.pmf(___, n=n_trials, p=___)   # fill in: k, p
    ax.plot(k, pmf, 'o-', color=c, lw=1.5, ms=4, label=f'p={p}')
    ax.axvline(n_trials * p, color=c, lw=0.7, linestyle=':')   # mark the mean

ax.set_xlabel('k'); ax.set_ylabel('P(X = k)')
ax.set_title(f'Binomial({n_trials}, p) for varying p')
ax.legend(fontsize=10)
plt.tight_layout(); plt.show()
# 💬 Which p gives a symmetric distribution? Why?
p_values = [0.1, 0.3, 0.5, 0.7, 0.9]
n_trials = 20
k        = np.arange(0, n_trials + 1)

fig, ax = plt.subplots(figsize=(9, 4))
colors  = plt.cm.coolwarm(np.linspace(0, 1, len(p_values)))

for p, c in zip(p_values, colors):
    pmf = stats.binom.pmf(___, n=n_trials, p=___)   # fill in: k, p
    ax.plot(k, pmf, 'o-', color=c, lw=1.5, ms=4, label=f'p={p}')
    ax.axvline(n_trials * p, color=c, lw=0.7, linestyle=':')   # mark the mean

ax.set_xlabel('k'); ax.set_ylabel('P(X = k)')
ax.set_title(f'Binomial({n_trials}, p) for varying p')
ax.legend(fontsize=10)
plt.tight_layout(); plt.show()
# 💬 Which p gives a symmetric distribution? Why?

Section 5 · Expectation & Variance¶

$$ \mathbb{E}[X] = \sum_k k\,P(X=k) \quad \text{(discrete)} \qquad \mathbb{E}[X] = \int_{-\infty}^{\infty} x\,f(x)\,dx \quad \text{(continuous)} $$

$$ \operatorname{Var}(X) = \mathbb{E}\!\left[(X - \mathbb{E}[X])^2\right] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 $$

5.1 · Worked — Law of Large Numbers¶

🧱 As $n \to \infty$, the sample mean $\bar{X}_n$ converges to $\mathbb{E}[X]$. Let's watch it happen.

In [ ]:

Copied!





# Roll a fair die. E[X] = (1+2+3+4+5+6)/6 = 3.5
true_mean = 3.5
n_max     = 10_000

rolls      = np.random.randint(1, 7, size=n_max)
cumul_mean = np.cumsum(rolls) / np.arange(1, n_max + 1)

fig, ax = plt.subplots(figsize=(9, 4))
ax.plot(cumul_mean, color='steelblue', lw=1.2, label='Running mean')
ax.axhline(true_mean, color='crimson', lw=1.5, linestyle='--', label=f'E[X] = {true_mean}')
ax.set_xscale('log')
ax.set_xlabel('Number of rolls (log scale)'); ax.set_ylabel('Sample mean')
ax.set_title('Law of Large Numbers — die roll')
ax.legend(); plt.tight_layout(); plt.show()
# Roll a fair die. E[X] = (1+2+3+4+5+6)/6 = 3.5
true_mean = 3.5
n_max     = 10_000

rolls      = np.random.randint(1, 7, size=n_max)
cumul_mean = np.cumsum(rolls) / np.arange(1, n_max + 1)

fig, ax = plt.subplots(figsize=(9, 4))
ax.plot(cumul_mean, color='steelblue', lw=1.2, label='Running mean')
ax.axhline(true_mean, color='crimson', lw=1.5, linestyle='--', label=f'E[X] = {true_mean}')
ax.set_xscale('log')
ax.set_xlabel('Number of rolls (log scale)'); ax.set_ylabel('Sample mean')
ax.set_title('Law of Large Numbers — die roll')
ax.legend(); plt.tight_layout(); plt.show()

✏️ 5.2 · Your Turn — Estimating Variance¶

For Binomial($n=10,\; p=0.3$): true mean $= np = 3$, true variance $= np(1-p) = 2.1$.

Draw 50 000 samples.
Compute the sample mean $\bar{x}$ and sample variance $s^2$ using np.mean and np.var(ddof=1).
Compare to the true values.
Why do we use ddof=1 (Bessel's correction)?

In [ ]:

Copied!

samples = np.random.binomial(n=10, p=0.3, size=50_000)

sample_mean = samples.___()           # fill in: mean
sample_var  = samples.___(ddof=1)     # fill in: var  (Bessel-corrected)

print(f'Sample mean     = {sample_mean:.4f}   (true = {10*0.3:.4f})')
print(f'Sample variance = {sample_var:.4f}   (true = {10*0.3*0.7:.4f})')

# 💬 What changes if you use ddof=0? How large is the correction for n=50000?
samples = np.random.binomial(n=10, p=0.3, size=50_000)

sample_mean = samples.___()           # fill in: mean
sample_var  = samples.___(ddof=1)     # fill in: var  (Bessel-corrected)

print(f'Sample mean     = {sample_mean:.4f}   (true = {10*0.3:.4f})')
print(f'Sample variance = {sample_var:.4f}   (true = {10*0.3*0.7:.4f})')

# 💬 What changes if you use ddof=0? How large is the correction for n=50000?

Section 6 · The Normal Distribution & Central Limit Theorem¶

The Central Limit Theorem (CLT) says: the sum (or mean) of many independent random variables converges to a Normal distribution — regardless of the original distribution.

$$ \sqrt{n}\,\left(\bar{X}_n - \mu\right) \;\xrightarrow{d}\; \mathcal{N}(0,\, \sigma^2) \qquad \text{as } n \to \infty $$

6.1 · Worked — Watching the CLT¶

🧱 We sample from a highly skewed exponential distribution and watch the sample means become Normal as sample size grows.

In [ ]:

Copied!





# Exponential distribution with λ=1 — very skewed (mean=1, std=1)
n_experiments = 20_000
sample_sizes  = [1, 4, 16, 64]

fig, axes = plt.subplots(1, 4, figsize=(16, 4))

for ax, n in zip(axes, sample_sizes):
    X     = np.random.exponential(scale=1, size=(n_experiments, n))
    means = X.mean(axis=1)              # shape: (n_experiments,)
    ax.hist(means, bins=60, density=True, color='steelblue', alpha=0.8)

    # Overlay the CLT prediction: N(μ=1, σ²=1/n)
    xr = np.linspace(means.min(), means.max(), 300)
    ax.plot(xr, stats.norm.pdf(xr, loc=1, scale=1/np.sqrt(n)),
            color='crimson', lw=2, label=f'N(1, 1/{n})')
    ax.set_title(f'n = {n}'); ax.legend(fontsize=9)

plt.suptitle('Central Limit Theorem — means of Exponential(1) samples', fontsize=13)
plt.tight_layout(); plt.show()
# Exponential distribution with λ=1 — very skewed (mean=1, std=1)
n_experiments = 20_000
sample_sizes  = [1, 4, 16, 64]

fig, axes = plt.subplots(1, 4, figsize=(16, 4))

for ax, n in zip(axes, sample_sizes):
    X     = np.random.exponential(scale=1, size=(n_experiments, n))
    means = X.mean(axis=1)              # shape: (n_experiments,)
    ax.hist(means, bins=60, density=True, color='steelblue', alpha=0.8)

    # Overlay the CLT prediction: N(μ=1, σ²=1/n)
    xr = np.linspace(means.min(), means.max(), 300)
    ax.plot(xr, stats.norm.pdf(xr, loc=1, scale=1/np.sqrt(n)),
            color='crimson', lw=2, label=f'N(1, 1/{n})')
    ax.set_title(f'n = {n}'); ax.legend(fontsize=9)

plt.suptitle('Central Limit Theorem — means of Exponential(1) samples', fontsize=13)
plt.tight_layout(); plt.show()

💡 Connection to ML: The CLT explains why validation metrics (MSE, accuracy) computed on a test set are approximately normally distributed. This is the theoretical backbone of confidence intervals for model performance — and why averaging over K folds gives a tighter, better-calibrated estimate.

🔬 6.2 · Experiment — How Many Samples Until Normal?¶

Use a Uniform distribution (less skewed than Exponential). Try n ∈ {1, 2, 5, 20}. At what n does the Normal approximation look convincing?

In [ ]:

Copied!





sample_sizes_exp = [1, 2, 5, 20]   # ← try changing these
fig, axes = plt.subplots(1, 4, figsize=(16, 4))

for ax, n in zip(axes, sample_sizes_exp):
    X     = np.random.uniform(0, 1, size=(20_000, n))
    means = X.mean(axis=1)        # E[Uniform(0,1)] = 0.5, Var = 1/12
    ax.hist(means, bins=60, density=True, color='steelblue', alpha=0.8)
    xr = np.linspace(means.min(), means.max(), 300)
    ax.plot(xr, stats.norm.pdf(xr, loc=0.5, scale=np.sqrt(1/(12*n))),
            color='crimson', lw=2)
    ax.set_title(f'n = {n}')

plt.suptitle('CLT — means of Uniform(0,1) samples', fontsize=13)
plt.tight_layout(); plt.show()
# 💬 At what n does the Normal approximation become convincing here vs Exponential?
sample_sizes_exp = [1, 2, 5, 20]   # ← try changing these
fig, axes = plt.subplots(1, 4, figsize=(16, 4))

for ax, n in zip(axes, sample_sizes_exp):
    X     = np.random.uniform(0, 1, size=(20_000, n))
    means = X.mean(axis=1)        # E[Uniform(0,1)] = 0.5, Var = 1/12
    ax.hist(means, bins=60, density=True, color='steelblue', alpha=0.8)
    xr = np.linspace(means.min(), means.max(), 300)
    ax.plot(xr, stats.norm.pdf(xr, loc=0.5, scale=np.sqrt(1/(12*n))),
            color='crimson', lw=2)
    ax.set_title(f'n = {n}')

plt.suptitle('CLT — means of Uniform(0,1) samples', fontsize=13)
plt.tight_layout(); plt.show()
# 💬 At what n does the Normal approximation become convincing here vs Exponential?

Section 7 · Why One Split Is Not Enough¶

The single-split problem¶

A single random split has a critical weakness: the test score is a random variable — it depends on which samples happened to land in the test set. With a small dataset, this variance can be enormous.

$$ \widehat{\text{Err}}_{\text{test}} = \frac{1}{|S_{\text{test}}|} \sum_{i \in S_{\text{test}}} \ell(f(\mathbf{x}_i),\, y_i) $$

This is an unbiased estimator of generalisation error — but it can have high variance.

7.1 · Worked — Variance of a Single Split¶

🧱

In [ ]:

Copied!





from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score

data = load_breast_cancer()
X, y = data.data, data.target
print(f'Dataset: {X.shape[0]} samples, {X.shape[1]} features, {np.unique(y).size} classes')

# Repeat a single 80/20 split 200 times with different random seeds
n_repeats   = 200
test_scores = []

for seed in range(n_repeats):
    X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=seed)
    scaler = StandardScaler().fit(X_tr)
    clf    = LogisticRegression(max_iter=1000, random_state=0)
    clf.fit(scaler.transform(X_tr), y_tr)
    test_scores.append(accuracy_score(y_te, clf.predict(scaler.transform(X_te))))

test_scores = np.array(test_scores)
print(f'\nSingle-split test accuracy over {n_repeats} random seeds:')
print(f'  Mean  : {test_scores.mean():.4f}')
print(f'  Std   : {test_scores.std():.4f}')
print(f'  Range : [{test_scores.min():.4f}, {test_scores.max():.4f}]')

fig, ax = plt.subplots(figsize=(8, 3.5))
ax.hist(test_scores, bins=30, color='steelblue', edgecolor='white')
ax.axvline(test_scores.mean(), color='crimson', lw=2, linestyle='--',
           label=f'Mean={test_scores.mean():.3f}')
ax.set_xlabel('Test accuracy'); ax.set_ylabel('Count')
ax.set_title('Distribution of single-split test accuracy across 200 seeds')
ax.legend(); plt.tight_layout(); plt.show()
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score

data = load_breast_cancer()
X, y = data.data, data.target
print(f'Dataset: {X.shape[0]} samples, {X.shape[1]} features, {np.unique(y).size} classes')

# Repeat a single 80/20 split 200 times with different random seeds
n_repeats   = 200
test_scores = []

for seed in range(n_repeats):
    X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=seed)
    scaler = StandardScaler().fit(X_tr)
    clf    = LogisticRegression(max_iter=1000, random_state=0)
    clf.fit(scaler.transform(X_tr), y_tr)
    test_scores.append(accuracy_score(y_te, clf.predict(scaler.transform(X_te))))

test_scores = np.array(test_scores)
print(f'\nSingle-split test accuracy over {n_repeats} random seeds:')
print(f'  Mean  : {test_scores.mean():.4f}')
print(f'  Std   : {test_scores.std():.4f}')
print(f'  Range : [{test_scores.min():.4f}, {test_scores.max():.4f}]')

fig, ax = plt.subplots(figsize=(8, 3.5))
ax.hist(test_scores, bins=30, color='steelblue', edgecolor='white')
ax.axvline(test_scores.mean(), color='crimson', lw=2, linestyle='--',
           label=f'Mean={test_scores.mean():.3f}')
ax.set_xlabel('Test accuracy'); ax.set_ylabel('Count')
ax.set_title('Distribution of single-split test accuracy across 200 seeds')
ax.legend(); plt.tight_layout(); plt.show()

⚠️ The problem with one split: A single test score can vary by several percentage points depending on the random seed. Reporting one number without a confidence interval is misleading. We need to average over multiple splits.

Section 8 · K-Fold Cross-Validation¶

The idea¶

Instead of one split, partition the data into $K$ equal folds. Train on $K-1$ folds and evaluate on the remaining one — repeat $K$ times, rotating which fold is the validation set.

Fold 1: [VAL] [---] [---] [---] [---]
Fold 2: [---] [VAL] [---] [---] [---]
Fold 3: [---] [---] [VAL] [---] [---]
Fold 4: [---] [---] [---] [VAL] [---]
Fold 5: [---] [---] [---] [---] [VAL]

$$ \widehat{\text{Err}}_{\text{CV}} = \frac{1}{K} \sum_{k=1}^{K} \text{Err}_k \qquad \widehat{\sigma}_{\text{CV}} = \sqrt{\frac{1}{K-1}\sum_{k=1}^{K}\left(\text{Err}_k - \widehat{\text{Err}}_{\text{CV}}\right)^2} $$

8.1 · Worked — K-Fold from Scratch¶

🧱 Manual 5-fold CV to see exactly what is happening inside.

In [ ]:

Copied!





K   = 5
n   = len(X)
idx = np.arange(n)
np.random.shuffle(idx)

fold_size = n // K
fold_accs = []

for fold in range(K):
    # Validation indices
    val_idx   = idx[fold * fold_size : (fold + 1) * fold_size]
    # Training indices = everything else
    train_idx = np.concatenate([idx[:fold * fold_size], idx[(fold + 1) * fold_size:]])

    X_tr, y_tr   = X[train_idx], y[train_idx]
    X_val, y_val = X[val_idx],   y[val_idx]

    scaler = StandardScaler().fit(X_tr)
    clf    = LogisticRegression(max_iter=1000, random_state=0)
    clf.fit(scaler.transform(X_tr), y_tr)

    acc = accuracy_score(y_val, clf.predict(scaler.transform(X_val)))
    fold_accs.append(acc)
    print(f'  Fold {fold+1}: val_size={len(val_idx)}, accuracy={acc:.4f}')

print(f'\nCV mean  = {np.mean(fold_accs):.4f}')
print(f'CV std   = {np.std(fold_accs, ddof=1):.4f}')
K   = 5
n   = len(X)
idx = np.arange(n)
np.random.shuffle(idx)

fold_size = n // K
fold_accs = []

for fold in range(K):
    # Validation indices
    val_idx   = idx[fold * fold_size : (fold + 1) * fold_size]
    # Training indices = everything else
    train_idx = np.concatenate([idx[:fold * fold_size], idx[(fold + 1) * fold_size:]])

    X_tr, y_tr   = X[train_idx], y[train_idx]
    X_val, y_val = X[val_idx],   y[val_idx]

    scaler = StandardScaler().fit(X_tr)
    clf    = LogisticRegression(max_iter=1000, random_state=0)
    clf.fit(scaler.transform(X_tr), y_tr)

    acc = accuracy_score(y_val, clf.predict(scaler.transform(X_val)))
    fold_accs.append(acc)
    print(f'  Fold {fold+1}: val_size={len(val_idx)}, accuracy={acc:.4f}')

print(f'\nCV mean  = {np.mean(fold_accs):.4f}')
print(f'CV std   = {np.std(fold_accs, ddof=1):.4f}')

8.2 · Worked — Using sklearn's KFold¶

🧱 sklearn handles the fold indices for us.

In [ ]:

Copied!





kf = KFold(n_splits=5, shuffle=True, random_state=42)

fold_accs_sk = []

for fold, (train_idx, val_idx) in enumerate(kf.split(X)):
    X_tr, y_tr   = X[train_idx], y[train_idx]
    X_val, y_val = X[val_idx],   y[val_idx]

    scaler = StandardScaler().fit(X_tr)
    clf    = LogisticRegression(max_iter=1000, random_state=0)
    clf.fit(scaler.transform(X_tr), y_tr)

    acc = accuracy_score(y_val, clf.predict(scaler.transform(X_val)))
    fold_accs_sk.append(acc)

scores = np.array(fold_accs_sk)
print(f'5-fold CV accuracy: {scores.mean():.4f} ± {scores.std(ddof=1):.4f}')

fig, ax = plt.subplots(figsize=(6, 3.5))
ax.bar(np.arange(1, K+1), scores, color='steelblue', edgecolor='white')
ax.axhline(scores.mean(), color='crimson', lw=2, linestyle='--', label=f'Mean = {scores.mean():.3f}')
ax.set_xlabel('Fold'); ax.set_ylabel('Validation accuracy')
ax.set_title('5-fold CV — per-fold accuracy')
ax.set_ylim(0.9, 1.0); ax.legend(); plt.tight_layout(); plt.show()
kf = KFold(n_splits=5, shuffle=True, random_state=42)

fold_accs_sk = []

for fold, (train_idx, val_idx) in enumerate(kf.split(X)):
    X_tr, y_tr   = X[train_idx], y[train_idx]
    X_val, y_val = X[val_idx],   y[val_idx]

    scaler = StandardScaler().fit(X_tr)
    clf    = LogisticRegression(max_iter=1000, random_state=0)
    clf.fit(scaler.transform(X_tr), y_tr)

    acc = accuracy_score(y_val, clf.predict(scaler.transform(X_val)))
    fold_accs_sk.append(acc)

scores = np.array(fold_accs_sk)
print(f'5-fold CV accuracy: {scores.mean():.4f} ± {scores.std(ddof=1):.4f}')

fig, ax = plt.subplots(figsize=(6, 3.5))
ax.bar(np.arange(1, K+1), scores, color='steelblue', edgecolor='white')
ax.axhline(scores.mean(), color='crimson', lw=2, linestyle='--', label=f'Mean = {scores.mean():.3f}')
ax.set_xlabel('Fold'); ax.set_ylabel('Validation accuracy')
ax.set_title('5-fold CV — per-fold accuracy')
ax.set_ylim(0.9, 1.0); ax.legend(); plt.tight_layout(); plt.show()

✏️ 8.3 · Your Turn — cross_val_score (the one-liner)¶

sklearn.model_selection.cross_val_score wraps the loop above into a single call.

What does the scoring parameter accept? Try 'accuracy' and 'f1'.
What does the cv parameter accept — an integer or a KFold object?
Why is passing a Pipeline (scaler + model) safer than scaling outside the loop?

In [ ]:

Copied!





# Build a Pipeline — scaler + classifier in one object
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('clf',    LogisticRegression(max_iter=1000, random_state=0)),
])

# cross_val_score with 5-fold CV
cv_scores = cross_val_score(
    ___,            # fill in: pipe
    X, y,
    cv      = ___, # fill in: 5  (or a KFold object)
    scoring = ___, # fill in: 'accuracy'
)

print(f'CV scores (per fold): {cv_scores.round(4)}')
print(f'Mean  ± Std         : {cv_scores.mean():.4f} ± {cv_scores.std(ddof=1):.4f}')

# Now try scoring='f1' — what changes?
cv_f1 = cross_val_score(pipe, X, y, cv=5, scoring=___)
print(f'\nCV F1   : {cv_f1.mean():.4f} ± {cv_f1.std(ddof=1):.4f}')
# Build a Pipeline — scaler + classifier in one object
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('clf',    LogisticRegression(max_iter=1000, random_state=0)),
])

# cross_val_score with 5-fold CV
cv_scores = cross_val_score(
    ___,            # fill in: pipe
    X, y,
    cv      = ___, # fill in: 5  (or a KFold object)
    scoring = ___, # fill in: 'accuracy'
)

print(f'CV scores (per fold): {cv_scores.round(4)}')
print(f'Mean  ± Std         : {cv_scores.mean():.4f} ± {cv_scores.std(ddof=1):.4f}')

# Now try scoring='f1' — what changes?
cv_f1 = cross_val_score(pipe, X, y, cv=5, scoring=___)
print(f'\nCV F1   : {cv_f1.mean():.4f} ± {cv_f1.std(ddof=1):.4f}')

🔬 8.4 · Experiment — How Does K Affect the Estimate?¶

Vary $K \in \{2, 3, 5, 10, \text{LOO}\}$ and observe how the mean and variance of the CV estimate change.

In [ ]:

Copied!





K_values = [2, 3, 5, 10, len(X)]   # last entry = LOO
results  = {}

for K_val in K_values:
    kf  = KFold(n_splits=K_val, shuffle=True, random_state=42)
    sc  = cross_val_score(pipe, X, y, cv=kf, scoring='accuracy')
    results[K_val] = sc
    label = 'LOO' if K_val == len(X) else f'K={K_val}'
    print(f'{label:6s}:  mean={sc.mean():.4f}  std={sc.std(ddof=1):.4f}')

# Plot mean ± std
fig, ax = plt.subplots(figsize=(8, 4))
labels = ['LOO' if k == len(X) else f'K={k}' for k in K_values]
means  = [results[k].mean() for k in K_values]
stds   = [results[k].std(ddof=1) for k in K_values]
ax.errorbar(labels, means, yerr=stds, fmt='o-', capsize=5, color='steelblue', lw=2)
ax.set_xlabel('Number of folds K'); ax.set_ylabel('CV accuracy')
ax.set_title('CV estimate mean ± std vs K')
plt.tight_layout(); plt.show()
# 💬 Why does variance decrease then (for LOO) potentially increase again?
# 💬 Why is K=5 or K=10 the standard choice in practice?
K_values = [2, 3, 5, 10, len(X)]   # last entry = LOO
results  = {}

for K_val in K_values:
    kf  = KFold(n_splits=K_val, shuffle=True, random_state=42)
    sc  = cross_val_score(pipe, X, y, cv=kf, scoring='accuracy')
    results[K_val] = sc
    label = 'LOO' if K_val == len(X) else f'K={K_val}'
    print(f'{label:6s}:  mean={sc.mean():.4f}  std={sc.std(ddof=1):.4f}')

# Plot mean ± std
fig, ax = plt.subplots(figsize=(8, 4))
labels = ['LOO' if k == len(X) else f'K={k}' for k in K_values]
means  = [results[k].mean() for k in K_values]
stds   = [results[k].std(ddof=1) for k in K_values]
ax.errorbar(labels, means, yerr=stds, fmt='o-', capsize=5, color='steelblue', lw=2)
ax.set_xlabel('Number of folds K'); ax.set_ylabel('CV accuracy')
ax.set_title('CV estimate mean ± std vs K')
plt.tight_layout(); plt.show()
# 💬 Why does variance decrease then (for LOO) potentially increase again?
# 💬 Why is K=5 or K=10 the standard choice in practice?

Section 9 · Stratified K-Fold¶

The class imbalance problem¶

Standard K-fold splits data randomly. If classes are imbalanced (e.g., 95% negative, 5% positive), some folds may have very few — or zero — positive examples. Stratified K-fold preserves the class ratio in each fold.

9.1 · Worked — Stratified vs Standard on Imbalanced Data¶

🧱

In [ ]:

Copied!





from sklearn.datasets import make_classification

X_imb, y_imb = make_classification(
    n_samples=500, n_features=10,
    weights=[0.95, 0.05],
    random_state=42
)
print(f'Class counts: {np.bincount(y_imb)} (ratio 1:{np.bincount(y_imb)[1]/np.bincount(y_imb)[0]:.3f})')

kf_plain = KFold(n_splits=5, shuffle=True, random_state=0)
skf      = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)

print('\n--- Standard KFold ---')
for i, (_, val_idx) in enumerate(kf_plain.split(X_imb)):
    ratio = y_imb[val_idx].mean()
    print(f'  Fold {i+1}: positive rate = {ratio:.3f}')

print('\n--- Stratified KFold ---')
for i, (_, val_idx) in enumerate(skf.split(X_imb, y_imb)):
    ratio = y_imb[val_idx].mean()
    print(f'  Fold {i+1}: positive rate = {ratio:.3f}')
from sklearn.datasets import make_classification

X_imb, y_imb = make_classification(
    n_samples=500, n_features=10,
    weights=[0.95, 0.05],
    random_state=42
)
print(f'Class counts: {np.bincount(y_imb)} (ratio 1:{np.bincount(y_imb)[1]/np.bincount(y_imb)[0]:.3f})')

kf_plain = KFold(n_splits=5, shuffle=True, random_state=0)
skf      = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)

print('\n--- Standard KFold ---')
for i, (_, val_idx) in enumerate(kf_plain.split(X_imb)):
    ratio = y_imb[val_idx].mean()
    print(f'  Fold {i+1}: positive rate = {ratio:.3f}')

print('\n--- Stratified KFold ---')
for i, (_, val_idx) in enumerate(skf.split(X_imb, y_imb)):
    ratio = y_imb[val_idx].mean()
    print(f'  Fold {i+1}: positive rate = {ratio:.3f}')

✏️ 9.2 · Your Turn — Compare CV Scores¶

Use the imbalanced dataset. Compare KFold vs StratifiedKFold accuracy scores. Then switch to scoring='f1' — which gives a more honest picture of performance on the minority class?

In [ ]:

Copied!





pipe_imb = Pipeline([
    ('scaler', StandardScaler()),
    ('clf',    LogisticRegression(max_iter=500, random_state=0)),
])

# Standard KFold
acc_kf = cross_val_score(pipe_imb, X_imb, y_imb, cv=___, scoring='accuracy')  # fill in: kf_plain
f1_kf  = cross_val_score(pipe_imb, X_imb, y_imb, cv=kf_plain, scoring=___)   # fill in: 'f1'

# Stratified KFold
acc_skf = cross_val_score(pipe_imb, X_imb, y_imb, cv=___, scoring='accuracy') # fill in: skf
f1_skf  = cross_val_score(pipe_imb, X_imb, y_imb, cv=skf, scoring='f1')

print(f'Standard   — Accuracy: {acc_kf.mean():.4f} ± {acc_kf.std():.4f} | F1: {f1_kf.mean():.4f} ± {f1_kf.std():.4f}')
print(f'Stratified — Accuracy: {acc_skf.mean():.4f} ± {acc_skf.std():.4f} | F1: {f1_skf.mean():.4f} ± {f1_skf.std():.4f}')
# 💬 Which metric better captures performance on the rare class?
pipe_imb = Pipeline([
    ('scaler', StandardScaler()),
    ('clf',    LogisticRegression(max_iter=500, random_state=0)),
])

# Standard KFold
acc_kf = cross_val_score(pipe_imb, X_imb, y_imb, cv=___, scoring='accuracy')  # fill in: kf_plain
f1_kf  = cross_val_score(pipe_imb, X_imb, y_imb, cv=kf_plain, scoring=___)   # fill in: 'f1'

# Stratified KFold
acc_skf = cross_val_score(pipe_imb, X_imb, y_imb, cv=___, scoring='accuracy') # fill in: skf
f1_skf  = cross_val_score(pipe_imb, X_imb, y_imb, cv=skf, scoring='f1')

print(f'Standard   — Accuracy: {acc_kf.mean():.4f} ± {acc_kf.std():.4f} | F1: {f1_kf.mean():.4f} ± {f1_kf.std():.4f}')
print(f'Stratified — Accuracy: {acc_skf.mean():.4f} ± {acc_skf.std():.4f} | F1: {f1_skf.mean():.4f} ± {f1_skf.std():.4f}')
# 💬 Which metric better captures performance on the rare class?

Section 10 · Full CV Pipeline¶

Best practices for reporting cross-validation results¶

Always shuffle before splitting (unless data is a time series).
Put all preprocessing inside the pipeline — scaler, imputer, encoder — so they are refitted on each fold's training data only.
Report mean ± std across folds, not just the mean.
Use StratifiedKFold for classification; KFold for regression.
Keep a final held-out test set separate — CV is for model selection, not final reporting.

🔑 The Golden Rule: Cross-validation is a tool for model selection (choosing K, comparing algorithms). Your held-out test set should still be used once, at the very end, to report final performance. Using the CV score as your final number is fine; using the test set to tune is leakage.

10.1 · Worked — The Complete Recommended Workflow¶

🧱

In [ ]:

Copied!





from sklearn.metrics import classification_report

# ── 1. Load ────────────────────────────────────────────────────────────────
data = load_breast_cancer()
X, y = data.data, data.target

# ── 2. Hold out a final test set — touch this ONLY at the very end ─────────
X_dev, X_test, y_dev, y_test = train_test_split(
    X, y, test_size=0.15, random_state=42, stratify=y
)
print(f'Dev set : {X_dev.shape[0]} samples')
print(f'Test set: {X_test.shape[0]} samples (locked away)\n')

# ── 3. Build a pipeline (scaler + model) ──────────────────────────────────
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('clf',    LogisticRegression(max_iter=1000, C=1.0, random_state=0)),
])

# ── 4. Cross-validate on the dev set only ─────────────────────────────────
skf     = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
cv_acc  = cross_val_score(pipe, X_dev, y_dev, cv=skf, scoring='accuracy')
cv_f1   = cross_val_score(pipe, X_dev, y_dev, cv=skf, scoring='f1')

print('10-fold CV on dev set:')
print(f'  Accuracy : {cv_acc.mean():.4f} ± {cv_acc.std(ddof=1):.4f}')
print(f'  F1 score : {cv_f1.mean():.4f} ± {cv_f1.std(ddof=1):.4f}')

# ── 5. Refit on ALL dev data and report test performance ──────────────────
scaler_final = StandardScaler().fit(X_dev)
pipe.fit(scaler_final.transform(X_dev), y_dev)
y_pred = pipe.predict(scaler_final.transform(X_test))

print('\n── Final test set performance ──')
print(classification_report(y_test, y_pred, target_names=data.target_names))
from sklearn.metrics import classification_report

# ── 1. Load ────────────────────────────────────────────────────────────────
data = load_breast_cancer()
X, y = data.data, data.target

# ── 2. Hold out a final test set — touch this ONLY at the very end ─────────
X_dev, X_test, y_dev, y_test = train_test_split(
    X, y, test_size=0.15, random_state=42, stratify=y
)
print(f'Dev set : {X_dev.shape[0]} samples')
print(f'Test set: {X_test.shape[0]} samples (locked away)\n')

# ── 3. Build a pipeline (scaler + model) ──────────────────────────────────
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('clf',    LogisticRegression(max_iter=1000, C=1.0, random_state=0)),
])

# ── 4. Cross-validate on the dev set only ─────────────────────────────────
skf     = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
cv_acc  = cross_val_score(pipe, X_dev, y_dev, cv=skf, scoring='accuracy')
cv_f1   = cross_val_score(pipe, X_dev, y_dev, cv=skf, scoring='f1')

print('10-fold CV on dev set:')
print(f'  Accuracy : {cv_acc.mean():.4f} ± {cv_acc.std(ddof=1):.4f}')
print(f'  F1 score : {cv_f1.mean():.4f} ± {cv_f1.std(ddof=1):.4f}')

# ── 5. Refit on ALL dev data and report test performance ──────────────────
scaler_final = StandardScaler().fit(X_dev)
pipe.fit(scaler_final.transform(X_dev), y_dev)
y_pred = pipe.predict(scaler_final.transform(X_test))

print('\n── Final test set performance ──')
print(classification_report(y_test, y_pred, target_names=data.target_names))

✏️ 10.2 · Your Turn — Compare Two Models with CV¶

Use the breast cancer dev set. Compare LogisticRegression(C=1.0) vs LogisticRegression(C=0.01) using 10-fold stratified CV. Which regularisation level generalises better?

In [ ]:

Copied!





models = {
    'LogReg C=1.0' : Pipeline([('sc', StandardScaler()), ('clf', LogisticRegression(C=1.0,  max_iter=1000))]),
    'LogReg C=0.01': Pipeline([('sc', StandardScaler()), ('clf', LogisticRegression(C=___, max_iter=1000))]),  # fill in
}

skf10 = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

print(f'{"Model":<20} {"Mean Acc":>10} {"Std":>8} {"Mean F1":>10} {"Std":>8}')
print('-' * 60)

for name, model in models.items():
    acc = cross_val_score(___, X_dev, y_dev, cv=skf10, scoring='accuracy')  # fill in: model
    f1  = cross_val_score(model, X_dev, y_dev, cv=___, scoring='f1')        # fill in: skf10
    print(f'{name:<20} {acc.mean():>10.4f} {acc.std():>8.4f} {f1.mean():>10.4f} {f1.std():>8.4f}')

# 💬 Which model would you choose? Would you look at accuracy or F1?
models = {
    'LogReg C=1.0' : Pipeline([('sc', StandardScaler()), ('clf', LogisticRegression(C=1.0,  max_iter=1000))]),
    'LogReg C=0.01': Pipeline([('sc', StandardScaler()), ('clf', LogisticRegression(C=___, max_iter=1000))]),  # fill in
}

skf10 = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

print(f'{"Model":<20} {"Mean Acc":>10} {"Std":>8} {"Mean F1":>10} {"Std":>8}')
print('-' * 60)

for name, model in models.items():
    acc = cross_val_score(___, X_dev, y_dev, cv=skf10, scoring='accuracy')  # fill in: model
    f1  = cross_val_score(model, X_dev, y_dev, cv=___, scoring='f1')        # fill in: skf10
    print(f'{name:<20} {acc.mean():>10.4f} {acc.std():>8.4f} {f1.mean():>10.4f} {f1.std():>8.4f}')

# 💬 Which model would you choose? Would you look at accuracy or F1?

🏁 Summary¶

Section	What you learned	Key rule
1–2	Sample spaces, events, Kolmogorov axioms, Bayes' theorem	All probability derives from three axioms
3	Discrete (PMF/CDF) and continuous (PDF) random variables	Row = sample drawn from a distribution
4	Bernoulli, Binomial, Uniform, Normal distributions	Know the parameters and where each appears in ML
5	Expectation, variance, Law of Large Numbers	More data → sample mean converges to true mean
6	Normal distribution, Central Limit Theorem	Average many iid RVs → Normal, regardless of origin
7	Single-split variance problem	One test score has high variance — never trust just one
8	K-fold cross-validation from scratch and with sklearn	Report mean ± std across K folds
9	Stratified K-fold for class imbalance	Always stratify for classification tasks
10	Full recommended CV pipeline	CV on dev set; test set touched once, at the very end

The central lesson:
Probability gives us the language to describe where data comes from. Cross-validation gives us an honest way to measure how well a model has learned from it. Together, they are the mathematical and empirical backbone of every trustworthy machine learning experiment.

MA2221 — Foundational Mathematics for Machine Learning · Mahindra University