Data Handling for Machine Learning¶

Course: MA2221 · Mahindra University
Reference: Mathematics for Machine Learning, Deisenroth, Faisal & Ong — Ch 9

Two questions drive this entire lab:

How do we load, inspect, and clean real data so it is ready for a model?
Why do we split data into train and test sets — and how do we do it correctly?

Every machine learning pipeline starts here — before any gradients, before any model.
A model is only as good as the data it trains on.

Structure¶

Section	Topic
1	Loading data — CSV, NumPy, and sklearn datasets
2	Inspecting data — shapes, dtypes, missing values, summary statistics
3	Indexing and slicing — rows, columns, boolean masks
4	Feature matrix $X$ and target vector $\mathbf{y}$
5	Train / test split — why, how, and what goes wrong without it
6	Feature scaling — standardisation and normalisation
7	Putting it all together — a clean ML-ready pipeline

Legend¶

🧱 Worked — run and read
✏️ Your turn — fill in ___
🔬 Experiment — change numbers and observe
💬 Discuss — no single right answer

0 · Setup¶

In [1]:

Copied!





import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

np.random.seed(42)

plt.rcParams.update({
    'figure.dpi'       : 120,
    'axes.spines.top'  : False,
    'axes.spines.right': False,
    'font.size'        : 12,
})
print('All imports OK ✓')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

np.random.seed(42)

plt.rcParams.update({
    'figure.dpi'       : 120,
    'axes.spines.top'  : False,
    'axes.spines.right': False,
    'font.size'        : 12,
})
print('All imports OK ✓')

All imports OK ✓

Section 1 · Loading Data¶

Where data comes from¶

In practice, data arrives in many forms — CSV files, databases, or bundled datasets from libraries.
In this notebook we use three routes:

pandas read_csv — the most common real-world path
NumPy loadtxt / genfromtxt — for plain numerical files
sklearn built-in datasets — clean, well-documented, great for learning

1.1 · Worked — Loading the California Housing Dataset¶

In [ ]:

Copied!





# sklearn bundles several classic ML datasets — no download needed
from sklearn.datasets import fetch_california_housing

raw = fetch_california_housing(as_frame=True)   # as_frame=True gives us a pandas DataFrame

# raw is a Bunch object — a dict-like container
print('Keys in the Bunch object:', raw.keys())
print()
print(raw.DESCR[:600])   # read the dataset description
# sklearn bundles several classic ML datasets — no download needed
from sklearn.datasets import fetch_california_housing

raw = fetch_california_housing(as_frame=True)   # as_frame=True gives us a pandas DataFrame

# raw is a Bunch object — a dict-like container
print('Keys in the Bunch object:', raw.keys())
print()
print(raw.DESCR[:600])   # read the dataset description

In [ ]:

Copied!





# Extract the full dataframe (features + target together for inspection)
df = raw.frame
print(f'Shape: {df.shape}   ({df.shape[0]} rows × {df.shape[1]} columns)')
df.head()
# Extract the full dataframe (features + target together for inspection)
df = raw.frame
print(f'Shape: {df.shape}   ({df.shape[0]} rows × {df.shape[1]} columns)')
df.head()

✏️ 1.2 · Your Turn — Load the Diabetes Dataset¶

sklearn also ships the diabetes dataset (sklearn.datasets.load_diabetes).
It has 442 patients, 10 features (age, sex, bmi, blood pressure, …), and a numerical disease-progression target.

Load it with as_frame=True.
Extract the .frame attribute into df_diabetes.
Print its shape and display the first 5 rows.

In [ ]:

Copied!

from sklearn.datasets import load_diabetes

raw_diab = load_diabetes(as_frame=___)
df_diabetes = raw_diab.___

print(f'Shape: {df_diabetes.___}')
df_diabetes.___
from sklearn.datasets import load_diabetes

raw_diab = load_diabetes(as_frame=___)
df_diabetes = raw_diab.___

print(f'Shape: {df_diabetes.___}')
df_diabetes.___

1.3 · Worked — Loading from a CSV with pandas¶

In the real world you'll most often receive a .csv file.
We save the California data to CSV and reload it, so you see the full round-trip.

In [ ]:

Copied!





# Save to CSV (simulates receiving a file)
df.to_csv('california.csv', index=False)

# Reload from CSV
df_from_csv = pd.read_csv('california.csv')

print('Loaded from CSV — shape:', df_from_csv.shape)
print('Column names:', df_from_csv.columns.tolist())
# Save to CSV (simulates receiving a file)
df.to_csv('california.csv', index=False)

# Reload from CSV
df_from_csv = pd.read_csv('california.csv')

print('Loaded from CSV — shape:', df_from_csv.shape)
print('Column names:', df_from_csv.columns.tolist())

Section 2 · Inspecting Data¶

Always look before you model¶

Before building any model you must understand:

Shape — how many samples $n$ and features $d$?
Types — are columns numerical or categorical?
Missing values — NaN will silently break most algorithms.
Scale — are some features 1000× larger than others?
Distribution — are there obvious outliers?

2.1 · Worked — Quick Inspection Toolkit¶

In [ ]:

Copied!





print('=== Shape ===')
print(df.shape)

print('\n=== Data types ===')
print(df.dtypes)

print('\n=== Missing values per column ===')
print(df.isnull().sum())

print('\n=== Summary statistics ===')
df.describe().round(3)
print('=== Shape ===')
print(df.shape)

print('\n=== Data types ===')
print(df.dtypes)

print('\n=== Missing values per column ===')
print(df.isnull().sum())

print('\n=== Summary statistics ===')
df.describe().round(3)

In [ ]:

Copied!





# Visualise distributions
fig, axes = plt.subplots(3, 3, figsize=(13, 9))
axes = axes.flatten()

for i, col in enumerate(df.columns):
    axes[i].hist(df[col], bins=40, color='steelblue', edgecolor='white', linewidth=0.3)
    axes[i].set_title(col, fontsize=10)
    axes[i].set_xlabel('')

plt.suptitle('Feature distributions — California Housing', fontsize=13, y=1.01)
plt.tight_layout()
plt.show()
# Visualise distributions
fig, axes = plt.subplots(3, 3, figsize=(13, 9))
axes = axes.flatten()

for i, col in enumerate(df.columns):
    axes[i].hist(df[col], bins=40, color='steelblue', edgecolor='white', linewidth=0.3)
    axes[i].set_title(col, fontsize=10)
    axes[i].set_xlabel('')

plt.suptitle('Feature distributions — California Housing', fontsize=13, y=1.01)
plt.tight_layout()
plt.show()

✏️ 2.2 · Your Turn — Inject and Detect Missing Values¶

Real datasets almost always contain missing values.
Here we deliberately corrupt a copy, then find and handle the missing entries.

Set 50 random entries in the 'MedInc' column to NaN.
Count how many NaNs the column now has.
Fill the missing values with the column mean using .fillna().
Confirm no NaNs remain.

In [ ]:

Copied!





df_dirty = df.copy()

# Step 1 — inject 50 NaNs at random row positions
rng = np.random.default_rng(0)
bad_rows = rng.choice(len(df_dirty), size=50, replace=False)
df_dirty.loc[bad_rows, 'MedInc'] = ___     # fill in: np.nan

# Step 2 — count missing values in MedInc
n_missing = df_dirty['MedInc'].___().___()
print(f'Missing values in MedInc: {n_missing}')   # should be 50

# Step 3 — fill with column mean
col_mean = df_dirty['MedInc'].___()           # fill in: .mean()
df_dirty['MedInc'] = df_dirty['MedInc'].fillna(___)

# Step 4 — confirm
print(f'Missing after imputation: {df_dirty["MedInc"].isnull().sum()}')  # should be 0
df_dirty = df.copy()

# Step 1 — inject 50 NaNs at random row positions
rng = np.random.default_rng(0)
bad_rows = rng.choice(len(df_dirty), size=50, replace=False)
df_dirty.loc[bad_rows, 'MedInc'] = ___     # fill in: np.nan

# Step 2 — count missing values in MedInc
n_missing = df_dirty['MedInc'].___().___()
print(f'Missing values in MedInc: {n_missing}')   # should be 50

# Step 3 — fill with column mean
col_mean = df_dirty['MedInc'].___()           # fill in: .mean()
df_dirty['MedInc'] = df_dirty['MedInc'].fillna(___)

# Step 4 — confirm
print(f'Missing after imputation: {df_dirty["MedInc"].isnull().sum()}')  # should be 0

Section 3 · Indexing and Slicing¶

NumPy and pandas slicing recap¶

Selecting the right rows and columns is the most frequent operation in any data pipeline.
We cover three complementary tools:

Tool	What it selects
`df[col]` / `df[[cols]]`	columns by name
`df.iloc[rows, cols]`	rows and columns by integer position
`df.loc[rows, cols]`	rows and columns by label / boolean mask

3.1 · Worked — Column and Row Selection¶

In [ ]:

Copied!





# --- Select a single column (returns a Series) ---
med_inc = df['MedInc']
print('Single column — type:', type(med_inc).__name__, '| shape:', med_inc.shape)

# --- Select multiple columns (returns a DataFrame) ---
subset = df[['MedInc', 'AveRooms', 'MedHouseVal']]
print('Multi-column subset shape:', subset.shape)

# --- Select rows by position (first 5) ---
print('\nFirst 3 rows (iloc):')
print(df.iloc[:3])

# --- Select a sub-block by position ---
print('\nRows 10-12, first 3 columns (iloc):')
print(df.iloc[10:13, :3])
# --- Select a single column (returns a Series) ---
med_inc = df['MedInc']
print('Single column — type:', type(med_inc).__name__, '| shape:', med_inc.shape)

# --- Select multiple columns (returns a DataFrame) ---
subset = df[['MedInc', 'AveRooms', 'MedHouseVal']]
print('Multi-column subset shape:', subset.shape)

# --- Select rows by position (first 5) ---
print('\nFirst 3 rows (iloc):')
print(df.iloc[:3])

# --- Select a sub-block by position ---
print('\nRows 10-12, first 3 columns (iloc):')
print(df.iloc[10:13, :3])

In [ ]:

Copied!





# --- Boolean mask: select rows where median income > 5 ---
mask_high_income = df['MedInc'] > 5.0
df_rich = df.loc[mask_high_income]

print(f'Rows with MedInc > 5:  {len(df_rich)} / {len(df)}')
print(f'Mean house value (high income): {df_rich["MedHouseVal"].mean():.3f}')
print(f'Mean house value (all):         {df["MedHouseVal"].mean():.3f}')
# --- Boolean mask: select rows where median income > 5 ---
mask_high_income = df['MedInc'] > 5.0
df_rich = df.loc[mask_high_income]

print(f'Rows with MedInc > 5:  {len(df_rich)} / {len(df)}')
print(f'Mean house value (high income): {df_rich["MedHouseVal"].mean():.3f}')
print(f'Mean house value (all):         {df["MedHouseVal"].mean():.3f}')

✏️ 3.2 · Your Turn — Boolean Slicing Practice¶

Using the California DataFrame df:

Select all rows where 'AveOccup' (average occupancy) is greater than 6 (overcrowded).
How many such blocks are there?
Select all rows where 'HouseAge' is exactly 52 (the dataset cap).
Combine both conditions with &: blocks that are both overcrowded and aged 52.
Print the count.

In [ ]:

Copied!





# 1. Overcrowded blocks
mask_crowded = df['AveOccup'] > ___
print(f'Overcrowded blocks: {mask_crowded.sum()}')

# 2. Maximum-age houses
mask_old = df['HouseAge'] == ___
print(f'Blocks with HouseAge == 52: {mask_old.sum()}')

# 3. Combined
df_combined = df.loc[___ & ___]
print(f'Overcrowded AND aged 52: {len(df_combined)}')
# 1. Overcrowded blocks
mask_crowded = df['AveOccup'] > ___
print(f'Overcrowded blocks: {mask_crowded.sum()}')

# 2. Maximum-age houses
mask_old = df['HouseAge'] == ___
print(f'Blocks with HouseAge == 52: {mask_old.sum()}')

# 3. Combined
df_combined = df.loc[___ & ___]
print(f'Overcrowded AND aged 52: {len(df_combined)}')

✏️ 3.3 · Your Turn — NumPy Slicing Review¶

Convert the DataFrame to a NumPy array and practice matrix slicing —
this is exactly how the feature matrix $X$ will look inside a model.

Convert df to a NumPy array A using .to_numpy().
Extract the last column (the target MedHouseVal) as a 1-D array y_all.
Extract all columns except the last as X_all.
Print both shapes.

In [ ]:

Copied!

A = df.to_numpy()           # shape: (20640, 9)

y_all = A[:, ___]           # fill in: last column
X_all = A[:, ___]           # fill in: all but last column

print('X_all shape:', X_all.shape)   # should be (20640, 8)
print('y_all shape:', y_all.shape)   # should be (20640,)
A = df.to_numpy()           # shape: (20640, 9)

y_all = A[:, ___]           # fill in: last column
X_all = A[:, ___]           # fill in: all but last column

print('X_all shape:', X_all.shape)   # should be (20640, 8)
print('y_all shape:', y_all.shape)   # should be (20640,)

Section 4 · Feature Matrix $X$ and Target Vector $\mathbf{y}$¶

The standard ML data layout¶

Every supervised learning algorithm expects data in this form:

$$X \in \mathbb{R}^{n \times d}, \qquad \mathbf{y} \in \mathbb{R}^n$$

where $n$ is the number of samples and $d$ is the number of features.

Row $i$ of $X$ is the feature vector $\mathbf{x}_i$ for the $i$-th sample.
Element $y_i$ is the corresponding label or target value.

4.1 · Worked — Extracting $X$ and $\mathbf{y}$ from the Dataset¶

In [ ]:

Copied!





# sklearn Bunch objects expose .data and .target directly
X = raw.data.to_numpy()    # shape (n, d)
y = raw.target.to_numpy()  # shape (n,)

feature_names = raw.feature_names

print(f'X shape : {X.shape}   ({X.shape[0]} samples, {X.shape[1]} features)')
print(f'y shape : {y.shape}')
print(f'Feature names: {feature_names}')
print(f'Target (MedHouseVal) — min: {y.min():.2f},  max: {y.max():.2f},  mean: {y.mean():.2f}')
# sklearn Bunch objects expose .data and .target directly
X = raw.data.to_numpy()    # shape (n, d)
y = raw.target.to_numpy()  # shape (n,)

feature_names = raw.feature_names

print(f'X shape : {X.shape}   ({X.shape[0]} samples, {X.shape[1]} features)')
print(f'y shape : {y.shape}')
print(f'Feature names: {feature_names}')
print(f'Target (MedHouseVal) — min: {y.min():.2f},  max: {y.max():.2f},  mean: {y.mean():.2f}')

✏️ 4.2 · Your Turn — Pairwise Scatter Plots¶

A scatter plot of each feature against the target $y$ is the fastest way to spot linear relationships.

Plot each of the 8 features on the x-axis against y on the y-axis.
Use alpha=0.05 (semi-transparent dots — there are 20 000 points!).
Label each subplot with the feature name.

Which features look most linearly related to MedHouseVal?

In [ ]:

Copied!





fig, axes = plt.subplots(2, 4, figsize=(15, 7))
axes = axes.flatten()

for i, name in enumerate(feature_names):
    axes[i].scatter(X[:, i], y, s=1, alpha=___, color='steelblue')   # fill in alpha
    axes[i].set_xlabel(___)
    axes[i].set_ylabel('MedHouseVal')

plt.suptitle('Feature vs target scatter plots', fontsize=13)
plt.tight_layout()
plt.show()

# 💬 Which feature shows the clearest linear trend with the house value?
fig, axes = plt.subplots(2, 4, figsize=(15, 7))
axes = axes.flatten()

for i, name in enumerate(feature_names):
    axes[i].scatter(X[:, i], y, s=1, alpha=___, color='steelblue')   # fill in alpha
    axes[i].set_xlabel(___)
    axes[i].set_ylabel('MedHouseVal')

plt.suptitle('Feature vs target scatter plots', fontsize=13)
plt.tight_layout()
plt.show()

# 💬 Which feature shows the clearest linear trend with the house value?

Section 5 · Train / Test Split¶

Why we split¶

A model trained and evaluated on the same data will appear to perform well —
but it may have simply memorised the training examples (overfitting).

We hold out a test set that the model never sees during training.
Performance on the test set is our honest estimate of generalisation.

$$\underbrace{X,\ \mathbf{y}}_{\text{all data}} \;\longrightarrow\; \underbrace{X_{\text{train}},\ \mathbf{y}_{\text{train}}}_{\sim 80\%} \;+\; \underbrace{X_{\text{test}},\ \mathbf{y}_{\text{test}}}_{\sim 20\%}$$

Critical rule: the test set must remain untouched until the very end.
Any decision made by looking at the test labels — including choosing a scaler or picking features — leaks information and inflates the reported performance.

5.1 · Worked — Manual Split by Index¶

In [ ]:

Copied!





# Manual 80/20 split — good for understanding what is happening under the hood
n = len(X)
n_train = int(0.8 * n)

# Shuffle indices first — order in the dataset may not be random!
idx = np.arange(n)
rng = np.random.default_rng(42)
rng.shuffle(idx)

train_idx = idx[:n_train]
test_idx  = idx[n_train:]

X_train_manual = X[train_idx]
y_train_manual = y[train_idx]
X_test_manual  = X[test_idx]
y_test_manual  = y[test_idx]

print(f'Total samples : {n}')
print(f'Train samples : {len(X_train_manual)}  ({100*len(X_train_manual)/n:.1f}%)')
print(f'Test  samples : {len(X_test_manual)}   ({100*len(X_test_manual)/n:.1f}%)')
print(f'\nTrain mean y  : {y_train_manual.mean():.4f}')
print(f'Test  mean y  : {y_test_manual.mean():.4f}   ← should be similar to train')
# Manual 80/20 split — good for understanding what is happening under the hood
n = len(X)
n_train = int(0.8 * n)

# Shuffle indices first — order in the dataset may not be random!
idx = np.arange(n)
rng = np.random.default_rng(42)
rng.shuffle(idx)

train_idx = idx[:n_train]
test_idx  = idx[n_train:]

X_train_manual = X[train_idx]
y_train_manual = y[train_idx]
X_test_manual  = X[test_idx]
y_test_manual  = y[test_idx]

print(f'Total samples : {n}')
print(f'Train samples : {len(X_train_manual)}  ({100*len(X_train_manual)/n:.1f}%)')
print(f'Test  samples : {len(X_test_manual)}   ({100*len(X_test_manual)/n:.1f}%)')
print(f'\nTrain mean y  : {y_train_manual.mean():.4f}')
print(f'Test  mean y  : {y_test_manual.mean():.4f}   ← should be similar to train')

5.2 · Worked — sklearn `train_test_split` (the standard way)¶

In [ ]:

Copied!





X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size    = 0.20,    # 20% held out
    random_state = 42,      # reproducible shuffle
)

print(f'X_train : {X_train.shape}   y_train : {y_train.shape}')
print(f'X_test  : {X_test.shape}    y_test  : {y_test.shape}')
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size    = 0.20,    # 20% held out
    random_state = 42,      # reproducible shuffle
)

print(f'X_train : {X_train.shape}   y_train : {y_train.shape}')
print(f'X_test  : {X_test.shape}    y_test  : {y_test.shape}')

✏️ 5.3 · Your Turn — What If We Don't Shuffle?¶

The California dataset is ordered geographically.
This cell shows what happens when we split without shuffling.

Do a naïve 80/20 split: first 80% as train, last 20% as test — no shuffling.
Compare the mean of y_train and y_test.
Plot the target histograms side by side for the shuffled vs unshuffled splits.

💬 What do you notice? Why would training on a biased train set be a problem?

In [ ]:

Copied!





# Naïve (no shuffle) split
n_tr = int(0.8 * n)

y_train_noshuf = y[___]    # fill in: first n_tr elements
y_test_noshuf  = y[___]    # fill in: remaining elements

print('--- No shuffle ---')
print(f'Train mean y : {y_train_noshuf.mean():.4f}')
print(f'Test  mean y : {y_test_noshuf.mean():.4f}')

print('\n--- With shuffle ---')
print(f'Train mean y : {y_train.mean():.4f}')
print(f'Test  mean y : {y_test.mean():.4f}')

fig, axes = plt.subplots(1, 2, figsize=(13, 4))

for ax, (tr, te, title) in zip(axes, [
    (y_train_noshuf, y_test_noshuf, 'No shuffle — biased split'),
    (y_train,        y_test,        'Shuffled — representative split')
]):
    ax.hist(tr, bins=40, alpha=0.6, color='steelblue', label='train')
    ax.hist(te, bins=40, alpha=0.6, color='crimson',   label='test')
    ax.set_title(title)
    ax.set_xlabel('MedHouseVal')
    ax.legend()

plt.tight_layout()
plt.show()
# Naïve (no shuffle) split
n_tr = int(0.8 * n)

y_train_noshuf = y[___]    # fill in: first n_tr elements
y_test_noshuf  = y[___]    # fill in: remaining elements

print('--- No shuffle ---')
print(f'Train mean y : {y_train_noshuf.mean():.4f}')
print(f'Test  mean y : {y_test_noshuf.mean():.4f}')

print('\n--- With shuffle ---')
print(f'Train mean y : {y_train.mean():.4f}')
print(f'Test  mean y : {y_test.mean():.4f}')

fig, axes = plt.subplots(1, 2, figsize=(13, 4))

for ax, (tr, te, title) in zip(axes, [
    (y_train_noshuf, y_test_noshuf, 'No shuffle — biased split'),
    (y_train,        y_test,        'Shuffled — representative split')
]):
    ax.hist(tr, bins=40, alpha=0.6, color='steelblue', label='train')
    ax.hist(te, bins=40, alpha=0.6, color='crimson',   label='test')
    ax.set_title(title)
    ax.set_xlabel('MedHouseVal')
    ax.legend()

plt.tight_layout()
plt.show()

🔬 5.4 · Experiment — How Does Split Ratio Affect Variance?¶

A small test set gives a noisy estimate of performance.
Run gradient descent linear regression (from Notebook 7) on splits of different sizes
and observe how the test-set MSE varies.

(We use a simple one-feature regression here; you will build the full model in the next notebook.)

In [ ]:

Copied!





# Use only MedInc (feature 0) to predict house value — simplest possible model
X1 = X[:, 0:1]   # shape (n, 1) — keep 2D for consistency

def mse(y_pred, y_true):
    return np.mean((y_pred - y_true)**2)

def fit_and_evaluate(test_size, seed):
    Xtr, Xte, ytr, yte = train_test_split(X1, y, test_size=test_size, random_state=seed)
    # Closed-form least squares: w* = (X^T X)^{-1} X^T y
    Xtr_b = np.hstack([np.ones((len(Xtr), 1)), Xtr])   # add bias column
    Xte_b = np.hstack([np.ones((len(Xte), 1)), Xte])
    w = np.linalg.lstsq(Xtr_b, ytr, rcond=None)[0]
    return mse(Xte_b @ w, yte)

test_sizes = [0.05, 0.10, 0.20, 0.30, 0.50]
n_repeats  = 20

results = {}
for ts in test_sizes:
    results[ts] = [fit_and_evaluate(ts, seed=s) for s in range(n_repeats)]

fig, ax = plt.subplots(figsize=(9, 4))
means = [np.mean(results[ts]) for ts in test_sizes]
stds  = [np.std(results[ts])  for ts in test_sizes]

ax.errorbar([str(int(ts*100))+'%' for ts in test_sizes], means, yerr=stds,
            fmt='o-', color='steelblue', capsize=5, lw=2)
ax.set_xlabel('Test set size')
ax.set_ylabel('Test MSE')
ax.set_title('Test MSE mean ± std across 20 random splits')
plt.tight_layout()
plt.show()

# 💬 Why does variance increase for very small test sets?
# 💬 Why does MSE change slightly for very large test sets?
# Use only MedInc (feature 0) to predict house value — simplest possible model
X1 = X[:, 0:1]   # shape (n, 1) — keep 2D for consistency

def mse(y_pred, y_true):
    return np.mean((y_pred - y_true)**2)

def fit_and_evaluate(test_size, seed):
    Xtr, Xte, ytr, yte = train_test_split(X1, y, test_size=test_size, random_state=seed)
    # Closed-form least squares: w* = (X^T X)^{-1} X^T y
    Xtr_b = np.hstack([np.ones((len(Xtr), 1)), Xtr])   # add bias column
    Xte_b = np.hstack([np.ones((len(Xte), 1)), Xte])
    w = np.linalg.lstsq(Xtr_b, ytr, rcond=None)[0]
    return mse(Xte_b @ w, yte)

test_sizes = [0.05, 0.10, 0.20, 0.30, 0.50]
n_repeats  = 20

results = {}
for ts in test_sizes:
    results[ts] = [fit_and_evaluate(ts, seed=s) for s in range(n_repeats)]

fig, ax = plt.subplots(figsize=(9, 4))
means = [np.mean(results[ts]) for ts in test_sizes]
stds  = [np.std(results[ts])  for ts in test_sizes]

ax.errorbar([str(int(ts*100))+'%' for ts in test_sizes], means, yerr=stds,
            fmt='o-', color='steelblue', capsize=5, lw=2)
ax.set_xlabel('Test set size')
ax.set_ylabel('Test MSE')
ax.set_title('Test MSE mean ± std across 20 random splits')
plt.tight_layout()
plt.show()

# 💬 Why does variance increase for very small test sets?
# 💬 Why does MSE change slightly for very large test sets?

Section 6 · Feature Scaling¶

Why scale?¶

Look at the summary statistics again: MedInc ranges from ~0 to ~15,
while Population ranges from 3 to 35 000.
When gradient descent updates $\mathbf{w}$, features with large magnitudes dominate the gradient —
exactly the ill-conditioning problem from Notebook 7 Section 5.

Two common remedies:

Standardisation (Z-score normalisation) $$\tilde{x}_j = \frac{x_j - \mu_j}{\sigma_j}$$ Result: each feature has mean 0 and std 1.

Min-max normalisation $$\tilde{x}_j = \frac{x_j - \min_j}{\max_j - \min_j}$$ Result: each feature lives in $[0, 1]$.

The golden rule: fit the scaler on the training set only, then apply to both train and test.
Using test statistics to scale leaks information from the test set.

6.1 · Worked — Standardisation with sklearn¶

In [ ]:

Copied!





scaler = StandardScaler()

# Fit ONLY on training data — never on test!
scaler.fit(X_train)

# Transform both sets using the training statistics
X_train_scaled = scaler.transform(X_train)
X_test_scaled  = scaler.transform(X_test)

print('Before scaling — MedInc column stats:')
print(f'  train mean={X_train[:,0].mean():.3f}, std={X_train[:,0].std():.3f}')

print('\nAfter scaling — MedInc column stats:')
print(f'  train mean={X_train_scaled[:,0].mean():.6f}  (≈ 0)')
print(f'  train std ={X_train_scaled[:,0].std():.6f}   (≈ 1)')
print(f'  test  mean={X_test_scaled[:,0].mean():.4f}   (close but not exactly 0 — that is OK)')
scaler = StandardScaler()

# Fit ONLY on training data — never on test!
scaler.fit(X_train)

# Transform both sets using the training statistics
X_train_scaled = scaler.transform(X_train)
X_test_scaled  = scaler.transform(X_test)

print('Before scaling — MedInc column stats:')
print(f'  train mean={X_train[:,0].mean():.3f}, std={X_train[:,0].std():.3f}')

print('\nAfter scaling — MedInc column stats:')
print(f'  train mean={X_train_scaled[:,0].mean():.6f}  (≈ 0)')
print(f'  train std ={X_train_scaled[:,0].std():.6f}   (≈ 1)')
print(f'  test  mean={X_test_scaled[:,0].mean():.4f}   (close but not exactly 0 — that is OK)')

In [ ]:

Copied!





# Visualise: before vs after scaling for all 8 features
fig, axes = plt.subplots(2, 8, figsize=(18, 5))

for i in range(8):
    axes[0, i].hist(X_train[:, i],        bins=30, color='steelblue', edgecolor='w', lw=0.2)
    axes[1, i].hist(X_train_scaled[:, i], bins=30, color='crimson',   edgecolor='w', lw=0.2)
    axes[0, i].set_title(feature_names[i], fontsize=8)
    axes[1, i].set_title('scaled', fontsize=8)

axes[0, 0].set_ylabel('Before scaling')
axes[1, 0].set_ylabel('After scaling')
plt.suptitle('Feature distributions before and after standardisation', fontsize=12)
plt.tight_layout()
plt.show()
# Visualise: before vs after scaling for all 8 features
fig, axes = plt.subplots(2, 8, figsize=(18, 5))

for i in range(8):
    axes[0, i].hist(X_train[:, i],        bins=30, color='steelblue', edgecolor='w', lw=0.2)
    axes[1, i].hist(X_train_scaled[:, i], bins=30, color='crimson',   edgecolor='w', lw=0.2)
    axes[0, i].set_title(feature_names[i], fontsize=8)
    axes[1, i].set_title('scaled', fontsize=8)

axes[0, 0].set_ylabel('Before scaling')
axes[1, 0].set_ylabel('After scaling')
plt.suptitle('Feature distributions before and after standardisation', fontsize=12)
plt.tight_layout()
plt.show()

✏️ 6.2 · Your Turn — Standardisation from Scratch¶

Implement standardisation without sklearn to see exactly what is happening.

Compute mu (column means) and sigma (column stds) from X_train only.
Apply the formula $\tilde{x}_j = (x_j - \mu_j) / \sigma_j$ to both X_train and X_test.
Verify that your result matches X_train_scaled from above.

In [ ]:

Copied!





# ✏️ Step 1 — compute statistics from training data only
mu    = X_train.___(___)     # fill in: mean, axis=0
sigma = X_train.___(___)     # fill in: std,  axis=0

print('mu    shape:', mu.shape)    # should be (8,)
print('sigma shape:', sigma.shape)

# ✏️ Step 2 — apply to train and test
X_train_scaled_manual = (X_train - ___) / ___
X_test_scaled_manual  = (X_test  - ___) / ___   # use the SAME mu and sigma!

# ✏️ Step 3 — verify
print('\nMatches sklearn StandardScaler:')
print('  train:', np.allclose(X_train_scaled_manual, X_train_scaled))
print('  test :', np.allclose(X_test_scaled_manual,  X_test_scaled))
# ✏️ Step 1 — compute statistics from training data only
mu    = X_train.___(___)     # fill in: mean, axis=0
sigma = X_train.___(___)     # fill in: std,  axis=0

print('mu    shape:', mu.shape)    # should be (8,)
print('sigma shape:', sigma.shape)

# ✏️ Step 2 — apply to train and test
X_train_scaled_manual = (X_train - ___) / ___
X_test_scaled_manual  = (X_test  - ___) / ___   # use the SAME mu and sigma!

# ✏️ Step 3 — verify
print('\nMatches sklearn StandardScaler:')
print('  train:', np.allclose(X_train_scaled_manual, X_train_scaled))
print('  test :', np.allclose(X_test_scaled_manual,  X_test_scaled))

🔬 6.3 · Experiment — Does Scaling Help Gradient Descent?¶

We saw in Notebook 7 that ill-conditioned problems slow GD dramatically.
Here we train the same linear model on the raw vs scaled features and compare convergence.

In [ ]:

Copied!





def gd_linear_regression(X_b, y, alpha, n_steps):
    """Gradient descent for least squares: minimise ||X_b w - y||^2."""
    n, d  = X_b.shape
    w     = np.zeros(d)
    losses = []
    for _ in range(n_steps):
        residual = X_b @ w - y
        grad     = (2 / n) * X_b.T @ residual
        w        = w - alpha * grad
        losses.append(np.mean(residual**2))
    return w, losses

# Add bias column (column of ones)
def add_bias(X):
    return np.hstack([np.ones((len(X), 1)), X])

# --- Unscaled ---
_, losses_raw    = gd_linear_regression(add_bias(X_train),        y_train, alpha=1e-6, n_steps=300)

# --- Scaled ---
_, losses_scaled = gd_linear_regression(add_bias(X_train_scaled), y_train, alpha=0.1,  n_steps=300)

fig, axes = plt.subplots(1, 2, figsize=(13, 4))

axes[0].plot(losses_raw,    color='crimson',   lw=2, label='raw features  (α=1e-6)')
axes[0].plot(losses_scaled, color='steelblue', lw=2, label='scaled features (α=0.1)')
axes[0].set_xlabel('Iteration'); axes[0].set_ylabel('Train MSE')
axes[0].set_title('GD convergence — raw vs scaled')
axes[0].legend()

axes[1].plot(losses_scaled, color='steelblue', lw=2)
axes[1].set_xlabel('Iteration'); axes[1].set_ylabel('Train MSE')
axes[1].set_title('Scaled features — zoomed in')

plt.tight_layout()
plt.show()

print(f'Final MSE (raw, alpha=1e-6):   {losses_raw[-1]:.4f}')
print(f'Final MSE (scaled, alpha=0.1): {losses_scaled[-1]:.4f}')
def gd_linear_regression(X_b, y, alpha, n_steps):
    """Gradient descent for least squares: minimise ||X_b w - y||^2."""
    n, d  = X_b.shape
    w     = np.zeros(d)
    losses = []
    for _ in range(n_steps):
        residual = X_b @ w - y
        grad     = (2 / n) * X_b.T @ residual
        w        = w - alpha * grad
        losses.append(np.mean(residual**2))
    return w, losses

# Add bias column (column of ones)
def add_bias(X):
    return np.hstack([np.ones((len(X), 1)), X])

# --- Unscaled ---
_, losses_raw    = gd_linear_regression(add_bias(X_train),        y_train, alpha=1e-6, n_steps=300)

# --- Scaled ---
_, losses_scaled = gd_linear_regression(add_bias(X_train_scaled), y_train, alpha=0.1,  n_steps=300)

fig, axes = plt.subplots(1, 2, figsize=(13, 4))

axes[0].plot(losses_raw,    color='crimson',   lw=2, label='raw features  (α=1e-6)')
axes[0].plot(losses_scaled, color='steelblue', lw=2, label='scaled features (α=0.1)')
axes[0].set_xlabel('Iteration'); axes[0].set_ylabel('Train MSE')
axes[0].set_title('GD convergence — raw vs scaled')
axes[0].legend()

axes[1].plot(losses_scaled, color='steelblue', lw=2)
axes[1].set_xlabel('Iteration'); axes[1].set_ylabel('Train MSE')
axes[1].set_title('Scaled features — zoomed in')

plt.tight_layout()
plt.show()

print(f'Final MSE (raw, alpha=1e-6):   {losses_raw[-1]:.4f}')
print(f'Final MSE (scaled, alpha=0.1): {losses_scaled[-1]:.4f}')

Section 7 · Putting It All Together — A Clean ML Pipeline¶

The canonical pipeline¶

Every ML project should follow this order — no exceptions:

1.  Load data
2.  Inspect & clean  (handle NaN, wrong types)
3.  Extract X, y
4.  Split  →  X_train, X_test, y_train, y_test
5.  Fit scaler on X_train only
6.  Transform X_train and X_test
7.  Train model on  (X_train_scaled, y_train)
8.  Evaluate on     (X_test_scaled,  y_test)

7.1 · Worked — The Full Pipeline in 15 Lines¶

In [ ]:

Copied!





# ── 1. Load ────────────────────────────────────────────────────────────────
raw = fetch_california_housing(as_frame=True)
X, y = raw.data.to_numpy(), raw.target.to_numpy()

# ── 2. Clean (no NaNs here, but we check) ─────────────────────────────────
assert not np.any(np.isnan(X)), 'NaNs found in X!'

# ── 3. Split ───────────────────────────────────────────────────────────────
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

# ── 4. Scale (fit on train only) ───────────────────────────────────────────
scaler = StandardScaler().fit(X_train)
X_train_s = scaler.transform(X_train)
X_test_s  = scaler.transform(X_test)

# ── 5. Train (closed-form least squares) ──────────────────────────────────
X_tr_b = add_bias(X_train_s)
X_te_b = add_bias(X_test_s)
w_star = np.linalg.lstsq(X_tr_b, y_train, rcond=None)[0]

# ── 6. Evaluate ────────────────────────────────────────────────────────────
train_mse = mse(X_tr_b @ w_star, y_train)
test_mse  = mse(X_te_b @ w_star, y_test)

print('Pipeline complete ✓')
print(f'Train MSE : {train_mse:.4f}')
print(f'Test  MSE : {test_mse:.4f}')
print(f'\nLearned weights: {w_star.round(4)}')
print(f'Feature names  : [bias] + {raw.feature_names}')
# ── 1. Load ────────────────────────────────────────────────────────────────
raw = fetch_california_housing(as_frame=True)
X, y = raw.data.to_numpy(), raw.target.to_numpy()

# ── 2. Clean (no NaNs here, but we check) ─────────────────────────────────
assert not np.any(np.isnan(X)), 'NaNs found in X!'

# ── 3. Split ───────────────────────────────────────────────────────────────
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

# ── 4. Scale (fit on train only) ───────────────────────────────────────────
scaler = StandardScaler().fit(X_train)
X_train_s = scaler.transform(X_train)
X_test_s  = scaler.transform(X_test)

# ── 5. Train (closed-form least squares) ──────────────────────────────────
X_tr_b = add_bias(X_train_s)
X_te_b = add_bias(X_test_s)
w_star = np.linalg.lstsq(X_tr_b, y_train, rcond=None)[0]

# ── 6. Evaluate ────────────────────────────────────────────────────────────
train_mse = mse(X_tr_b @ w_star, y_train)
test_mse  = mse(X_te_b @ w_star, y_test)

print('Pipeline complete ✓')
print(f'Train MSE : {train_mse:.4f}')
print(f'Test  MSE : {test_mse:.4f}')
print(f'\nLearned weights: {w_star.round(4)}')
print(f'Feature names  : [bias] + {raw.feature_names}')

In [ ]:

Copied!





# Predicted vs actual plot
y_pred_test = X_te_b @ w_star

fig, ax = plt.subplots(figsize=(6, 6))
ax.scatter(y_test, y_pred_test, s=3, alpha=0.3, color='steelblue')
lims = [y_test.min(), y_test.max()]
ax.plot(lims, lims, 'r--', lw=1.5, label='perfect prediction')
ax.set_xlabel('Actual MedHouseVal')
ax.set_ylabel('Predicted MedHouseVal')
ax.set_title(f'Test set predictions  (MSE = {test_mse:.4f})')
ax.legend()
plt.tight_layout()
plt.show()
# Predicted vs actual plot
y_pred_test = X_te_b @ w_star

fig, ax = plt.subplots(figsize=(6, 6))
ax.scatter(y_test, y_pred_test, s=3, alpha=0.3, color='steelblue')
lims = [y_test.min(), y_test.max()]
ax.plot(lims, lims, 'r--', lw=1.5, label='perfect prediction')
ax.set_xlabel('Actual MedHouseVal')
ax.set_ylabel('Predicted MedHouseVal')
ax.set_title(f'Test set predictions  (MSE = {test_mse:.4f})')
ax.legend()
plt.tight_layout()
plt.show()

✏️ 7.2 · Your Turn — Run the Full Pipeline on the Diabetes Dataset¶

Repeat the full pipeline on the diabetes dataset you loaded in Section 1.2.

Extract X_d and y_d from raw_diab.
Split 80/20, scale, and fit the closed-form least squares solution.
Report train and test MSE.
Plot predicted vs actual.

💬 Is the model fitting better or worse than on the California data?
💬 What does a large gap between train MSE and test MSE mean?

In [ ]:

Copied!





# ── Fill in the pipeline for the diabetes dataset ─────────────────────────

X_d = raw_diab.___.to_numpy()
y_d = raw_diab.___.to_numpy()

X_d_train, X_d_test, y_d_train, y_d_test = train_test_split(
    ___, ___, test_size=___, random_state=42)

scaler_d = StandardScaler().fit(___)
X_d_train_s = scaler_d.transform(___)
X_d_test_s  = scaler_d.transform(___)

X_d_tr_b = add_bias(X_d_train_s)
X_d_te_b = add_bias(X_d_test_s)
w_d      = np.linalg.lstsq(X_d_tr_b, y_d_train, rcond=None)[0]

print(f'Diabetes — Train MSE : {mse(X_d_tr_b @ w_d, y_d_train):.2f}')
print(f'Diabetes — Test  MSE : {mse(X_d_te_b @ w_d, y_d_test):.2f}')
# ── Fill in the pipeline for the diabetes dataset ─────────────────────────

X_d = raw_diab.___.to_numpy()
y_d = raw_diab.___.to_numpy()

X_d_train, X_d_test, y_d_train, y_d_test = train_test_split(
    ___, ___, test_size=___, random_state=42)

scaler_d = StandardScaler().fit(___)
X_d_train_s = scaler_d.transform(___)
X_d_test_s  = scaler_d.transform(___)

X_d_tr_b = add_bias(X_d_train_s)
X_d_te_b = add_bias(X_d_test_s)
w_d      = np.linalg.lstsq(X_d_tr_b, y_d_train, rcond=None)[0]

print(f'Diabetes — Train MSE : {mse(X_d_tr_b @ w_d, y_d_train):.2f}')
print(f'Diabetes — Test  MSE : {mse(X_d_te_b @ w_d, y_d_test):.2f}')

🏁 Summary¶

Section	What you practised	Key rule
1	Loading data — sklearn, CSV	Always check what you loaded
2	Inspection — shape, dtypes, NaNs, stats	Never model dirty data
3	Indexing — column select, iloc, boolean masks	Understand your slices
4	Extracting $X \in \mathbb{R}^{n \times d}$ and $\mathbf{y} \in \mathbb{R}^n$	Row = sample, column = feature
5	Train / test split — shuffle, ratio, leakage	Test set is sacred
6	Standardisation — from scratch and sklearn	Fit scaler on train only
7	Full pipeline — load → clean → split → scale → train → evaluate	Always in this order

The central lesson:

Any information that flows from the test set into training — even indirectly, through scaling or feature selection — is data leakage.
Leakage makes models look better than they are, and they fail silently in production.

Data Handling for Machine Learning¶

Structure¶

Legend¶

0 · Setup¶

Section 1 · Loading Data¶

Where data comes from¶

1.1 · Worked — Loading the California Housing Dataset¶

✏️ 1.2 · Your Turn — Load the Diabetes Dataset¶

1.3 · Worked — Loading from a CSV with pandas¶

Section 2 · Inspecting Data¶

Always look before you model¶

2.1 · Worked — Quick Inspection Toolkit¶

✏️ 2.2 · Your Turn — Inject and Detect Missing Values¶

Section 3 · Indexing and Slicing¶

NumPy and pandas slicing recap¶

3.1 · Worked — Column and Row Selection¶

✏️ 3.2 · Your Turn — Boolean Slicing Practice¶

✏️ 3.3 · Your Turn — NumPy Slicing Review¶

Section 4 · Feature Matrix $X$ and Target Vector $\mathbf{y}$¶

The standard ML data layout¶

4.1 · Worked — Extracting $X$ and $\mathbf{y}$ from the Dataset¶

✏️ 4.2 · Your Turn — Pairwise Scatter Plots¶

Section 5 · Train / Test Split¶

Why we split¶

5.1 · Worked — Manual Split by Index¶

5.2 · Worked — sklearn train_test_split (the standard way)¶

✏️ 5.3 · Your Turn — What If We Don't Shuffle?¶

🔬 5.4 · Experiment — How Does Split Ratio Affect Variance?¶

Section 6 · Feature Scaling¶

Why scale?¶

6.1 · Worked — Standardisation with sklearn¶

✏️ 6.2 · Your Turn — Standardisation from Scratch¶

🔬 6.3 · Experiment — Does Scaling Help Gradient Descent?¶

Section 7 · Putting It All Together — A Clean ML Pipeline¶

The canonical pipeline¶

7.1 · Worked — The Full Pipeline in 15 Lines¶

✏️ 7.2 · Your Turn — Run the Full Pipeline on the Diabetes Dataset¶

🏁 Summary¶

5.2 · Worked — sklearn `train_test_split` (the standard way)¶