Data Handling for Machine Learning¶
Course: MA2221 · Mahindra University
Reference: Mathematics for Machine Learning, Deisenroth, Faisal & Ong — Ch 9
Two questions drive this entire lab:
How do we load, inspect, and clean real data so it is ready for a model?
Why do we split data into train and test sets — and how do we do it correctly?
Every machine learning pipeline starts here — before any gradients, before any model.
A model is only as good as the data it trains on.
Structure¶
| Section | Topic |
|---|---|
| 1 | Loading data — CSV, NumPy, and sklearn datasets |
| 2 | Inspecting data — shapes, dtypes, missing values, summary statistics |
| 3 | Indexing and slicing — rows, columns, boolean masks |
| 4 | Feature matrix $X$ and target vector $\mathbf{y}$ |
| 5 | Train / test split — why, how, and what goes wrong without it |
| 6 | Feature scaling — standardisation and normalisation |
| 7 | Putting it all together — a clean ML-ready pipeline |
Legend¶
- 🧱 Worked — run and read
- ✏️ Your turn — fill in
___ - 🔬 Experiment — change numbers and observe
- 💬 Discuss — no single right answer
0 · Setup¶
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
np.random.seed(42)
plt.rcParams.update({
'figure.dpi' : 120,
'axes.spines.top' : False,
'axes.spines.right': False,
'font.size' : 12,
})
print('All imports OK ✓')
All imports OK ✓
Section 1 · Loading Data¶
Where data comes from¶
In practice, data arrives in many forms — CSV files, databases, or bundled datasets from libraries.
In this notebook we use three routes:
- pandas
read_csv— the most common real-world path - NumPy
loadtxt/genfromtxt— for plain numerical files - sklearn built-in datasets — clean, well-documented, great for learning
1.1 · Worked — Loading the California Housing Dataset¶
# sklearn bundles several classic ML datasets — no download needed
from sklearn.datasets import fetch_california_housing
raw = fetch_california_housing(as_frame=True) # as_frame=True gives us a pandas DataFrame
# raw is a Bunch object — a dict-like container
print('Keys in the Bunch object:', raw.keys())
print()
print(raw.DESCR[:600]) # read the dataset description
# Extract the full dataframe (features + target together for inspection)
df = raw.frame
print(f'Shape: {df.shape} ({df.shape[0]} rows × {df.shape[1]} columns)')
df.head()
✏️ 1.2 · Your Turn — Load the Diabetes Dataset¶
sklearn also ships the diabetes dataset (sklearn.datasets.load_diabetes).
It has 442 patients, 10 features (age, sex, bmi, blood pressure, …), and a numerical disease-progression target.
- Load it with
as_frame=True. - Extract the
.frameattribute intodf_diabetes. - Print its shape and display the first 5 rows.
from sklearn.datasets import load_diabetes
raw_diab = load_diabetes(as_frame=___)
df_diabetes = raw_diab.___
print(f'Shape: {df_diabetes.___}')
df_diabetes.___
1.3 · Worked — Loading from a CSV with pandas¶
In the real world you'll most often receive a .csv file.
We save the California data to CSV and reload it, so you see the full round-trip.
# Save to CSV (simulates receiving a file)
df.to_csv('california.csv', index=False)
# Reload from CSV
df_from_csv = pd.read_csv('california.csv')
print('Loaded from CSV — shape:', df_from_csv.shape)
print('Column names:', df_from_csv.columns.tolist())
Section 2 · Inspecting Data¶
Always look before you model¶
Before building any model you must understand:
- Shape — how many samples $n$ and features $d$?
- Types — are columns numerical or categorical?
- Missing values —
NaNwill silently break most algorithms. - Scale — are some features 1000× larger than others?
- Distribution — are there obvious outliers?
2.1 · Worked — Quick Inspection Toolkit¶
print('=== Shape ===')
print(df.shape)
print('\n=== Data types ===')
print(df.dtypes)
print('\n=== Missing values per column ===')
print(df.isnull().sum())
print('\n=== Summary statistics ===')
df.describe().round(3)
# Visualise distributions
fig, axes = plt.subplots(3, 3, figsize=(13, 9))
axes = axes.flatten()
for i, col in enumerate(df.columns):
axes[i].hist(df[col], bins=40, color='steelblue', edgecolor='white', linewidth=0.3)
axes[i].set_title(col, fontsize=10)
axes[i].set_xlabel('')
plt.suptitle('Feature distributions — California Housing', fontsize=13, y=1.01)
plt.tight_layout()
plt.show()
✏️ 2.2 · Your Turn — Inject and Detect Missing Values¶
Real datasets almost always contain missing values.
Here we deliberately corrupt a copy, then find and handle the missing entries.
- Set 50 random entries in the
'MedInc'column toNaN. - Count how many
NaNs the column now has. - Fill the missing values with the column mean using
.fillna(). - Confirm no
NaNs remain.
df_dirty = df.copy()
# Step 1 — inject 50 NaNs at random row positions
rng = np.random.default_rng(0)
bad_rows = rng.choice(len(df_dirty), size=50, replace=False)
df_dirty.loc[bad_rows, 'MedInc'] = ___ # fill in: np.nan
# Step 2 — count missing values in MedInc
n_missing = df_dirty['MedInc'].___().___()
print(f'Missing values in MedInc: {n_missing}') # should be 50
# Step 3 — fill with column mean
col_mean = df_dirty['MedInc'].___() # fill in: .mean()
df_dirty['MedInc'] = df_dirty['MedInc'].fillna(___)
# Step 4 — confirm
print(f'Missing after imputation: {df_dirty["MedInc"].isnull().sum()}') # should be 0
Section 3 · Indexing and Slicing¶
NumPy and pandas slicing recap¶
Selecting the right rows and columns is the most frequent operation in any data pipeline.
We cover three complementary tools:
| Tool | What it selects |
|---|---|
df[col] / df[[cols]] |
columns by name |
df.iloc[rows, cols] |
rows and columns by integer position |
df.loc[rows, cols] |
rows and columns by label / boolean mask |
3.1 · Worked — Column and Row Selection¶
# --- Select a single column (returns a Series) ---
med_inc = df['MedInc']
print('Single column — type:', type(med_inc).__name__, '| shape:', med_inc.shape)
# --- Select multiple columns (returns a DataFrame) ---
subset = df[['MedInc', 'AveRooms', 'MedHouseVal']]
print('Multi-column subset shape:', subset.shape)
# --- Select rows by position (first 5) ---
print('\nFirst 3 rows (iloc):')
print(df.iloc[:3])
# --- Select a sub-block by position ---
print('\nRows 10-12, first 3 columns (iloc):')
print(df.iloc[10:13, :3])
# --- Boolean mask: select rows where median income > 5 ---
mask_high_income = df['MedInc'] > 5.0
df_rich = df.loc[mask_high_income]
print(f'Rows with MedInc > 5: {len(df_rich)} / {len(df)}')
print(f'Mean house value (high income): {df_rich["MedHouseVal"].mean():.3f}')
print(f'Mean house value (all): {df["MedHouseVal"].mean():.3f}')
✏️ 3.2 · Your Turn — Boolean Slicing Practice¶
Using the California DataFrame df:
- Select all rows where
'AveOccup'(average occupancy) is greater than 6 (overcrowded).
How many such blocks are there? - Select all rows where
'HouseAge'is exactly 52 (the dataset cap). - Combine both conditions with
&: blocks that are both overcrowded and aged 52.
Print the count.
# 1. Overcrowded blocks
mask_crowded = df['AveOccup'] > ___
print(f'Overcrowded blocks: {mask_crowded.sum()}')
# 2. Maximum-age houses
mask_old = df['HouseAge'] == ___
print(f'Blocks with HouseAge == 52: {mask_old.sum()}')
# 3. Combined
df_combined = df.loc[___ & ___]
print(f'Overcrowded AND aged 52: {len(df_combined)}')
✏️ 3.3 · Your Turn — NumPy Slicing Review¶
Convert the DataFrame to a NumPy array and practice matrix slicing —
this is exactly how the feature matrix $X$ will look inside a model.
- Convert
dfto a NumPy arrayAusing.to_numpy(). - Extract the last column (the target
MedHouseVal) as a 1-D arrayy_all. - Extract all columns except the last as
X_all. - Print both shapes.
A = df.to_numpy() # shape: (20640, 9)
y_all = A[:, ___] # fill in: last column
X_all = A[:, ___] # fill in: all but last column
print('X_all shape:', X_all.shape) # should be (20640, 8)
print('y_all shape:', y_all.shape) # should be (20640,)
Section 4 · Feature Matrix $X$ and Target Vector $\mathbf{y}$¶
The standard ML data layout¶
Every supervised learning algorithm expects data in this form:
$$X \in \mathbb{R}^{n \times d}, \qquad \mathbf{y} \in \mathbb{R}^n$$
where $n$ is the number of samples and $d$ is the number of features.
Row $i$ of $X$ is the feature vector $\mathbf{x}_i$ for the $i$-th sample.
Element $y_i$ is the corresponding label or target value.
4.1 · Worked — Extracting $X$ and $\mathbf{y}$ from the Dataset¶
# sklearn Bunch objects expose .data and .target directly
X = raw.data.to_numpy() # shape (n, d)
y = raw.target.to_numpy() # shape (n,)
feature_names = raw.feature_names
print(f'X shape : {X.shape} ({X.shape[0]} samples, {X.shape[1]} features)')
print(f'y shape : {y.shape}')
print(f'Feature names: {feature_names}')
print(f'Target (MedHouseVal) — min: {y.min():.2f}, max: {y.max():.2f}, mean: {y.mean():.2f}')
✏️ 4.2 · Your Turn — Pairwise Scatter Plots¶
A scatter plot of each feature against the target $y$ is the fastest way to spot linear relationships.
- Plot each of the 8 features on the x-axis against
yon the y-axis. - Use
alpha=0.05(semi-transparent dots — there are 20 000 points!). - Label each subplot with the feature name.
Which features look most linearly related to MedHouseVal?
fig, axes = plt.subplots(2, 4, figsize=(15, 7))
axes = axes.flatten()
for i, name in enumerate(feature_names):
axes[i].scatter(X[:, i], y, s=1, alpha=___, color='steelblue') # fill in alpha
axes[i].set_xlabel(___)
axes[i].set_ylabel('MedHouseVal')
plt.suptitle('Feature vs target scatter plots', fontsize=13)
plt.tight_layout()
plt.show()
# 💬 Which feature shows the clearest linear trend with the house value?
Section 5 · Train / Test Split¶
Why we split¶
A model trained and evaluated on the same data will appear to perform well —
but it may have simply memorised the training examples (overfitting).
We hold out a test set that the model never sees during training.
Performance on the test set is our honest estimate of generalisation.
$$\underbrace{X,\ \mathbf{y}}_{\text{all data}} \;\longrightarrow\; \underbrace{X_{\text{train}},\ \mathbf{y}_{\text{train}}}_{\sim 80\%} \;+\; \underbrace{X_{\text{test}},\ \mathbf{y}_{\text{test}}}_{\sim 20\%}$$
Critical rule: the test set must remain untouched until the very end.
Any decision made by looking at the test labels — including choosing a scaler or picking features — leaks information and inflates the reported performance.
5.1 · Worked — Manual Split by Index¶
# Manual 80/20 split — good for understanding what is happening under the hood
n = len(X)
n_train = int(0.8 * n)
# Shuffle indices first — order in the dataset may not be random!
idx = np.arange(n)
rng = np.random.default_rng(42)
rng.shuffle(idx)
train_idx = idx[:n_train]
test_idx = idx[n_train:]
X_train_manual = X[train_idx]
y_train_manual = y[train_idx]
X_test_manual = X[test_idx]
y_test_manual = y[test_idx]
print(f'Total samples : {n}')
print(f'Train samples : {len(X_train_manual)} ({100*len(X_train_manual)/n:.1f}%)')
print(f'Test samples : {len(X_test_manual)} ({100*len(X_test_manual)/n:.1f}%)')
print(f'\nTrain mean y : {y_train_manual.mean():.4f}')
print(f'Test mean y : {y_test_manual.mean():.4f} ← should be similar to train')
5.2 · Worked — sklearn train_test_split (the standard way)¶
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size = 0.20, # 20% held out
random_state = 42, # reproducible shuffle
)
print(f'X_train : {X_train.shape} y_train : {y_train.shape}')
print(f'X_test : {X_test.shape} y_test : {y_test.shape}')
✏️ 5.3 · Your Turn — What If We Don't Shuffle?¶
The California dataset is ordered geographically.
This cell shows what happens when we split without shuffling.
- Do a naïve 80/20 split: first 80% as train, last 20% as test — no shuffling.
- Compare the mean of
y_trainandy_test. - Plot the target histograms side by side for the shuffled vs unshuffled splits.
💬 What do you notice? Why would training on a biased train set be a problem?
# Naïve (no shuffle) split
n_tr = int(0.8 * n)
y_train_noshuf = y[___] # fill in: first n_tr elements
y_test_noshuf = y[___] # fill in: remaining elements
print('--- No shuffle ---')
print(f'Train mean y : {y_train_noshuf.mean():.4f}')
print(f'Test mean y : {y_test_noshuf.mean():.4f}')
print('\n--- With shuffle ---')
print(f'Train mean y : {y_train.mean():.4f}')
print(f'Test mean y : {y_test.mean():.4f}')
fig, axes = plt.subplots(1, 2, figsize=(13, 4))
for ax, (tr, te, title) in zip(axes, [
(y_train_noshuf, y_test_noshuf, 'No shuffle — biased split'),
(y_train, y_test, 'Shuffled — representative split')
]):
ax.hist(tr, bins=40, alpha=0.6, color='steelblue', label='train')
ax.hist(te, bins=40, alpha=0.6, color='crimson', label='test')
ax.set_title(title)
ax.set_xlabel('MedHouseVal')
ax.legend()
plt.tight_layout()
plt.show()
🔬 5.4 · Experiment — How Does Split Ratio Affect Variance?¶
A small test set gives a noisy estimate of performance.
Run gradient descent linear regression (from Notebook 7) on splits of different sizes
and observe how the test-set MSE varies.
(We use a simple one-feature regression here; you will build the full model in the next notebook.)
# Use only MedInc (feature 0) to predict house value — simplest possible model
X1 = X[:, 0:1] # shape (n, 1) — keep 2D for consistency
def mse(y_pred, y_true):
return np.mean((y_pred - y_true)**2)
def fit_and_evaluate(test_size, seed):
Xtr, Xte, ytr, yte = train_test_split(X1, y, test_size=test_size, random_state=seed)
# Closed-form least squares: w* = (X^T X)^{-1} X^T y
Xtr_b = np.hstack([np.ones((len(Xtr), 1)), Xtr]) # add bias column
Xte_b = np.hstack([np.ones((len(Xte), 1)), Xte])
w = np.linalg.lstsq(Xtr_b, ytr, rcond=None)[0]
return mse(Xte_b @ w, yte)
test_sizes = [0.05, 0.10, 0.20, 0.30, 0.50]
n_repeats = 20
results = {}
for ts in test_sizes:
results[ts] = [fit_and_evaluate(ts, seed=s) for s in range(n_repeats)]
fig, ax = plt.subplots(figsize=(9, 4))
means = [np.mean(results[ts]) for ts in test_sizes]
stds = [np.std(results[ts]) for ts in test_sizes]
ax.errorbar([str(int(ts*100))+'%' for ts in test_sizes], means, yerr=stds,
fmt='o-', color='steelblue', capsize=5, lw=2)
ax.set_xlabel('Test set size')
ax.set_ylabel('Test MSE')
ax.set_title('Test MSE mean ± std across 20 random splits')
plt.tight_layout()
plt.show()
# 💬 Why does variance increase for very small test sets?
# 💬 Why does MSE change slightly for very large test sets?
Section 6 · Feature Scaling¶
Why scale?¶
Look at the summary statistics again: MedInc ranges from ~0 to ~15,
while Population ranges from 3 to 35 000.
When gradient descent updates $\mathbf{w}$, features with large magnitudes dominate the gradient —
exactly the ill-conditioning problem from Notebook 7 Section 5.
Two common remedies:
Standardisation (Z-score normalisation) $$\tilde{x}_j = \frac{x_j - \mu_j}{\sigma_j}$$ Result: each feature has mean 0 and std 1.
Min-max normalisation $$\tilde{x}_j = \frac{x_j - \min_j}{\max_j - \min_j}$$ Result: each feature lives in $[0, 1]$.
The golden rule: fit the scaler on the training set only, then apply to both train and test.
Using test statistics to scale leaks information from the test set.
6.1 · Worked — Standardisation with sklearn¶
scaler = StandardScaler()
# Fit ONLY on training data — never on test!
scaler.fit(X_train)
# Transform both sets using the training statistics
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
print('Before scaling — MedInc column stats:')
print(f' train mean={X_train[:,0].mean():.3f}, std={X_train[:,0].std():.3f}')
print('\nAfter scaling — MedInc column stats:')
print(f' train mean={X_train_scaled[:,0].mean():.6f} (≈ 0)')
print(f' train std ={X_train_scaled[:,0].std():.6f} (≈ 1)')
print(f' test mean={X_test_scaled[:,0].mean():.4f} (close but not exactly 0 — that is OK)')
# Visualise: before vs after scaling for all 8 features
fig, axes = plt.subplots(2, 8, figsize=(18, 5))
for i in range(8):
axes[0, i].hist(X_train[:, i], bins=30, color='steelblue', edgecolor='w', lw=0.2)
axes[1, i].hist(X_train_scaled[:, i], bins=30, color='crimson', edgecolor='w', lw=0.2)
axes[0, i].set_title(feature_names[i], fontsize=8)
axes[1, i].set_title('scaled', fontsize=8)
axes[0, 0].set_ylabel('Before scaling')
axes[1, 0].set_ylabel('After scaling')
plt.suptitle('Feature distributions before and after standardisation', fontsize=12)
plt.tight_layout()
plt.show()
✏️ 6.2 · Your Turn — Standardisation from Scratch¶
Implement standardisation without sklearn to see exactly what is happening.
- Compute
mu(column means) andsigma(column stds) fromX_trainonly. - Apply the formula $\tilde{x}_j = (x_j - \mu_j) / \sigma_j$ to both
X_trainandX_test. - Verify that your result matches
X_train_scaledfrom above.
# ✏️ Step 1 — compute statistics from training data only
mu = X_train.___(___) # fill in: mean, axis=0
sigma = X_train.___(___) # fill in: std, axis=0
print('mu shape:', mu.shape) # should be (8,)
print('sigma shape:', sigma.shape)
# ✏️ Step 2 — apply to train and test
X_train_scaled_manual = (X_train - ___) / ___
X_test_scaled_manual = (X_test - ___) / ___ # use the SAME mu and sigma!
# ✏️ Step 3 — verify
print('\nMatches sklearn StandardScaler:')
print(' train:', np.allclose(X_train_scaled_manual, X_train_scaled))
print(' test :', np.allclose(X_test_scaled_manual, X_test_scaled))
🔬 6.3 · Experiment — Does Scaling Help Gradient Descent?¶
We saw in Notebook 7 that ill-conditioned problems slow GD dramatically.
Here we train the same linear model on the raw vs scaled features and compare convergence.
def gd_linear_regression(X_b, y, alpha, n_steps):
"""Gradient descent for least squares: minimise ||X_b w - y||^2."""
n, d = X_b.shape
w = np.zeros(d)
losses = []
for _ in range(n_steps):
residual = X_b @ w - y
grad = (2 / n) * X_b.T @ residual
w = w - alpha * grad
losses.append(np.mean(residual**2))
return w, losses
# Add bias column (column of ones)
def add_bias(X):
return np.hstack([np.ones((len(X), 1)), X])
# --- Unscaled ---
_, losses_raw = gd_linear_regression(add_bias(X_train), y_train, alpha=1e-6, n_steps=300)
# --- Scaled ---
_, losses_scaled = gd_linear_regression(add_bias(X_train_scaled), y_train, alpha=0.1, n_steps=300)
fig, axes = plt.subplots(1, 2, figsize=(13, 4))
axes[0].plot(losses_raw, color='crimson', lw=2, label='raw features (α=1e-6)')
axes[0].plot(losses_scaled, color='steelblue', lw=2, label='scaled features (α=0.1)')
axes[0].set_xlabel('Iteration'); axes[0].set_ylabel('Train MSE')
axes[0].set_title('GD convergence — raw vs scaled')
axes[0].legend()
axes[1].plot(losses_scaled, color='steelblue', lw=2)
axes[1].set_xlabel('Iteration'); axes[1].set_ylabel('Train MSE')
axes[1].set_title('Scaled features — zoomed in')
plt.tight_layout()
plt.show()
print(f'Final MSE (raw, alpha=1e-6): {losses_raw[-1]:.4f}')
print(f'Final MSE (scaled, alpha=0.1): {losses_scaled[-1]:.4f}')
Section 7 · Putting It All Together — A Clean ML Pipeline¶
The canonical pipeline¶
Every ML project should follow this order — no exceptions:
1. Load data
2. Inspect & clean (handle NaN, wrong types)
3. Extract X, y
4. Split → X_train, X_test, y_train, y_test
5. Fit scaler on X_train only
6. Transform X_train and X_test
7. Train model on (X_train_scaled, y_train)
8. Evaluate on (X_test_scaled, y_test)
7.1 · Worked — The Full Pipeline in 15 Lines¶
# ── 1. Load ────────────────────────────────────────────────────────────────
raw = fetch_california_housing(as_frame=True)
X, y = raw.data.to_numpy(), raw.target.to_numpy()
# ── 2. Clean (no NaNs here, but we check) ─────────────────────────────────
assert not np.any(np.isnan(X)), 'NaNs found in X!'
# ── 3. Split ───────────────────────────────────────────────────────────────
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
# ── 4. Scale (fit on train only) ───────────────────────────────────────────
scaler = StandardScaler().fit(X_train)
X_train_s = scaler.transform(X_train)
X_test_s = scaler.transform(X_test)
# ── 5. Train (closed-form least squares) ──────────────────────────────────
X_tr_b = add_bias(X_train_s)
X_te_b = add_bias(X_test_s)
w_star = np.linalg.lstsq(X_tr_b, y_train, rcond=None)[0]
# ── 6. Evaluate ────────────────────────────────────────────────────────────
train_mse = mse(X_tr_b @ w_star, y_train)
test_mse = mse(X_te_b @ w_star, y_test)
print('Pipeline complete ✓')
print(f'Train MSE : {train_mse:.4f}')
print(f'Test MSE : {test_mse:.4f}')
print(f'\nLearned weights: {w_star.round(4)}')
print(f'Feature names : [bias] + {raw.feature_names}')
# Predicted vs actual plot
y_pred_test = X_te_b @ w_star
fig, ax = plt.subplots(figsize=(6, 6))
ax.scatter(y_test, y_pred_test, s=3, alpha=0.3, color='steelblue')
lims = [y_test.min(), y_test.max()]
ax.plot(lims, lims, 'r--', lw=1.5, label='perfect prediction')
ax.set_xlabel('Actual MedHouseVal')
ax.set_ylabel('Predicted MedHouseVal')
ax.set_title(f'Test set predictions (MSE = {test_mse:.4f})')
ax.legend()
plt.tight_layout()
plt.show()
✏️ 7.2 · Your Turn — Run the Full Pipeline on the Diabetes Dataset¶
Repeat the full pipeline on the diabetes dataset you loaded in Section 1.2.
- Extract
X_dandy_dfromraw_diab. - Split 80/20, scale, and fit the closed-form least squares solution.
- Report train and test MSE.
- Plot predicted vs actual.
💬 Is the model fitting better or worse than on the California data?
💬 What does a large gap between train MSE and test MSE mean?
# ── Fill in the pipeline for the diabetes dataset ─────────────────────────
X_d = raw_diab.___.to_numpy()
y_d = raw_diab.___.to_numpy()
X_d_train, X_d_test, y_d_train, y_d_test = train_test_split(
___, ___, test_size=___, random_state=42)
scaler_d = StandardScaler().fit(___)
X_d_train_s = scaler_d.transform(___)
X_d_test_s = scaler_d.transform(___)
X_d_tr_b = add_bias(X_d_train_s)
X_d_te_b = add_bias(X_d_test_s)
w_d = np.linalg.lstsq(X_d_tr_b, y_d_train, rcond=None)[0]
print(f'Diabetes — Train MSE : {mse(X_d_tr_b @ w_d, y_d_train):.2f}')
print(f'Diabetes — Test MSE : {mse(X_d_te_b @ w_d, y_d_test):.2f}')
🏁 Summary¶
| Section | What you practised | Key rule |
|---|---|---|
| 1 | Loading data — sklearn, CSV | Always check what you loaded |
| 2 | Inspection — shape, dtypes, NaNs, stats | Never model dirty data |
| 3 | Indexing — column select, iloc, boolean masks | Understand your slices |
| 4 | Extracting $X \in \mathbb{R}^{n \times d}$ and $\mathbf{y} \in \mathbb{R}^n$ | Row = sample, column = feature |
| 5 | Train / test split — shuffle, ratio, leakage | Test set is sacred |
| 6 | Standardisation — from scratch and sklearn | Fit scaler on train only |
| 7 | Full pipeline — load → clean → split → scale → train → evaluate | Always in this order |
The central lesson:
Any information that flows from the test set into training — even indirectly, through scaling or feature selection — is data leakage.
Leakage makes models look better than they are, and they fail silently in production.
MA2221 — Foundational Mathematics for Machine Learning · Mahindra University
Lab Notebook 7 · © Biswarup Biswas