Complete Machine Learning Pipeline¶
Course: MA2221 · Mahindra University
Integrating: Data Handling + Probability + Optimization
The Big Question:
How do we go from raw data to a trained model that makes predictions?
This lab walks through the entire machine learning workflow:
Load data → Explore → Clean → Split → Scale → Build Model → Train with Gradient Descent → Evaluate → Predict
You'll build two models:
- Linear Regression from scratch using gradient descent (PyTorch)
- Logistic Regression for classification
By the end, you'll understand what happens inside model.fit() in scikit-learn.
Structure¶
| Section | Topic |
|---|---|
| 0 | Setup and imports |
| 1 | Load and explore the California housing dataset |
| 2 | Train/test split and why it matters |
| 3 | Feature scaling (standardization) |
| 4 | Linear regression from scratch with gradient descent |
| 5 | Visualizing the training process |
| 6 | Evaluation metrics (MSE, RMSE, R²) |
| 7 | Classification with Logistic Regression |
| 8 | K-fold cross-validation for robust evaluation |
| 9 | Putting it all together — The complete pipeline |
Legend¶
- 🧱 Worked — run and read
- ✏️ Your Turn — fill in the code
- 🔬 Experiment — change parameters and observe
- 💬 Discuss — think about the implications
Section 0 · Setup¶
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch
from sklearn import datasets
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, confusion_matrix
# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)
# Plot styling
plt.rcParams.update({
'figure.dpi': 120,
'axes.spines.top': False,
'axes.spines.right': False,
'font.size': 11,
})
print('✓ All imports successful')
print(f'✓ NumPy version: {np.__version__}')
print(f'✓ PyTorch version: {torch.__version__}')
# Load the dataset
california = datasets.fetch_california_housing()
print("Dataset description:")
print(california.DESCR[:500]) # First 500 characters
print("\n" + "="*60)
# Create a DataFrame for easier exploration
df = pd.DataFrame(california.data, columns=california.feature_names)
df['MedHouseVal'] = california.target
print(f"Shape: {df.shape}")
print(f"Number of samples: {df.shape[0]}")
print(f"Number of features: {df.shape[1] - 1}") # -1 for target
print("\nFirst 5 rows:")
df.head()
1.2 🧱 Worked — Understanding the Features¶
print("Feature names:")
for i, name in enumerate(california.feature_names, 1):
print(f" {i}. {name}")
print(f"\nTarget variable: MedHouseVal (Median house value in $100,000s)")
1.3 🧱 Worked — Summary Statistics¶
df.describe()
💬 Discuss: Notice the different scales of features:
MedInc(median income): range ~0.5 to 15Population: range ~3 to 35,000Latitude/Longitude: ~32 to 42 and -124 to -114
Why might this be a problem for gradient descent?
1.4 ✏️ Your Turn — Visualize the Target Distribution¶
# Plot histogram of house values
plt.figure(figsize=(8, 4))
plt.hist(___, bins=50, edgecolor='black', alpha=0.7)
plt.xlabel('Median House Value ($100k)')
plt.ylabel('Frequency')
plt.title('Distribution of Median House Values')
plt.axvline(___.mean(), color='red', linestyle='--', label=f'Mean = {df["MedHouseVal"].mean():.2f}')
plt.legend()
plt.show()
1.5 🔬 Experiment — Check for Missing Values¶
print("Missing values per column:")
print(df.isnull().sum())
print(f"\nTotal missing values: {df.isnull().sum().sum()}")
Section 2 · Train/Test Split¶
Why Split?¶
The fundamental problem: A model can memorize the training data but fail on new data.
Solution: Hold out a test set that the model never sees during training.
Test performance tells us how well the model generalizes.
2.1 🧱 Worked — Creating Train and Test Sets¶
# Separate features (X) and target (y)
X = california.data
y = california.target
print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")
# Split into 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2, # 20% for testing
random_state=42 # For reproducibility
)
print(f"Training set size: {X_train.shape[0]} samples")
print(f"Test set size: {X_test.shape[0]} samples")
print(f"\nRatio: {X_test.shape[0] / X_train.shape[0]:.2f} (test/train)")
2.2 💬 Discuss — What Happens Without Splitting?¶
If we train and test on the same data:
- Training error will be artificially low
- We have no idea how the model performs on new data
- The model might overfit — memorizing noise instead of learning patterns
Rule: Never touch the test set until final evaluation.
Section 3 · Feature Scaling¶
Why Scale?¶
Features have different units and ranges:
MedInc: 0.5 to 15Population: 3 to 35,000
Gradient descent struggles when features have vastly different scales:
- Small changes in large-scale features dominate the gradient
- Learning becomes slow and unstable
Solution: Standardization (z-score normalization)
$$z = \frac{x - \mu}{\sigma}$$
After scaling, each feature has mean 0 and standard deviation 1.
3.1 🧱 Worked — Standardizing Features¶
# Create scaler and fit on training data ONLY
scaler = StandardScaler()
scaler.fit(X_train) # Learn mean and std from training data
# Transform both train and test using the SAME scaler
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
print("Before scaling (first sample):")
print(X_train[0])
print("\nAfter scaling (first sample):")
print(X_train_scaled[0])
3.2 ✏️ Your Turn — Verify Standardization¶
# Check that mean ≈ 0 and std ≈ 1 for each feature
print("Mean of each feature after scaling:")
print(np.mean(___, axis=0))
print("\nStd of each feature after scaling:")
print(np.std(___, axis=0))
⚠️ Critical Rule:
- Fit the scaler on training data only
- Apply the same transformation to test data
- Why? To prevent information leakage from test → train
Section 4 · Linear Regression from Scratch¶
The Model¶
Linear regression predicts $y$ as a weighted sum of features:
$$\hat{y} = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + b$$
In matrix form:
$$\hat{y} = X w + b$$
Where:
- $X$ is the feature matrix (samples × features)
- $w$ is the weight vector
- $b$ is the bias (intercept)
The Loss Function¶
Mean Squared Error (MSE):
$$L = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$
The Algorithm: Gradient Descent¶
- Initialize $w$ and $b$ randomly
- Compute predictions: $\hat{y} = Xw + b$
- Compute loss: $L = \text{MSE}(y, \hat{y})$
- Compute gradients: $\frac{\partial L}{\partial w}$ and $\frac{\partial L}{\partial b}$
- Update parameters:
- $w \leftarrow w - \alpha \frac{\partial L}{\partial w}$
- $b \leftarrow b - \alpha \frac{\partial L}{\partial b}$
- Repeat steps 2-5 for many iterations
4.1 🧱 Worked — Implementing Linear Regression in PyTorch¶
# Convert to PyTorch tensors
X_train_tensor = torch.FloatTensor(X_train_scaled)
y_train_tensor = torch.FloatTensor(y_train)
X_test_tensor = torch.FloatTensor(X_test_scaled)
y_test_tensor = torch.FloatTensor(y_test)
print(f"X_train shape: {X_train_tensor.shape}")
print(f"y_train shape: {y_train_tensor.shape}")
# Initialize parameters
n_features = X_train_tensor.shape[1]
# Weights and bias (requires_grad=True for autograd)
w = torch.randn(n_features, 1, requires_grad=True)
b = torch.zeros(1, requires_grad=True)
print(f"Initial weights shape: {w.shape}")
print(f"Initial bias shape: {b.shape}")
# Hyperparameters
learning_rate = 0.01
num_epochs = 1000
# Track loss over time
losses = []
# Training loop
for epoch in range(num_epochs):
# Forward pass: compute predictions
y_pred = X_train_tensor @ w + b
# Compute loss (MSE)
loss = torch.mean((y_pred.squeeze() - y_train_tensor) ** 2)
losses.append(loss.item())
# Backward pass: compute gradients
loss.backward()
# Update parameters (gradient descent step)
with torch.no_grad():
w -= learning_rate * w.grad
b -= learning_rate * b.grad
# Zero gradients for next iteration
w.grad.zero_()
b.grad.zero_()
# Print progress
if (epoch + 1) % 100 == 0:
print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}")
print("\n✓ Training complete!")
4.2 ✏️ Your Turn — Understanding the Code¶
Questions:
- What does
@do inX_train_tensor @ w? - Why do we use
with torch.no_grad()when updating parameters? - What happens if we forget to zero the gradients?
Click for answers
@is matrix multiplication (same astorch.matmulornp.dot)- We don't want PyTorch to track gradients during parameter updates (not part of the computational graph)
- Gradients accumulate! Old gradients add to new ones, corrupting the update.
plt.figure(figsize=(10, 4))
# Plot 1: Full training curve
plt.subplot(1, 2, 1)
plt.plot(losses, linewidth=2)
plt.xlabel('Epoch')
plt.ylabel('Loss (MSE)')
plt.title('Training Loss Over Time')
plt.grid(True, alpha=0.3)
# Plot 2: Log scale to see convergence
plt.subplot(1, 2, 2)
plt.plot(losses, linewidth=2, color='orange')
plt.xlabel('Epoch')
plt.ylabel('Loss (MSE)')
plt.title('Training Loss (Log Scale)')
plt.yscale('log')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print(f"Initial loss: {losses[0]:.4f}")
print(f"Final loss: {losses[-1]:.4f}")
print(f"Reduction: {(1 - losses[-1]/losses[0]) * 100:.1f}%")
5.2 🔬 Experiment — Effect of Learning Rate¶
Try different learning rates and observe:
- Too small (e.g., 0.0001): slow convergence
- Too large (e.g., 0.1): oscillation or divergence
- Just right (e.g., 0.01): smooth descent
def train_with_lr(lr, epochs=500):
"""Train model with given learning rate and return loss history"""
w_temp = torch.randn(n_features, 1, requires_grad=True)
b_temp = torch.zeros(1, requires_grad=True)
losses_temp = []
for epoch in range(epochs):
y_pred = X_train_tensor @ w_temp + b_temp
loss = torch.mean((y_pred.squeeze() - y_train_tensor) ** 2)
losses_temp.append(loss.item())
loss.backward()
with torch.no_grad():
w_temp -= lr * w_temp.grad
b_temp -= lr * b_temp.grad
w_temp.grad.zero_()
b_temp.grad.zero_()
return losses_temp
# Compare different learning rates
learning_rates = [0.001, 0.01, 0.05, 0.1]
plt.figure(figsize=(10, 5))
for lr in learning_rates:
losses_lr = train_with_lr(lr)
plt.plot(losses_lr, label=f'LR = {lr}', linewidth=2)
plt.xlabel('Epoch')
plt.ylabel('Loss (MSE)')
plt.title('Effect of Learning Rate on Training')
plt.legend()
plt.grid(True, alpha=0.3)
plt.yscale('log')
plt.show()
# Make predictions on test set
with torch.no_grad():
y_test_pred = (X_test_tensor @ w + b).squeeze()
# Convert back to numpy for evaluation
y_test_pred_np = y_test_pred.numpy()
print("First 5 predictions vs actual:")
for i in range(5):
print(f"Predicted: {y_test_pred_np[i]:.2f}, Actual: {y_test[i]:.2f}")
6.2 🧱 Worked — Computing Metrics¶
Mean Squared Error (MSE): $$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$
Root Mean Squared Error (RMSE): $$\text{RMSE} = \sqrt{\text{MSE}}$$
R² Score (Coefficient of Determination): $$R^2 = 1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2}$$
- $R^2 = 1$: perfect predictions
- $R^2 = 0$: model is no better than predicting the mean
- $R^2 < 0$: model is worse than predicting the mean
# Compute metrics
mse = mean_squared_error(y_test, y_test_pred_np)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_test_pred_np)
print("="*50)
print("TEST SET PERFORMANCE")
print("="*50)
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"Root Mean Squared Error: {rmse:.4f}")
print(f"R² Score: {r2:.4f}")
print("="*50)
print(f"\nInterpretation: Our model explains {r2*100:.1f}% of variance in house prices")
6.3 🧱 Worked — Predictions vs Actual Plot¶
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_test_pred_np, alpha=0.5, s=20)
plt.plot([y_test.min(), y_test.max()],
[y_test.min(), y_test.max()],
'r--', linewidth=2, label='Perfect predictions')
plt.xlabel('Actual House Value ($100k)')
plt.ylabel('Predicted House Value ($100k)')
plt.title(f'Predictions vs Actual (R² = {r2:.3f})')
plt.legend()
plt.grid(True, alpha=0.3)
plt.axis('equal')
plt.tight_layout()
plt.show()
6.4 ✏️ Your Turn — Residual Analysis¶
# Compute residuals (errors)
residuals = y_test - ___
plt.figure(figsize=(10, 4))
# Plot 1: Residual histogram
plt.subplot(1, 2, 1)
plt.hist(___, bins=50, edgecolor='black', alpha=0.7)
plt.xlabel('Residual (Actual - Predicted)')
plt.ylabel('Frequency')
plt.title('Distribution of Residuals')
plt.axvline(0, color='red', linestyle='--')
# Plot 2: Residuals vs predictions
plt.subplot(1, 2, 2)
plt.scatter(___, residuals, alpha=0.5, s=20)
plt.xlabel('Predicted Value')
plt.ylabel('Residual')
plt.title('Residual Plot')
plt.axhline(0, color='red', linestyle='--')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
💬 Discuss:
- What does a good residual plot look like?
- What patterns in residuals indicate problems?
# Train sklearn's LinearRegression
sklearn_model = LinearRegression()
sklearn_model.fit(X_train_scaled, y_train)
# Make predictions
y_test_pred_sklearn = sklearn_model.predict(X_test_scaled)
# Compute metrics
mse_sklearn = mean_squared_error(y_test, y_test_pred_sklearn)
r2_sklearn = r2_score(y_test, y_test_pred_sklearn)
print("="*50)
print("COMPARISON: Our Model vs sklearn")
print("="*50)
print(f"{'Metric':<25} {'Our Model':<15} {'sklearn':<15}")
print("-"*50)
print(f"{'MSE':<25} {mse:<15.4f} {mse_sklearn:<15.4f}")
print(f"{'R² Score':<25} {r2:<15.4f} {r2_sklearn:<15.4f}")
print("="*50)
Note: sklearn uses the closed-form solution (normal equation) instead of gradient descent:
$$w = (X^T X)^{-1} X^T y$$
This is exact but computationally expensive for large datasets. Gradient descent scales better.
# Load iris dataset (3-class classification)
iris = datasets.load_iris()
X_iris = iris.data
y_iris = iris.target
print(f"Number of samples: {X_iris.shape[0]}")
print(f"Number of features: {X_iris.shape[1]}")
print(f"Number of classes: {len(np.unique(y_iris))}")
print(f"Class names: {iris.target_names}")
8.2 ✏️ Your Turn — Complete Classification Pipeline¶
# Step 1: Split data
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(
X_iris, y_iris,
test_size=___, # Fill in
random_state=42,
stratify=y_iris # Ensures balanced classes in train/test
)
# Step 2: Scale features
scaler_iris = StandardScaler()
X_train_iris_scaled = scaler_iris.fit_transform(___)
X_test_iris_scaled = scaler_iris.transform(___)
# Step 3: Train classifier
clf = LogisticRegression(max_iter=1000, random_state=42)
clf.fit(___, ___)
# Step 4: Make predictions
y_pred_iris = clf.predict(___)
# Step 5: Evaluate
accuracy = accuracy_score(___, ___)
print(f"\nTest Accuracy: {accuracy:.3f} ({accuracy*100:.1f}%)")
8.3 🧱 Worked — Confusion Matrix¶
# Compute confusion matrix
cm = confusion_matrix(y_test_iris, y_pred_iris)
# Visualize
plt.figure(figsize=(8, 6))
plt.imshow(cm, interpolation='nearest', cmap='Blues')
plt.title('Confusion Matrix - Iris Classification')
plt.colorbar()
tick_marks = np.arange(len(iris.target_names))
plt.xticks(tick_marks, iris.target_names, rotation=45)
plt.yticks(tick_marks, iris.target_names)
# Add text annotations
thresh = cm.max() / 2.
for i in range(cm.shape[0]):
for j in range(cm.shape[1]):
plt.text(j, i, format(cm[i, j], 'd'),
ha="center", va="center",
color="white" if cm[i, j] > thresh else "black",
fontsize=16)
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.tight_layout()
plt.show()
Section 9 · K-Fold Cross-Validation¶
The Problem with Single Train/Test Split¶
One split gives us one number — which is a noisy estimate of true performance.
Different splits give different results.
The Solution: K-Fold Cross-Validation¶
- Split data into K folds (e.g., K=5)
- Train K times, each time using a different fold as test set
- Average the K test scores
This gives a more reliable estimate of generalization performance.
9.1 🧱 Worked — Implementing K-Fold CV¶
# Create K-fold splitter
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
# Manually implement K-fold
fold_scores = []
for fold, (train_idx, val_idx) in enumerate(kfold.split(X_iris), 1):
# Split data
X_train_fold = X_iris[train_idx]
X_val_fold = X_iris[val_idx]
y_train_fold = y_iris[train_idx]
y_val_fold = y_iris[val_idx]
# Scale
scaler_fold = StandardScaler()
X_train_fold_scaled = scaler_fold.fit_transform(X_train_fold)
X_val_fold_scaled = scaler_fold.transform(X_val_fold)
# Train
clf_fold = LogisticRegression(max_iter=1000, random_state=42)
clf_fold.fit(X_train_fold_scaled, y_train_fold)
# Evaluate
score = clf_fold.score(X_val_fold_scaled, y_val_fold)
fold_scores.append(score)
print(f"Fold {fold}: Accuracy = {score:.3f}")
print("\n" + "="*40)
print(f"Mean Accuracy: {np.mean(fold_scores):.3f} ± {np.std(fold_scores):.3f}")
print("="*40)
9.2 🧱 Worked — Using sklearn's cross_val_score¶
# sklearn provides a convenient function
clf_cv = LogisticRegression(max_iter=1000, random_state=42)
# This handles scaling + training + evaluation for each fold
# Note: For proper scaling in CV, use Pipeline (see next section)
scores = cross_val_score(clf_cv, X_iris, y_iris, cv=5)
print("Cross-validation scores:", scores)
print(f"\nMean: {scores.mean():.3f}")
print(f"Std: {scores.std():.3f}")
print(f"95% CI: [{scores.mean() - 2*scores.std():.3f}, {scores.mean() + 2*scores.std():.3f}]")
9.3 ✏️ Your Turn — Visualize CV Results¶
plt.figure(figsize=(8, 5))
plt.bar(range(1, 6), fold_scores, alpha=0.7, edgecolor='black')
plt.axhline(np.mean(fold_scores), color='red', linestyle='--',
label=f'Mean = {np.mean(fold_scores):.3f}')
plt.xlabel('Fold')
plt.ylabel('Accuracy')
plt.title('5-Fold Cross-Validation Results')
plt.ylim([0.8, 1.0])
plt.legend()
plt.grid(True, alpha=0.3, axis='y')
plt.show()
from sklearn.pipeline import Pipeline
# Create pipeline: scaler → model
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', LogisticRegression(max_iter=1000, random_state=42))
])
# Now cross-validation is done correctly
# Scaler is fit on training folds only, not validation fold
cv_scores = cross_val_score(pipeline, X_iris, y_iris, cv=5)
print("Pipeline CV Scores:", cv_scores)
print(f"Mean Accuracy: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")
10.2 🧱 Worked — Final Model Training¶
After CV tells us the model is good, train on all available data for deployment.
# Train on full dataset
pipeline.fit(X_iris, y_iris)
# Now we can make predictions on new data
# Example: predict for a new flower
new_flower = np.array([[5.1, 3.5, 1.4, 0.2]]) # Example features
prediction = pipeline.predict(new_flower)
probability = pipeline.predict_proba(new_flower)
print(f"Predicted class: {iris.target_names[prediction[0]]}")
print(f"\nClass probabilities:")
for i, prob in enumerate(probability[0]):
print(f" {iris.target_names[i]}: {prob:.3f}")
Section 11 · Summary¶
What You've Learned¶
✅ Data Pipeline:
- Load data from various sources
- Explore and visualize
- Handle missing values
✅ Train/Test Split:
- Why: prevent overfitting, measure generalization
- How: 80/20 or 70/30 split
- Critical rule: never touch test data during training
✅ Feature Scaling:
- Why: gradient descent needs similar scales
- How: standardization (z-score)
- Critical rule: fit on train, transform on test
✅ Model Training:
- Implemented linear regression from scratch
- Understood gradient descent algorithm
- Visualized training dynamics
✅ Evaluation:
- Regression: MSE, RMSE, R²
- Classification: Accuracy, Confusion Matrix
- Residual analysis
✅ Cross-Validation:
- Why: single split is unreliable
- K-fold gives robust estimate
- Pipelines prevent data leakage
The Standard ML Workflow¶
# 1. Load data
X, y = load_data()
# 2. Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# 3. Build pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', SomeModel())
])
# 4. Cross-validate
scores = cross_val_score(pipeline, X_train, y_train, cv=5)
print(f"CV Score: {scores.mean():.3f} ± {scores.std():.3f}")
# 5. Train on all training data
pipeline.fit(X_train, y_train)
# 6. Final evaluation on test set
test_score = pipeline.score(X_test, y_test)
print(f"Test Score: {test_score:.3f}")
Next Steps¶
- Try this pipeline on the California housing data
- Experiment with different models (Ridge, Lasso, etc.)
- Add hyperparameter tuning (GridSearchCV)
- Work with your own datasets
🎯 Practice Exercises¶
Exercise 1: Complete Pipeline for California Housing¶
Build a complete pipeline for the California housing dataset:
- Use only 2 features:
MedIncandHouseAge - Implement 10-fold cross-validation
- Compare performance with using all features
Exercise 2: Learning Rate Exploration¶
For the gradient descent implementation:
- Find the optimal learning rate by trying values from 0.0001 to 0.1
- Plot final loss vs learning rate
- What happens at the extremes?
Exercise 3: Feature Importance¶
After training linear regression:
- Extract the learned weights
- Visualize which features have the largest (absolute) weights
- What does this tell you about feature importance?
Exercise 4: Train vs Test Performance¶
Modify the training loop to:
- Track both train AND test loss at each epoch
- Plot both curves on the same graph
- Do they diverge? Why or why not?
Congratulations! 🎉 You've built your first complete ML pipeline from scratch!