Complete Machine Learning Pipeline¶

Course: MA2221 · Mahindra University
Integrating: Data Handling + Probability + Optimization

The Big Question:

How do we go from raw data to a trained model that makes predictions?

This lab walks through the entire machine learning workflow:
Load data → Explore → Clean → Split → Scale → Build Model → Train with Gradient Descent → Evaluate → Predict

You'll build two models:

Linear Regression from scratch using gradient descent (PyTorch)
Logistic Regression for classification

By the end, you'll understand what happens inside model.fit() in scikit-learn.

Structure¶

Section	Topic
0	Setup and imports
1	Load and explore the California housing dataset
2	Train/test split and why it matters
3	Feature scaling (standardization)
4	Linear regression from scratch with gradient descent
5	Visualizing the training process
6	Evaluation metrics (MSE, RMSE, R²)
7	Classification with Logistic Regression
8	K-fold cross-validation for robust evaluation
9	Putting it all together — The complete pipeline

Legend¶

🧱 Worked — run and read
✏️ Your Turn — fill in the code
🔬 Experiment — change parameters and observe
💬 Discuss — think about the implications

Section 0 · Setup¶

In [ ]:

Copied!





import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch
from sklearn import datasets
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, confusion_matrix

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)

# Plot styling
plt.rcParams.update({
    'figure.dpi': 120,
    'axes.spines.top': False,
    'axes.spines.right': False,
    'font.size': 11,
})

print('✓ All imports successful')
print(f'✓ NumPy version: {np.__version__}')
print(f'✓ PyTorch version: {torch.__version__}')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch
from sklearn import datasets
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, confusion_matrix

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)

# Plot styling
plt.rcParams.update({
    'figure.dpi': 120,
    'axes.spines.top': False,
    'axes.spines.right': False,
    'font.size': 11,
})

print('✓ All imports successful')
print(f'✓ NumPy version: {np.__version__}')
print(f'✓ PyTorch version: {torch.__version__}')

Section 1 · Load and Explore Data¶

1.1 🧱 Worked — Loading the California Housing Dataset¶

This dataset contains information about housing in California districts.
Goal: Predict median house value based on features like location, number of rooms, population, etc.

In [ ]:

Copied!





# Load the dataset
california = datasets.fetch_california_housing()

print("Dataset description:")
print(california.DESCR[:500])  # First 500 characters
print("\n" + "="*60)
# Load the dataset
california = datasets.fetch_california_housing()

print("Dataset description:")
print(california.DESCR[:500])  # First 500 characters
print("\n" + "="*60)

In [ ]:

Copied!





# Create a DataFrame for easier exploration
df = pd.DataFrame(california.data, columns=california.feature_names)
df['MedHouseVal'] = california.target

print(f"Shape: {df.shape}")
print(f"Number of samples: {df.shape[0]}")
print(f"Number of features: {df.shape[1] - 1}")  # -1 for target
print("\nFirst 5 rows:")
df.head()
# Create a DataFrame for easier exploration
df = pd.DataFrame(california.data, columns=california.feature_names)
df['MedHouseVal'] = california.target

print(f"Shape: {df.shape}")
print(f"Number of samples: {df.shape[0]}")
print(f"Number of features: {df.shape[1] - 1}")  # -1 for target
print("\nFirst 5 rows:")
df.head()

1.2 🧱 Worked — Understanding the Features¶

In [ ]:

Copied!

print("Feature names:")
for i, name in enumerate(california.feature_names, 1):
    print(f"  {i}. {name}")

print(f"\nTarget variable: MedHouseVal (Median house value in $100,000s)")
print("Feature names:")
for i, name in enumerate(california.feature_names, 1):
    print(f"  {i}. {name}")

print(f"\nTarget variable: MedHouseVal (Median house value in $100,000s)")

1.3 🧱 Worked — Summary Statistics¶

In [ ]:

Copied!

df.describe()
df.describe()

💬 Discuss: Notice the different scales of features:

MedInc (median income): range ~0.5 to 15
Population: range ~3 to 35,000
Latitude/Longitude: ~32 to 42 and -124 to -114

Why might this be a problem for gradient descent?

1.4 ✏️ Your Turn — Visualize the Target Distribution¶

In [ ]:

Copied!





# Plot histogram of house values
plt.figure(figsize=(8, 4))
plt.hist(___, bins=50, edgecolor='black', alpha=0.7)
plt.xlabel('Median House Value ($100k)')
plt.ylabel('Frequency')
plt.title('Distribution of Median House Values')
plt.axvline(___.mean(), color='red', linestyle='--', label=f'Mean = {df["MedHouseVal"].mean():.2f}')
plt.legend()
plt.show()
# Plot histogram of house values
plt.figure(figsize=(8, 4))
plt.hist(___, bins=50, edgecolor='black', alpha=0.7)
plt.xlabel('Median House Value ($100k)')
plt.ylabel('Frequency')
plt.title('Distribution of Median House Values')
plt.axvline(___.mean(), color='red', linestyle='--', label=f'Mean = {df["MedHouseVal"].mean():.2f}')
plt.legend()
plt.show()

1.5 🔬 Experiment — Check for Missing Values¶

In [ ]:

Copied!

print("Missing values per column:")
print(df.isnull().sum())
print(f"\nTotal missing values: {df.isnull().sum().sum()}")
print("Missing values per column:")
print(df.isnull().sum())
print(f"\nTotal missing values: {df.isnull().sum().sum()}")

Section 2 · Train/Test Split¶

Why Split?¶

The fundamental problem: A model can memorize the training data but fail on new data.

Solution: Hold out a test set that the model never sees during training.
Test performance tells us how well the model generalizes.

2.1 🧱 Worked — Creating Train and Test Sets¶

In [ ]:

Copied!





# Separate features (X) and target (y)
X = california.data
y = california.target

print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")
# Separate features (X) and target (y)
X = california.data
y = california.target

print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")

In [ ]:

Copied!





# Split into 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,      # 20% for testing
    random_state=42     # For reproducibility
)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Test set size:     {X_test.shape[0]} samples")
print(f"\nRatio: {X_test.shape[0] / X_train.shape[0]:.2f} (test/train)")
# Split into 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,      # 20% for testing
    random_state=42     # For reproducibility
)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Test set size:     {X_test.shape[0]} samples")
print(f"\nRatio: {X_test.shape[0] / X_train.shape[0]:.2f} (test/train)")

2.2 💬 Discuss — What Happens Without Splitting?¶

If we train and test on the same data:

Training error will be artificially low
We have no idea how the model performs on new data
The model might overfit — memorizing noise instead of learning patterns

Rule: Never touch the test set until final evaluation.

Section 3 · Feature Scaling¶

Why Scale?¶

Features have different units and ranges:

MedInc: 0.5 to 15
Population: 3 to 35,000

Gradient descent struggles when features have vastly different scales:

Small changes in large-scale features dominate the gradient
Learning becomes slow and unstable

Solution: Standardization (z-score normalization)

$$z = \frac{x - \mu}{\sigma}$$

After scaling, each feature has mean 0 and standard deviation 1.

3.1 🧱 Worked — Standardizing Features¶

In [ ]:

Copied!





# Create scaler and fit on training data ONLY
scaler = StandardScaler()
scaler.fit(X_train)  # Learn mean and std from training data

# Transform both train and test using the SAME scaler
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Before scaling (first sample):")
print(X_train[0])
print("\nAfter scaling (first sample):")
print(X_train_scaled[0])
# Create scaler and fit on training data ONLY
scaler = StandardScaler()
scaler.fit(X_train)  # Learn mean and std from training data

# Transform both train and test using the SAME scaler
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Before scaling (first sample):")
print(X_train[0])
print("\nAfter scaling (first sample):")
print(X_train_scaled[0])

3.2 ✏️ Your Turn — Verify Standardization¶

In [ ]:

Copied!





# Check that mean ≈ 0 and std ≈ 1 for each feature
print("Mean of each feature after scaling:")
print(np.mean(___, axis=0))

print("\nStd of each feature after scaling:")
print(np.std(___, axis=0))
# Check that mean ≈ 0 and std ≈ 1 for each feature
print("Mean of each feature after scaling:")
print(np.mean(___, axis=0))

print("\nStd of each feature after scaling:")
print(np.std(___, axis=0))

⚠️ Critical Rule:

Fit the scaler on training data only
Apply the same transformation to test data
Why? To prevent information leakage from test → train

Section 4 · Linear Regression from Scratch¶

The Model¶

Linear regression predicts $y$ as a weighted sum of features:

$$\hat{y} = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + b$$

In matrix form:

$$\hat{y} = X w + b$$

Where:

$X$ is the feature matrix (samples × features)
$w$ is the weight vector
$b$ is the bias (intercept)

The Loss Function¶

Mean Squared Error (MSE):

$$L = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

The Algorithm: Gradient Descent¶

Initialize $w$ and $b$ randomly
Compute predictions: $\hat{y} = Xw + b$
Compute loss: $L = \text{MSE}(y, \hat{y})$
Compute gradients: $\frac{\partial L}{\partial w}$ and $\frac{\partial L}{\partial b}$
Update parameters:
- $w \leftarrow w - \alpha \frac{\partial L}{\partial w}$
- $b \leftarrow b - \alpha \frac{\partial L}{\partial b}$
Repeat steps 2-5 for many iterations

4.1 🧱 Worked — Implementing Linear Regression in PyTorch¶

In [ ]:

Copied!





# Convert to PyTorch tensors
X_train_tensor = torch.FloatTensor(X_train_scaled)
y_train_tensor = torch.FloatTensor(y_train)
X_test_tensor = torch.FloatTensor(X_test_scaled)
y_test_tensor = torch.FloatTensor(y_test)

print(f"X_train shape: {X_train_tensor.shape}")
print(f"y_train shape: {y_train_tensor.shape}")
# Convert to PyTorch tensors
X_train_tensor = torch.FloatTensor(X_train_scaled)
y_train_tensor = torch.FloatTensor(y_train)
X_test_tensor = torch.FloatTensor(X_test_scaled)
y_test_tensor = torch.FloatTensor(y_test)

print(f"X_train shape: {X_train_tensor.shape}")
print(f"y_train shape: {y_train_tensor.shape}")

In [ ]:

Copied!





# Initialize parameters
n_features = X_train_tensor.shape[1]

# Weights and bias (requires_grad=True for autograd)
w = torch.randn(n_features, 1, requires_grad=True)
b = torch.zeros(1, requires_grad=True)

print(f"Initial weights shape: {w.shape}")
print(f"Initial bias shape: {b.shape}")
# Initialize parameters
n_features = X_train_tensor.shape[1]

# Weights and bias (requires_grad=True for autograd)
w = torch.randn(n_features, 1, requires_grad=True)
b = torch.zeros(1, requires_grad=True)

print(f"Initial weights shape: {w.shape}")
print(f"Initial bias shape: {b.shape}")

In [ ]:

Copied!





# Hyperparameters
learning_rate = 0.01
num_epochs = 1000

# Track loss over time
losses = []

# Training loop
for epoch in range(num_epochs):
    # Forward pass: compute predictions
    y_pred = X_train_tensor @ w + b
    
    # Compute loss (MSE)
    loss = torch.mean((y_pred.squeeze() - y_train_tensor) ** 2)
    losses.append(loss.item())
    
    # Backward pass: compute gradients
    loss.backward()
    
    # Update parameters (gradient descent step)
    with torch.no_grad():
        w -= learning_rate * w.grad
        b -= learning_rate * b.grad
        
        # Zero gradients for next iteration
        w.grad.zero_()
        b.grad.zero_()
    
    # Print progress
    if (epoch + 1) % 100 == 0:
        print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}")

print("\n✓ Training complete!")
# Hyperparameters
learning_rate = 0.01
num_epochs = 1000

# Track loss over time
losses = []

# Training loop
for epoch in range(num_epochs):
    # Forward pass: compute predictions
    y_pred = X_train_tensor @ w + b
    
    # Compute loss (MSE)
    loss = torch.mean((y_pred.squeeze() - y_train_tensor) ** 2)
    losses.append(loss.item())
    
    # Backward pass: compute gradients
    loss.backward()
    
    # Update parameters (gradient descent step)
    with torch.no_grad():
        w -= learning_rate * w.grad
        b -= learning_rate * b.grad
        
        # Zero gradients for next iteration
        w.grad.zero_()
        b.grad.zero_()
    
    # Print progress
    if (epoch + 1) % 100 == 0:
        print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}")

print("\n✓ Training complete!")

4.2 ✏️ Your Turn — Understanding the Code¶

Questions:

What does @ do in X_train_tensor @ w?
Why do we use with torch.no_grad() when updating parameters?
What happens if we forget to zero the gradients?

Click for answers

@ is matrix multiplication (same as torch.matmul or np.dot)
We don't want PyTorch to track gradients during parameter updates (not part of the computational graph)
Gradients accumulate! Old gradients add to new ones, corrupting the update.

Section 5 · Visualizing Training¶

5.1 🧱 Worked — Loss Curve¶

In [ ]:

Copied!





plt.figure(figsize=(10, 4))

# Plot 1: Full training curve
plt.subplot(1, 2, 1)
plt.plot(losses, linewidth=2)
plt.xlabel('Epoch')
plt.ylabel('Loss (MSE)')
plt.title('Training Loss Over Time')
plt.grid(True, alpha=0.3)

# Plot 2: Log scale to see convergence
plt.subplot(1, 2, 2)
plt.plot(losses, linewidth=2, color='orange')
plt.xlabel('Epoch')
plt.ylabel('Loss (MSE)')
plt.title('Training Loss (Log Scale)')
plt.yscale('log')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Initial loss: {losses[0]:.4f}")
print(f"Final loss:   {losses[-1]:.4f}")
print(f"Reduction:    {(1 - losses[-1]/losses[0]) * 100:.1f}%")
plt.figure(figsize=(10, 4))

# Plot 1: Full training curve
plt.subplot(1, 2, 1)
plt.plot(losses, linewidth=2)
plt.xlabel('Epoch')
plt.ylabel('Loss (MSE)')
plt.title('Training Loss Over Time')
plt.grid(True, alpha=0.3)

# Plot 2: Log scale to see convergence
plt.subplot(1, 2, 2)
plt.plot(losses, linewidth=2, color='orange')
plt.xlabel('Epoch')
plt.ylabel('Loss (MSE)')
plt.title('Training Loss (Log Scale)')
plt.yscale('log')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Initial loss: {losses[0]:.4f}")
print(f"Final loss:   {losses[-1]:.4f}")
print(f"Reduction:    {(1 - losses[-1]/losses[0]) * 100:.1f}%")

5.2 🔬 Experiment — Effect of Learning Rate¶

Try different learning rates and observe:

Too small (e.g., 0.0001): slow convergence
Too large (e.g., 0.1): oscillation or divergence
Just right (e.g., 0.01): smooth descent

In [ ]:

Copied!





def train_with_lr(lr, epochs=500):
    """Train model with given learning rate and return loss history"""
    w_temp = torch.randn(n_features, 1, requires_grad=True)
    b_temp = torch.zeros(1, requires_grad=True)
    losses_temp = []
    
    for epoch in range(epochs):
        y_pred = X_train_tensor @ w_temp + b_temp
        loss = torch.mean((y_pred.squeeze() - y_train_tensor) ** 2)
        losses_temp.append(loss.item())
        
        loss.backward()
        with torch.no_grad():
            w_temp -= lr * w_temp.grad
            b_temp -= lr * b_temp.grad
            w_temp.grad.zero_()
            b_temp.grad.zero_()
    
    return losses_temp

# Compare different learning rates
learning_rates = [0.001, 0.01, 0.05, 0.1]
plt.figure(figsize=(10, 5))

for lr in learning_rates:
    losses_lr = train_with_lr(lr)
    plt.plot(losses_lr, label=f'LR = {lr}', linewidth=2)

plt.xlabel('Epoch')
plt.ylabel('Loss (MSE)')
plt.title('Effect of Learning Rate on Training')
plt.legend()
plt.grid(True, alpha=0.3)
plt.yscale('log')
plt.show()
def train_with_lr(lr, epochs=500):
    """Train model with given learning rate and return loss history"""
    w_temp = torch.randn(n_features, 1, requires_grad=True)
    b_temp = torch.zeros(1, requires_grad=True)
    losses_temp = []
    
    for epoch in range(epochs):
        y_pred = X_train_tensor @ w_temp + b_temp
        loss = torch.mean((y_pred.squeeze() - y_train_tensor) ** 2)
        losses_temp.append(loss.item())
        
        loss.backward()
        with torch.no_grad():
            w_temp -= lr * w_temp.grad
            b_temp -= lr * b_temp.grad
            w_temp.grad.zero_()
            b_temp.grad.zero_()
    
    return losses_temp

# Compare different learning rates
learning_rates = [0.001, 0.01, 0.05, 0.1]
plt.figure(figsize=(10, 5))

for lr in learning_rates:
    losses_lr = train_with_lr(lr)
    plt.plot(losses_lr, label=f'LR = {lr}', linewidth=2)

plt.xlabel('Epoch')
plt.ylabel('Loss (MSE)')
plt.title('Effect of Learning Rate on Training')
plt.legend()
plt.grid(True, alpha=0.3)
plt.yscale('log')
plt.show()

Section 6 · Evaluation Metrics¶

6.1 🧱 Worked — Make Predictions on Test Set¶

In [ ]:

Copied!





# Make predictions on test set
with torch.no_grad():
    y_test_pred = (X_test_tensor @ w + b).squeeze()

# Convert back to numpy for evaluation
y_test_pred_np = y_test_pred.numpy()

print("First 5 predictions vs actual:")
for i in range(5):
    print(f"Predicted: {y_test_pred_np[i]:.2f}, Actual: {y_test[i]:.2f}")
# Make predictions on test set
with torch.no_grad():
    y_test_pred = (X_test_tensor @ w + b).squeeze()

# Convert back to numpy for evaluation
y_test_pred_np = y_test_pred.numpy()

print("First 5 predictions vs actual:")
for i in range(5):
    print(f"Predicted: {y_test_pred_np[i]:.2f}, Actual: {y_test[i]:.2f}")

6.2 🧱 Worked — Computing Metrics¶

Mean Squared Error (MSE): $$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

Root Mean Squared Error (RMSE): $$\text{RMSE} = \sqrt{\text{MSE}}$$

R² Score (Coefficient of Determination): $$R^2 = 1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2}$$

$R^2 = 1$: perfect predictions
$R^2 = 0$: model is no better than predicting the mean
$R^2 < 0$: model is worse than predicting the mean

In [ ]:

Copied!





# Compute metrics
mse = mean_squared_error(y_test, y_test_pred_np)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_test_pred_np)

print("="*50)
print("TEST SET PERFORMANCE")
print("="*50)
print(f"Mean Squared Error (MSE):  {mse:.4f}")
print(f"Root Mean Squared Error:   {rmse:.4f}")
print(f"R² Score:                  {r2:.4f}")
print("="*50)
print(f"\nInterpretation: Our model explains {r2*100:.1f}% of variance in house prices")
# Compute metrics
mse = mean_squared_error(y_test, y_test_pred_np)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_test_pred_np)

print("="*50)
print("TEST SET PERFORMANCE")
print("="*50)
print(f"Mean Squared Error (MSE):  {mse:.4f}")
print(f"Root Mean Squared Error:   {rmse:.4f}")
print(f"R² Score:                  {r2:.4f}")
print("="*50)
print(f"\nInterpretation: Our model explains {r2*100:.1f}% of variance in house prices")

6.3 🧱 Worked — Predictions vs Actual Plot¶

In [ ]:

Copied!





plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_test_pred_np, alpha=0.5, s=20)
plt.plot([y_test.min(), y_test.max()], 
         [y_test.min(), y_test.max()], 
         'r--', linewidth=2, label='Perfect predictions')
plt.xlabel('Actual House Value ($100k)')
plt.ylabel('Predicted House Value ($100k)')
plt.title(f'Predictions vs Actual (R² = {r2:.3f})')
plt.legend()
plt.grid(True, alpha=0.3)
plt.axis('equal')
plt.tight_layout()
plt.show()
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_test_pred_np, alpha=0.5, s=20)
plt.plot([y_test.min(), y_test.max()], 
         [y_test.min(), y_test.max()], 
         'r--', linewidth=2, label='Perfect predictions')
plt.xlabel('Actual House Value ($100k)')
plt.ylabel('Predicted House Value ($100k)')
plt.title(f'Predictions vs Actual (R² = {r2:.3f})')
plt.legend()
plt.grid(True, alpha=0.3)
plt.axis('equal')
plt.tight_layout()
plt.show()

6.4 ✏️ Your Turn — Residual Analysis¶

In [ ]:

Copied!





# Compute residuals (errors)
residuals = y_test - ___

plt.figure(figsize=(10, 4))

# Plot 1: Residual histogram
plt.subplot(1, 2, 1)
plt.hist(___, bins=50, edgecolor='black', alpha=0.7)
plt.xlabel('Residual (Actual - Predicted)')
plt.ylabel('Frequency')
plt.title('Distribution of Residuals')
plt.axvline(0, color='red', linestyle='--')

# Plot 2: Residuals vs predictions
plt.subplot(1, 2, 2)
plt.scatter(___, residuals, alpha=0.5, s=20)
plt.xlabel('Predicted Value')
plt.ylabel('Residual')
plt.title('Residual Plot')
plt.axhline(0, color='red', linestyle='--')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
# Compute residuals (errors)
residuals = y_test - ___

plt.figure(figsize=(10, 4))

# Plot 1: Residual histogram
plt.subplot(1, 2, 1)
plt.hist(___, bins=50, edgecolor='black', alpha=0.7)
plt.xlabel('Residual (Actual - Predicted)')
plt.ylabel('Frequency')
plt.title('Distribution of Residuals')
plt.axvline(0, color='red', linestyle='--')

# Plot 2: Residuals vs predictions
plt.subplot(1, 2, 2)
plt.scatter(___, residuals, alpha=0.5, s=20)
plt.xlabel('Predicted Value')
plt.ylabel('Residual')
plt.title('Residual Plot')
plt.axhline(0, color='red', linestyle='--')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

💬 Discuss:

What does a good residual plot look like?
What patterns in residuals indicate problems?

Section 7 · Comparison with scikit-learn¶

Let's verify our implementation matches scikit-learn's optimized version.

7.1 🧱 Worked — Training with sklearn¶

In [ ]:

Copied!





# Train sklearn's LinearRegression
sklearn_model = LinearRegression()
sklearn_model.fit(X_train_scaled, y_train)

# Make predictions
y_test_pred_sklearn = sklearn_model.predict(X_test_scaled)

# Compute metrics
mse_sklearn = mean_squared_error(y_test, y_test_pred_sklearn)
r2_sklearn = r2_score(y_test, y_test_pred_sklearn)

print("="*50)
print("COMPARISON: Our Model vs sklearn")
print("="*50)
print(f"{'Metric':<25} {'Our Model':<15} {'sklearn':<15}")
print("-"*50)
print(f"{'MSE':<25} {mse:<15.4f} {mse_sklearn:<15.4f}")
print(f"{'R² Score':<25} {r2:<15.4f} {r2_sklearn:<15.4f}")
print("="*50)
# Train sklearn's LinearRegression
sklearn_model = LinearRegression()
sklearn_model.fit(X_train_scaled, y_train)

# Make predictions
y_test_pred_sklearn = sklearn_model.predict(X_test_scaled)

# Compute metrics
mse_sklearn = mean_squared_error(y_test, y_test_pred_sklearn)
r2_sklearn = r2_score(y_test, y_test_pred_sklearn)

print("="*50)
print("COMPARISON: Our Model vs sklearn")
print("="*50)
print(f"{'Metric':<25} {'Our Model':<15} {'sklearn':<15}")
print("-"*50)
print(f"{'MSE':<25} {mse:<15.4f} {mse_sklearn:<15.4f}")
print(f"{'R² Score':<25} {r2:<15.4f} {r2_sklearn:<15.4f}")
print("="*50)

Note: sklearn uses the closed-form solution (normal equation) instead of gradient descent:
$$w = (X^T X)^{-1} X^T y$$

This is exact but computationally expensive for large datasets. Gradient descent scales better.

Section 8 · Classification Example¶

Let's apply the pipeline to a classification problem.

8.1 🧱 Worked — Loading Iris Dataset¶

In [ ]:

Copied!





# Load iris dataset (3-class classification)
iris = datasets.load_iris()
X_iris = iris.data
y_iris = iris.target

print(f"Number of samples: {X_iris.shape[0]}")
print(f"Number of features: {X_iris.shape[1]}")
print(f"Number of classes: {len(np.unique(y_iris))}")
print(f"Class names: {iris.target_names}")
# Load iris dataset (3-class classification)
iris = datasets.load_iris()
X_iris = iris.data
y_iris = iris.target

print(f"Number of samples: {X_iris.shape[0]}")
print(f"Number of features: {X_iris.shape[1]}")
print(f"Number of classes: {len(np.unique(y_iris))}")
print(f"Class names: {iris.target_names}")

8.2 ✏️ Your Turn — Complete Classification Pipeline¶

In [ ]:

Copied!





# Step 1: Split data
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(
    X_iris, y_iris, 
    test_size=___,  # Fill in
    random_state=42,
    stratify=y_iris  # Ensures balanced classes in train/test
)

# Step 2: Scale features
scaler_iris = StandardScaler()
X_train_iris_scaled = scaler_iris.fit_transform(___)
X_test_iris_scaled = scaler_iris.transform(___)

# Step 3: Train classifier
clf = LogisticRegression(max_iter=1000, random_state=42)
clf.fit(___, ___)

# Step 4: Make predictions
y_pred_iris = clf.predict(___)

# Step 5: Evaluate
accuracy = accuracy_score(___, ___)
print(f"\nTest Accuracy: {accuracy:.3f} ({accuracy*100:.1f}%)")
# Step 1: Split data
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(
    X_iris, y_iris, 
    test_size=___,  # Fill in
    random_state=42,
    stratify=y_iris  # Ensures balanced classes in train/test
)

# Step 2: Scale features
scaler_iris = StandardScaler()
X_train_iris_scaled = scaler_iris.fit_transform(___)
X_test_iris_scaled = scaler_iris.transform(___)

# Step 3: Train classifier
clf = LogisticRegression(max_iter=1000, random_state=42)
clf.fit(___, ___)

# Step 4: Make predictions
y_pred_iris = clf.predict(___)

# Step 5: Evaluate
accuracy = accuracy_score(___, ___)
print(f"\nTest Accuracy: {accuracy:.3f} ({accuracy*100:.1f}%)")

8.3 🧱 Worked — Confusion Matrix¶

In [ ]:

Copied!





# Compute confusion matrix
cm = confusion_matrix(y_test_iris, y_pred_iris)

# Visualize
plt.figure(figsize=(8, 6))
plt.imshow(cm, interpolation='nearest', cmap='Blues')
plt.title('Confusion Matrix - Iris Classification')
plt.colorbar()

tick_marks = np.arange(len(iris.target_names))
plt.xticks(tick_marks, iris.target_names, rotation=45)
plt.yticks(tick_marks, iris.target_names)

# Add text annotations
thresh = cm.max() / 2.
for i in range(cm.shape[0]):
    for j in range(cm.shape[1]):
        plt.text(j, i, format(cm[i, j], 'd'),
                ha="center", va="center",
                color="white" if cm[i, j] > thresh else "black",
                fontsize=16)

plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.tight_layout()
plt.show()
# Compute confusion matrix
cm = confusion_matrix(y_test_iris, y_pred_iris)

# Visualize
plt.figure(figsize=(8, 6))
plt.imshow(cm, interpolation='nearest', cmap='Blues')
plt.title('Confusion Matrix - Iris Classification')
plt.colorbar()

tick_marks = np.arange(len(iris.target_names))
plt.xticks(tick_marks, iris.target_names, rotation=45)
plt.yticks(tick_marks, iris.target_names)

# Add text annotations
thresh = cm.max() / 2.
for i in range(cm.shape[0]):
    for j in range(cm.shape[1]):
        plt.text(j, i, format(cm[i, j], 'd'),
                ha="center", va="center",
                color="white" if cm[i, j] > thresh else "black",
                fontsize=16)

plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.tight_layout()
plt.show()

Section 9 · K-Fold Cross-Validation¶

The Problem with Single Train/Test Split¶

One split gives us one number — which is a noisy estimate of true performance.
Different splits give different results.

The Solution: K-Fold Cross-Validation¶

Split data into K folds (e.g., K=5)
Train K times, each time using a different fold as test set
Average the K test scores

This gives a more reliable estimate of generalization performance.

9.1 🧱 Worked — Implementing K-Fold CV¶

In [ ]:

Copied!





# Create K-fold splitter
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

# Manually implement K-fold
fold_scores = []

for fold, (train_idx, val_idx) in enumerate(kfold.split(X_iris), 1):
    # Split data
    X_train_fold = X_iris[train_idx]
    X_val_fold = X_iris[val_idx]
    y_train_fold = y_iris[train_idx]
    y_val_fold = y_iris[val_idx]
    
    # Scale
    scaler_fold = StandardScaler()
    X_train_fold_scaled = scaler_fold.fit_transform(X_train_fold)
    X_val_fold_scaled = scaler_fold.transform(X_val_fold)
    
    # Train
    clf_fold = LogisticRegression(max_iter=1000, random_state=42)
    clf_fold.fit(X_train_fold_scaled, y_train_fold)
    
    # Evaluate
    score = clf_fold.score(X_val_fold_scaled, y_val_fold)
    fold_scores.append(score)
    print(f"Fold {fold}: Accuracy = {score:.3f}")

print("\n" + "="*40)
print(f"Mean Accuracy: {np.mean(fold_scores):.3f} ± {np.std(fold_scores):.3f}")
print("="*40)
# Create K-fold splitter
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

# Manually implement K-fold
fold_scores = []

for fold, (train_idx, val_idx) in enumerate(kfold.split(X_iris), 1):
    # Split data
    X_train_fold = X_iris[train_idx]
    X_val_fold = X_iris[val_idx]
    y_train_fold = y_iris[train_idx]
    y_val_fold = y_iris[val_idx]
    
    # Scale
    scaler_fold = StandardScaler()
    X_train_fold_scaled = scaler_fold.fit_transform(X_train_fold)
    X_val_fold_scaled = scaler_fold.transform(X_val_fold)
    
    # Train
    clf_fold = LogisticRegression(max_iter=1000, random_state=42)
    clf_fold.fit(X_train_fold_scaled, y_train_fold)
    
    # Evaluate
    score = clf_fold.score(X_val_fold_scaled, y_val_fold)
    fold_scores.append(score)
    print(f"Fold {fold}: Accuracy = {score:.3f}")

print("\n" + "="*40)
print(f"Mean Accuracy: {np.mean(fold_scores):.3f} ± {np.std(fold_scores):.3f}")
print("="*40)

9.2 🧱 Worked — Using sklearn's cross_val_score¶

In [ ]:

Copied!





# sklearn provides a convenient function
clf_cv = LogisticRegression(max_iter=1000, random_state=42)

# This handles scaling + training + evaluation for each fold
# Note: For proper scaling in CV, use Pipeline (see next section)
scores = cross_val_score(clf_cv, X_iris, y_iris, cv=5)

print("Cross-validation scores:", scores)
print(f"\nMean: {scores.mean():.3f}")
print(f"Std:  {scores.std():.3f}")
print(f"95% CI: [{scores.mean() - 2*scores.std():.3f}, {scores.mean() + 2*scores.std():.3f}]")
# sklearn provides a convenient function
clf_cv = LogisticRegression(max_iter=1000, random_state=42)

# This handles scaling + training + evaluation for each fold
# Note: For proper scaling in CV, use Pipeline (see next section)
scores = cross_val_score(clf_cv, X_iris, y_iris, cv=5)

print("Cross-validation scores:", scores)
print(f"\nMean: {scores.mean():.3f}")
print(f"Std:  {scores.std():.3f}")
print(f"95% CI: [{scores.mean() - 2*scores.std():.3f}, {scores.mean() + 2*scores.std():.3f}]")

9.3 ✏️ Your Turn — Visualize CV Results¶

In [ ]:

Copied!





plt.figure(figsize=(8, 5))
plt.bar(range(1, 6), fold_scores, alpha=0.7, edgecolor='black')
plt.axhline(np.mean(fold_scores), color='red', linestyle='--', 
            label=f'Mean = {np.mean(fold_scores):.3f}')
plt.xlabel('Fold')
plt.ylabel('Accuracy')
plt.title('5-Fold Cross-Validation Results')
plt.ylim([0.8, 1.0])
plt.legend()
plt.grid(True, alpha=0.3, axis='y')
plt.show()
plt.figure(figsize=(8, 5))
plt.bar(range(1, 6), fold_scores, alpha=0.7, edgecolor='black')
plt.axhline(np.mean(fold_scores), color='red', linestyle='--', 
            label=f'Mean = {np.mean(fold_scores):.3f}')
plt.xlabel('Fold')
plt.ylabel('Accuracy')
plt.title('5-Fold Cross-Validation Results')
plt.ylim([0.8, 1.0])
plt.legend()
plt.grid(True, alpha=0.3, axis='y')
plt.show()

Section 10 · The Complete Pipeline¶

10.1 🧱 Worked — Using sklearn Pipeline¶

A Pipeline chains preprocessing + model into one object.
This ensures:

Scaling is done correctly in cross-validation
No data leakage
Clean, reproducible code

In [ ]:

Copied!





from sklearn.pipeline import Pipeline

# Create pipeline: scaler → model
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(max_iter=1000, random_state=42))
])

# Now cross-validation is done correctly
# Scaler is fit on training folds only, not validation fold
cv_scores = cross_val_score(pipeline, X_iris, y_iris, cv=5)

print("Pipeline CV Scores:", cv_scores)
print(f"Mean Accuracy: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")
from sklearn.pipeline import Pipeline

# Create pipeline: scaler → model
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(max_iter=1000, random_state=42))
])

# Now cross-validation is done correctly
# Scaler is fit on training folds only, not validation fold
cv_scores = cross_val_score(pipeline, X_iris, y_iris, cv=5)

print("Pipeline CV Scores:", cv_scores)
print(f"Mean Accuracy: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")

10.2 🧱 Worked — Final Model Training¶

After CV tells us the model is good, train on all available data for deployment.

In [ ]:

Copied!





# Train on full dataset
pipeline.fit(X_iris, y_iris)

# Now we can make predictions on new data
# Example: predict for a new flower
new_flower = np.array([[5.1, 3.5, 1.4, 0.2]])  # Example features
prediction = pipeline.predict(new_flower)
probability = pipeline.predict_proba(new_flower)

print(f"Predicted class: {iris.target_names[prediction[0]]}")
print(f"\nClass probabilities:")
for i, prob in enumerate(probability[0]):
    print(f"  {iris.target_names[i]}: {prob:.3f}")
# Train on full dataset
pipeline.fit(X_iris, y_iris)

# Now we can make predictions on new data
# Example: predict for a new flower
new_flower = np.array([[5.1, 3.5, 1.4, 0.2]])  # Example features
prediction = pipeline.predict(new_flower)
probability = pipeline.predict_proba(new_flower)

print(f"Predicted class: {iris.target_names[prediction[0]]}")
print(f"\nClass probabilities:")
for i, prob in enumerate(probability[0]):
    print(f"  {iris.target_names[i]}: {prob:.3f}")

Section 11 · Summary¶

What You've Learned¶

✅ Data Pipeline:

Load data from various sources
Explore and visualize
Handle missing values

✅ Train/Test Split:

Why: prevent overfitting, measure generalization
How: 80/20 or 70/30 split
Critical rule: never touch test data during training

✅ Feature Scaling:

Why: gradient descent needs similar scales
How: standardization (z-score)
Critical rule: fit on train, transform on test

✅ Model Training:

Implemented linear regression from scratch
Understood gradient descent algorithm
Visualized training dynamics

✅ Evaluation:

Regression: MSE, RMSE, R²
Classification: Accuracy, Confusion Matrix
Residual analysis

✅ Cross-Validation:

Why: single split is unreliable
K-fold gives robust estimate
Pipelines prevent data leakage

The Standard ML Workflow¶

# 1. Load data
X, y = load_data()

# 2. Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# 3. Build pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', SomeModel())
])

# 4. Cross-validate
scores = cross_val_score(pipeline, X_train, y_train, cv=5)
print(f"CV Score: {scores.mean():.3f} ± {scores.std():.3f}")

# 5. Train on all training data
pipeline.fit(X_train, y_train)

# 6. Final evaluation on test set
test_score = pipeline.score(X_test, y_test)
print(f"Test Score: {test_score:.3f}")

Next Steps¶

Try this pipeline on the California housing data
Experiment with different models (Ridge, Lasso, etc.)
Add hyperparameter tuning (GridSearchCV)
Work with your own datasets

🎯 Practice Exercises¶

Exercise 1: Complete Pipeline for California Housing¶

Build a complete pipeline for the California housing dataset:

Use only 2 features: MedInc and HouseAge
Implement 10-fold cross-validation
Compare performance with using all features

Exercise 2: Learning Rate Exploration¶

For the gradient descent implementation:

Find the optimal learning rate by trying values from 0.0001 to 0.1
Plot final loss vs learning rate
What happens at the extremes?

Exercise 3: Feature Importance¶

After training linear regression:

Extract the learned weights
Visualize which features have the largest (absolute) weights
What does this tell you about feature importance?

Exercise 4: Train vs Test Performance¶

Modify the training loop to:

Track both train AND test loss at each epoch
Plot both curves on the same graph
Do they diverge? Why or why not?

Congratulations! 🎉 You've built your first complete ML pipeline from scratch!