Gradients and Optimisation in PyTorch¶
Course: MA2221 · Mahindra University
Reference: Mathematics for Machine Learning, Deisenroth, Faisal & Ong — Ch 5 & 7
Two questions drive this entire lab:
How does a computer compute gradients?
How do we use gradients to find the minimum of a function?
You will answer both — first by doing it by hand in NumPy, then by letting PyTorch's autograd do the work, and finally by watching gradient descent succeed, struggle, and fail on different landscapes.
Structure¶
| Section | Topic |
|---|---|
| 1 | Derivatives and Partial Derivatives — from scratch |
| 2 | Gradients and the Jacobian |
| 3 | PyTorch Autograd — automatic differentiation |
| 4 | Gradient Descent — the algorithm |
| 5 | Learning Rate — the most important hyperparameter |
| 6 | Beyond Classical GD — Momentum and Adam |
Legend¶
- 🧱 Worked — run and read
- ✏️ Your turn — fill in
___ - 🔬 Experiment — change numbers and observe
- 💬 Discuss — no single right answer
0 · Setup¶
import numpy as np
import matplotlib.pyplot as plt
import torch
torch.manual_seed(0)
np.random.seed(0)
plt.rcParams.update({
'figure.dpi' : 120,
'axes.spines.top' : False,
'axes.spines.right': False,
'font.size' : 12,
})
print(f'PyTorch version: {torch.__version__} ✓')
Section 1 · Derivatives — From the Definition¶
Differentiation of Univariate Functions¶
The derivative of $f$ at $x$ is the slope of the tangent line:
$$f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$$
We approximate this with a finite $h$ — called the finite difference.
The central difference is more accurate:
$$f'(x) \approx \frac{f(x+h) - f(x-h)}{2h}$$
1.1 · Worked — Finite Difference vs Exact Derivative¶
# Let's work with f(x) = x^3 - 2x + 1, whose exact derivative is f'(x) = 3x^2 - 2
def f(x): return x**3 - 2*x + 1
def f_prime(x): return 3*x**2 - 2
x0 = 1.5
# Try progressively smaller h values
h_values = [1.0, 0.1, 0.01, 1e-4, 1e-7, 1e-12]
print(f'Exact f\u2019({x0}) = {f_prime(x0):.8f}\n')
print(f'{"h":>12} {"forward diff":>16} {"central diff":>16} {"central error":>14}')
for h in h_values:
fwd = (f(x0 + h) - f(x0)) / h
cen = (f(x0 + h) - f(x0 - h)) / (2*h)
err = abs(cen - f_prime(x0))
print(f'{h:>12.2e} {fwd:>16.8f} {cen:>16.8f} {err:>14.2e}')
# Visualise the tangent line
xs = np.linspace(0, 3, 200)
tangent = f(x0) + f_prime(x0) * (xs - x0)
fig, ax = plt.subplots(figsize=(7, 4))
ax.plot(xs, f(xs), 'steelblue', lw=2.5, label='f(x) = x³ - 2x + 1')
ax.plot(xs, tangent, 'crimson', lw=1.8, linestyle='--', label=f"Tangent at x={x0}")
ax.plot(x0, f(x0), 'ko', ms=7)
ax.set_xlabel('x'); ax.set_title('Derivative as slope of tangent line')
ax.legend(); plt.tight_layout(); plt.show()
✏️ 1.2 · Your Turn — Chain Rule¶
The chain rule: if $y = f(g(x))$ then $\frac{dy}{dx} = f'(g(x)) \cdot g'(x)$.
Consider $h(x) = \sin(x^2)$.
- Derive $h'(x)$ analytically using the chain rule.
- Implement both
handh_primebelow. - Verify with the central difference check.
def h(x): return np.sin(x**2)
def h_prime(x):
# ✏️ Chain rule: d/dx sin(x^2) = cos(x^2) * 2x
return ___ * ___ # fill in: outer derivative * inner derivative
# Verify at multiple points
test_pts = [0.5, 1.0, 1.5, 2.0]
hval = 1e-5
print(f'{"x":>6} {"analytical":>14} {"numerical":>14} {"error":>12}')
for x in test_pts:
anal = h_prime(x)
num = (h(x + hval) - h(x - hval)) / (2 * hval)
print(f'{x:>6.2f} {anal:>14.8f} {num:>14.8f} {abs(anal-num):>12.2e}')
Section 2 · Partial Derivatives and the Gradient¶
Partial Differentiation and Gradients¶
For $f: \mathbb{R}^n \to \mathbb{R}$, the gradient is the vector of all partial derivatives:
$$\nabla_{\mathbf{x}} f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix} \in \mathbb{R}^n$$
The gradient always points in the direction of steepest ascent.
To minimise $f$, we move in the negative gradient direction.
2.1 · Worked — Gradient as a Slope Field¶
# f(x1, x2) = x1^2 + 2*x2^2 (elliptic paraboloid)
# Partial derivatives: df/dx1 = 2*x1, df/dx2 = 4*x2
def f2d(x1, x2): return x1**2 + 2*x2**2
def grad_f2d(x): return np.array([2*x[0], 4*x[1]])
# Build grid
g = np.linspace(-3, 3, 200)
X1, X2 = np.meshgrid(g, g)
Z = f2d(X1, X2)
# Coarse grid for arrows
gc = np.linspace(-2.5, 2.5, 11)
G1c, G2c = np.meshgrid(gc, gc)
U, V = 2*G1c, 4*G2c # gradient components
fig, axes = plt.subplots(1, 2, figsize=(13, 5))
ax = axes[0]
cs = ax.contourf(X1, X2, Z, levels=20, cmap='Blues', alpha=0.7)
ax.contour(X1, X2, Z, levels=20, colors='white', linewidths=0.4, alpha=0.5)
plt.colorbar(cs, ax=ax)
ax.quiver(G1c, G2c, U, V, color='crimson', alpha=0.7, scale=120)
ax.set_title('Contours + gradient field ∇f points uphill')
ax.set_xlabel('x₁'); ax.set_ylabel('x₂')
ax2 = axes[1]
ax2.contourf(X1, X2, Z, levels=20, cmap='Blues', alpha=0.7)
ax2.contour(X1, X2, Z, levels=20, colors='white', linewidths=0.4, alpha=0.5)
ax2.quiver(G1c, G2c, -U, -V, color='green', alpha=0.7, scale=120)
ax2.set_title('–∇f points DOWNHILL ← direction of descent')
ax2.set_xlabel('x₁'); ax2.set_ylabel('x₂')
plt.tight_layout(); plt.show()
# Key observation: gradient arrows are PERPENDICULAR to contour lines
✏️ 2.2 · Your Turn — Compute Gradients Analytically¶
For each function, derive $\nabla f$ analytically, implement it, then verify with finite differences.
Use the gradient identities:
- $\nabla_{\mathbf{x}}(\mathbf{a}^\top \mathbf{x}) = \mathbf{a}$
- $\nabla_{\mathbf{x}}(\mathbf{x}^\top \mathbf{x}) = 2\mathbf{x}$
- $\nabla_{\mathbf{x}}(\mathbf{x}^\top A \mathbf{x}) = (A + A^\top)\mathbf{x}$
def numerical_gradient(func, x, h=1e-5):
grad = np.zeros_like(x, dtype=float)
for i in range(len(x)):
e = np.zeros_like(x, dtype=float); e[i] = 1.
grad[i] = (func(x + h*e) - func(x - h*e)) / (2*h)
return grad
x_test = np.array([1., -2., 0.5])
a = np.array([3., 1., -2.])
A_sym = np.array([[2., 1., 0.],
[1., 3., 1.],
[0., 1., 2.]], dtype=float) # symmetric
# ── f1(x) = a^T x ────────────────────────────────────────────────────────
f1 = lambda x: a @ x
grad_f1 = lambda x: ___ # fill in
# ── f2(x) = ||x||^2 = x^T x ─────────────────────────────────────────────
f2 = lambda x: x @ x
grad_f2 = lambda x: ___ # fill in
# ── f3(x) = x^T A x (quadratic form, A symmetric) ─────────────────────
f3 = lambda x: x @ A_sym @ x
grad_f3 = lambda x: ___ # fill in (A symmetric -> simplifies)
# ── Verify all ────────────────────────────────────────────────────────────
print(f'{"function":>10} {"analytical":>30} {"error":>10}')
for name, f, g in [('a^Tx', f1, grad_f1), ('||x||^2', f2, grad_f2), ('x^TAx', f3, grad_f3)]:
anal = g(x_test)
num = numerical_gradient(f, x_test)
err = np.abs(anal - num).max()
status = '✓' if err < 1e-6 else '✗'
print(f'{status} {name:>10} anal={np.round(anal,4)} err={err:.2e}')
✏️ 2.3 · Your Turn — Jacobian of a Vector Function¶
When $f: \mathbb{R}^n \to \mathbb{R}^m$, the derivative is the Jacobian $J \in \mathbb{R}^{m \times n}$:
$$J_{ij} = \frac{\partial f_i}{\partial x_j}$$
Each column $j$ of $J$ is: "how does the output change when we nudge $x_j$?"
Compute $J$ numerically for $f(\mathbf{x}) = [x_1^2,\ x_1 x_2,\ \sin(x_2)]$.
def f_vec(x):
return np.array([x[0]**2,
x[0] * x[1],
np.sin(x[1])])
x0 = np.array([2., 1.])
h = 1e-5
m, n = 3, 2
# ✏️ Compute the Jacobian numerically
J_num = np.zeros((m, n))
for j in range(n):
e = np.zeros(n); e[j] = 1.
J_num[:, j] = (f_vec(x0 + h*e) - f_vec(x0 - h*e)) / ___ # fill in denominator
print('Jacobian (numerical):')
print(J_num.round(6))
# ✏️ Now compute analytically:
# J = [ [df1/dx1, df1/dx2],
# [df2/dx1, df2/dx2],
# [df3/dx1, df3/dx2] ]
# = [ [2x1, 0 ],
# [x2, x1 ],
# [0, cos(x2) ] ]
J_anal = np.array([[___, ___ ],
[___, ___ ],
[___, ___ ]]) # fill in at x0=[2,1]
print('\nJacobian (analytical):')
print(J_anal)
print('\nMatch:', np.allclose(J_num, J_anal, atol=1e-5))
Section 3 · Automatic Differentiation with PyTorch¶
Backpropagation and Automatic Differentiation¶
Computing gradients by hand is error-prone and slow for large functions.
Automatic differentiation (autograd) builds a computation graph and applies the chain rule automatically — this is exactly what torch.autograd does.
Key rule: wrap any tensor you want to differentiate with requires_grad=True,
call .backward() on the scalar output, then read the gradient from .grad.
3.1 · Worked — Your First Autograd Computation¶
# Scalar function: f(x) = x^3 - 2x + 1
x = torch.tensor(1.5, requires_grad=True)
# Forward pass — build the computation graph
y = x**3 - 2*x + 1
# Backward pass — compute df/dx via chain rule
y.backward()
print(f'x = {x.item()}')
print(f'f(x) = {y.item():.6f}')
print(f'f\'(x) = {x.grad.item():.6f} <- autograd')
print(f'exact = {3*1.5**2 - 2:.6f} <- 3x^2 - 2 at x=1.5')
3.2 · Worked — Autograd on a Multivariate Function¶
# f(x1, x2) = x1^2 + 2*x2^2
x1 = torch.tensor(2.0, requires_grad=True)
x2 = torch.tensor(-1.0, requires_grad=True)
f = x1**2 + 2*x2**2
f.backward()
print(f'f(2, -1) = {f.item():.4f}')
print(f'df/dx1 = {x1.grad.item():.4f} (exact: 2*x1 = {2*2.0})')
print(f'df/dx2 = {x2.grad.item():.4f} (exact: 4*x2 = {4*(-1.0)})')
# Important: zero gradients before re-using tensors!
# PyTorch ACCUMULATES gradients by default.
x1.grad.zero_()
x2.grad.zero_()
print('\nGradients zeroed — always do this before a new backward pass in a loop.')
✏️ 3.3 · Your Turn — Autograd on Custom Functions¶
Use PyTorch autograd to compute the gradient of each function below.
Then verify your result against the analytical formula.
# ✏️ Function 1: f(x1, x2) = (x1 - 3)^2 + (x2 + 1)^2
# Minimum is at (3, -1) — verify the gradient is zero there!
x1 = torch.tensor(___, requires_grad=True, dtype=torch.float64) # fill in 3.0
x2 = torch.tensor(___, requires_grad=True, dtype=torch.float64) # fill in -1.0
f1 = (x1 - 3)**2 + (x2 + 1)**2
f1.backward()
print('Function 1 at minimum (3, -1):')
print(f' f = {f1.item():.4f} (should be 0)')
print(f' df/dx1 = {x1.grad.item():.4f} (should be 0)')
print(f' df/dx2 = {x2.grad.item():.4f} (should be 0)')
# ✏️ Function 2: f(x) = ||Ax - b||^2 for vector x
A = torch.tensor([[1., 2.], [3., 4.], [5., 6.]], dtype=torch.float64)
b = torch.tensor([1., 2., 3.], dtype=torch.float64)
x = torch.tensor([0.5, -0.5], requires_grad=True, dtype=torch.float64)
residual = A @ x - b
f2 = residual @ residual # = ||Ax - b||^2
f2.___() # fill in: backward pass
print('\nFunction 2: f = ||Ax - b||^2')
print(f' autograd gradient : {x.grad.detach().numpy().round(4)}')
# Analytical: grad = 2 * A^T (Ax - b)
with torch.no_grad():
grad_anal = 2 * A.T @ (A @ x - b)
print(f' analytical gradient: {grad_anal.numpy().round(4)}')
print(f' match: {torch.allclose(x.grad, grad_anal, atol=1e-8)}')
🔬 3.4 · Experiment — Computation Graph¶
PyTorch builds a dynamic computation graph during the forward pass and traverses it backwards.
The cell below shows how the graph is constructed for a simple expression.
Change the expression and observe how the graph (and gradients) change.
# PyTorch tracks operations on requires_grad tensors
a = torch.tensor(2.0, requires_grad=True)
b = torch.tensor(3.0, requires_grad=True)
# Forward pass (try changing this expression)
c = a * b # c = ab
d = c + a**2 # d = ab + a^2
e = torch.log(d) # e = log(ab + a^2)
e.backward()
print(f'a={a.item()}, b={b.item()}')
print(f'c = a*b = {c.item():.4f}')
print(f'd = c + a^2 = {d.item():.4f}')
print(f'e = log(d) = {e.item():.4f}')
print(f'de/da (autograd) = {a.grad.item():.6f}')
print(f'de/db (autograd) = {b.grad.item():.6f}')
# Manual chain rule:
# e = log(ab + a^2)
# de/da = (b + 2a) / (ab + a^2)
# de/db = a / (ab + a^2)
av, bv = 2.0, 3.0
print(f'\nde/da (manual) = {(bv + 2*av) / (av*bv + av**2):.6f}')
print(f'de/db (manual) = {av / (av*bv + av**2):.6f}')
# 🔬 Try: change the expression for e and recompute
# e.g. e = a**3 * torch.exp(-b) or e = torch.sin(a) + b**2
4.1 · Worked — Manual GD in NumPy¶
# f(x1, x2) = (x1-2)^2 + (x2+1)^2 minimum at (2, -1)
def f_bowl(x): return (x[0]-2)**2 + (x[1]+1)**2
def grad_bowl(x): return np.array([2*(x[0]-2), 2*(x[1]+1)])
def run_gd_numpy(grad_fn, x_init, alpha, n_steps):
x = np.array(x_init, dtype=float)
path = [x.copy()]
for _ in range(n_steps):
x = x - alpha * grad_fn(x)
path.append(x.copy())
return np.array(path)
path_np = run_gd_numpy(grad_bowl, x_init=[-2., 3.], alpha=0.3, n_steps=30)
print('NumPy GD — first 5 steps:')
for i, pt in enumerate(path_np[:6]):
print(f' step {i:2d}: x = {pt.round(4)}, f = {f_bowl(pt):.6f}')
print(f' ...\n final : x = {path_np[-1].round(6)}, f = {f_bowl(path_np[-1]):.2e}')
4.2 · Worked — The Same GD in PyTorch (autograd computes the gradient)¶
def f_bowl_torch(x): return (x[0]-2)**2 + (x[1]+1)**2
def run_gd_torch(f, x_init, alpha, n_steps):
x = torch.tensor(x_init, dtype=torch.float64, requires_grad=True)
path = [x.detach().numpy().copy()]
for _ in range(n_steps):
if x.grad is not None:
x.grad.zero_() # clear accumulated gradient
loss = f(x)
loss.backward() # autograd computes gradient
with torch.no_grad():
x -= alpha * x.grad # update step
path.append(x.detach().numpy().copy())
return np.array(path)
path_pt = run_gd_torch(f_bowl_torch, x_init=[-2., 3.], alpha=0.3, n_steps=30)
print('PyTorch GD — same starting point, same alpha:')
print(f' final: x = {path_pt[-1].round(6)}')
print(f'\nPaths identical: {np.allclose(path_np, path_pt, atol=1e-10)}')
# Visualise the descent path
g = np.linspace(-3, 4, 200)
X1, X2 = np.meshgrid(g, g)
Z = (X1-2)**2 + (X2+1)**2
fig, ax = plt.subplots(figsize=(7, 6))
ax.contourf(X1, X2, Z, levels=25, cmap='Blues', alpha=0.7)
ax.contour( X1, X2, Z, levels=25, colors='white', linewidths=0.4, alpha=0.5)
ax.plot(path_np[:,0], path_np[:,1], 'o-', color='crimson', ms=4, lw=1.5, label='GD path')
ax.plot(path_np[0,0], path_np[0,1], 's', color='crimson', ms=10, label='start')
ax.plot(2, -1, '*', color='gold', ms=15, markeredgecolor='k', label='minimum (2,−1)')
ax.set_title('Gradient Descent on (x₁−2)² + (x₂+1)²')
ax.legend(); plt.tight_layout(); plt.show()
✏️ 4.3 · Your Turn — GD with PyTorch Autograd¶
Use run_gd_torch to minimise the Rosenbrock function (a classic non-trivial landscape):
$$f(x_1, x_2) = (1 - x_1)^2 + 100(x_2 - x_1^2)^2$$
The global minimum is at $(1, 1)$ where $f = 0$.
The valley is very narrow — gradient descent has to work hard.
- Implement
f_rosenbrockin PyTorch. - Run
run_gd_torchfrom starting point[-0.5, 0.5]withalpha=0.001,n_steps=2000. - Plot the path on the contour.
def f_rosenbrock(x):
# ✏️ fill in: (1 - x[0])^2 + 100*(x[1] - x[0]^2)^2
return ___ + 100 * ___
path_rb = run_gd_torch(f_rosenbrock, x_init=[-0.5, 0.5], alpha=___, n_steps=___)
print(f'Start : x = {path_rb[0].round(4)}')
print(f'Final : x = {path_rb[-1].round(4)}')
print(f'f(final) = {f_rosenbrock(torch.tensor(path_rb[-1])).item():.6f} (should be close to 0)')
# Contour plot
g = np.linspace(-1.5, 1.5, 300)
X1r, X2r = np.meshgrid(g, g)
Zr = (1 - X1r)**2 + 100*(X2r - X1r**2)**2
fig, ax = plt.subplots(figsize=(7, 6))
ax.contourf(X1r, X2r, np.log1p(Zr), levels=40, cmap='Blues', alpha=0.7)
ax.plot(path_rb[:,0], path_rb[:,1], '-', color='crimson', lw=0.8, alpha=0.8, label='GD path')
ax.plot(path_rb[0,0], path_rb[0,1], 's', color='crimson', ms=9, label='start')
ax.plot(1, 1, '*', color='gold', ms=15, markeredgecolor='k', label='minimum (1,1)')
ax.set_title('Gradient Descent on Rosenbrock function')
ax.legend(); plt.tight_layout(); plt.show()
Section 5 · The Learning Rate — Most Important Hyperparameter¶
Step Size¶
The learning rate $\alpha$ controls how big a step we take along $-\nabla f$.
Too small → converges, but very slowly.
Too large → overshoots, may diverge.
For a convex function with Lipschitz-continuous gradient, the safe range is:
$$\alpha < \frac{2}{L} \quad \text{where } L = \text{Lipschitz constant of } \nabla f$$
For $f(\mathbf{x}) = \mathbf{x}^\top A \mathbf{x}$, $L = 2 \lambda_{\max}(A)$, so $\alpha < 1/\lambda_{\max}$.
5.1 · Worked — Racing Different Learning Rates¶
# Ill-conditioned function: one axis is 20× steeper than the other
# f(x) = x1^2 + 10*x2^2 — Hessian eigenvalues are 2 and 20
def f_ill(x): return (x[0])**2 + 10*(x[1])**2
def grad_ill(x): return np.array([2*x[0], 20*x[1]])
# Maximum safe alpha = 1 / lambda_max = 1/20 = 0.05
lambda_max = 20
alpha_safe = 1 / lambda_max
print(f'Hessian eigenvalues: [2, 20]')
print(f'Safe alpha < {alpha_safe:.3f}')
alphas = [0.01, 0.04, 0.05, 0.08]
x0 = np.array([3., 2.])
n_steps = 80
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
g = np.linspace(-3.5, 3.5, 200)
X1p, X2p = np.meshgrid(g, g)
Zp = X1p**2 + 10*X2p**2
colors = ['steelblue', 'green', 'darkorange', 'crimson']
for alpha, col in zip(alphas, colors):
path = [x0.copy()]
x = x0.copy()
for _ in range(n_steps):
x = x - alpha * grad_ill(x)
path.append(x.copy())
path = np.array(path)
losses = [f_ill(p) for p in path]
axes[0].contourf(X1p, X2p, Zp, levels=20, cmap='Blues', alpha=0.15)
axes[0].plot(path[:,0], path[:,1], '-o', color=col, ms=2, lw=1.2, label=f'α={alpha}')
axes[1].plot(losses, color=col, lw=2, label=f'α={alpha}')
axes[0].set_xlim([-3.5,3.5]); axes[0].set_ylim([-2.5,2.5])
axes[0].set_title('Descent paths'); axes[0].legend(fontsize=9)
axes[1].set_xlabel('Iteration'); axes[1].set_ylabel('f(x)')
axes[1].set_title('Loss curves'); axes[1].legend(fontsize=9)
axes[1].set_ylim([0, 150])
plt.tight_layout(); plt.show()
✏️ 5.2 · Your Turn — Find the Critical Learning Rate¶
For the function $f(x_1, x_2) = 3x_1^2 + 0.5x_2^2$:
- What are the Hessian eigenvalues?
- What is the theoretical maximum safe learning rate?
- Verify experimentally — find the smallest $\alpha$ that causes divergence.
def f_new(x): return 3*x[0]**2 + 0.5*x[1]**2
def grad_new(x): return np.array([6*x[0], x[1]])
# ✏️ Step 1: Hessian eigenvalues
# H = diag(6, 1) -> eigenvalues = ___, ___
lambda_max_new = ___ # fill in
alpha_theory = ___ # fill in: 1 / lambda_max
print(f'Hessian eigenvalues: [6, 1]')
print(f'Max safe alpha (theory): {alpha_theory:.4f}')
# ✏️ Step 2: Experiment — try alphas around the critical value
alphas_test = [0.05, 0.10, 0.15, 0.20, 0.35] # adjust based on your theory answer
x0 = np.array([2., 3.])
fig, ax = plt.subplots(figsize=(8, 4))
for alpha_i in alphas_test:
x = x0.copy()
losses = []
for _ in range(60):
x = x - alpha_i * grad_new(x)
losses.append(f_new(x))
losses = np.clip(losses, 0, 500) # clip for visibility
ax.plot(losses, label=f'α={alpha_i}')
ax.axhline(0, color='k', lw=0.8, linestyle='--')
ax.set_xlabel('Iteration'); ax.set_ylabel('f(x)')
ax.set_title('Finding the critical learning rate')
ax.legend(); plt.tight_layout(); plt.show()
# 💬 Discussion: Where exactly does it start to diverge? Does it match the theory?
Section 6 · Beyond Classical Gradient Descent¶
Gradient Descent with Momentum; Adam¶
Vanilla gradient descent has two big weaknesses:
- Slow in ravines — it zigzags on ill-conditioned landscapes
- Same step size for every parameter — some dimensions need bigger steps
Two widely-used fixes:
Momentum adds a velocity term to smooth out the zigzag: $$\mathbf{v}^{(t+1)} = \beta \mathbf{v}^{(t)} - \alpha \nabla f(\mathbf{x}^{(t)}), \qquad \mathbf{x}^{(t+1)} = \mathbf{x}^{(t)} + \mathbf{v}^{(t+1)}$$
Adam (Adaptive Moment Estimation) keeps per-parameter running estimates of the gradient and its square, and adapts the step size for each coordinate individually.
6.1 · Worked — Implementing Momentum from Scratch¶
def run_momentum(grad_fn, x_init, alpha, beta, n_steps):
"""Gradient descent with momentum. beta=0 recovers vanilla GD."""
x = np.array(x_init, dtype=float)
v = np.zeros_like(x) # velocity initialised at 0
path = [x.copy()]
for _ in range(n_steps):
v = beta * v - alpha * grad_fn(x) # update velocity
x = x + v # update position
path.append(x.copy())
return np.array(path)
# Compare on the ill-conditioned landscape
x0 = np.array([3., 2.])
n_steps = 80
path_gd = run_gd_numpy(grad_ill, x0, alpha=0.04, n_steps=n_steps)
path_mom = run_momentum(grad_ill, x0, alpha=0.04, beta=0.85, n_steps=n_steps)
g = np.linspace(-3.5, 3.5, 200)
X1p, X2p = np.meshgrid(g, g)
Zp = X1p**2 + 10*X2p**2
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
for ax in axes:
ax.contourf(X1p, X2p, Zp, levels=20, cmap='Blues', alpha=0.2)
axes[0].plot(path_gd[:,0], path_gd[:,1], '-o', color='crimson', ms=2, lw=1.2, label='Vanilla GD')
axes[0].plot(path_mom[:,0], path_mom[:,1], '-o', color='steelblue', ms=2, lw=1.2, label='Momentum')
axes[0].set_xlim([-3.5,3.5]); axes[0].set_ylim([-2.5,2.5])
axes[0].set_title('Descent paths (same α=0.04)'); axes[0].legend()
losses_gd = [f_ill(p) for p in path_gd]
losses_mom = [f_ill(p) for p in path_mom]
axes[1].plot(losses_gd, color='crimson', lw=2, label='Vanilla GD')
axes[1].plot(losses_mom, color='steelblue', lw=2, label='Momentum β=0.85')
axes[1].set_xlabel('Iteration'); axes[1].set_ylabel('f(x)')
axes[1].set_title('Loss curves'); axes[1].legend()
plt.tight_layout(); plt.show()
✏️ 6.2 · Your Turn — Adam Optimizer in PyTorch¶
PyTorch provides Adam out of the box via torch.optim.Adam.
The workflow is:
optimizer = torch.optim.Adam([x], lr=alpha)
optimizer.zero_grad() # clear old gradients
loss = f(x)
loss.backward() # compute new gradients
optimizer.step() # update x
Use this to minimise the Rosenbrock function from Section 4.3.
Compare Adam vs vanilla GD (Section 4.3 result) — how many steps does each need?
# ✏️ Minimise Rosenbrock with Adam
x_adam = torch.tensor([-0.5, 0.5], dtype=torch.float64, requires_grad=True)
optimizer = torch.optim.Adam([x_adam], lr=___) # fill in a learning rate (try 0.01)
path_adam = [x_adam.detach().numpy().copy()]
loss_adam = []
for step in range(___):
optimizer.___() # fill in: zero_grad
loss = f_rosenbrock(x_adam)
loss.___() # fill in: backward
optimizer.___() # fill in: step
path_adam.append(x_adam.detach().numpy().copy())
loss_adam.append(loss.item())
path_adam = np.array(path_adam)
print(f'Adam final x = {path_adam[-1].round(5)}')
print(f'Adam final f = {loss_adam[-1]:.6f}')
# Plot loss curve
fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(loss_adam, color='steelblue', lw=2, label='Adam')
ax.set_xlabel('Iteration'); ax.set_ylabel('f(x)')
ax.set_yscale('log')
ax.set_title('Adam on Rosenbrock (log scale)')
ax.legend(); plt.tight_layout(); plt.show()
# 💬 Compare with the vanilla GD path from Section 4.3:
# How many steps did GD need vs Adam to reach f < 0.01?
🔬 6.3 · Experiment — Momentum Beta¶
The momentum parameter $\beta \in [0, 1)$ controls how much of the previous velocity is kept.
- $\beta = 0$: vanilla gradient descent
- $\beta \to 1$: very persistent velocity (can overshoot!)
Run the cell below. Try different values of beta and observe the trade-off.
# 🔬 Change beta and re-run
betas = [0.0, 0.5, 0.85, 0.95]
x0 = np.array([3., 2.])
n_steps = 100
alpha = 0.04
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
colors = ['crimson', 'darkorange', 'steelblue', 'green']
for beta, col in zip(betas, colors):
path = run_momentum(grad_ill, x0, alpha=alpha, beta=beta, n_steps=n_steps)
losses = [f_ill(p) for p in path]
axes[0].plot(path[:,0], path[:,1], '-', color=col, lw=1.2, alpha=0.8,
label=f'β={beta}')
axes[1].plot(losses, color=col, lw=2, label=f'β={beta}')
for ax in axes:
ax.legend(fontsize=9)
axes[0].contour(X1p, X2p, Zp, levels=15, colors='grey', linewidths=0.4, alpha=0.5)
axes[0].set_xlim([-4,4]); axes[0].set_ylim([-3,3])
axes[0].set_title('Paths for different β values')
axes[1].set_xlabel('Iteration'); axes[1].set_ylabel('f(x)')
axes[1].set_title('Loss curves')
axes[1].set_ylim([0, 150])
plt.tight_layout(); plt.show()
# 💬 Discuss:
# What is the sweet spot for beta on this landscape?
# What happens when beta is too high (e.g. 0.99)?
🏁 Summary¶
| Section | What you built | mml-book reference |
|---|---|---|
| 1 | Finite differences, chain rule by hand | §5.1 |
| 2 | Gradient field, analytical vs numerical $\nabla f$, Jacobian | §5.2–5.3, §5.5 |
| 3 | PyTorch autograd — computation graph, .backward() |
§5.6 |
| 4 | Gradient descent in NumPy and PyTorch — same result | §7.1 |
| 5 | Learning rate — safe bound, ill-conditioning, zigzag | §7.1 |
| 6 | Momentum from scratch; Adam via torch.optim |
§7.1 |
The key insight connecting all six sections:
The gradient tells us the direction.
The learning rate tells us how far to step.
The optimizer decides how to use that information — vanilla, momentum, or adaptive.
MA2221 — Foundational Mathematics for Machine Learning · Mahindra University
Lab Notebook 7 · © Biswarup Biswas