Gradients and Optimisation in PyTorch¶

Course: MA2221 · Mahindra University
Reference: Mathematics for Machine Learning, Deisenroth, Faisal & Ong — Ch 5 & 7

Two questions drive this entire lab:

How does a computer compute gradients?
How do we use gradients to find the minimum of a function?

You will answer both — first by doing it by hand in NumPy, then by letting PyTorch's autograd do the work, and finally by watching gradient descent succeed, struggle, and fail on different landscapes.

Structure¶

Section	Topic
1	Derivatives and Partial Derivatives — from scratch
2	Gradients and the Jacobian
3	PyTorch Autograd — automatic differentiation
4	Gradient Descent — the algorithm
5	Learning Rate — the most important hyperparameter
6	Beyond Classical GD — Momentum and Adam

Legend¶

🧱 Worked — run and read
✏️ Your turn — fill in ___
🔬 Experiment — change numbers and observe
💬 Discuss — no single right answer

0 · Setup¶

In [ ]:

Copied!





import numpy as np
import matplotlib.pyplot as plt
import torch

torch.manual_seed(0)
np.random.seed(0)

plt.rcParams.update({
    'figure.dpi'       : 120,
    'axes.spines.top'  : False,
    'axes.spines.right': False,
    'font.size'        : 12,
})
print(f'PyTorch version: {torch.__version__}  ✓')
import numpy as np
import matplotlib.pyplot as plt
import torch

torch.manual_seed(0)
np.random.seed(0)

plt.rcParams.update({
    'figure.dpi'       : 120,
    'axes.spines.top'  : False,
    'axes.spines.right': False,
    'font.size'        : 12,
})
print(f'PyTorch version: {torch.__version__}  ✓')

Section 1 · Derivatives — From the Definition¶

Differentiation of Univariate Functions¶

The derivative of $f$ at $x$ is the slope of the tangent line:

$$f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$$

We approximate this with a finite $h$ — called the finite difference.
The central difference is more accurate:

$$f'(x) \approx \frac{f(x+h) - f(x-h)}{2h}$$

1.1 · Worked — Finite Difference vs Exact Derivative¶

In [ ]:

Copied!





# Let's work with f(x) = x^3 - 2x + 1, whose exact derivative is f'(x) = 3x^2 - 2
def f(x):      return x**3 - 2*x + 1
def f_prime(x): return 3*x**2 - 2

x0 = 1.5

# Try progressively smaller h values
h_values = [1.0, 0.1, 0.01, 1e-4, 1e-7, 1e-12]
print(f'Exact f\u2019({x0}) = {f_prime(x0):.8f}\n')
print(f'{"h":>12}   {"forward diff":>16}   {"central diff":>16}   {"central error":>14}')
for h in h_values:
    fwd = (f(x0 + h) - f(x0)) / h
    cen = (f(x0 + h) - f(x0 - h)) / (2*h)
    err = abs(cen - f_prime(x0))
    print(f'{h:>12.2e}   {fwd:>16.8f}   {cen:>16.8f}   {err:>14.2e}')

# Visualise the tangent line
xs = np.linspace(0, 3, 200)
tangent = f(x0) + f_prime(x0) * (xs - x0)

fig, ax = plt.subplots(figsize=(7, 4))
ax.plot(xs, f(xs), 'steelblue', lw=2.5, label='f(x) = x³ - 2x + 1')
ax.plot(xs, tangent, 'crimson', lw=1.8, linestyle='--', label=f"Tangent at x={x0}")
ax.plot(x0, f(x0), 'ko', ms=7)
ax.set_xlabel('x'); ax.set_title('Derivative as slope of tangent line')
ax.legend(); plt.tight_layout(); plt.show()
# Let's work with f(x) = x^3 - 2x + 1, whose exact derivative is f'(x) = 3x^2 - 2
def f(x):      return x**3 - 2*x + 1
def f_prime(x): return 3*x**2 - 2

x0 = 1.5

# Try progressively smaller h values
h_values = [1.0, 0.1, 0.01, 1e-4, 1e-7, 1e-12]
print(f'Exact f\u2019({x0}) = {f_prime(x0):.8f}\n')
print(f'{"h":>12}   {"forward diff":>16}   {"central diff":>16}   {"central error":>14}')
for h in h_values:
    fwd = (f(x0 + h) - f(x0)) / h
    cen = (f(x0 + h) - f(x0 - h)) / (2*h)
    err = abs(cen - f_prime(x0))
    print(f'{h:>12.2e}   {fwd:>16.8f}   {cen:>16.8f}   {err:>14.2e}')

# Visualise the tangent line
xs = np.linspace(0, 3, 200)
tangent = f(x0) + f_prime(x0) * (xs - x0)

fig, ax = plt.subplots(figsize=(7, 4))
ax.plot(xs, f(xs), 'steelblue', lw=2.5, label='f(x) = x³ - 2x + 1')
ax.plot(xs, tangent, 'crimson', lw=1.8, linestyle='--', label=f"Tangent at x={x0}")
ax.plot(x0, f(x0), 'ko', ms=7)
ax.set_xlabel('x'); ax.set_title('Derivative as slope of tangent line')
ax.legend(); plt.tight_layout(); plt.show()

✏️ 1.2 · Your Turn — Chain Rule¶

The chain rule: if $y = f(g(x))$ then $\frac{dy}{dx} = f'(g(x)) \cdot g'(x)$.

Consider $h(x) = \sin(x^2)$.

Derive $h'(x)$ analytically using the chain rule.
Implement both h and h_prime below.
Verify with the central difference check.

In [ ]:

Copied!





def h(x):       return np.sin(x**2)

def h_prime(x):
    # ✏️ Chain rule: d/dx sin(x^2) = cos(x^2) * 2x
    return ___ * ___     # fill in: outer derivative * inner derivative

# Verify at multiple points
test_pts = [0.5, 1.0, 1.5, 2.0]
hval = 1e-5
print(f'{"x":>6}  {"analytical":>14}  {"numerical":>14}  {"error":>12}')
for x in test_pts:
    anal = h_prime(x)
    num  = (h(x + hval) - h(x - hval)) / (2 * hval)
    print(f'{x:>6.2f}  {anal:>14.8f}  {num:>14.8f}  {abs(anal-num):>12.2e}')
def h(x):       return np.sin(x**2)

def h_prime(x):
    # ✏️ Chain rule: d/dx sin(x^2) = cos(x^2) * 2x
    return ___ * ___     # fill in: outer derivative * inner derivative

# Verify at multiple points
test_pts = [0.5, 1.0, 1.5, 2.0]
hval = 1e-5
print(f'{"x":>6}  {"analytical":>14}  {"numerical":>14}  {"error":>12}')
for x in test_pts:
    anal = h_prime(x)
    num  = (h(x + hval) - h(x - hval)) / (2 * hval)
    print(f'{x:>6.2f}  {anal:>14.8f}  {num:>14.8f}  {abs(anal-num):>12.2e}')

Section 2 · Partial Derivatives and the Gradient¶

Partial Differentiation and Gradients¶

For $f: \mathbb{R}^n \to \mathbb{R}$, the gradient is the vector of all partial derivatives:

$$\nabla_{\mathbf{x}} f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix} \in \mathbb{R}^n$$

The gradient always points in the direction of steepest ascent.
To minimise $f$, we move in the negative gradient direction.

2.1 · Worked — Gradient as a Slope Field¶

In [ ]:

Copied!





# f(x1, x2) = x1^2 + 2*x2^2   (elliptic paraboloid)
# Partial derivatives:  df/dx1 = 2*x1,   df/dx2 = 4*x2

def f2d(x1, x2):    return x1**2 + 2*x2**2
def grad_f2d(x):    return np.array([2*x[0], 4*x[1]])

# Build grid
g = np.linspace(-3, 3, 200)
X1, X2 = np.meshgrid(g, g)
Z = f2d(X1, X2)

# Coarse grid for arrows
gc = np.linspace(-2.5, 2.5, 11)
G1c, G2c = np.meshgrid(gc, gc)
U, V = 2*G1c, 4*G2c   # gradient components

fig, axes = plt.subplots(1, 2, figsize=(13, 5))

ax = axes[0]
cs = ax.contourf(X1, X2, Z, levels=20, cmap='Blues', alpha=0.7)
ax.contour(X1, X2, Z, levels=20, colors='white', linewidths=0.4, alpha=0.5)
plt.colorbar(cs, ax=ax)
ax.quiver(G1c, G2c, U, V, color='crimson', alpha=0.7, scale=120)
ax.set_title('Contours + gradient field  ∇f points uphill')
ax.set_xlabel('x₁'); ax.set_ylabel('x₂')

ax2 = axes[1]
ax2.contourf(X1, X2, Z, levels=20, cmap='Blues', alpha=0.7)
ax2.contour(X1, X2, Z, levels=20, colors='white', linewidths=0.4, alpha=0.5)
ax2.quiver(G1c, G2c, -U, -V, color='green', alpha=0.7, scale=120)
ax2.set_title('–∇f points DOWNHILL  ← direction of descent')
ax2.set_xlabel('x₁'); ax2.set_ylabel('x₂')

plt.tight_layout(); plt.show()

# Key observation: gradient arrows are PERPENDICULAR to contour lines
# f(x1, x2) = x1^2 + 2*x2^2   (elliptic paraboloid)
# Partial derivatives:  df/dx1 = 2*x1,   df/dx2 = 4*x2

def f2d(x1, x2):    return x1**2 + 2*x2**2
def grad_f2d(x):    return np.array([2*x[0], 4*x[1]])

# Build grid
g = np.linspace(-3, 3, 200)
X1, X2 = np.meshgrid(g, g)
Z = f2d(X1, X2)

# Coarse grid for arrows
gc = np.linspace(-2.5, 2.5, 11)
G1c, G2c = np.meshgrid(gc, gc)
U, V = 2*G1c, 4*G2c   # gradient components

fig, axes = plt.subplots(1, 2, figsize=(13, 5))

ax = axes[0]
cs = ax.contourf(X1, X2, Z, levels=20, cmap='Blues', alpha=0.7)
ax.contour(X1, X2, Z, levels=20, colors='white', linewidths=0.4, alpha=0.5)
plt.colorbar(cs, ax=ax)
ax.quiver(G1c, G2c, U, V, color='crimson', alpha=0.7, scale=120)
ax.set_title('Contours + gradient field  ∇f points uphill')
ax.set_xlabel('x₁'); ax.set_ylabel('x₂')

ax2 = axes[1]
ax2.contourf(X1, X2, Z, levels=20, cmap='Blues', alpha=0.7)
ax2.contour(X1, X2, Z, levels=20, colors='white', linewidths=0.4, alpha=0.5)
ax2.quiver(G1c, G2c, -U, -V, color='green', alpha=0.7, scale=120)
ax2.set_title('–∇f points DOWNHILL  ← direction of descent')
ax2.set_xlabel('x₁'); ax2.set_ylabel('x₂')

plt.tight_layout(); plt.show()

# Key observation: gradient arrows are PERPENDICULAR to contour lines

✏️ 2.2 · Your Turn — Compute Gradients Analytically¶

For each function, derive $\nabla f$ analytically, implement it, then verify with finite differences.

Use the gradient identities:

$\nabla_{\mathbf{x}}(\mathbf{a}^\top \mathbf{x}) = \mathbf{a}$
$\nabla_{\mathbf{x}}(\mathbf{x}^\top \mathbf{x}) = 2\mathbf{x}$
$\nabla_{\mathbf{x}}(\mathbf{x}^\top A \mathbf{x}) = (A + A^\top)\mathbf{x}$

In [ ]:

Copied!





def numerical_gradient(func, x, h=1e-5):
    grad = np.zeros_like(x, dtype=float)
    for i in range(len(x)):
        e = np.zeros_like(x, dtype=float); e[i] = 1.
        grad[i] = (func(x + h*e) - func(x - h*e)) / (2*h)
    return grad

x_test = np.array([1., -2., 0.5])
a      = np.array([3.,  1., -2.])
A_sym  = np.array([[2., 1., 0.],
                   [1., 3., 1.],
                   [0., 1., 2.]], dtype=float)   # symmetric

# ── f1(x) = a^T x ────────────────────────────────────────────────────────
f1     = lambda x: a @ x
grad_f1 = lambda x: ___           # fill in

# ── f2(x) = ||x||^2 = x^T x  ─────────────────────────────────────────────
f2     = lambda x: x @ x
grad_f2 = lambda x: ___           # fill in

# ── f3(x) = x^T A x  (quadratic form, A symmetric)  ─────────────────────
f3     = lambda x: x @ A_sym @ x
grad_f3 = lambda x: ___           # fill in  (A symmetric -> simplifies)

# ── Verify all ────────────────────────────────────────────────────────────
print(f'{"function":>10}  {"analytical":>30}  {"error":>10}')
for name, f, g in [('a^Tx', f1, grad_f1), ('||x||^2', f2, grad_f2), ('x^TAx', f3, grad_f3)]:
    anal = g(x_test)
    num  = numerical_gradient(f, x_test)
    err  = np.abs(anal - num).max()
    status = '✓' if err < 1e-6 else '✗'
    print(f'{status} {name:>10}  anal={np.round(anal,4)}  err={err:.2e}')
def numerical_gradient(func, x, h=1e-5):
    grad = np.zeros_like(x, dtype=float)
    for i in range(len(x)):
        e = np.zeros_like(x, dtype=float); e[i] = 1.
        grad[i] = (func(x + h*e) - func(x - h*e)) / (2*h)
    return grad

x_test = np.array([1., -2., 0.5])
a      = np.array([3.,  1., -2.])
A_sym  = np.array([[2., 1., 0.],
                   [1., 3., 1.],
                   [0., 1., 2.]], dtype=float)   # symmetric

# ── f1(x) = a^T x ────────────────────────────────────────────────────────
f1     = lambda x: a @ x
grad_f1 = lambda x: ___           # fill in

# ── f2(x) = ||x||^2 = x^T x  ─────────────────────────────────────────────
f2     = lambda x: x @ x
grad_f2 = lambda x: ___           # fill in

# ── f3(x) = x^T A x  (quadratic form, A symmetric)  ─────────────────────
f3     = lambda x: x @ A_sym @ x
grad_f3 = lambda x: ___           # fill in  (A symmetric -> simplifies)

# ── Verify all ────────────────────────────────────────────────────────────
print(f'{"function":>10}  {"analytical":>30}  {"error":>10}')
for name, f, g in [('a^Tx', f1, grad_f1), ('||x||^2', f2, grad_f2), ('x^TAx', f3, grad_f3)]:
    anal = g(x_test)
    num  = numerical_gradient(f, x_test)
    err  = np.abs(anal - num).max()
    status = '✓' if err < 1e-6 else '✗'
    print(f'{status} {name:>10}  anal={np.round(anal,4)}  err={err:.2e}')

✏️ 2.3 · Your Turn — Jacobian of a Vector Function¶

When $f: \mathbb{R}^n \to \mathbb{R}^m$, the derivative is the Jacobian $J \in \mathbb{R}^{m \times n}$:

$$J_{ij} = \frac{\partial f_i}{\partial x_j}$$

Each column $j$ of $J$ is: "how does the output change when we nudge $x_j$?"

Compute $J$ numerically for $f(\mathbf{x}) = [x_1^2,\ x_1 x_2,\ \sin(x_2)]$.

In [ ]:

Copied!





def f_vec(x):
    return np.array([x[0]**2,
                     x[0] * x[1],
                     np.sin(x[1])])

x0  = np.array([2., 1.])
h   = 1e-5
m, n = 3, 2

# ✏️ Compute the Jacobian numerically
J_num = np.zeros((m, n))
for j in range(n):
    e = np.zeros(n); e[j] = 1.
    J_num[:, j] = (f_vec(x0 + h*e) - f_vec(x0 - h*e)) / ___   # fill in denominator

print('Jacobian (numerical):')
print(J_num.round(6))

# ✏️ Now compute analytically:
# J = [ [df1/dx1, df1/dx2],
#       [df2/dx1, df2/dx2],
#       [df3/dx1, df3/dx2] ]
# = [ [2x1,      0        ],
#     [x2,       x1       ],
#     [0,        cos(x2)  ] ]
J_anal = np.array([[___,          ___        ],
                   [___,          ___        ],
                   [___,          ___        ]])   # fill in at x0=[2,1]

print('\nJacobian (analytical):')
print(J_anal)
print('\nMatch:', np.allclose(J_num, J_anal, atol=1e-5))
def f_vec(x):
    return np.array([x[0]**2,
                     x[0] * x[1],
                     np.sin(x[1])])

x0  = np.array([2., 1.])
h   = 1e-5
m, n = 3, 2

# ✏️ Compute the Jacobian numerically
J_num = np.zeros((m, n))
for j in range(n):
    e = np.zeros(n); e[j] = 1.
    J_num[:, j] = (f_vec(x0 + h*e) - f_vec(x0 - h*e)) / ___   # fill in denominator

print('Jacobian (numerical):')
print(J_num.round(6))

# ✏️ Now compute analytically:
# J = [ [df1/dx1, df1/dx2],
#       [df2/dx1, df2/dx2],
#       [df3/dx1, df3/dx2] ]
# = [ [2x1,      0        ],
#     [x2,       x1       ],
#     [0,        cos(x2)  ] ]
J_anal = np.array([[___,          ___        ],
                   [___,          ___        ],
                   [___,          ___        ]])   # fill in at x0=[2,1]

print('\nJacobian (analytical):')
print(J_anal)
print('\nMatch:', np.allclose(J_num, J_anal, atol=1e-5))

Section 3 · Automatic Differentiation with PyTorch¶

Backpropagation and Automatic Differentiation¶

Computing gradients by hand is error-prone and slow for large functions.
Automatic differentiation (autograd) builds a computation graph and applies the chain rule automatically — this is exactly what torch.autograd does.

Key rule: wrap any tensor you want to differentiate with requires_grad=True,
call .backward() on the scalar output, then read the gradient from .grad.

3.1 · Worked — Your First Autograd Computation¶

In [ ]:

Copied!





# Scalar function: f(x) = x^3 - 2x + 1
x = torch.tensor(1.5, requires_grad=True)

# Forward pass — build the computation graph
y = x**3 - 2*x + 1

# Backward pass — compute df/dx via chain rule
y.backward()

print(f'x       = {x.item()}')
print(f'f(x)    = {y.item():.6f}')
print(f'f\'(x)  = {x.grad.item():.6f}   <- autograd')
print(f'exact   = {3*1.5**2 - 2:.6f}   <- 3x^2 - 2 at x=1.5')
# Scalar function: f(x) = x^3 - 2x + 1
x = torch.tensor(1.5, requires_grad=True)

# Forward pass — build the computation graph
y = x**3 - 2*x + 1

# Backward pass — compute df/dx via chain rule
y.backward()

print(f'x       = {x.item()}')
print(f'f(x)    = {y.item():.6f}')
print(f'f\'(x)  = {x.grad.item():.6f}   <- autograd')
print(f'exact   = {3*1.5**2 - 2:.6f}   <- 3x^2 - 2 at x=1.5')

3.2 · Worked — Autograd on a Multivariate Function¶

In [ ]:

Copied!





# f(x1, x2) = x1^2 + 2*x2^2
x1 = torch.tensor(2.0, requires_grad=True)
x2 = torch.tensor(-1.0, requires_grad=True)

f  = x1**2 + 2*x2**2
f.backward()

print(f'f(2, -1)       = {f.item():.4f}')
print(f'df/dx1 = {x1.grad.item():.4f}   (exact: 2*x1 = {2*2.0})')
print(f'df/dx2 = {x2.grad.item():.4f}   (exact: 4*x2 = {4*(-1.0)})')

# Important: zero gradients before re-using tensors!
# PyTorch ACCUMULATES gradients by default.
x1.grad.zero_()
x2.grad.zero_()
print('\nGradients zeroed — always do this before a new backward pass in a loop.')
# f(x1, x2) = x1^2 + 2*x2^2
x1 = torch.tensor(2.0, requires_grad=True)
x2 = torch.tensor(-1.0, requires_grad=True)

f  = x1**2 + 2*x2**2
f.backward()

print(f'f(2, -1)       = {f.item():.4f}')
print(f'df/dx1 = {x1.grad.item():.4f}   (exact: 2*x1 = {2*2.0})')
print(f'df/dx2 = {x2.grad.item():.4f}   (exact: 4*x2 = {4*(-1.0)})')

# Important: zero gradients before re-using tensors!
# PyTorch ACCUMULATES gradients by default.
x1.grad.zero_()
x2.grad.zero_()
print('\nGradients zeroed — always do this before a new backward pass in a loop.')

✏️ 3.3 · Your Turn — Autograd on Custom Functions¶

Use PyTorch autograd to compute the gradient of each function below.
Then verify your result against the analytical formula.

In [ ]:

Copied!





# ✏️ Function 1:  f(x1, x2) = (x1 - 3)^2 + (x2 + 1)^2
# Minimum is at (3, -1) — verify the gradient is zero there!

x1 = torch.tensor(___, requires_grad=True, dtype=torch.float64)  # fill in 3.0
x2 = torch.tensor(___, requires_grad=True, dtype=torch.float64)  # fill in -1.0

f1 = (x1 - 3)**2 + (x2 + 1)**2
f1.backward()

print('Function 1 at minimum (3, -1):')
print(f'  f = {f1.item():.4f}  (should be 0)')
print(f'  df/dx1 = {x1.grad.item():.4f}  (should be 0)')
print(f'  df/dx2 = {x2.grad.item():.4f}  (should be 0)')

# ✏️ Function 2:  f(x) = ||Ax - b||^2  for vector x
A = torch.tensor([[1., 2.], [3., 4.], [5., 6.]], dtype=torch.float64)
b = torch.tensor([1., 2., 3.], dtype=torch.float64)
x = torch.tensor([0.5, -0.5], requires_grad=True, dtype=torch.float64)

residual = A @ x - b
f2 = residual @ residual         # = ||Ax - b||^2
f2.___()                         # fill in: backward pass

print('\nFunction 2:  f = ||Ax - b||^2')
print(f'  autograd gradient : {x.grad.detach().numpy().round(4)}')
# Analytical: grad = 2 * A^T (Ax - b)
with torch.no_grad():
    grad_anal = 2 * A.T @ (A @ x - b)
print(f'  analytical gradient: {grad_anal.numpy().round(4)}')
print(f'  match: {torch.allclose(x.grad, grad_anal, atol=1e-8)}')
# ✏️ Function 1:  f(x1, x2) = (x1 - 3)^2 + (x2 + 1)^2
# Minimum is at (3, -1) — verify the gradient is zero there!

x1 = torch.tensor(___, requires_grad=True, dtype=torch.float64)  # fill in 3.0
x2 = torch.tensor(___, requires_grad=True, dtype=torch.float64)  # fill in -1.0

f1 = (x1 - 3)**2 + (x2 + 1)**2
f1.backward()

print('Function 1 at minimum (3, -1):')
print(f'  f = {f1.item():.4f}  (should be 0)')
print(f'  df/dx1 = {x1.grad.item():.4f}  (should be 0)')
print(f'  df/dx2 = {x2.grad.item():.4f}  (should be 0)')

# ✏️ Function 2:  f(x) = ||Ax - b||^2  for vector x
A = torch.tensor([[1., 2.], [3., 4.], [5., 6.]], dtype=torch.float64)
b = torch.tensor([1., 2., 3.], dtype=torch.float64)
x = torch.tensor([0.5, -0.5], requires_grad=True, dtype=torch.float64)

residual = A @ x - b
f2 = residual @ residual         # = ||Ax - b||^2
f2.___()                         # fill in: backward pass

print('\nFunction 2:  f = ||Ax - b||^2')
print(f'  autograd gradient : {x.grad.detach().numpy().round(4)}')
# Analytical: grad = 2 * A^T (Ax - b)
with torch.no_grad():
    grad_anal = 2 * A.T @ (A @ x - b)
print(f'  analytical gradient: {grad_anal.numpy().round(4)}')
print(f'  match: {torch.allclose(x.grad, grad_anal, atol=1e-8)}')

🔬 3.4 · Experiment — Computation Graph¶

PyTorch builds a dynamic computation graph during the forward pass and traverses it backwards.
The cell below shows how the graph is constructed for a simple expression.
Change the expression and observe how the graph (and gradients) change.

In [ ]:

Copied!





# PyTorch tracks operations on requires_grad tensors
a = torch.tensor(2.0, requires_grad=True)
b = torch.tensor(3.0, requires_grad=True)

# Forward pass  (try changing this expression)
c = a * b          # c = ab
d = c + a**2       # d = ab + a^2
e = torch.log(d)   # e = log(ab + a^2)

e.backward()

print(f'a={a.item()}, b={b.item()}')
print(f'c = a*b                = {c.item():.4f}')
print(f'd = c + a^2            = {d.item():.4f}')
print(f'e = log(d)             = {e.item():.4f}')
print(f'de/da (autograd)       = {a.grad.item():.6f}')
print(f'de/db (autograd)       = {b.grad.item():.6f}')

# Manual chain rule:
# e = log(ab + a^2)
# de/da = (b + 2a) / (ab + a^2)
# de/db = a / (ab + a^2)
av, bv = 2.0, 3.0
print(f'\nde/da (manual)         = {(bv + 2*av) / (av*bv + av**2):.6f}')
print(f'de/db (manual)         = {av / (av*bv + av**2):.6f}')

# 🔬 Try: change the expression for e and recompute
# e.g.  e = a**3 * torch.exp(-b)   or   e = torch.sin(a) + b**2
# PyTorch tracks operations on requires_grad tensors
a = torch.tensor(2.0, requires_grad=True)
b = torch.tensor(3.0, requires_grad=True)

# Forward pass  (try changing this expression)
c = a * b          # c = ab
d = c + a**2       # d = ab + a^2
e = torch.log(d)   # e = log(ab + a^2)

e.backward()

print(f'a={a.item()}, b={b.item()}')
print(f'c = a*b                = {c.item():.4f}')
print(f'd = c + a^2            = {d.item():.4f}')
print(f'e = log(d)             = {e.item():.4f}')
print(f'de/da (autograd)       = {a.grad.item():.6f}')
print(f'de/db (autograd)       = {b.grad.item():.6f}')

# Manual chain rule:
# e = log(ab + a^2)
# de/da = (b + 2a) / (ab + a^2)
# de/db = a / (ab + a^2)
av, bv = 2.0, 3.0
print(f'\nde/da (manual)         = {(bv + 2*av) / (av*bv + av**2):.6f}')
print(f'de/db (manual)         = {av / (av*bv + av**2):.6f}')

# 🔬 Try: change the expression for e and recompute
# e.g.  e = a**3 * torch.exp(-b)   or   e = torch.sin(a) + b**2

Section 4 · Gradient Descent¶

Optimisation Using Gradient Descent¶

The update rule:

$$\mathbf{x}^{(t+1)} = \mathbf{x}^{(t)} - \alpha\,\nabla f(\mathbf{x}^{(t)})$$

We will implement this two ways: manually with NumPy, and then with PyTorch's autograd
— and confirm both give the same result.

4.1 · Worked — Manual GD in NumPy¶

In [ ]:

Copied!





# f(x1, x2) = (x1-2)^2 + (x2+1)^2   minimum at (2, -1)
def f_bowl(x):    return (x[0]-2)**2 + (x[1]+1)**2
def grad_bowl(x): return np.array([2*(x[0]-2), 2*(x[1]+1)])

def run_gd_numpy(grad_fn, x_init, alpha, n_steps):
    x    = np.array(x_init, dtype=float)
    path = [x.copy()]
    for _ in range(n_steps):
        x = x - alpha * grad_fn(x)
        path.append(x.copy())
    return np.array(path)

path_np = run_gd_numpy(grad_bowl, x_init=[-2., 3.], alpha=0.3, n_steps=30)

print('NumPy GD — first 5 steps:')
for i, pt in enumerate(path_np[:6]):
    print(f'  step {i:2d}:  x = {pt.round(4)},  f = {f_bowl(pt):.6f}')
print(f'  ...\n  final : x = {path_np[-1].round(6)},  f = {f_bowl(path_np[-1]):.2e}')
# f(x1, x2) = (x1-2)^2 + (x2+1)^2   minimum at (2, -1)
def f_bowl(x):    return (x[0]-2)**2 + (x[1]+1)**2
def grad_bowl(x): return np.array([2*(x[0]-2), 2*(x[1]+1)])

def run_gd_numpy(grad_fn, x_init, alpha, n_steps):
    x    = np.array(x_init, dtype=float)
    path = [x.copy()]
    for _ in range(n_steps):
        x = x - alpha * grad_fn(x)
        path.append(x.copy())
    return np.array(path)

path_np = run_gd_numpy(grad_bowl, x_init=[-2., 3.], alpha=0.3, n_steps=30)

print('NumPy GD — first 5 steps:')
for i, pt in enumerate(path_np[:6]):
    print(f'  step {i:2d}:  x = {pt.round(4)},  f = {f_bowl(pt):.6f}')
print(f'  ...\n  final : x = {path_np[-1].round(6)},  f = {f_bowl(path_np[-1]):.2e}')

4.2 · Worked — The Same GD in PyTorch (autograd computes the gradient)¶

In [ ]:

Copied!





def f_bowl_torch(x): return (x[0]-2)**2 + (x[1]+1)**2

def run_gd_torch(f, x_init, alpha, n_steps):
    x    = torch.tensor(x_init, dtype=torch.float64, requires_grad=True)
    path = [x.detach().numpy().copy()]

    for _ in range(n_steps):
        if x.grad is not None:
            x.grad.zero_()                  # clear accumulated gradient
        loss = f(x)
        loss.backward()                     # autograd computes gradient
        with torch.no_grad():
            x -= alpha * x.grad             # update step
        path.append(x.detach().numpy().copy())

    return np.array(path)

path_pt = run_gd_torch(f_bowl_torch, x_init=[-2., 3.], alpha=0.3, n_steps=30)

print('PyTorch GD — same starting point, same alpha:')
print(f'  final: x = {path_pt[-1].round(6)}')
print(f'\nPaths identical: {np.allclose(path_np, path_pt, atol=1e-10)}')

# Visualise the descent path
g = np.linspace(-3, 4, 200)
X1, X2 = np.meshgrid(g, g)
Z = (X1-2)**2 + (X2+1)**2

fig, ax = plt.subplots(figsize=(7, 6))
ax.contourf(X1, X2, Z, levels=25, cmap='Blues', alpha=0.7)
ax.contour( X1, X2, Z, levels=25, colors='white', linewidths=0.4, alpha=0.5)
ax.plot(path_np[:,0], path_np[:,1], 'o-', color='crimson', ms=4, lw=1.5, label='GD path')
ax.plot(path_np[0,0], path_np[0,1], 's', color='crimson', ms=10, label='start')
ax.plot(2, -1, '*', color='gold', ms=15, markeredgecolor='k', label='minimum (2,−1)')
ax.set_title('Gradient Descent on (x₁−2)² + (x₂+1)²')
ax.legend(); plt.tight_layout(); plt.show()
def f_bowl_torch(x): return (x[0]-2)**2 + (x[1]+1)**2

def run_gd_torch(f, x_init, alpha, n_steps):
    x    = torch.tensor(x_init, dtype=torch.float64, requires_grad=True)
    path = [x.detach().numpy().copy()]

    for _ in range(n_steps):
        if x.grad is not None:
            x.grad.zero_()                  # clear accumulated gradient
        loss = f(x)
        loss.backward()                     # autograd computes gradient
        with torch.no_grad():
            x -= alpha * x.grad             # update step
        path.append(x.detach().numpy().copy())

    return np.array(path)

path_pt = run_gd_torch(f_bowl_torch, x_init=[-2., 3.], alpha=0.3, n_steps=30)

print('PyTorch GD — same starting point, same alpha:')
print(f'  final: x = {path_pt[-1].round(6)}')
print(f'\nPaths identical: {np.allclose(path_np, path_pt, atol=1e-10)}')

# Visualise the descent path
g = np.linspace(-3, 4, 200)
X1, X2 = np.meshgrid(g, g)
Z = (X1-2)**2 + (X2+1)**2

fig, ax = plt.subplots(figsize=(7, 6))
ax.contourf(X1, X2, Z, levels=25, cmap='Blues', alpha=0.7)
ax.contour( X1, X2, Z, levels=25, colors='white', linewidths=0.4, alpha=0.5)
ax.plot(path_np[:,0], path_np[:,1], 'o-', color='crimson', ms=4, lw=1.5, label='GD path')
ax.plot(path_np[0,0], path_np[0,1], 's', color='crimson', ms=10, label='start')
ax.plot(2, -1, '*', color='gold', ms=15, markeredgecolor='k', label='minimum (2,−1)')
ax.set_title('Gradient Descent on (x₁−2)² + (x₂+1)²')
ax.legend(); plt.tight_layout(); plt.show()

✏️ 4.3 · Your Turn — GD with PyTorch Autograd¶

Use run_gd_torch to minimise the Rosenbrock function (a classic non-trivial landscape):

$$f(x_1, x_2) = (1 - x_1)^2 + 100(x_2 - x_1^2)^2$$

The global minimum is at $(1, 1)$ where $f = 0$.
The valley is very narrow — gradient descent has to work hard.

Implement f_rosenbrock in PyTorch.
Run run_gd_torch from starting point [-0.5, 0.5] with alpha=0.001, n_steps=2000.
Plot the path on the contour.

In [ ]:

Copied!





def f_rosenbrock(x):
    # ✏️ fill in:  (1 - x[0])^2 + 100*(x[1] - x[0]^2)^2
    return ___ + 100 * ___

path_rb = run_gd_torch(f_rosenbrock, x_init=[-0.5, 0.5], alpha=___, n_steps=___)

print(f'Start  : x = {path_rb[0].round(4)}')
print(f'Final  : x = {path_rb[-1].round(4)}')
print(f'f(final) = {f_rosenbrock(torch.tensor(path_rb[-1])).item():.6f}  (should be close to 0)')

# Contour plot
g = np.linspace(-1.5, 1.5, 300)
X1r, X2r = np.meshgrid(g, g)
Zr = (1 - X1r)**2 + 100*(X2r - X1r**2)**2

fig, ax = plt.subplots(figsize=(7, 6))
ax.contourf(X1r, X2r, np.log1p(Zr), levels=40, cmap='Blues', alpha=0.7)
ax.plot(path_rb[:,0], path_rb[:,1], '-', color='crimson', lw=0.8, alpha=0.8, label='GD path')
ax.plot(path_rb[0,0], path_rb[0,1], 's', color='crimson', ms=9, label='start')
ax.plot(1, 1, '*', color='gold', ms=15, markeredgecolor='k', label='minimum (1,1)')
ax.set_title('Gradient Descent on Rosenbrock function')
ax.legend(); plt.tight_layout(); plt.show()
def f_rosenbrock(x):
    # ✏️ fill in:  (1 - x[0])^2 + 100*(x[1] - x[0]^2)^2
    return ___ + 100 * ___

path_rb = run_gd_torch(f_rosenbrock, x_init=[-0.5, 0.5], alpha=___, n_steps=___)

print(f'Start  : x = {path_rb[0].round(4)}')
print(f'Final  : x = {path_rb[-1].round(4)}')
print(f'f(final) = {f_rosenbrock(torch.tensor(path_rb[-1])).item():.6f}  (should be close to 0)')

# Contour plot
g = np.linspace(-1.5, 1.5, 300)
X1r, X2r = np.meshgrid(g, g)
Zr = (1 - X1r)**2 + 100*(X2r - X1r**2)**2

fig, ax = plt.subplots(figsize=(7, 6))
ax.contourf(X1r, X2r, np.log1p(Zr), levels=40, cmap='Blues', alpha=0.7)
ax.plot(path_rb[:,0], path_rb[:,1], '-', color='crimson', lw=0.8, alpha=0.8, label='GD path')
ax.plot(path_rb[0,0], path_rb[0,1], 's', color='crimson', ms=9, label='start')
ax.plot(1, 1, '*', color='gold', ms=15, markeredgecolor='k', label='minimum (1,1)')
ax.set_title('Gradient Descent on Rosenbrock function')
ax.legend(); plt.tight_layout(); plt.show()

Section 5 · The Learning Rate — Most Important Hyperparameter¶

Step Size¶

The learning rate $\alpha$ controls how big a step we take along $-\nabla f$.

Too small → converges, but very slowly.
Too large → overshoots, may diverge.

For a convex function with Lipschitz-continuous gradient, the safe range is:

$$\alpha < \frac{2}{L} \quad \text{where } L = \text{Lipschitz constant of } \nabla f$$

For $f(\mathbf{x}) = \mathbf{x}^\top A \mathbf{x}$, $L = 2 \lambda_{\max}(A)$, so $\alpha < 1/\lambda_{\max}$.

5.1 · Worked — Racing Different Learning Rates¶

In [ ]:

Copied!





# Ill-conditioned function: one axis is 20× steeper than the other
# f(x) = x1^2 + 10*x2^2  — Hessian eigenvalues are 2 and 20
def f_ill(x):    return (x[0])**2 + 10*(x[1])**2
def grad_ill(x): return np.array([2*x[0], 20*x[1]])

# Maximum safe alpha = 1 / lambda_max = 1/20 = 0.05
lambda_max = 20
alpha_safe = 1 / lambda_max
print(f'Hessian eigenvalues: [2, 20]')
print(f'Safe alpha < {alpha_safe:.3f}')

alphas    = [0.01, 0.04, 0.05, 0.08]
x0        = np.array([3., 2.])
n_steps   = 80

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

g = np.linspace(-3.5, 3.5, 200)
X1p, X2p = np.meshgrid(g, g)
Zp = X1p**2 + 10*X2p**2
colors = ['steelblue', 'green', 'darkorange', 'crimson']

for alpha, col in zip(alphas, colors):
    path = [x0.copy()]
    x = x0.copy()
    for _ in range(n_steps):
        x = x - alpha * grad_ill(x)
        path.append(x.copy())
    path = np.array(path)
    losses = [f_ill(p) for p in path]

    axes[0].contourf(X1p, X2p, Zp, levels=20, cmap='Blues', alpha=0.15)
    axes[0].plot(path[:,0], path[:,1], '-o', color=col, ms=2, lw=1.2, label=f'α={alpha}')
    axes[1].plot(losses, color=col, lw=2, label=f'α={alpha}')

axes[0].set_xlim([-3.5,3.5]); axes[0].set_ylim([-2.5,2.5])
axes[0].set_title('Descent paths'); axes[0].legend(fontsize=9)
axes[1].set_xlabel('Iteration'); axes[1].set_ylabel('f(x)')
axes[1].set_title('Loss curves'); axes[1].legend(fontsize=9)
axes[1].set_ylim([0, 150])
plt.tight_layout(); plt.show()
# Ill-conditioned function: one axis is 20× steeper than the other
# f(x) = x1^2 + 10*x2^2  — Hessian eigenvalues are 2 and 20
def f_ill(x):    return (x[0])**2 + 10*(x[1])**2
def grad_ill(x): return np.array([2*x[0], 20*x[1]])

# Maximum safe alpha = 1 / lambda_max = 1/20 = 0.05
lambda_max = 20
alpha_safe = 1 / lambda_max
print(f'Hessian eigenvalues: [2, 20]')
print(f'Safe alpha < {alpha_safe:.3f}')

alphas    = [0.01, 0.04, 0.05, 0.08]
x0        = np.array([3., 2.])
n_steps   = 80

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

g = np.linspace(-3.5, 3.5, 200)
X1p, X2p = np.meshgrid(g, g)
Zp = X1p**2 + 10*X2p**2
colors = ['steelblue', 'green', 'darkorange', 'crimson']

for alpha, col in zip(alphas, colors):
    path = [x0.copy()]
    x = x0.copy()
    for _ in range(n_steps):
        x = x - alpha * grad_ill(x)
        path.append(x.copy())
    path = np.array(path)
    losses = [f_ill(p) for p in path]

    axes[0].contourf(X1p, X2p, Zp, levels=20, cmap='Blues', alpha=0.15)
    axes[0].plot(path[:,0], path[:,1], '-o', color=col, ms=2, lw=1.2, label=f'α={alpha}')
    axes[1].plot(losses, color=col, lw=2, label=f'α={alpha}')

axes[0].set_xlim([-3.5,3.5]); axes[0].set_ylim([-2.5,2.5])
axes[0].set_title('Descent paths'); axes[0].legend(fontsize=9)
axes[1].set_xlabel('Iteration'); axes[1].set_ylabel('f(x)')
axes[1].set_title('Loss curves'); axes[1].legend(fontsize=9)
axes[1].set_ylim([0, 150])
plt.tight_layout(); plt.show()

✏️ 5.2 · Your Turn — Find the Critical Learning Rate¶

For the function $f(x_1, x_2) = 3x_1^2 + 0.5x_2^2$:

What are the Hessian eigenvalues?
What is the theoretical maximum safe learning rate?
Verify experimentally — find the smallest $\alpha$ that causes divergence.

In [ ]:

Copied!





def f_new(x):    return 3*x[0]**2 + 0.5*x[1]**2
def grad_new(x): return np.array([6*x[0], x[1]])

# ✏️ Step 1: Hessian eigenvalues
# H = diag(6, 1)  ->  eigenvalues = ___, ___
lambda_max_new = ___       # fill in
alpha_theory   = ___       # fill in: 1 / lambda_max
print(f'Hessian eigenvalues: [6, 1]')
print(f'Max safe alpha (theory): {alpha_theory:.4f}')

# ✏️ Step 2: Experiment — try alphas around the critical value
alphas_test = [0.05, 0.10, 0.15, 0.20, 0.35]  # adjust based on your theory answer
x0 = np.array([2., 3.])

fig, ax = plt.subplots(figsize=(8, 4))
for alpha_i in alphas_test:
    x = x0.copy()
    losses = []
    for _ in range(60):
        x = x - alpha_i * grad_new(x)
        losses.append(f_new(x))
    losses = np.clip(losses, 0, 500)  # clip for visibility
    ax.plot(losses, label=f'α={alpha_i}')

ax.axhline(0, color='k', lw=0.8, linestyle='--')
ax.set_xlabel('Iteration'); ax.set_ylabel('f(x)')
ax.set_title('Finding the critical learning rate')
ax.legend(); plt.tight_layout(); plt.show()

# 💬 Discussion: Where exactly does it start to diverge? Does it match the theory?
def f_new(x):    return 3*x[0]**2 + 0.5*x[1]**2
def grad_new(x): return np.array([6*x[0], x[1]])

# ✏️ Step 1: Hessian eigenvalues
# H = diag(6, 1)  ->  eigenvalues = ___, ___
lambda_max_new = ___       # fill in
alpha_theory   = ___       # fill in: 1 / lambda_max
print(f'Hessian eigenvalues: [6, 1]')
print(f'Max safe alpha (theory): {alpha_theory:.4f}')

# ✏️ Step 2: Experiment — try alphas around the critical value
alphas_test = [0.05, 0.10, 0.15, 0.20, 0.35]  # adjust based on your theory answer
x0 = np.array([2., 3.])

fig, ax = plt.subplots(figsize=(8, 4))
for alpha_i in alphas_test:
    x = x0.copy()
    losses = []
    for _ in range(60):
        x = x - alpha_i * grad_new(x)
        losses.append(f_new(x))
    losses = np.clip(losses, 0, 500)  # clip for visibility
    ax.plot(losses, label=f'α={alpha_i}')

ax.axhline(0, color='k', lw=0.8, linestyle='--')
ax.set_xlabel('Iteration'); ax.set_ylabel('f(x)')
ax.set_title('Finding the critical learning rate')
ax.legend(); plt.tight_layout(); plt.show()

# 💬 Discussion: Where exactly does it start to diverge? Does it match the theory?

Section 6 · Beyond Classical Gradient Descent¶

Gradient Descent with Momentum; Adam¶

Vanilla gradient descent has two big weaknesses:

Slow in ravines — it zigzags on ill-conditioned landscapes
Same step size for every parameter — some dimensions need bigger steps

Two widely-used fixes:

Momentum adds a velocity term to smooth out the zigzag: $$\mathbf{v}^{(t+1)} = \beta \mathbf{v}^{(t)} - \alpha \nabla f(\mathbf{x}^{(t)}), \qquad \mathbf{x}^{(t+1)} = \mathbf{x}^{(t)} + \mathbf{v}^{(t+1)}$$

Adam (Adaptive Moment Estimation) keeps per-parameter running estimates of the gradient and its square, and adapts the step size for each coordinate individually.

6.1 · Worked — Implementing Momentum from Scratch¶

In [ ]:

Copied!





def run_momentum(grad_fn, x_init, alpha, beta, n_steps):
    """Gradient descent with momentum.  beta=0 recovers vanilla GD."""
    x = np.array(x_init, dtype=float)
    v = np.zeros_like(x)            # velocity initialised at 0
    path = [x.copy()]
    for _ in range(n_steps):
        v = beta * v - alpha * grad_fn(x)   # update velocity
        x = x + v                           # update position
        path.append(x.copy())
    return np.array(path)

# Compare on the ill-conditioned landscape
x0       = np.array([3., 2.])
n_steps  = 80

path_gd   = run_gd_numpy(grad_ill, x0, alpha=0.04, n_steps=n_steps)
path_mom  = run_momentum(grad_ill, x0, alpha=0.04, beta=0.85, n_steps=n_steps)

g = np.linspace(-3.5, 3.5, 200)
X1p, X2p = np.meshgrid(g, g)
Zp = X1p**2 + 10*X2p**2

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

for ax in axes:
    ax.contourf(X1p, X2p, Zp, levels=20, cmap='Blues', alpha=0.2)

axes[0].plot(path_gd[:,0],  path_gd[:,1],  '-o', color='crimson',    ms=2, lw=1.2, label='Vanilla GD')
axes[0].plot(path_mom[:,0], path_mom[:,1], '-o', color='steelblue',  ms=2, lw=1.2, label='Momentum')
axes[0].set_xlim([-3.5,3.5]); axes[0].set_ylim([-2.5,2.5])
axes[0].set_title('Descent paths  (same α=0.04)'); axes[0].legend()

losses_gd  = [f_ill(p) for p in path_gd]
losses_mom = [f_ill(p) for p in path_mom]
axes[1].plot(losses_gd,  color='crimson',   lw=2, label='Vanilla GD')
axes[1].plot(losses_mom, color='steelblue', lw=2, label='Momentum β=0.85')
axes[1].set_xlabel('Iteration'); axes[1].set_ylabel('f(x)')
axes[1].set_title('Loss curves'); axes[1].legend()
plt.tight_layout(); plt.show()
def run_momentum(grad_fn, x_init, alpha, beta, n_steps):
    """Gradient descent with momentum.  beta=0 recovers vanilla GD."""
    x = np.array(x_init, dtype=float)
    v = np.zeros_like(x)            # velocity initialised at 0
    path = [x.copy()]
    for _ in range(n_steps):
        v = beta * v - alpha * grad_fn(x)   # update velocity
        x = x + v                           # update position
        path.append(x.copy())
    return np.array(path)

# Compare on the ill-conditioned landscape
x0       = np.array([3., 2.])
n_steps  = 80

path_gd   = run_gd_numpy(grad_ill, x0, alpha=0.04, n_steps=n_steps)
path_mom  = run_momentum(grad_ill, x0, alpha=0.04, beta=0.85, n_steps=n_steps)

g = np.linspace(-3.5, 3.5, 200)
X1p, X2p = np.meshgrid(g, g)
Zp = X1p**2 + 10*X2p**2

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

for ax in axes:
    ax.contourf(X1p, X2p, Zp, levels=20, cmap='Blues', alpha=0.2)

axes[0].plot(path_gd[:,0],  path_gd[:,1],  '-o', color='crimson',    ms=2, lw=1.2, label='Vanilla GD')
axes[0].plot(path_mom[:,0], path_mom[:,1], '-o', color='steelblue',  ms=2, lw=1.2, label='Momentum')
axes[0].set_xlim([-3.5,3.5]); axes[0].set_ylim([-2.5,2.5])
axes[0].set_title('Descent paths  (same α=0.04)'); axes[0].legend()

losses_gd  = [f_ill(p) for p in path_gd]
losses_mom = [f_ill(p) for p in path_mom]
axes[1].plot(losses_gd,  color='crimson',   lw=2, label='Vanilla GD')
axes[1].plot(losses_mom, color='steelblue', lw=2, label='Momentum β=0.85')
axes[1].set_xlabel('Iteration'); axes[1].set_ylabel('f(x)')
axes[1].set_title('Loss curves'); axes[1].legend()
plt.tight_layout(); plt.show()

✏️ 6.2 · Your Turn — Adam Optimizer in PyTorch¶

PyTorch provides Adam out of the box via torch.optim.Adam.
The workflow is:

optimizer = torch.optim.Adam([x], lr=alpha)
optimizer.zero_grad()   # clear old gradients
loss = f(x)
loss.backward()         # compute new gradients
optimizer.step()        # update x

Use this to minimise the Rosenbrock function from Section 4.3.
Compare Adam vs vanilla GD (Section 4.3 result) — how many steps does each need?

In [ ]:

Copied!





# ✏️ Minimise Rosenbrock with Adam

x_adam = torch.tensor([-0.5, 0.5], dtype=torch.float64, requires_grad=True)

optimizer = torch.optim.Adam([x_adam], lr=___)   # fill in a learning rate (try 0.01)

path_adam  = [x_adam.detach().numpy().copy()]
loss_adam  = []

for step in range(___):
    optimizer.___()                              # fill in: zero_grad
    loss = f_rosenbrock(x_adam)
    loss.___()                                   # fill in: backward
    optimizer.___()                              # fill in: step
    path_adam.append(x_adam.detach().numpy().copy())
    loss_adam.append(loss.item())

path_adam = np.array(path_adam)
print(f'Adam final x = {path_adam[-1].round(5)}')
print(f'Adam final f = {loss_adam[-1]:.6f}')

# Plot loss curve
fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(loss_adam, color='steelblue', lw=2, label='Adam')
ax.set_xlabel('Iteration'); ax.set_ylabel('f(x)')
ax.set_yscale('log')
ax.set_title('Adam on Rosenbrock (log scale)')
ax.legend(); plt.tight_layout(); plt.show()

# 💬 Compare with the vanilla GD path from Section 4.3:
# How many steps did GD need vs Adam to reach f < 0.01?
# ✏️ Minimise Rosenbrock with Adam

x_adam = torch.tensor([-0.5, 0.5], dtype=torch.float64, requires_grad=True)

optimizer = torch.optim.Adam([x_adam], lr=___)   # fill in a learning rate (try 0.01)

path_adam  = [x_adam.detach().numpy().copy()]
loss_adam  = []

for step in range(___):
    optimizer.___()                              # fill in: zero_grad
    loss = f_rosenbrock(x_adam)
    loss.___()                                   # fill in: backward
    optimizer.___()                              # fill in: step
    path_adam.append(x_adam.detach().numpy().copy())
    loss_adam.append(loss.item())

path_adam = np.array(path_adam)
print(f'Adam final x = {path_adam[-1].round(5)}')
print(f'Adam final f = {loss_adam[-1]:.6f}')

# Plot loss curve
fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(loss_adam, color='steelblue', lw=2, label='Adam')
ax.set_xlabel('Iteration'); ax.set_ylabel('f(x)')
ax.set_yscale('log')
ax.set_title('Adam on Rosenbrock (log scale)')
ax.legend(); plt.tight_layout(); plt.show()

# 💬 Compare with the vanilla GD path from Section 4.3:
# How many steps did GD need vs Adam to reach f < 0.01?

🔬 6.3 · Experiment — Momentum Beta¶

The momentum parameter $\beta \in [0, 1)$ controls how much of the previous velocity is kept.

$\beta = 0$: vanilla gradient descent
$\beta \to 1$: very persistent velocity (can overshoot!)

Run the cell below. Try different values of beta and observe the trade-off.

In [ ]:

Copied!





# 🔬 Change beta and re-run
betas    = [0.0, 0.5, 0.85, 0.95]
x0       = np.array([3., 2.])
n_steps  = 100
alpha    = 0.04

fig, axes = plt.subplots(1, 2, figsize=(14, 5))
colors   = ['crimson', 'darkorange', 'steelblue', 'green']

for beta, col in zip(betas, colors):
    path   = run_momentum(grad_ill, x0, alpha=alpha, beta=beta, n_steps=n_steps)
    losses = [f_ill(p) for p in path]
    axes[0].plot(path[:,0],  path[:,1],  '-', color=col, lw=1.2, alpha=0.8,
                 label=f'β={beta}')
    axes[1].plot(losses, color=col, lw=2, label=f'β={beta}')

for ax in axes:
    ax.legend(fontsize=9)
axes[0].contour(X1p, X2p, Zp, levels=15, colors='grey', linewidths=0.4, alpha=0.5)
axes[0].set_xlim([-4,4]); axes[0].set_ylim([-3,3])
axes[0].set_title('Paths for different β values')
axes[1].set_xlabel('Iteration'); axes[1].set_ylabel('f(x)')
axes[1].set_title('Loss curves')
axes[1].set_ylim([0, 150])
plt.tight_layout(); plt.show()

# 💬 Discuss:
# What is the sweet spot for beta on this landscape?
# What happens when beta is too high (e.g. 0.99)?
# 🔬 Change beta and re-run
betas    = [0.0, 0.5, 0.85, 0.95]
x0       = np.array([3., 2.])
n_steps  = 100
alpha    = 0.04

fig, axes = plt.subplots(1, 2, figsize=(14, 5))
colors   = ['crimson', 'darkorange', 'steelblue', 'green']

for beta, col in zip(betas, colors):
    path   = run_momentum(grad_ill, x0, alpha=alpha, beta=beta, n_steps=n_steps)
    losses = [f_ill(p) for p in path]
    axes[0].plot(path[:,0],  path[:,1],  '-', color=col, lw=1.2, alpha=0.8,
                 label=f'β={beta}')
    axes[1].plot(losses, color=col, lw=2, label=f'β={beta}')

for ax in axes:
    ax.legend(fontsize=9)
axes[0].contour(X1p, X2p, Zp, levels=15, colors='grey', linewidths=0.4, alpha=0.5)
axes[0].set_xlim([-4,4]); axes[0].set_ylim([-3,3])
axes[0].set_title('Paths for different β values')
axes[1].set_xlabel('Iteration'); axes[1].set_ylabel('f(x)')
axes[1].set_title('Loss curves')
axes[1].set_ylim([0, 150])
plt.tight_layout(); plt.show()

# 💬 Discuss:
# What is the sweet spot for beta on this landscape?
# What happens when beta is too high (e.g. 0.99)?

🏁 Summary¶

Section	What you built	mml-book reference
1	Finite differences, chain rule by hand	§5.1
2	Gradient field, analytical vs numerical $\nabla f$, Jacobian	§5.2–5.3, §5.5
3	PyTorch autograd — computation graph, `.backward()`	§5.6
4	Gradient descent in NumPy and PyTorch — same result	§7.1
5	Learning rate — safe bound, ill-conditioning, zigzag	§7.1
6	Momentum from scratch; Adam via `torch.optim`	§7.1

The key insight connecting all six sections:

The gradient tells us the direction.
The learning rate tells us how far to step.
The optimizer decides how to use that information — vanilla, momentum, or adaptive.