Training a neural network means finding the weights that minimise a loss function. The function takes millions of parameters as input and outputs a single number — how wrong the network's predictions are. Finding the minimum of that function is an optimisation problem, and the algorithm that solves it — gradient descent — runs calculus every single iteration.
This is not metaphorical. Backpropagation is the chain rule applied recursively through every layer of the network. Every weight update is a calculus computation. You cannot understand what is actually happening inside a neural network without knowing what a derivative is.
The Loss Function and its Gradient
Training a neural network minimises a loss function L(θ) where θ is the vector of all weights. The gradient ∇_θ L points in the direction of steepest increase of L. Gradient descent updates: θ ← θ − η·∇_θ L, where η is the learning rate. Each step moves downhill on the loss surface.
Backpropagation is the Chain Rule
Backprop computes ∇_θ L by applying the Chain Rule layer by layer, from output back to input. For a network with layers L₁,L₂,...,Lₙ: ∂L/∂w₁ = (∂L/∂aₙ)·(∂aₙ/∂aₙ₋₁)·...·(∂a₂/∂a₁)·(∂a₁/∂w₁). This is the multivariable chain rule applied recursively.
Second-Order Methods — Hessian
The Hessian matrix H = [∂²L/∂θᵢ∂θⱼ] is the matrix of second partial derivatives. Newton's method uses H⁻¹ to make larger, smarter gradient steps. Adam, AdaGrad, and other adaptive optimisers approximate aspects of the Hessian without computing it directly.
Integrals in Probabilistic ML
In Bayesian machine learning, the posterior P(θ|data) ∝ P(data|θ)·P(θ) requires integration to normalise. Computing expected values E[f(θ)] = ∫f(θ)P(θ|data)dθ is often intractable, requiring Monte Carlo integration or variational inference — approximation methods grounded in calculus.
Every neural network is trained by computing derivatives (backpropagation = Chain Rule) and minimising a function (gradient descent). Understanding calculus is the difference between using ML tools and understanding why they work.
The Loss Function — What We're Minimising
Training a machine learning model means finding parameters θ (weights and biases) that minimise a loss function L(θ). For regression, L(θ) = (1/n)Σ(yᵢ − ŷᵢ)² (mean squared error). For classification, L(θ) = −(1/n)Σ[yᵢ log(ŷᵢ) + (1−yᵢ)log(1−ŷᵢ)] (cross-entropy). The loss L is a function of potentially millions of parameters — a scalar function of a high-dimensional vector. Finding its minimum requires multivariable calculus.
Gradient Descent — The Optimisation Algorithm
The gradient ∇_θ L is the vector of all partial derivatives ∂L/∂θᵢ — it points in the direction of steepest increase of L. Gradient descent moves in the opposite direction: θ ← θ − η·∇_θ L, where η is the learning rate (step size). Each iteration moves the parameters slightly downhill on the loss surface. With a suitable learning rate and enough iterations, this converges to a (local) minimum. Stochastic gradient descent (SGD) uses a random mini-batch of training examples to approximate the full gradient — computationally cheaper and often faster to converge in practice.
Backpropagation — The Chain Rule at Scale
Computing ∇_θ L for a deep neural network requires the Chain Rule applied through every layer. For a network with composition y = f_L(f_{L-1}(···f_1(x)···)), the gradient with respect to weights in layer k is: ∂L/∂w_k = (∂L/∂y)·(∂y/∂a_L)·(∂a_L/∂a_{L-1})···(∂a_{k+1}/∂a_k)·(∂a_k/∂w_k). This is the Chain Rule applied L−k times. Backpropagation computes this efficiently by caching intermediate values (the "forward pass") then propagating gradient information backwards (the "backward pass"). The entire algorithm is the multivariable Chain Rule implemented cleverly.
Activation Functions and Their Derivatives
The choice of activation function directly affects the gradients available for training. ReLU: f(x) = max(0,x), f'(x) = 0 (x<0) or 1 (x>0). Sigmoid: f(x) = 1/(1+e^(−x)), f'(x) = f(x)(1−f(x)). Tanh: f'(x) = 1 − tanh²(x). The "vanishing gradient problem" — where gradients become exponentially small in deep networks — arises because sigmoid and tanh derivatives are ≤ 0.25, and multiplying many such small numbers together drives the product to zero. ReLU (derivative exactly 1 in the positive region) largely solved this problem. The mathematical analysis is entirely about the Chain Rule and the magnitude of derivative products.
Regularisation — Calculus Meets Statistics
L2 regularisation adds λ||θ||² to the loss: L_reg = L + λΣθᵢ². The gradient becomes ∂L_reg/∂θᵢ = ∂L/∂θᵢ + 2λθᵢ. The extra term 2λθᵢ pulls weights toward zero — this is "weight decay", and it prevents overfitting by penalising large weights. L1 regularisation adds λΣ|θᵢ|. Its gradient ∂|θᵢ|/∂θᵢ = sign(θᵢ) is 1 or −1 — it pushes weights exactly to zero, producing sparse models. The different sparsity behaviour of L1 vs L2 regularisation is a direct consequence of the different shapes of |x| vs x² and their different derivatives.
Probability and Integration in Bayesian ML
Bayesian machine learning represents uncertainty as probability distributions. The posterior P(θ|data) ∝ P(data|θ)·P(θ) requires integration to normalise: Z = ∫P(data|θ)P(θ)dθ. For all but the simplest models, this integral is analytically intractable. Variational inference approximates it by optimising a lower bound (the ELBO) using gradient descent. Markov Chain Monte Carlo methods approximate the integral numerically using samples. Both methods — used by modern Bayesian neural networks — are fundamentally about computing or approximating high-dimensional integrals using calculus.
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
- Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer.
- Rumelhart, D., Hinton, G., & Williams, R. (1986). Learning representations by back-propagating errors. Nature, 323, 533–536.