Calculus in Machine Learning — Gradients and Backpropagation

Training a neural network means finding the weights that minimise a loss function. The function takes millions of parameters as input and outputs a single number — how wrong the network's predictions are. Finding the minimum of that function is an optimisation problem, and the algorithm that solves it — gradient descent — runs calculus every single iteration.

This is not metaphorical. Backpropagation is the chain rule applied recursively through every layer of the network. Every weight update is a calculus computation. You cannot understand what is actually happening inside a neural network without knowing what a derivative is.

The Loss Function and its Gradient

Training a neural network minimises a loss function L(θ) where θ is the vector of all weights. The gradient ∇_θ L points in the direction of steepest increase of L. Gradient descent updates: θ ← θ − η·∇_θ L, where η is the learning rate. Each step moves downhill on the loss surface.

Backpropagation is the Chain Rule

Backprop computes ∇_θ L by applying the Chain Rule layer by layer, from output back to input. For a network with layers L₁,L₂,...,Lₙ: ∂L/∂w₁ = (∂L/∂aₙ)·(∂aₙ/∂aₙ₋₁)·...·(∂a₂/∂a₁)·(∂a₁/∂w₁). This is the multivariable chain rule applied recursively.

Second-Order Methods — Hessian

The Hessian matrix H = [∂²L/∂θᵢ∂θⱼ] is the matrix of second partial derivatives. Newton's method uses H⁻¹ to make larger, smarter gradient steps. Adam, AdaGrad, and other adaptive optimisers approximate aspects of the Hessian without computing it directly.

Integrals in Probabilistic ML

In Bayesian machine learning, the posterior P(θ|data) ∝ P(data|θ)·P(θ) requires integration to normalise. Computing expected values E[f(θ)] = ∫f(θ)P(θ|data)dθ is often intractable, requiring Monte Carlo integration or variational inference — approximation methods grounded in calculus.

Calculus in ML

Every neural network is trained by computing derivatives (backpropagation = Chain Rule) and minimising a function (gradient descent). Understanding calculus is the difference between using ML tools and understanding why they work.

The Loss Function — What We're Minimising

Training a machine learning model means finding parameters θ (weights and biases) that minimise a loss function L(θ). For regression, L(θ) = (1/n)Σ(yᵢ − ŷᵢ)² (mean squared error). For classification, L(θ) = −(1/n)Σ[yᵢ log(ŷᵢ) + (1−yᵢ)log(1−ŷᵢ)] (cross-entropy). The loss L is a function of potentially millions of parameters — a scalar function of a high-dimensional vector. Finding its minimum requires multivariable calculus.

Gradient Descent — The Optimisation Algorithm

The gradient ∇_θ L is the vector of all partial derivatives ∂L/∂θᵢ — it points in the direction of steepest increase of L. Gradient descent moves in the opposite direction: θ ← θ − η·∇_θ L, where η is the learning rate (step size). Each iteration moves the parameters slightly downhill on the loss surface. With a suitable learning rate and enough iterations, this converges to a (local) minimum. Stochastic gradient descent (SGD) uses a random mini-batch of training examples to approximate the full gradient — computationally cheaper and often faster to converge in practice.

Backpropagation — The Chain Rule at Scale

Computing ∇_θ L for a deep neural network requires the Chain Rule applied through every layer. For a network with composition y = f_L(f_{L-1}(···f_1(x)···)), the gradient with respect to weights in layer k is: ∂L/∂w_k = (∂L/∂y)·(∂y/∂a_L)·(∂a_L/∂a_{L-1})···(∂a_{k+1}/∂a_k)·(∂a_k/∂w_k). This is the Chain Rule applied L−k times. Backpropagation computes this efficiently by caching intermediate values (the "forward pass") then propagating gradient information backwards (the "backward pass"). The entire algorithm is the multivariable Chain Rule implemented cleverly.

Activation Functions and Their Derivatives

The choice of activation function directly affects the gradients available for training. ReLU: f(x) = max(0,x), f'(x) = 0 (x<0) or 1 (x>0). Sigmoid: f(x) = 1/(1+e^(−x)), f'(x) = f(x)(1−f(x)). Tanh: f'(x) = 1 − tanh²(x). The "vanishing gradient problem" — where gradients become exponentially small in deep networks — arises because sigmoid and tanh derivatives are ≤ 0.25, and multiplying many such small numbers together drives the product to zero. ReLU (derivative exactly 1 in the positive region) largely solved this problem. The mathematical analysis is entirely about the Chain Rule and the magnitude of derivative products.

Regularisation — Calculus Meets Statistics

L2 regularisation adds λ||θ||² to the loss: L_reg = L + λΣθᵢ². The gradient becomes ∂L_reg/∂θᵢ = ∂L/∂θᵢ + 2λθᵢ. The extra term 2λθᵢ pulls weights toward zero — this is "weight decay", and it prevents overfitting by penalising large weights. L1 regularisation adds λΣ|θᵢ|. Its gradient ∂|θᵢ|/∂θᵢ = sign(θᵢ) is 1 or −1 — it pushes weights exactly to zero, producing sparse models. The different sparsity behaviour of L1 vs L2 regularisation is a direct consequence of the different shapes of |x| vs x² and their different derivatives.

Probability and Integration in Bayesian ML

Bayesian machine learning represents uncertainty as probability distributions. The posterior P(θ|data) ∝ P(data|θ)·P(θ) requires integration to normalise: Z = ∫P(data|θ)P(θ)dθ. For all but the simplest models, this integral is analytically intractable. Variational inference approximates it by optimising a lower bound (the ELBO) using gradient descent. Markov Chain Monte Carlo methods approximate the integral numerically using samples. Both methods — used by modern Bayesian neural networks — are fundamentally about computing or approximating high-dimensional integrals using calculus.

Frequently Asked Questions

Do I need to know calculus to use machine learning libraries?▼

No — PyTorch and TensorFlow compute gradients automatically (autograd). But to understand why models fail, debug training instability, design custom loss functions, or do research, you need to understand what those gradients mean. Calculus separates practitioners from engineers.

What is automatic differentiation?▼

Autograd/automatic differentiation computes exact derivatives numerically using the chain rule, not symbolic algebra. It maintains a computational graph of all operations and propagates derivatives backward through it. This is the implementation of backpropagation in all modern deep learning frameworks.

#machine-learning #gradient-descent #backpropagation #neural-networks

References & Further Reading

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer.
Rumelhart, D., Hinton, G., & Williams, R. (1986). Learning representations by back-propagating errors. Nature, 323, 533–536.

Dr. Aisha Malik, PhD Mathematics

Senior Lecturer in Applied Mathematics · 12 years teaching calculus at university level

Dr. Malik holds a PhD in Applied Mathematics from the University of Edinburgh and has taught calculus to over 4,000 students at both undergraduate and postgraduate level. Her research focuses on numerical methods for differential equations. She has reviewed this article for mathematical accuracy and pedagogical clarity.

Technically reviewed by: Prof. James Chen, Stanford Mathematics Department