Perceptron → MLP, Forward Pass & Activation Functions
You hire a new junior engineer and ask them to debug a ResNet that trains "but gives terrible accuracy." The first thing to check: are activations dead? Are gradients flowing? Understanding how a network actually computes — neuron by neuron, layer by layer — is the prerequisite for every debugging conversation you will ever have.
A single neuron (perceptron) computes a weighted sum of inputs plus a bias, then applies a nonlinearity: z = w₁x₁ + w₂x₂ + ... + wₙxₙ + b = wᵀx + b output = σ(z) where σ is an activation function A Multi-Layer Perceptron (MLP) stacks layers of neurons: Layer 1: a⁽¹⁾ = σ(W⁽¹⁾x + b⁽¹⁾) Layer 2: a⁽²⁾ = σ(W⁽²⁾a⁽¹⁾ + b⁽²⁾) Output: ŷ = W⁽³⁾a⁽²⁾ + b⁽³⁾ The forward pass is purely left-to-right: compute each layer's pre-activation z, apply σ, pass to the next layer. No feedback, no cycles (in a basic MLP). Universal Approximation Theorem: A single hidden layer with enough neurons can approximate any continuous function on a compact subset of ℝⁿ to arbitrary precision. This gives MLPs their theoretical power — but says nothing about how to train them or how many neurons are "enough." In practice: depth is more efficient than width.
Input Hidden Layer 1 Hidden Layer 2 Output
───── ────────────── ────────────── ──────
x₁ ──────► ┌────────────┐ ─────► ┌────────────┐ ────►
│ σ(W⁽¹⁾x+b) │ │ σ(W⁽²⁾a+b) │
x₂ ──────► │ h₁ h₂ │ ─────► │ h₁ h₂ │ ────► ŷ
│ h₃ h₄ │ │ h₃ h₄ │
x₃ ──────► └────────────┘ ─────► └────────────┘ ────►
z = Wx + b → a = σ(z) → z = Wa + b → a = σ(z) → ŷ import torch
import torch.nn as nn
import torch.nn.functional as F
# ── Define a 3-layer MLP ──────────────────────────────────
class MLP(nn.Module):
def __init__(self, in_features, hidden, out_features):
super().__init__()
self.fc1 = nn.Linear(in_features, hidden)
self.fc2 = nn.Linear(hidden, hidden)
self.fc3 = nn.Linear(hidden, out_features)
def forward(self, x):
x = F.relu(self.fc1(x)) # hidden layer 1
x = F.relu(self.fc2(x)) # hidden layer 2
return self.fc3(x) # output logits (no activation)
model = MLP(in_features=10, hidden=64, out_features=1)
n_params = sum(p.numel() for p in model.parameters())
print(f"Parameters: {n_params:,}") # 10*64 + 64 + 64*64 + 64 + 64*1 + 1 = 4801
# ── Run a forward pass ────────────────────────────────────
batch = torch.randn(32, 10) # 32 samples, 10 features
output = model(batch)
print(f"Input: {batch.shape} Output: {output.shape}") # [32, 1]
# ── Inspect intermediate activations ─────────────────────
x = torch.randn(4, 10)
z1 = model.fc1(x) # pre-activation
a1 = F.relu(z1) # post-ReLU
z2 = model.fc2(a1)
a2 = F.relu(z2)
out = model.fc3(a2)
print(f"\nLayer-by-layer shapes and stats:")
print(f" z1 pre-act: {z1.shape} range=[{z1.min():.2f}, {z1.max():.2f}]")
print(f" a1 post-ReLU:{a1.shape} zeros={( a1==0).sum()}/{a1.numel()}")
print(f" output: {out.shape}")
# ── Verify determinism in eval mode ──────────────────────
model.eval()
with torch.no_grad():
out1 = model(x)
out2 = model(x)
print(f"\nDeterministic in eval mode: {torch.allclose(out1, out2)}") In training mode, dropout randomly zeros activations and BatchNorm uses batch statistics rather than running statistics. A model in training mode gives a different output every call on the same input. Engineers often skip model.eval() during quick debugging or A/B comparisons, then wonder why outputs vary across runs.
nn.Linear outputs raw logits — unbounded real numbers, often negative or > 1. Using them directly for thresholding (e.g., output > 0.5 for binary classification) or as probability distributions is incorrect. Logits do not sum to 1 in multi-class settings.
The UAT states that a single hidden layer MLP with enough neurons can approximate any continuous function on a compact subset of ℝⁿ to arbitrary precision. Practical limitations: (1) Non-constructive — it does not say how many neurons are needed or how to find the right weights via training. (2) It says nothing about generalisation — a network can perfectly approximate training data and fail on unseen inputs. (3) Depth beats width in practice — a 3-layer network of width 256 generalises better and trains faster than a 1-layer network of width 100,000 with the same parameters. The theorem motivates using MLPs but does not tell you how to design one.
Composing linear functions produces a linear function. W₂(W₁x + b₁) + b₂ = (W₂W₁)x + (W₂b₁ + b₂) = W'x + b'. No matter how many linear layers you stack, the result is always expressible as a single matrix multiply plus bias — equivalent to one linear layer. The nonlinear activation breaks this: σ(W₂·σ(W₁x+b₁)+b₂) cannot be collapsed to a single linear operation. Without activations, a 100-layer MLP has exactly the same representational power as logistic regression. This is why accidentally removing activations in a custom model causes catastrophically poor performance while training appears to proceed normally (loss still decreases, just more slowly and toward a worse solution).
Activation functions introduce nonlinearity. The choice affects gradient flow, output range, sparsity, and training speed. ReLU — Rectified Linear Unit: f(x) = max(0, x) Gradient: 1 if x>0, 0 otherwise (non-differentiable at x=0) Fast, sparse activations, no saturation for x>0 Risk: dead ReLU — neurons permanently output 0 for all inputs GELU — Gaussian Error Linear Unit: f(x) = x · Φ(x) ≈ 0.5x(1 + tanh(√(2/π)(x + 0.044715x³))) Smooth soft-gate: attenuates rather than hard-zeros negative inputs Default in all modern Transformers: BERT, GPT, ViT, LLaMA Sigmoid: f(x) = 1/(1 + e^{-x}) Range: (0, 1) Gradient: f(x)(1-f(x)) ≤ 0.25 → saturates → vanishing gradient in hidden layers Only correct use: binary output layer Tanh: f(x) = (e^x − e^{-x})/(e^x + e^{-x}) Range: (-1, 1), zero-centred Gradient: 1 − f(x)² ≤ 1 → still saturates, but better than sigmoid for hidden layers Dead ReLU Problem: A neuron is dead if z < 0 for ALL inputs → gradient = 0 → weights never update Causes: large negative bias, large LR causing weights to go very negative Fix: Leaky ReLU f(x)=max(0.01x, x); ELU; careful He init + smaller LR
Activation functions compared:
f(x) ↑
1.0 ├ · · · sigmoid · · · · · · (saturates at 1.0)
│ · ·
0.5 ├ · ·
│ · · tanh (zero-centred, saturates ±1)
0.0 ├──────────────────────────────────────→ x
│ ← GELU (slight negative, smooth)
-0.5 ├
ReLU:
f(x) │ ╱ (gradient=1 for x>0)
│ ╱
0.0 ├────────╱────────────────────→ x
│ gradient=0 here → dead ReLU risk import torch
import torch.nn as nn
import torch.nn.functional as F
# ── Compare activation outputs ────────────────────────────
x = torch.linspace(-3, 3, 7)
print(f"x: {x.tolist()}")
print(f"ReLU: {F.relu(x).tolist()}")
print(f"GELU: {[round(v,3) for v in F.gelu(x).tolist()]}")
print(f"Sigmoid: {[round(v,3) for v in torch.sigmoid(x).tolist()]}")
print(f"Tanh: {[round(v,3) for v in torch.tanh(x).tolist()]}")
# ── Gradient comparison — saturation ─────────────────────
x_sat = torch.tensor([-3.0, -1.0, 0.0, 1.0, 3.0], requires_grad=True)
torch.sigmoid(x_sat).sum().backward()
print(f"\nSigmoid grads at [-3,-1,0,1,3]: {[round(g,4) for g in x_sat.grad.tolist()]}")
# Near 0 at extremes → vanishing gradient in deep nets
x_relu = torch.tensor([-3.0, -1.0, 0.0, 1.0, 3.0], requires_grad=True)
F.relu(x_relu).sum().backward()
print(f"ReLU grads at [-3,-1,0,1,3]: {x_relu.grad.tolist()}")
# Binary: 0 for x<0, 1 for x>0 — no saturation above 0
# ── Dead ReLU detection ───────────────────────────────────
class DeepReLU(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(10, 64)
self.fc2 = nn.Linear(64, 1)
def forward(self, x):
self.h1 = F.relu(self.fc1(x))
return self.fc2(self.h1)
model = DeepReLU()
data = torch.randn(200, 10)
_ = model(data)
dead = (model.h1 == 0).all(dim=0).sum().item()
print(f"\nDead neurons (always 0 across 200 samples): {dead}/64")
# ── Transformer FFN block with GELU ─────────────────────
class TransformerFFN(nn.Module):
def __init__(self, d_model=512, d_ff=2048):
super().__init__()
self.fc1 = nn.Linear(d_model, d_ff)
self.fc2 = nn.Linear(d_ff, d_model)
def forward(self, x):
return self.fc2(F.gelu(self.fc1(x))) # GELU standard in Transformers
ffn = TransformerFFN()
x = torch.randn(8, 64, 512) # (batch, seq_len, d_model)
out = ffn(x)
print(f"\nTransformer FFN: {x.shape} → {out.shape}") Sigmoid gradient max is 0.25 (at x=0). In a 10-layer sigmoid network, the chain-rule product of gradients is at most 0.25^10 ≈ 10^{-6}. Early layers receive near-zero gradients and effectively stop learning. The loss may still decrease (final layers learn) while the bulk of the network capacity goes unused. The bug is invisible without gradient-norm logging.
Dead neurons contribute nothing, waste capacity, and accumulate silently. A model with 40% dead neurons in layer 1 trains as if its first layer had 60% of its width. Symptoms: loss plateauing early, activation statistics going to zero, gradient norms collapsing for early layers — all subtle and easy to miss without explicit logging.
GELU is smooth and differentiable everywhere. ReLU has a hard zero for all x < 0 (dead zone) and a kink at x = 0. GELU computes x·Φ(x) where Φ is the standard normal CDF — this gives a soft gate that attenuates negative inputs by a smooth, input-dependent factor rather than zeroing them hard. Practical benefits: (1) No dead neuron problem — GELU never permanently zeros a neuron. (2) Smooth gradients throughout training improve Adam's convergence. (3) Empirically better on NLP benchmarks — BERT, GPT-2, GPT-3, ViT all default to GELU. ReLU remains preferred for CNNs where its sparsity is a useful inductive bias and its lower computation cost matters at scale.
A ReLU neuron is dead when its pre-activation z < 0 for every input in the training set. The ReLU gradient is exactly 0 for z < 0, so weights connected to this neuron receive zero gradient and never update. The neuron is stuck permanently. Causes: (1) Large initial learning rate pushing weights into a large negative region in the first few steps; (2) Very negative initial biases. Fix options without architecture change: (a) Switch to Leaky ReLU: f(x) = max(αx, x) with α=0.01 — gives a small gradient even for x<0, preventing permanent death; (b) Use He initialisation (kaiming_normal_ with fan_out) — scales initial weights so activation variance ≈ 1; (c) Reduce initial learning rate or add warmup. Architectural fix: skip connections (ResNet style) provide a gradient path around dead ReLU blocks.
An MLP without nonlinear activation is just a matrix multiply — no matter how many layers you stack, it collapses to a single linear transformation. The activation function is what gives deep networks their power.