Neural networks: the basic building block
In this series (25 parts)
- Neural networks: the basic building block
- Forward pass and backpropagation
- Training neural networks: a practical guide
- Convolutional neural networks
- Recurrent neural networks and LSTMs
- Attention mechanism and transformers
- Word embeddings: from one-hot to dense representations
- Transfer learning and fine-tuning
- Optimization techniques for deep networks
- Regularization for deep networks
- Encoder-decoder architectures
- Generative models: an overview
- Restricted Boltzmann Machines
- Deep Belief Networks
- Variational Autoencoders
- Generative Adversarial Networks: training and theory
- DCGAN, conditional GANs, and GAN variants
- Representation learning and self-supervised learning
- Domain adaptation and fine-tuning strategies
- Distributed representations and latent spaces
- AutoML and hyperparameter optimization
- Neural architecture search
- Network compression and efficient inference
- Graph neural networks
- Practical deep learning: debugging and tuning
Prerequisites
Before diving in, make sure you are comfortable with:
- Logistic regression: a single neuron is essentially logistic regression with a nonlinear twist
- Matrix multiplication: the core operation in every neural network layer
What is a neuron?
A neural network starts with one simple idea: take some inputs, multiply each by a weight, add a bias, and pass the result through a function. That is a neuron.
A single neuron computes two things. First, the weighted sum:
Then, the activation:
Here is the input vector, is the weight vector, is the bias, and is the activation function.
The dot product measures how much the input aligns with what the neuron is looking for. The bias shifts the decision boundary. The activation function introduces nonlinearity.
graph LR x1((x₁)) -->|w₁| S["∑ + b"] x2((x₂)) -->|w₂| S x3((x₃)) -->|w₃| S S -->|z| A["σ(z)"] A -->|a| O((output))
If you have seen logistic regression, you already know what a neuron does. Logistic regression computes where is the sigmoid function. A neuron is the same computation, except you can swap sigmoid for other activation functions.
Why activation functions matter
Here is the key insight: without activation functions, a deep network is just a single linear transformation. It does not matter how many layers you stack.
Suppose you have two linear layers:
Substitute the first equation into the second:
This is just where . No matter how many linear layers you stack, the result collapses to a single linear layer. We prove this concretely with numbers in Example 3 below.
Activation functions break this linearity. They let the network learn curved decision boundaries, complex patterns, and subtle relationships in data. Every hidden layer needs a nonlinear activation, or you are wasting depth.
Common activation functions
Activation functions compared: ReLU, sigmoid, and tanh over the range x = -5 to 5.
Sigmoid
Squashes any real number into the range . Historically popular because it resembles a probability. The problem: for large , the gradient approaches zero. This makes learning very slow in deep networks. The phenomenon is called the vanishing gradient problem.
Tanh
Output range is . Zero-centered outputs help gradient descent converge faster than sigmoid. Still suffers from vanishing gradients at the extremes, though.
ReLU (Rectified Linear Unit)
Dead simple. If is positive, pass it through. If negative, output zero. ReLU solved a huge practical problem: gradients do not vanish for positive inputs because the gradient is exactly 1. Training became much faster. The downside: neurons can “die” if they consistently receive negative inputs, because the gradient is exactly 0 there.
Leaky ReLU
where is a small constant like 0.01. This fixes the dying ReLU problem by allowing a small gradient when .
Activation function comparison
| Function | Formula | Output range | Gradient at saturation | Typical use | Known problem |
|---|---|---|---|---|---|
| Sigmoid | Output layer (binary classification) | Vanishing gradient | |||
| Tanh | Hidden layers (older architectures) | Vanishing gradient | |||
| ReLU | for | Hidden layers (default choice) | Dying neurons | ||
| Leaky ReLU | for | Hidden layers | Adds hyperparameter |
For most problems, start with ReLU in hidden layers. Use sigmoid or softmax in the output layer for classification. That is a solid default.
How layers compose
A single neuron can only learn a linear boundary (with a nonlinear squash). To learn complex patterns, we stack neurons into layers, and layers into networks.
A feedforward neural network has three types of layers:
- Input layer: holds your raw features. No computation happens here.
- Hidden layers: each neuron takes all outputs from the previous layer, computes a weighted sum plus bias, and applies an activation function.
- Output layer: produces the final prediction. For regression, typically no activation or a linear one. For binary classification, sigmoid. For multi-class, softmax.
graph LR
subgraph Input
i1((x₁))
i2((x₂))
end
subgraph Hidden["Hidden layer (3 units)"]
h1((h₁))
h2((h₂))
h3((h₃))
end
subgraph Output
o1((ŷ))
end
i1 --> h1
i1 --> h2
i1 --> h3
i2 --> h1
i2 --> h2
i2 --> h3
h1 --> o1
h2 --> o1
h3 --> o1
In matrix notation, a hidden layer computes:
where is the weight matrix, is the bias vector, and is applied element-wise. Each row of contains the weights for one neuron.
The full network is function composition:
Each represents one layer’s operation: linear transform followed by an activation.
Universal approximation theorem
Here is a remarkable result: a feedforward network with a single hidden layer containing enough neurons can approximate any continuous function on a compact domain to arbitrary accuracy.
What this means in plain terms: neural networks are theoretically powerful enough to learn any reasonable input-output mapping. Given enough hidden neurons, the network can get as close as you want to the true function.
What it does not mean:
- It does not tell you how many neurons you need. The number could be astronomically large.
- It does not guarantee that gradient descent will find the right weights.
- It says nothing about generalization to unseen data.
In practice, we use deeper networks (more layers) rather than extremely wide ones (many neurons in a single layer). Depth lets networks build hierarchical representations. Early layers learn simple patterns. Later layers combine them into complex features.
Example 1: Single neuron forward pass
Given: , , . Compute the output using ReLU.
Step 1: Weighted sum plus bias.
Step 2: Apply ReLU.
The neuron outputs 0. Because is negative, ReLU kills the signal completely. This neuron, with these particular weights, does not “fire” for this input.
Example 2: Forward pass through a 2-layer network
Network: 2 inputs, 3 hidden units with ReLU, 1 output with linear activation.
Given:
Input:
Hidden layer weights and biases:
Output layer weights and bias:
Step 1: Compute hidden layer pre-activations.
Step 2: Apply ReLU.
All values are positive, so all three neurons are active.
Step 3: Compute output.
The network outputs 0.28. Every hidden neuron contributed to the final answer, weighted by the output layer.
Example 3: Why stacking linear layers collapses
Let us prove that stacking linear layers without activations gives you nothing extra. Take two layers with weight matrices and no biases for clarity.
Given:
Input:
Two separate layers:
Single combined layer:
Same result. Two linear layers without activation are mathematically identical to one linear layer with . Add 100 layers, still the same story. This is exactly why activation functions are non-negotiable.
What comes next
You now know what a neuron computes and how layers stack to form networks. But how does a network actually learn? How do we adjust the weights so the output matches what we want?
That is backpropagation: the algorithm that computes gradients through the network so gradient descent can update every weight. It is the engine behind all of deep learning.