Search…

Neural networks: the basic building block

In this series (25 parts)
  1. Neural networks: the basic building block
  2. Forward pass and backpropagation
  3. Training neural networks: a practical guide
  4. Convolutional neural networks
  5. Recurrent neural networks and LSTMs
  6. Attention mechanism and transformers
  7. Word embeddings: from one-hot to dense representations
  8. Transfer learning and fine-tuning
  9. Optimization techniques for deep networks
  10. Regularization for deep networks
  11. Encoder-decoder architectures
  12. Generative models: an overview
  13. Restricted Boltzmann Machines
  14. Deep Belief Networks
  15. Variational Autoencoders
  16. Generative Adversarial Networks: training and theory
  17. DCGAN, conditional GANs, and GAN variants
  18. Representation learning and self-supervised learning
  19. Domain adaptation and fine-tuning strategies
  20. Distributed representations and latent spaces
  21. AutoML and hyperparameter optimization
  22. Neural architecture search
  23. Network compression and efficient inference
  24. Graph neural networks
  25. Practical deep learning: debugging and tuning

Prerequisites

Before diving in, make sure you are comfortable with:


What is a neuron?

A neural network starts with one simple idea: take some inputs, multiply each by a weight, add a bias, and pass the result through a function. That is a neuron.

A single neuron computes two things. First, the weighted sum:

z=w1x1+w2x2++wnxn+b=wx+bz = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + b = \mathbf{w}^\top \mathbf{x} + b

Then, the activation:

a=σ(z)a = \sigma(z)

Here x\mathbf{x} is the input vector, w\mathbf{w} is the weight vector, bb is the bias, and σ\sigma is the activation function.

The dot product wx\mathbf{w}^\top \mathbf{x} measures how much the input aligns with what the neuron is looking for. The bias shifts the decision boundary. The activation function introduces nonlinearity.

graph LR
  x1((x₁)) -->|w₁| S["∑ + b"]
  x2((x₂)) -->|w₂| S
  x3((x₃)) -->|w₃| S
  S -->|z| A["σ(z)"]
  A -->|a| O((output))

If you have seen logistic regression, you already know what a neuron does. Logistic regression computes σ(wx+b)\sigma(\mathbf{w}^\top \mathbf{x} + b) where σ\sigma is the sigmoid function. A neuron is the same computation, except you can swap sigmoid for other activation functions.


Why activation functions matter

Here is the key insight: without activation functions, a deep network is just a single linear transformation. It does not matter how many layers you stack.

Suppose you have two linear layers:

h=W1x+b1\mathbf{h} = W_1 \mathbf{x} + \mathbf{b}_1

y=W2h+b2\mathbf{y} = W_2 \mathbf{h} + \mathbf{b}_2

Substitute the first equation into the second:

y=W2(W1x+b1)+b2=(W2W1)x+(W2b1+b2)\mathbf{y} = W_2 (W_1 \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2 = (W_2 W_1) \mathbf{x} + (W_2 \mathbf{b}_1 + \mathbf{b}_2)

This is just y=Wx+b\mathbf{y} = W' \mathbf{x} + \mathbf{b}' where W=W2W1W' = W_2 W_1. No matter how many linear layers you stack, the result collapses to a single linear layer. We prove this concretely with numbers in Example 3 below.

Activation functions break this linearity. They let the network learn curved decision boundaries, complex patterns, and subtle relationships in data. Every hidden layer needs a nonlinear activation, or you are wasting depth.


Common activation functions

Activation functions compared: ReLU, sigmoid, and tanh over the range x = -5 to 5.

Sigmoid

σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}

Squashes any real number into the range (0,1)(0, 1). Historically popular because it resembles a probability. The problem: for large z|z|, the gradient approaches zero. This makes learning very slow in deep networks. The phenomenon is called the vanishing gradient problem.

Tanh

tanh(z)=ezezez+ez\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}

Output range is (1,1)(-1, 1). Zero-centered outputs help gradient descent converge faster than sigmoid. Still suffers from vanishing gradients at the extremes, though.

ReLU (Rectified Linear Unit)

ReLU(z)=max(0,z)\text{ReLU}(z) = \max(0, z)

Dead simple. If zz is positive, pass it through. If negative, output zero. ReLU solved a huge practical problem: gradients do not vanish for positive inputs because the gradient is exactly 1. Training became much faster. The downside: neurons can “die” if they consistently receive negative inputs, because the gradient is exactly 0 there.

Leaky ReLU

LeakyReLU(z)={zif z>0αzif z0\text{LeakyReLU}(z) = \begin{cases} z & \text{if } z > 0 \\ \alpha z & \text{if } z \leq 0 \end{cases}

where α\alpha is a small constant like 0.01. This fixes the dying ReLU problem by allowing a small gradient when z<0z < 0.

Activation function comparison

FunctionFormulaOutput rangeGradient at saturationTypical useKnown problem
Sigmoid11+ez\frac{1}{1+e^{-z}}(0,1)(0, 1)0\approx 0Output layer (binary classification)Vanishing gradient
Tanhezezez+ez\frac{e^z - e^{-z}}{e^z + e^{-z}}(1,1)(-1, 1)0\approx 0Hidden layers (older architectures)Vanishing gradient
ReLUmax(0,z)\max(0, z)[0,)[0, \infty)00 for z<0z < 0Hidden layers (default choice)Dying neurons
Leaky ReLUmax(αz,z)\max(\alpha z, z)(,)(-\infty, \infty)α\alpha for z<0z < 0Hidden layersAdds hyperparameter α\alpha

For most problems, start with ReLU in hidden layers. Use sigmoid or softmax in the output layer for classification. That is a solid default.


How layers compose

A single neuron can only learn a linear boundary (with a nonlinear squash). To learn complex patterns, we stack neurons into layers, and layers into networks.

A feedforward neural network has three types of layers:

  1. Input layer: holds your raw features. No computation happens here.
  2. Hidden layers: each neuron takes all outputs from the previous layer, computes a weighted sum plus bias, and applies an activation function.
  3. Output layer: produces the final prediction. For regression, typically no activation or a linear one. For binary classification, sigmoid. For multi-class, softmax.
graph LR
  subgraph Input
      i1((x₁))
      i2((x₂))
  end
  subgraph Hidden["Hidden layer (3 units)"]
      h1((h₁))
      h2((h₂))
      h3((h₃))
  end
  subgraph Output
      o1((ŷ))
  end
  i1 --> h1
  i1 --> h2
  i1 --> h3
  i2 --> h1
  i2 --> h2
  i2 --> h3
  h1 --> o1
  h2 --> o1
  h3 --> o1

In matrix notation, a hidden layer computes:

h=σ(Wx+b)\mathbf{h} = \sigma(W \mathbf{x} + \mathbf{b})

where WW is the weight matrix, b\mathbf{b} is the bias vector, and σ\sigma is applied element-wise. Each row of WW contains the weights for one neuron.

The full network is function composition:

y^=fL(fL1(f2(f1(x))))\hat{y} = f_L(f_{L-1}(\cdots f_2(f_1(\mathbf{x}))))

Each flf_l represents one layer’s operation: linear transform followed by an activation.


Universal approximation theorem

Here is a remarkable result: a feedforward network with a single hidden layer containing enough neurons can approximate any continuous function on a compact domain to arbitrary accuracy.

What this means in plain terms: neural networks are theoretically powerful enough to learn any reasonable input-output mapping. Given enough hidden neurons, the network can get as close as you want to the true function.

What it does not mean:

  • It does not tell you how many neurons you need. The number could be astronomically large.
  • It does not guarantee that gradient descent will find the right weights.
  • It says nothing about generalization to unseen data.

In practice, we use deeper networks (more layers) rather than extremely wide ones (many neurons in a single layer). Depth lets networks build hierarchical representations. Early layers learn simple patterns. Later layers combine them into complex features.


Example 1: Single neuron forward pass

Given: x=[2,3,1]\mathbf{x} = [2, 3, -1], w=[0.5,0.2,0.8]\mathbf{w} = [0.5, -0.2, 0.8], b=0.1b = 0.1. Compute the output using ReLU.

Step 1: Weighted sum plus bias.

z=w1x1+w2x2+w3x3+bz = w_1 x_1 + w_2 x_2 + w_3 x_3 + b

z=(0.5)(2)+(0.2)(3)+(0.8)(1)+0.1z = (0.5)(2) + (-0.2)(3) + (0.8)(-1) + 0.1

z=1.00.60.8+0.1=0.3z = 1.0 - 0.6 - 0.8 + 0.1 = -0.3

Step 2: Apply ReLU.

a=ReLU(0.3)=max(0,0.3)=0a = \text{ReLU}(-0.3) = \max(0, -0.3) = 0

The neuron outputs 0. Because zz is negative, ReLU kills the signal completely. This neuron, with these particular weights, does not “fire” for this input.


Example 2: Forward pass through a 2-layer network

Network: 2 inputs, 3 hidden units with ReLU, 1 output with linear activation.

Given:

Input: x=[12]\mathbf{x} = \begin{bmatrix} 1 \\ 2 \end{bmatrix}

Hidden layer weights and biases:

W(1)=[0.20.10.40.30.50.6],b(1)=[0.10.20.0]W^{(1)} = \begin{bmatrix} 0.2 & -0.1 \\ 0.4 & 0.3 \\ -0.5 & 0.6 \end{bmatrix}, \quad \mathbf{b}^{(1)} = \begin{bmatrix} 0.1 \\ -0.2 \\ 0.0 \end{bmatrix}

Output layer weights and bias:

w(2)=[0.70.30.5],b(2)=0.1\mathbf{w}^{(2)} = \begin{bmatrix} 0.7 \\ -0.3 \\ 0.5 \end{bmatrix}, \quad b^{(2)} = 0.1

Step 1: Compute hidden layer pre-activations.

z(1)=W(1)x+b(1)\mathbf{z}^{(1)} = W^{(1)} \mathbf{x} + \mathbf{b}^{(1)}

z1(1)=(0.2)(1)+(0.1)(2)+0.1=0.20.2+0.1=0.1z_1^{(1)} = (0.2)(1) + (-0.1)(2) + 0.1 = 0.2 - 0.2 + 0.1 = 0.1

z2(1)=(0.4)(1)+(0.3)(2)+(0.2)=0.4+0.60.2=0.8z_2^{(1)} = (0.4)(1) + (0.3)(2) + (-0.2) = 0.4 + 0.6 - 0.2 = 0.8

z3(1)=(0.5)(1)+(0.6)(2)+0.0=0.5+1.2=0.7z_3^{(1)} = (-0.5)(1) + (0.6)(2) + 0.0 = -0.5 + 1.2 = 0.7

Step 2: Apply ReLU.

h=ReLU(z(1))=[max(0,0.1)max(0,0.8)max(0,0.7)]=[0.10.80.7]\mathbf{h} = \text{ReLU}(\mathbf{z}^{(1)}) = \begin{bmatrix} \max(0, 0.1) \\ \max(0, 0.8) \\ \max(0, 0.7) \end{bmatrix} = \begin{bmatrix} 0.1 \\ 0.8 \\ 0.7 \end{bmatrix}

All values are positive, so all three neurons are active.

Step 3: Compute output.

y^=(w(2))h+b(2)\hat{y} = (\mathbf{w}^{(2)})^\top \mathbf{h} + b^{(2)}

y^=(0.7)(0.1)+(0.3)(0.8)+(0.5)(0.7)+0.1\hat{y} = (0.7)(0.1) + (-0.3)(0.8) + (0.5)(0.7) + 0.1

y^=0.070.24+0.35+0.1=0.28\hat{y} = 0.07 - 0.24 + 0.35 + 0.1 = 0.28

The network outputs 0.28. Every hidden neuron contributed to the final answer, weighted by the output layer.


Example 3: Why stacking linear layers collapses

Let us prove that stacking linear layers without activations gives you nothing extra. Take two layers with 2×22 \times 2 weight matrices and no biases for clarity.

Given:

W1=[1234],W2=[0.511.50.5]W_1 = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}, \quad W_2 = \begin{bmatrix} 0.5 & -1 \\ 1.5 & 0.5 \end{bmatrix}

Input: x=[11]\mathbf{x} = \begin{bmatrix} 1 \\ 1 \end{bmatrix}

Two separate layers:

h=W1x=[1+23+4]=[37]\mathbf{h} = W_1 \mathbf{x} = \begin{bmatrix} 1 + 2 \\ 3 + 4 \end{bmatrix} = \begin{bmatrix} 3 \\ 7 \end{bmatrix}

y=W2h=[(0.5)(3)+(1)(7)(1.5)(3)+(0.5)(7)]=[1.574.5+3.5]=[5.58.0]\mathbf{y} = W_2 \mathbf{h} = \begin{bmatrix} (0.5)(3) + (-1)(7) \\ (1.5)(3) + (0.5)(7) \end{bmatrix} = \begin{bmatrix} 1.5 - 7 \\ 4.5 + 3.5 \end{bmatrix} = \begin{bmatrix} -5.5 \\ 8.0 \end{bmatrix}

Single combined layer:

W=W2W1=[(0.5)(1)+(1)(3)(0.5)(2)+(1)(4)(1.5)(1)+(0.5)(3)(1.5)(2)+(0.5)(4)]=[2.53.03.05.0]W' = W_2 W_1 = \begin{bmatrix} (0.5)(1)+(-1)(3) & (0.5)(2)+(-1)(4) \\ (1.5)(1)+(0.5)(3) & (1.5)(2)+(0.5)(4) \end{bmatrix} = \begin{bmatrix} -2.5 & -3.0 \\ 3.0 & 5.0 \end{bmatrix}

y=Wx=[2.53.03.0+5.0]=[5.58.0]\mathbf{y} = W' \mathbf{x} = \begin{bmatrix} -2.5 - 3.0 \\ 3.0 + 5.0 \end{bmatrix} = \begin{bmatrix} -5.5 \\ 8.0 \end{bmatrix}

Same result. Two linear layers without activation are mathematically identical to one linear layer with W=W2W1W' = W_2 W_1. Add 100 layers, still the same story. This is exactly why activation functions are non-negotiable.


What comes next

You now know what a neuron computes and how layers stack to form networks. But how does a network actually learn? How do we adjust the weights so the output matches what we want?

That is backpropagation: the algorithm that computes gradients through the network so gradient descent can update every weight. It is the engine behind all of deep learning.

Start typing to search across all content
navigate Enter open Esc close