Oct 1, 2025 · 20 min read · Deep Learning

Neural networks: the basic building block

In this series (25 parts)

Prerequisites

Before diving in, make sure you are comfortable with:

Logistic regression: a single neuron is essentially logistic regression with a nonlinear twist
Matrix multiplication: the core operation in every neural network layer

What is a neuron?

A neural network starts with one simple idea: take some inputs, multiply each by a weight, add a bias, and pass the result through a function. That is a neuron.

A single neuron computes two things. First, the weighted sum:

$z = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + b = \mathbf{w}^\top \mathbf{x} + b$

Then, the activation:

$a = \sigma(z)$

Here $\mathbf{x}$ is the input vector, $\mathbf{w}$ is the weight vector, $b$ is the bias, and $\sigma$ is the activation function.

The dot product $\mathbf{w}^\top \mathbf{x}$ measures how much the input aligns with what the neuron is looking for. The bias shifts the decision boundary. The activation function introduces nonlinearity.

graph LR
  x1((x₁)) -->|w₁| S["∑ + b"]
  x2((x₂)) -->|w₂| S
  x3((x₃)) -->|w₃| S
  S -->|z| A["σ(z)"]
  A -->|a| O((output))

If you have seen logistic regression, you already know what a neuron does. Logistic regression computes $\sigma(\mathbf{w}^\top \mathbf{x} + b)$ where $\sigma$ is the sigmoid function. A neuron is the same computation, except you can swap sigmoid for other activation functions.

Why activation functions matter

Here is the key insight: without activation functions, a deep network is just a single linear transformation. It does not matter how many layers you stack.

Suppose you have two linear layers:

$\mathbf{h} = W_1 \mathbf{x} + \mathbf{b}_1$

$\mathbf{y} = W_2 \mathbf{h} + \mathbf{b}_2$

Substitute the first equation into the second:

$\mathbf{y} = W_2 (W_1 \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2 = (W_2 W_1) \mathbf{x} + (W_2 \mathbf{b}_1 + \mathbf{b}_2)$

This is just $\mathbf{y} = W' \mathbf{x} + \mathbf{b}'$ where $W' = W_2 W_1$ . No matter how many linear layers you stack, the result collapses to a single linear layer. We prove this concretely with numbers in Example 3 below.

Activation functions break this linearity. They let the network learn curved decision boundaries, complex patterns, and subtle relationships in data. Every hidden layer needs a nonlinear activation, or you are wasting depth.

Common activation functions

Activation functions compared: ReLU, sigmoid, and tanh over the range x = -5 to 5.

Sigmoid

$\sigma(z) = \frac{1}{1 + e^{-z}}$

Squashes any real number into the range $(0, 1)$ . Historically popular because it resembles a probability. The problem: for large $|z|$ , the gradient approaches zero. This makes learning very slow in deep networks. The phenomenon is called the vanishing gradient problem.

Tanh

$\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$

Output range is $(-1, 1)$ . Zero-centered outputs help gradient descent converge faster than sigmoid. Still suffers from vanishing gradients at the extremes, though.

ReLU (Rectified Linear Unit)

$\text{ReLU}(z) = \max(0, z)$

Dead simple. If $z$ is positive, pass it through. If negative, output zero. ReLU solved a huge practical problem: gradients do not vanish for positive inputs because the gradient is exactly 1. Training became much faster. The downside: neurons can “die” if they consistently receive negative inputs, because the gradient is exactly 0 there.

Leaky ReLU

$\text{LeakyReLU}(z) = \begin{cases} z & \text{if } z > 0 \\ \alpha z & \text{if } z \leq 0 \end{cases}$

where $\alpha$ is a small constant like 0.01. This fixes the dying ReLU problem by allowing a small gradient when $z < 0$ .

Activation function comparison

Function	Formula	Output range	Gradient at saturation	Typical use	Known problem
Sigmoid	$\frac{1}{1+e^{-z}}$	$(0, 1)$	$\approx 0$	Output layer (binary classification)	Vanishing gradient
Tanh	$\frac{e^z - e^{-z}}{e^z + e^{-z}}$	$(-1, 1)$	$\approx 0$	Hidden layers (older architectures)	Vanishing gradient
ReLU	$\max(0, z)$	$[0, \infty)$	$0$ for $z < 0$	Hidden layers (default choice)	Dying neurons
Leaky ReLU	$\max(\alpha z, z)$	$(-\infty, \infty)$	$\alpha$ for $z < 0$	Hidden layers	Adds hyperparameter $\alpha$

For most problems, start with ReLU in hidden layers. Use sigmoid or softmax in the output layer for classification. That is a solid default.

How layers compose

A single neuron can only learn a linear boundary (with a nonlinear squash). To learn complex patterns, we stack neurons into layers, and layers into networks.

A feedforward neural network has three types of layers:

Input layer: holds your raw features. No computation happens here.
Hidden layers: each neuron takes all outputs from the previous layer, computes a weighted sum plus bias, and applies an activation function.
Output layer: produces the final prediction. For regression, typically no activation or a linear one. For binary classification, sigmoid. For multi-class, softmax.

graph LR
  subgraph Input
      i1((x₁))
      i2((x₂))
  end
  subgraph Hidden["Hidden layer (3 units)"]
      h1((h₁))
      h2((h₂))
      h3((h₃))
  end
  subgraph Output
      o1((ŷ))
  end
  i1 --> h1
  i1 --> h2
  i1 --> h3
  i2 --> h1
  i2 --> h2
  i2 --> h3
  h1 --> o1
  h2 --> o1
  h3 --> o1

In matrix notation, a hidden layer computes:

$\mathbf{h} = \sigma(W \mathbf{x} + \mathbf{b})$

where $W$ is the weight matrix, $\mathbf{b}$ is the bias vector, and $\sigma$ is applied element-wise. Each row of $W$ contains the weights for one neuron.

The full network is function composition:

$\hat{y} = f_L(f_{L-1}(\cdots f_2(f_1(\mathbf{x}))))$

Each $f_l$ represents one layer’s operation: linear transform followed by an activation.

Universal approximation theorem

Here is a remarkable result: a feedforward network with a single hidden layer containing enough neurons can approximate any continuous function on a compact domain to arbitrary accuracy.

What this means in plain terms: neural networks are theoretically powerful enough to learn any reasonable input-output mapping. Given enough hidden neurons, the network can get as close as you want to the true function.

What it does not mean:

It does not tell you how many neurons you need. The number could be astronomically large.
It does not guarantee that gradient descent will find the right weights.
It says nothing about generalization to unseen data.

In practice, we use deeper networks (more layers) rather than extremely wide ones (many neurons in a single layer). Depth lets networks build hierarchical representations. Early layers learn simple patterns. Later layers combine them into complex features.

Example 1: Single neuron forward pass

Given: $\mathbf{x} = [2, 3, -1]$ , $\mathbf{w} = [0.5, -0.2, 0.8]$ , $b = 0.1$ . Compute the output using ReLU.

Step 1: Weighted sum plus bias.

$z = w_1 x_1 + w_2 x_2 + w_3 x_3 + b$

$z = (0.5)(2) + (-0.2)(3) + (0.8)(-1) + 0.1$

$z = 1.0 - 0.6 - 0.8 + 0.1 = -0.3$

Step 2: Apply ReLU.

$a = \text{ReLU}(-0.3) = \max(0, -0.3) = 0$

The neuron outputs 0. Because $z$ is negative, ReLU kills the signal completely. This neuron, with these particular weights, does not “fire” for this input.

Example 2: Forward pass through a 2-layer network

Network: 2 inputs, 3 hidden units with ReLU, 1 output with linear activation.

Given:

Input: $\mathbf{x} = \begin{bmatrix} 1 \\ 2 \end{bmatrix}$

Hidden layer weights and biases:

$W^{(1)} = \begin{bmatrix} 0.2 & -0.1 \\ 0.4 & 0.3 \\ -0.5 & 0.6 \end{bmatrix}, \quad \mathbf{b}^{(1)} = \begin{bmatrix} 0.1 \\ -0.2 \\ 0.0 \end{bmatrix}$

Output layer weights and bias:

$\mathbf{w}^{(2)} = \begin{bmatrix} 0.7 \\ -0.3 \\ 0.5 \end{bmatrix}, \quad b^{(2)} = 0.1$

Step 1: Compute hidden layer pre-activations.

$\mathbf{z}^{(1)} = W^{(1)} \mathbf{x} + \mathbf{b}^{(1)}$

$z_1^{(1)} = (0.2)(1) + (-0.1)(2) + 0.1 = 0.2 - 0.2 + 0.1 = 0.1$

$z_2^{(1)} = (0.4)(1) + (0.3)(2) + (-0.2) = 0.4 + 0.6 - 0.2 = 0.8$

$z_3^{(1)} = (-0.5)(1) + (0.6)(2) + 0.0 = -0.5 + 1.2 = 0.7$

Step 2: Apply ReLU.

$\mathbf{h} = \text{ReLU}(\mathbf{z}^{(1)}) = \begin{bmatrix} \max(0, 0.1) \\ \max(0, 0.8) \\ \max(0, 0.7) \end{bmatrix} = \begin{bmatrix} 0.1 \\ 0.8 \\ 0.7 \end{bmatrix}$

All values are positive, so all three neurons are active.

Step 3: Compute output.

$\hat{y} = (\mathbf{w}^{(2)})^\top \mathbf{h} + b^{(2)}$

$\hat{y} = (0.7)(0.1) + (-0.3)(0.8) + (0.5)(0.7) + 0.1$

$\hat{y} = 0.07 - 0.24 + 0.35 + 0.1 = 0.28$

The network outputs 0.28. Every hidden neuron contributed to the final answer, weighted by the output layer.

Example 3: Why stacking linear layers collapses

Let us prove that stacking linear layers without activations gives you nothing extra. Take two layers with $2 \times 2$ weight matrices and no biases for clarity.

Given:

$W_1 = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}, \quad W_2 = \begin{bmatrix} 0.5 & -1 \\ 1.5 & 0.5 \end{bmatrix}$

Input: $\mathbf{x} = \begin{bmatrix} 1 \\ 1 \end{bmatrix}$

Two separate layers:

$\mathbf{h} = W_1 \mathbf{x} = \begin{bmatrix} 1 + 2 \\ 3 + 4 \end{bmatrix} = \begin{bmatrix} 3 \\ 7 \end{bmatrix}$

$\mathbf{y} = W_2 \mathbf{h} = \begin{bmatrix} (0.5)(3) + (-1)(7) \\ (1.5)(3) + (0.5)(7) \end{bmatrix} = \begin{bmatrix} 1.5 - 7 \\ 4.5 + 3.5 \end{bmatrix} = \begin{bmatrix} -5.5 \\ 8.0 \end{bmatrix}$

Single combined layer:

$W' = W_2 W_1 = \begin{bmatrix} (0.5)(1)+(-1)(3) & (0.5)(2)+(-1)(4) \\ (1.5)(1)+(0.5)(3) & (1.5)(2)+(0.5)(4) \end{bmatrix} = \begin{bmatrix} -2.5 & -3.0 \\ 3.0 & 5.0 \end{bmatrix}$

$\mathbf{y} = W' \mathbf{x} = \begin{bmatrix} -2.5 - 3.0 \\ 3.0 + 5.0 \end{bmatrix} = \begin{bmatrix} -5.5 \\ 8.0 \end{bmatrix}$

Same result. Two linear layers without activation are mathematically identical to one linear layer with $W' = W_2 W_1$ . Add 100 layers, still the same story. This is exactly why activation functions are non-negotiable.

What comes next

You now know what a neuron computes and how layers stack to form networks. But how does a network actually learn? How do we adjust the weights so the output matches what we want?

That is backpropagation: the algorithm that computes gradients through the network so gradient descent can update every weight. It is the engine behind all of deep learning.

← Back to all series