Search…
Parallel Thinking · Part 3

Amdahl's law and scalability limits

In this series (5 parts)
  1. What parallelism actually means and why it is hard
  2. Flynn's taxonomy and types of parallel hardware
  3. Amdahl's law and scalability limits
  4. Parallel decomposition: how to split work across processors
  5. CPU vs GPU: why they are built differently and when to use each

Prerequisites

Before reading this article, make sure you are familiar with the basics of parallel computation:

  • Parallel thinking intro: covers why parallelism matters, how work gets divided, and the fundamental vocabulary we build on here.

The promise and the ceiling

Adding more processors should make programs faster. Double the cores, halve the time. That intuition is correct for the parallel portion of your code. But every real program has a serial portion that cannot be split across processors. Amdahl’s law quantifies exactly how that serial fraction limits your speedup, no matter how much hardware you throw at the problem.

Amdahl’s law: the formula

Gene Amdahl stated the relationship in 1967. The formula is simple:

S(N)={1}{s+{1s}{N}}S(N) = \frac\{1\}\{s + \frac\{1 - s\}\{N\}\}

Where:

  • S(N) is the speedup with N processors
  • s is the serial fraction (the portion of the program that cannot be parallelized)
  • N is the number of processors
  • (1 - s) is the parallel fraction

The serial fraction runs on one processor regardless. The parallel fraction gets divided across all N processors. Total time is the sum of those two parts.

Visualizing the split

Think of a program as a timeline. Part of it must run sequentially. The rest can be spread across multiple processors.

graph TD
  subgraph "Single Processor"
      A["Serial: s"] --> B["Parallel: 1-s"]
  end
  subgraph "2 Processors"
      C["Serial: s"] --> D["P1: (1-s)/2"]
      C --> E["P2: (1-s)/2"]
  end
  subgraph "4 Processors"
      F["Serial: s"] --> G["P1: (1-s)/4"]
      F --> H["P2: (1-s)/4"]
      F --> I["P3: (1-s)/4"]
      F --> J["P4: (1-s)/4"]
  end

As you add processors, the parallel portion shrinks. The serial portion stays the same size. Eventually, the serial portion dominates total execution time.

The ceiling effect

Here is the critical insight: with infinite processors, the parallel portion takes zero time. All that remains is the serial fraction. So the theoretical maximum speedup is:

S()={1}{s}S(\infty) = \frac\{1\}\{s\}

If 10% of your program is serial (s = 0.10), the maximum speedup is 10x. Not 100x. Not 1000x. Ten. No amount of hardware changes that number. The only way to push past this ceiling is to reduce the serial fraction itself.

Speedup vs processors

The chart below shows speedup as a function of processor count for four different serial fractions. Watch how the curves flatten.

Even with 1024 processors, a 50% serial fraction caps you at 2x speedup. A 5% serial fraction caps you at about 15x with 256 processors and barely climbs beyond that. The diminishing returns are severe.

Serial fraction vs maximum speedup

The table below makes the ceiling concrete. For each serial fraction, it shows the speedup you get with 10, 100, 1000, and infinite processors.

Serial %Parallel %Speedup (10 procs)Speedup (100 procs)Speedup (1000 procs)Speedup (∞ procs)
1%99%9.1750.25500.50100.00
5%95%6.9016.8118.8420.00
10%90%5.269.179.9110.00
25%75%3.083.883.994.00
50%50%1.821.982.002.00

The pattern is stark. Once the serial fraction hits 25%, you are paying for 1000 processors but getting roughly the same speedup as 100 processors. The money is wasted. The engineering effort to distribute the work is wasted.

Worked example 1: Amdahl’s calculation

A program is 90% parallelizable (s = 0.10). Let’s compute the speedup for different GPU counts.

With 8 GPUs:

S(8)={1}{0.10+{0.90}{8}}={1}{0.10+0.1125}={1}{0.2125}=4.71S(8) = \frac\{1\}\{0.10 + \frac\{0.90\}\{8\}\} = \frac\{1\}\{0.10 + 0.1125\} = \frac\{1\}\{0.2125\} = 4.71

You have 8 GPUs but only get 4.71x speedup. That is 59% efficiency (4.71 / 8).

With 128 GPUs:

S(128)={1}{0.10+{0.90}{128}}={1}{0.10+0.00703}={1}{0.10703}=9.34S(128) = \frac\{1\}\{0.10 + \frac\{0.90\}\{128\}\} = \frac\{1\}\{0.10 + 0.00703\} = \frac\{1\}\{0.10703\} = 9.34

128 GPUs, but only 9.34x speedup. Efficiency drops to 7.3% (9.34 / 128). You are paying for 128 GPUs and 119 of them are idle most of the time, waiting for the serial section to finish.

With infinite GPUs:

S()={1}{0.10}=10S(\infty) = \frac\{1\}\{0.10\} = 10

The maximum possible speedup is 10x. Period. If this program currently takes 100 minutes, the absolute floor is 10 minutes. Those 10 minutes are the serial portion, and no amount of parallelism can touch them.

Key takeaway: If you need more than 10x speedup on this program, you must reduce the serial fraction. Hardware alone will not get you there.

Worked example 2: finding the serial bottleneck

You measure a program running on 16 processors and observe 8x speedup. What is the serial fraction?

Start with Amdahl’s law and solve for s:

S(N)={1}{s+{1s}{N}}S(N) = \frac\{1\}\{s + \frac\{1 - s\}\{N\}\}

Plug in S = 8 and N = 16:

8={1}{s+{1s}{16}}8 = \frac\{1\}\{s + \frac\{1 - s\}\{16\}\}

Invert both sides:

{1}{8}=s+{1s}{16}\frac\{1\}\{8\} = s + \frac\{1 - s\}\{16\} 0.125=s+0.06250.0625s0.125 = s + 0.0625 - 0.0625s 0.125=0.9375s+0.06250.125 = 0.9375s + 0.0625 0.0625=0.9375s0.0625 = 0.9375s s=0.0667s = 0.0667

The serial fraction is about 6.67%. That means the theoretical maximum speedup is 1 / 0.0667 = 15x. Even with thousands of processors, you cannot beat 15x on this workload.

Now suppose you optimize and cut that serial fraction in half (s = 0.0333):

S(16)={1}{0.0333+{0.9667}{16}}={1}{0.0333+0.0604}={1}{0.0937}=10.67S(16) = \frac\{1\}\{0.0333 + \frac\{0.9667\}\{16\}\} = \frac\{1\}\{0.0333 + 0.0604\} = \frac\{1\}\{0.0937\} = 10.67

Halving the serial fraction on the same 16 processors jumps speedup from 8x to 10.67x. And the new ceiling rises to 1 / 0.0333 = 30x. Optimizing serial code is often more valuable than adding processors.

Gustafson’s law: a different perspective

Amdahl’s law assumes a fixed problem size. You have a workload, and you want to finish it faster. But in practice, when people get more processors, they often solve bigger problems instead. This is Gustafson’s insight from 1988.

Gustafson’s law (also called Gustafson-Barsis’ law) says:

S(N)=Ns(N1)S(N) = N - s \cdot (N - 1)

Where s is the serial fraction measured on the parallel system. The key difference: instead of fixing the total work and asking “how fast can we go?”, Gustafson fixes the total time and asks “how much work can we do?”

Example: A weather simulation with s = 0.05 on 64 processors gives:

S(64)=640.05×63=643.15=60.85S(64) = 64 - 0.05 \times 63 = 64 - 3.15 = 60.85

That is 60.85x the work compared to a single processor, which sounds far more optimistic than Amdahl’s prediction for the same parameters. The difference is not a contradiction. It is a different question. Amdahl asks about time. Gustafson asks about throughput.

Strong scaling vs weak scaling

These two laws map directly to two standard ways of measuring parallel performance:

Strong scaling holds the problem size constant and increases processor count. This is Amdahl’s domain. You want the same job done faster. Success means linear speedup: double the processors, half the time. In practice, strong scaling almost always hits a wall because of the serial fraction.

Weak scaling holds the problem size per processor constant and increases both problem size and processor count together. This is Gustafson’s domain. You want to solve proportionally bigger problems in the same time. Success means constant execution time as you add processors and proportionally more work.

PropertyStrong ScalingWeak Scaling
Problem sizeFixedGrows with processor count
GoalReduce timeIncrease throughput
Governing lawAmdahl’s lawGustafson’s law
MetricSpeedup = T(1) / T(N)Efficiency = T(1) / (N * T(N))
Typical resultDiminishing returnsNear-linear if communication is low
When to useLatency-sensitive tasksThroughput-oriented workloads

In GPU programming, both matter. Training a single neural network faster is a strong scaling problem. Processing more training data in the same time is a weak scaling problem. Most real systems blend both.

Why communication overhead changes the picture

Amdahl’s law assumes the parallel portion divides perfectly. Real systems have overhead:

  • Synchronization: threads or processes must wait for each other at barriers, locks, or reduction operations.
  • Communication: data must move between processors. On a GPU, this means transfers between global memory and shared memory, or between GPUs over PCIe or NVLink.
  • Load imbalance: if work is not evenly divided, some processors finish early and sit idle.
  • Memory contention: multiple processors accessing the same memory locations cause serialization at the hardware level.

These overheads effectively increase the serial fraction. A program that is theoretically 95% parallelizable might behave like it is only 80% parallelizable once you account for communication and synchronization costs. The modified formula looks like:

S(N)={1}{s+{1s}{N}+O(N)}S(N) = \frac\{1\}\{s + \frac\{1 - s\}\{N\} + O(N)\}

Where O(N) represents overhead that grows with processor count. In many systems, O(N) grows logarithmically or linearly with N. This means speedup does not just plateau. It can actually decrease if you add too many processors. You pay more in coordination than you gain in parallelism.

On GPUs specifically, common sources of overhead include:

  • Kernel launch latency: each GPU kernel launch has fixed overhead. Many small kernels serialize more than one large kernel.
  • Memory transfer: moving data between CPU and GPU memory is slow. PCIe bandwidth is orders of magnitude lower than GPU memory bandwidth.
  • Warp divergence: threads within a warp that take different branches execute serially, not in parallel.
  • Atomic operations: when threads must update shared counters or accumulators, those operations serialize.

In practice

Amdahl’s law is a thinking tool, not just a formula. Here is how to apply it:

Profile before parallelizing. Measure the serial fraction of your actual workload. If it is 25%, no amount of GPU work gets you past 4x. Know your ceiling before you invest.

Optimize the serial path first. Reducing the serial fraction from 10% to 5% doubles your theoretical ceiling from 10x to 20x. That single optimization is worth more than going from 64 to 1024 processors.

Match processor count to the problem. If your serial fraction is 10%, the speedup difference between 64 and 1024 processors is less than 1x. Do not overprovision.

Think about which scaling mode applies. Latency-bound problems (interactive applications, real-time inference) need strong scaling. Throughput-bound problems (batch processing, training) can leverage weak scaling.

Account for overhead. The theoretical speedup from Amdahl’s law is an upper bound. Real speedup will be lower because of communication, synchronization, and memory access patterns.

Do not assume linear speedup. “We have 8 GPUs, so we will be 8x faster” is almost never true. Set realistic expectations with your team and stakeholders.

Do not ignore serial sections because they are “small.” A 5% serial fraction sounds tiny, but it caps your speedup at 20x no matter what. Small serial fractions have disproportionate impact.

Do not parallelize everything blindly. Some code runs fast enough serially. The overhead of parallelizing it can make it slower. Parallelism has a cost. Apply it where the payoff is clear.

Summary

Amdahl’s law and Gustafson’s law are two sides of the same coin. Amdahl tells you the ceiling for making a fixed workload faster. Gustafson tells you how much more work you can tackle in the same time. Strong scaling hits walls. Weak scaling sidesteps them by growing the problem.

The practical message is this: the serial fraction of your code is the single most important number in parallel performance. Before you request more GPUs, measure it. Before you redesign your pipeline for more parallelism, check whether the bottleneck is in the serial path. The formula is simple. The discipline to use it before spending money on hardware is what separates productive engineering from wasted compute.

What comes next

Now that you understand the theoretical limits of parallelism, the next article covers how to actually decompose problems into parallel work units:

Start typing to search across all content
navigate Enter open Esc close