Search…

Background jobs and task queues

In this series (15 parts)
  1. Backend system design scope
  2. Designing RESTful APIs
  3. Authentication and session management
  4. Database design for backend systems
  5. Caching in backend systems
  6. Background jobs and task queues
  7. File upload and storage
  8. Search integration
  9. Email and notification delivery
  10. Webhooks: design and security
  11. Payments integration
  12. Multi-tenancy patterns
  13. Backend for Frontend (BFF) pattern
  14. GraphQL server design
  15. gRPC and internal service APIs

Not everything belongs in the request/response cycle. Sending an email, generating a PDF, resizing an image, syncing data to a third-party API: these operations are slow, unreliable, or both. If you make the user wait for them, your API feels sluggish. If the operation fails, the entire request fails. Background jobs solve this by moving work to a separate process that runs asynchronously.

Why background jobs exist

Three reasons to push work to a background job:

  1. Latency: the operation takes too long. A user creating an account should not wait 3 seconds while you send a welcome email, create a Stripe customer, and provision storage.
  2. Reliability: the operation depends on an external service that might be down. A background job can retry; a synchronous request cannot (the client already timed out).
  3. Decoupling: the producer of work does not need to know the details of how work is done. The API endpoint enqueues a job. A separate worker picks it up.

Job queue architecture

A job queue has three components: producers (your API), the queue itself (message queue like Redis, RabbitMQ, or SQS), and consumers (worker processes).

graph LR
  API["API Server<br/>(Producer)"] -->|Enqueue job| Q["Job Queue<br/>(Redis / RabbitMQ / SQS)"]
  Q -->|Dequeue job| W1["Worker 1<br/>(Consumer)"]
  Q -->|Dequeue job| W2["Worker 2<br/>(Consumer)"]
  Q -->|Dequeue job| W3["Worker 3<br/>(Consumer)"]
  W1 -->|On failure| DLQ["Dead Letter Queue"]
  W2 -->|On failure| DLQ
  W3 -->|On failure| DLQ

  style Q fill:#f39c12,color:#fff
  style DLQ fill:#e74c3c,color:#fff

Job queue architecture. Producers enqueue jobs, workers dequeue and process them, and failed jobs land in the dead letter queue.

Queue selection

QueueBest ForTrade-offs
Redis (Bull, Sidekiq)Low latency, simple setupVolatile; needs persistence config
RabbitMQComplex routing, priority queuesOperational complexity
SQSManaged, high durabilityHigher latency, at-least-once
PostgreSQL (SKIP LOCKED)Small scale, no new infraNot designed for high throughput

For most applications starting out, Redis-backed queues (Bull for Node.js, Sidekiq for Ruby, Celery with Redis for Python) are the pragmatic choice. They are fast, well-documented, and simple to operate.

Job lifecycle

A job moves through a defined set of states from creation to completion or failure.

stateDiagram-v2
  [*] --> Enqueued: Job created
  Enqueued --> Active: Worker picks up
  Active --> Completed: Success
  Active --> Failed: Error thrown
  Failed --> Enqueued: Retry (if attempts remain)
  Failed --> DeadLetter: Max retries exceeded
  Enqueued --> Delayed: Scheduled for later
  Delayed --> Enqueued: Delay elapsed
  Active --> Stalled: Worker crashed
  Stalled --> Enqueued: Recovered by monitor

Job state machine. Jobs move from enqueued to active to completed, with retry and dead letter paths for failures.

Designing idempotent jobs

Jobs must be idempotent. The queue guarantees at-least-once delivery, which means a job may run more than once. If your job sends an email, a duplicate run sends two emails. To make it idempotent:

  • Use a unique job ID derived from the business operation (e.g., send-welcome-email:user:42).
  • Check if the operation was already completed before executing it.
  • Use database constraints (unique indexes) to prevent duplicate side effects.

Retry with exponential backoff

When a job fails, retrying immediately usually fails again (the external service is still down). Exponential backoff spaces retries out over increasing intervals.

attempt 1: immediate
attempt 2: 30 seconds
attempt 3: 2 minutes
attempt 4: 8 minutes
attempt 5: 30 minutes

Add jitter (random variation) to prevent thundering herds when many jobs fail simultaneously and all retry at the same intervals.

function getRetryDelay(attempt) {
  const base = 30_000; // 30 seconds
  const exponential = base * Math.pow(2, attempt - 1);
  const jitter = Math.random() * exponential * 0.2;
  return exponential + jitter;
}

Exponential backoff gives transient failures time to resolve while fixed intervals can overwhelm a recovering service with constant retries.

Dead letter queues

After exhausting all retry attempts, a job moves to the dead letter queue (DLQ). The DLQ holds failed jobs for inspection. It serves three purposes:

  1. Visibility: you can see which jobs are failing and why.
  2. Manual retry: after fixing the underlying issue, you can replay jobs from the DLQ.
  3. Alerting: a growing DLQ triggers an alert. If DLQ depth exceeds a threshold, something is systematically broken.

DLQ handling process

1. Alert fires: DLQ depth > 10
2. Engineer inspects failed jobs (error messages, stack traces)
3. Root cause identified (e.g., downstream API changed response format)
4. Fix deployed
5. Jobs replayed from DLQ
6. DLQ depth returns to zero

Do not automatically retry from the DLQ. Jobs end up there because automatic retries already failed. Human inspection is needed.

Cron vs event-triggered jobs

Event-triggered jobs

Triggered by something that happened: a user signed up, an order was placed, a file was uploaded. These are reactive.

// After creating a user, enqueue the welcome email job
await createUser(userData);
await jobQueue.add('send-welcome-email', { userId: user.id });

Cron jobs (scheduled jobs)

Triggered by time: every hour, every night at 2 AM, every Monday morning. These are proactive.

Generate daily reports       - every day at 1:00 AM
Clean up expired sessions    - every hour
Send digest emails           - every Monday at 8:00 AM
Retry failed webhook calls   - every 5 minutes

Common pitfall: cron in multiple instances

If you run 10 application instances and each registers the same cron schedule, the job runs 10 times. Solutions:

  • Leader election: only one instance runs cron jobs. If it fails, another takes over.
  • Distributed lock: use a Redis lock so only the first instance to acquire the lock runs the job.
  • Dedicated scheduler: a single process (or managed service like CloudWatch Events) triggers jobs. Workers only consume.

Job observability

Without observability, your job system is a black box. Track these signals using reliability patterns as a guide:

Metrics

  • Enqueue rate: jobs added per second, by queue.
  • Processing rate: jobs completed per second, by queue.
  • Failure rate: jobs failed per second, by queue.
  • Queue depth: number of waiting jobs. Growing depth means workers cannot keep up.
  • Processing time: p50, p95, p99 of job execution duration.
  • DLQ depth: number of dead-lettered jobs. Should be near zero.

Structured logging

Every job should log:

  • Job ID, type, and relevant business IDs.
  • Start and completion timestamps.
  • On failure: the error message, stack trace, and attempt number.
{
  "job_id": "j-abc123",
  "job_type": "send-welcome-email",
  "user_id": "u-42",
  "attempt": 3,
  "status": "failed",
  "error": "SMTP connection timeout",
  "duration_ms": 5023,
  "timestamp": "2026-04-20T10:15:30Z"
}

Alerting rules

AlertThresholdSeverity
Queue depth growing> 1000 for 5 minutesWarning
DLQ depth > 0Any jobs in DLQWarning
DLQ depth growing> 50 in 1 hourCritical
Processing time spikep99 > 2x baselineWarning
Worker process count dropped< expected countCritical

Job queue anti-patterns

Putting too much data in the job payload. Store the data in the database and pass an ID. Large payloads slow down the queue and make debugging harder.

No timeout on job execution. A hung job blocks a worker forever. Set a maximum execution time and kill jobs that exceed it.

Coupling job processing order. If job B depends on job A completing, do not rely on queue ordering. Use a workflow engine or chain jobs explicitly.

Ignoring backpressure. If producers add jobs faster than consumers process them, the queue grows without bound. Monitor queue depth and scale workers or throttle producers.

What comes next

The next article covers file upload and storage: direct uploads vs presigned URLs, chunked uploads for large files, virus scanning pipelines, and CDN integration. File uploads are often handled as background jobs, making this a natural next step.

Start typing to search across all content
navigate Enter open Esc close