Background jobs and task queues
In this series (15 parts)
- Backend system design scope
- Designing RESTful APIs
- Authentication and session management
- Database design for backend systems
- Caching in backend systems
- Background jobs and task queues
- File upload and storage
- Search integration
- Email and notification delivery
- Webhooks: design and security
- Payments integration
- Multi-tenancy patterns
- Backend for Frontend (BFF) pattern
- GraphQL server design
- gRPC and internal service APIs
Not everything belongs in the request/response cycle. Sending an email, generating a PDF, resizing an image, syncing data to a third-party API: these operations are slow, unreliable, or both. If you make the user wait for them, your API feels sluggish. If the operation fails, the entire request fails. Background jobs solve this by moving work to a separate process that runs asynchronously.
Why background jobs exist
Three reasons to push work to a background job:
- Latency: the operation takes too long. A user creating an account should not wait 3 seconds while you send a welcome email, create a Stripe customer, and provision storage.
- Reliability: the operation depends on an external service that might be down. A background job can retry; a synchronous request cannot (the client already timed out).
- Decoupling: the producer of work does not need to know the details of how work is done. The API endpoint enqueues a job. A separate worker picks it up.
Job queue architecture
A job queue has three components: producers (your API), the queue itself (message queue like Redis, RabbitMQ, or SQS), and consumers (worker processes).
graph LR API["API Server<br/>(Producer)"] -->|Enqueue job| Q["Job Queue<br/>(Redis / RabbitMQ / SQS)"] Q -->|Dequeue job| W1["Worker 1<br/>(Consumer)"] Q -->|Dequeue job| W2["Worker 2<br/>(Consumer)"] Q -->|Dequeue job| W3["Worker 3<br/>(Consumer)"] W1 -->|On failure| DLQ["Dead Letter Queue"] W2 -->|On failure| DLQ W3 -->|On failure| DLQ style Q fill:#f39c12,color:#fff style DLQ fill:#e74c3c,color:#fff
Job queue architecture. Producers enqueue jobs, workers dequeue and process them, and failed jobs land in the dead letter queue.
Queue selection
| Queue | Best For | Trade-offs |
|---|---|---|
| Redis (Bull, Sidekiq) | Low latency, simple setup | Volatile; needs persistence config |
| RabbitMQ | Complex routing, priority queues | Operational complexity |
| SQS | Managed, high durability | Higher latency, at-least-once |
| PostgreSQL (SKIP LOCKED) | Small scale, no new infra | Not designed for high throughput |
For most applications starting out, Redis-backed queues (Bull for Node.js, Sidekiq for Ruby, Celery with Redis for Python) are the pragmatic choice. They are fast, well-documented, and simple to operate.
Job lifecycle
A job moves through a defined set of states from creation to completion or failure.
stateDiagram-v2 [*] --> Enqueued: Job created Enqueued --> Active: Worker picks up Active --> Completed: Success Active --> Failed: Error thrown Failed --> Enqueued: Retry (if attempts remain) Failed --> DeadLetter: Max retries exceeded Enqueued --> Delayed: Scheduled for later Delayed --> Enqueued: Delay elapsed Active --> Stalled: Worker crashed Stalled --> Enqueued: Recovered by monitor
Job state machine. Jobs move from enqueued to active to completed, with retry and dead letter paths for failures.
Designing idempotent jobs
Jobs must be idempotent. The queue guarantees at-least-once delivery, which means a job may run more than once. If your job sends an email, a duplicate run sends two emails. To make it idempotent:
- Use a unique job ID derived from the business operation (e.g.,
send-welcome-email:user:42). - Check if the operation was already completed before executing it.
- Use database constraints (unique indexes) to prevent duplicate side effects.
Retry with exponential backoff
When a job fails, retrying immediately usually fails again (the external service is still down). Exponential backoff spaces retries out over increasing intervals.
attempt 1: immediate
attempt 2: 30 seconds
attempt 3: 2 minutes
attempt 4: 8 minutes
attempt 5: 30 minutes
Add jitter (random variation) to prevent thundering herds when many jobs fail simultaneously and all retry at the same intervals.
function getRetryDelay(attempt) {
const base = 30_000; // 30 seconds
const exponential = base * Math.pow(2, attempt - 1);
const jitter = Math.random() * exponential * 0.2;
return exponential + jitter;
}
Exponential backoff gives transient failures time to resolve while fixed intervals can overwhelm a recovering service with constant retries.
Dead letter queues
After exhausting all retry attempts, a job moves to the dead letter queue (DLQ). The DLQ holds failed jobs for inspection. It serves three purposes:
- Visibility: you can see which jobs are failing and why.
- Manual retry: after fixing the underlying issue, you can replay jobs from the DLQ.
- Alerting: a growing DLQ triggers an alert. If DLQ depth exceeds a threshold, something is systematically broken.
DLQ handling process
1. Alert fires: DLQ depth > 10
2. Engineer inspects failed jobs (error messages, stack traces)
3. Root cause identified (e.g., downstream API changed response format)
4. Fix deployed
5. Jobs replayed from DLQ
6. DLQ depth returns to zero
Do not automatically retry from the DLQ. Jobs end up there because automatic retries already failed. Human inspection is needed.
Cron vs event-triggered jobs
Event-triggered jobs
Triggered by something that happened: a user signed up, an order was placed, a file was uploaded. These are reactive.
// After creating a user, enqueue the welcome email job
await createUser(userData);
await jobQueue.add('send-welcome-email', { userId: user.id });
Cron jobs (scheduled jobs)
Triggered by time: every hour, every night at 2 AM, every Monday morning. These are proactive.
Generate daily reports - every day at 1:00 AM
Clean up expired sessions - every hour
Send digest emails - every Monday at 8:00 AM
Retry failed webhook calls - every 5 minutes
Common pitfall: cron in multiple instances
If you run 10 application instances and each registers the same cron schedule, the job runs 10 times. Solutions:
- Leader election: only one instance runs cron jobs. If it fails, another takes over.
- Distributed lock: use a Redis lock so only the first instance to acquire the lock runs the job.
- Dedicated scheduler: a single process (or managed service like CloudWatch Events) triggers jobs. Workers only consume.
Job observability
Without observability, your job system is a black box. Track these signals using reliability patterns as a guide:
Metrics
- Enqueue rate: jobs added per second, by queue.
- Processing rate: jobs completed per second, by queue.
- Failure rate: jobs failed per second, by queue.
- Queue depth: number of waiting jobs. Growing depth means workers cannot keep up.
- Processing time: p50, p95, p99 of job execution duration.
- DLQ depth: number of dead-lettered jobs. Should be near zero.
Structured logging
Every job should log:
- Job ID, type, and relevant business IDs.
- Start and completion timestamps.
- On failure: the error message, stack trace, and attempt number.
{
"job_id": "j-abc123",
"job_type": "send-welcome-email",
"user_id": "u-42",
"attempt": 3,
"status": "failed",
"error": "SMTP connection timeout",
"duration_ms": 5023,
"timestamp": "2026-04-20T10:15:30Z"
}
Alerting rules
| Alert | Threshold | Severity |
|---|---|---|
| Queue depth growing | > 1000 for 5 minutes | Warning |
| DLQ depth > 0 | Any jobs in DLQ | Warning |
| DLQ depth growing | > 50 in 1 hour | Critical |
| Processing time spike | p99 > 2x baseline | Warning |
| Worker process count dropped | < expected count | Critical |
Job queue anti-patterns
Putting too much data in the job payload. Store the data in the database and pass an ID. Large payloads slow down the queue and make debugging harder.
No timeout on job execution. A hung job blocks a worker forever. Set a maximum execution time and kill jobs that exceed it.
Coupling job processing order. If job B depends on job A completing, do not rely on queue ordering. Use a workflow engine or chain jobs explicitly.
Ignoring backpressure. If producers add jobs faster than consumers process them, the queue grows without bound. Monitor queue depth and scale workers or throttle producers.
What comes next
The next article covers file upload and storage: direct uploads vs presigned URLs, chunked uploads for large files, virus scanning pipelines, and CDN integration. File uploads are often handled as background jobs, making this a natural next step.