Search…
DevSecOps · Part 10

Incident response for DevSecOps

In this series (10 parts)
  1. What DevSecOps means
  2. Shift-left security
  3. SAST and DAST
  4. Software supply chain security
  5. Container security
  6. Kubernetes security in depth
  7. Secrets management in practice
  8. Cloud security posture management
  9. Compliance as code
  10. Incident response for DevSecOps

Breaches happen. The difference between a minor incident and a catastrophic breach often depends on how quickly and effectively the team responds. DevSecOps teams have an advantage: infrastructure as code, automated tooling, and immutable infrastructure enable response patterns that traditional operations cannot match.

Security vs operational incidents

Operational incidents (outages, performance degradation) and security incidents (unauthorized access, data exfiltration) share incident management structures but differ in critical ways.

AspectOperational IncidentSecurity Incident
GoalRestore serviceContain threat, preserve evidence
First actionRoll back or scaleIsolate, do not destroy
EvidenceLogs sufficientForensic images required
CommunicationStatus page, internal chatLegal, compliance, possibly regulators
Timeline pressureMinutes to hoursHours to days (containment fast, investigation slow)
Post-incidentBlameless retrospectiveRoot cause + legal review

The most dangerous mistake is treating a security incident like an operational one. Restarting a compromised pod destroys forensic evidence. Redeploying from the same compromised image reintroduces the vulnerability.

Incident response phases

graph LR
  A[Detection] --> B[Triage]
  B --> C[Containment]
  C --> D[Evidence Collection]
  D --> E[Eradication]
  E --> F[Recovery]
  F --> G[Post-Incident Review]

  A -.- A1[SIEM alerts<br/>Falco<br/>GuardDuty]
  B -.- B1[Severity assessment<br/>Scope determination]
  C -.- C1[Network isolation<br/>Credential rotation]
  D -.- D1[Forensic snapshots<br/>Log preservation]
  E -.- E1[Remove threat actor<br/>Patch vulnerabilities]
  F -.- F1[Restore from IaC<br/>Validate clean state]
  G -.- G1[Timeline<br/>Lessons learned<br/>Control improvements]

Incident response lifecycle. Each phase has specific actions appropriate for cloud-native environments.

Detection

Security incidents are detected through multiple channels:

  • SIEM alerts: Aggregated logs trigger correlation rules
  • Runtime security: Falco detects anomalous system calls
  • Cloud-native alerts: GuardDuty, Security Hub findings
  • External reports: Bug bounties, customer reports, threat intelligence
  • Automated scanning: CSPM tools detecting configuration changes

The median time to detect a breach is 204 days (IBM Cost of a Data Breach Report). Every automated detection mechanism reduces this window.

Triage

Assign severity based on data sensitivity and blast radius:

SeverityCriteriaResponse Time
CriticalActive data exfiltration, production database accessImmediate, all hands
HighCompromised credentials, lateral movement detectedWithin 1 hour
MediumMalware in non-production, suspicious access patternsWithin 4 hours
LowPolicy violations, failed attack attemptsNext business day

Containment

Contain first. Investigate second. Containment strategies for cloud environments:

Network isolation of a Kubernetes pod:

# Apply a deny-all network policy to the compromised namespace
kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: emergency-isolate
  namespace: compromised-ns
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
EOF

AWS instance isolation:

# Create isolation security group (no ingress, no egress)
SG_ID=$(aws ec2 create-security-group \
  --group-name forensic-isolation \
  --description "Incident response isolation" \
  --vpc-id vpc-12345 \
  --query 'GroupId' --output text)

# Replace all security groups on compromised instance
aws ec2 modify-instance-attribute \
  --instance-id i-compromised \
  --groups "$SG_ID"

# Disable the compromised IAM role
aws iam put-role-policy \
  --role-name compromised-role \
  --policy-name deny-all \
  --policy-document '{"Version":"2012-10-17","Statement":[{"Effect":"Deny","Action":"*","Resource":"*"}]}'

Credential rotation:

# Rotate all secrets in the affected scope
vault lease revoke -prefix secret/production/
vault write -f auth/kubernetes/role/api-app/tidy

# Invalidate active sessions
aws iam delete-access-key --user-name compromised-user --access-key-id AKIA...

Forensic evidence in cloud

Cloud environments are ephemeral. Containers restart. Instances get replaced. Logs rotate. Preserving evidence requires immediate action.

EBS volume snapshots

# Snapshot the root and data volumes
VOLUMES=$(aws ec2 describe-instances \
  --instance-ids i-compromised \
  --query 'Reservations[].Instances[].BlockDeviceMappings[].Ebs.VolumeId' \
  --output text)

for vol in $VOLUMES; do
  aws ec2 create-snapshot \
    --volume-id "$vol" \
    --description "Forensic snapshot - Incident INC-2026-042" \
    --tag-specifications "ResourceType=snapshot,Tags=[{Key=incident,Value=INC-2026-042}]"
done

Container forensics

# Export the filesystem of a running container
kubectl cp compromised-ns/compromised-pod:/  ./forensic-export/ -c app

# Capture the container image for analysis
docker save $(docker inspect --format='{{.Image}}' compromised-container) > forensic-image.tar

Log preservation

# Export CloudTrail logs for the incident period
aws s3 sync s3://cloudtrail-logs/AWSLogs/123456789012/CloudTrail/ \
  ./forensic-logs/ \
  --exclude "*" \
  --include "*/2026/04/20/*" \
  --include "*/2026/04/21/*"

# Export Kubernetes audit logs
kubectl logs -n kube-system kube-apiserver-master \
  --since-time="2026-04-20T00:00:00Z" > k8s-audit.log

Store all forensic artifacts in a dedicated, write-once storage bucket with legal hold:

aws s3api put-object-lock-configuration \
  --bucket forensic-evidence \
  --object-lock-configuration '{
    "ObjectLockEnabled": "Enabled",
    "Rule": {
      "DefaultRetention": {
        "Mode": "COMPLIANCE",
        "Days": 365
      }
    }
  }'

Isolating compromised workloads

The decision to terminate vs isolate depends on evidence needs.

Terminate when:

  • The workload is actively exfiltrating data
  • You have sufficient forensic snapshots already
  • Immutable infrastructure means you can rebuild identically

Isolate when:

  • Memory analysis is needed (process dumps, network connections)
  • The attack vector is unknown and live observation helps
  • Legal counsel requires preserving the running state

For Kubernetes, isolate by relabeling the pod so it no longer matches any service selector, keeping it running but unreachable:

# Remove the pod from the service by changing its labels
kubectl label pod compromised-pod app- --overwrite
kubectl label pod compromised-pod quarantine=true

Post-incident IaC remediation

After eradication, rebuild from infrastructure as code. Do not patch the compromised environment in place.

# Destroy and recreate the affected infrastructure
cd terraform/environments/production
terraform taint aws_instance.api_server
terraform apply

# Redeploy Kubernetes workloads from clean manifests
kubectl delete namespace compromised-ns
kubectl apply -f manifests/production/

IaC remediation guarantees a known-good state. Manual patching leaves uncertainty about whether all attacker modifications were found and reverted.

Update the IaC to include the fix for the vulnerability that was exploited:

# Add the security control that was missing
resource "aws_security_group" "api" {
  # ... existing rules ...

  # Added after INC-2026-042: restrict egress to known endpoints only
  egress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = var.allowed_egress_cidrs
  }
}

Communication

Security incidents require structured communication across multiple audiences:

Internal engineering: Real-time updates in a dedicated incident channel. Use a bot to log all actions with timestamps.

Leadership: Regular summaries (every 2-4 hours during active incidents) covering scope, impact, containment status, and estimated resolution time.

Legal and compliance: Involve immediately for incidents involving personal data. Regulatory notification deadlines vary by jurisdiction: GDPR requires 72-hour notification, HIPAA requires 60 days, state breach notification laws vary.

Customers: Communicate what happened, what data was affected, and what you are doing about it. Transparency builds trust. Vague statements erode it.

Incident communication template

## Incident Update: INC-2026-042
**Status:** Contained
**Severity:** High
**Time detected:** 2026-04-20 14:32 UTC
**Time contained:** 2026-04-20 15:08 UTC

### Summary
Unauthorized access to the production API database was detected
through anomalous query patterns flagged by GuardDuty.

### Scope
- Production database read access for approximately 36 minutes
- No evidence of data modification or exfiltration

### Actions Taken
1. Database credentials rotated
2. Compromised IAM role permissions revoked
3. Network access restricted to forensic isolation
4. Forensic snapshots captured

### Next Steps
- Complete log analysis for full scope determination
- Review and strengthen database access controls
- Update incident timeline for legal review

Post-incident review

Schedule the review 3-5 days after resolution. Use a structured format:

  1. Timeline reconstruction. Minute-by-minute account of what happened.
  2. Detection analysis. How was it found? Could it have been found sooner?
  3. Response evaluation. What went well? What was slow or confused?
  4. Root cause. What vulnerability or misconfiguration enabled the incident?
  5. Action items. Specific, assigned, time-bound improvements.

Track action items to completion. The review is worthless if recommendations are never implemented.

These are target times for a high-severity incident with a prepared team. Actual times vary significantly based on incident complexity and team readiness. Regular tabletop exercises reduce response times by building muscle memory.

What comes next

This concludes the DevSecOps series. You now have a comprehensive understanding of security across the entire software delivery lifecycle: from shift-left practices and code scanning through supply chain security, container and Kubernetes hardening, secrets management, cloud posture management, compliance automation, and incident response. The next step is implementation. Pick the area with the highest risk in your organization and start there.

Start typing to search across all content
navigate Enter open Esc close