Incident response for DevSecOps
In this series (10 parts)
Breaches happen. The difference between a minor incident and a catastrophic breach often depends on how quickly and effectively the team responds. DevSecOps teams have an advantage: infrastructure as code, automated tooling, and immutable infrastructure enable response patterns that traditional operations cannot match.
Security vs operational incidents
Operational incidents (outages, performance degradation) and security incidents (unauthorized access, data exfiltration) share incident management structures but differ in critical ways.
| Aspect | Operational Incident | Security Incident |
|---|---|---|
| Goal | Restore service | Contain threat, preserve evidence |
| First action | Roll back or scale | Isolate, do not destroy |
| Evidence | Logs sufficient | Forensic images required |
| Communication | Status page, internal chat | Legal, compliance, possibly regulators |
| Timeline pressure | Minutes to hours | Hours to days (containment fast, investigation slow) |
| Post-incident | Blameless retrospective | Root cause + legal review |
The most dangerous mistake is treating a security incident like an operational one. Restarting a compromised pod destroys forensic evidence. Redeploying from the same compromised image reintroduces the vulnerability.
Incident response phases
graph LR A[Detection] --> B[Triage] B --> C[Containment] C --> D[Evidence Collection] D --> E[Eradication] E --> F[Recovery] F --> G[Post-Incident Review] A -.- A1[SIEM alerts<br/>Falco<br/>GuardDuty] B -.- B1[Severity assessment<br/>Scope determination] C -.- C1[Network isolation<br/>Credential rotation] D -.- D1[Forensic snapshots<br/>Log preservation] E -.- E1[Remove threat actor<br/>Patch vulnerabilities] F -.- F1[Restore from IaC<br/>Validate clean state] G -.- G1[Timeline<br/>Lessons learned<br/>Control improvements]
Incident response lifecycle. Each phase has specific actions appropriate for cloud-native environments.
Detection
Security incidents are detected through multiple channels:
- SIEM alerts: Aggregated logs trigger correlation rules
- Runtime security: Falco detects anomalous system calls
- Cloud-native alerts: GuardDuty, Security Hub findings
- External reports: Bug bounties, customer reports, threat intelligence
- Automated scanning: CSPM tools detecting configuration changes
The median time to detect a breach is 204 days (IBM Cost of a Data Breach Report). Every automated detection mechanism reduces this window.
Triage
Assign severity based on data sensitivity and blast radius:
| Severity | Criteria | Response Time |
|---|---|---|
| Critical | Active data exfiltration, production database access | Immediate, all hands |
| High | Compromised credentials, lateral movement detected | Within 1 hour |
| Medium | Malware in non-production, suspicious access patterns | Within 4 hours |
| Low | Policy violations, failed attack attempts | Next business day |
Containment
Contain first. Investigate second. Containment strategies for cloud environments:
Network isolation of a Kubernetes pod:
# Apply a deny-all network policy to the compromised namespace
kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: emergency-isolate
namespace: compromised-ns
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
EOF
AWS instance isolation:
# Create isolation security group (no ingress, no egress)
SG_ID=$(aws ec2 create-security-group \
--group-name forensic-isolation \
--description "Incident response isolation" \
--vpc-id vpc-12345 \
--query 'GroupId' --output text)
# Replace all security groups on compromised instance
aws ec2 modify-instance-attribute \
--instance-id i-compromised \
--groups "$SG_ID"
# Disable the compromised IAM role
aws iam put-role-policy \
--role-name compromised-role \
--policy-name deny-all \
--policy-document '{"Version":"2012-10-17","Statement":[{"Effect":"Deny","Action":"*","Resource":"*"}]}'
Credential rotation:
# Rotate all secrets in the affected scope
vault lease revoke -prefix secret/production/
vault write -f auth/kubernetes/role/api-app/tidy
# Invalidate active sessions
aws iam delete-access-key --user-name compromised-user --access-key-id AKIA...
Forensic evidence in cloud
Cloud environments are ephemeral. Containers restart. Instances get replaced. Logs rotate. Preserving evidence requires immediate action.
EBS volume snapshots
# Snapshot the root and data volumes
VOLUMES=$(aws ec2 describe-instances \
--instance-ids i-compromised \
--query 'Reservations[].Instances[].BlockDeviceMappings[].Ebs.VolumeId' \
--output text)
for vol in $VOLUMES; do
aws ec2 create-snapshot \
--volume-id "$vol" \
--description "Forensic snapshot - Incident INC-2026-042" \
--tag-specifications "ResourceType=snapshot,Tags=[{Key=incident,Value=INC-2026-042}]"
done
Container forensics
# Export the filesystem of a running container
kubectl cp compromised-ns/compromised-pod:/ ./forensic-export/ -c app
# Capture the container image for analysis
docker save $(docker inspect --format='{{.Image}}' compromised-container) > forensic-image.tar
Log preservation
# Export CloudTrail logs for the incident period
aws s3 sync s3://cloudtrail-logs/AWSLogs/123456789012/CloudTrail/ \
./forensic-logs/ \
--exclude "*" \
--include "*/2026/04/20/*" \
--include "*/2026/04/21/*"
# Export Kubernetes audit logs
kubectl logs -n kube-system kube-apiserver-master \
--since-time="2026-04-20T00:00:00Z" > k8s-audit.log
Store all forensic artifacts in a dedicated, write-once storage bucket with legal hold:
aws s3api put-object-lock-configuration \
--bucket forensic-evidence \
--object-lock-configuration '{
"ObjectLockEnabled": "Enabled",
"Rule": {
"DefaultRetention": {
"Mode": "COMPLIANCE",
"Days": 365
}
}
}'
Isolating compromised workloads
The decision to terminate vs isolate depends on evidence needs.
Terminate when:
- The workload is actively exfiltrating data
- You have sufficient forensic snapshots already
- Immutable infrastructure means you can rebuild identically
Isolate when:
- Memory analysis is needed (process dumps, network connections)
- The attack vector is unknown and live observation helps
- Legal counsel requires preserving the running state
For Kubernetes, isolate by relabeling the pod so it no longer matches any service selector, keeping it running but unreachable:
# Remove the pod from the service by changing its labels
kubectl label pod compromised-pod app- --overwrite
kubectl label pod compromised-pod quarantine=true
Post-incident IaC remediation
After eradication, rebuild from infrastructure as code. Do not patch the compromised environment in place.
# Destroy and recreate the affected infrastructure
cd terraform/environments/production
terraform taint aws_instance.api_server
terraform apply
# Redeploy Kubernetes workloads from clean manifests
kubectl delete namespace compromised-ns
kubectl apply -f manifests/production/
IaC remediation guarantees a known-good state. Manual patching leaves uncertainty about whether all attacker modifications were found and reverted.
Update the IaC to include the fix for the vulnerability that was exploited:
# Add the security control that was missing
resource "aws_security_group" "api" {
# ... existing rules ...
# Added after INC-2026-042: restrict egress to known endpoints only
egress {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = var.allowed_egress_cidrs
}
}
Communication
Security incidents require structured communication across multiple audiences:
Internal engineering: Real-time updates in a dedicated incident channel. Use a bot to log all actions with timestamps.
Leadership: Regular summaries (every 2-4 hours during active incidents) covering scope, impact, containment status, and estimated resolution time.
Legal and compliance: Involve immediately for incidents involving personal data. Regulatory notification deadlines vary by jurisdiction: GDPR requires 72-hour notification, HIPAA requires 60 days, state breach notification laws vary.
Customers: Communicate what happened, what data was affected, and what you are doing about it. Transparency builds trust. Vague statements erode it.
Incident communication template
## Incident Update: INC-2026-042
**Status:** Contained
**Severity:** High
**Time detected:** 2026-04-20 14:32 UTC
**Time contained:** 2026-04-20 15:08 UTC
### Summary
Unauthorized access to the production API database was detected
through anomalous query patterns flagged by GuardDuty.
### Scope
- Production database read access for approximately 36 minutes
- No evidence of data modification or exfiltration
### Actions Taken
1. Database credentials rotated
2. Compromised IAM role permissions revoked
3. Network access restricted to forensic isolation
4. Forensic snapshots captured
### Next Steps
- Complete log analysis for full scope determination
- Review and strengthen database access controls
- Update incident timeline for legal review
Post-incident review
Schedule the review 3-5 days after resolution. Use a structured format:
- Timeline reconstruction. Minute-by-minute account of what happened.
- Detection analysis. How was it found? Could it have been found sooner?
- Response evaluation. What went well? What was slow or confused?
- Root cause. What vulnerability or misconfiguration enabled the incident?
- Action items. Specific, assigned, time-bound improvements.
Track action items to completion. The review is worthless if recommendations are never implemented.
These are target times for a high-severity incident with a prepared team. Actual times vary significantly based on incident complexity and team readiness. Regular tabletop exercises reduce response times by building muscle memory.
What comes next
This concludes the DevSecOps series. You now have a comprehensive understanding of security across the entire software delivery lifecycle: from shift-left practices and code scanning through supply chain security, container and Kubernetes hardening, secrets management, cloud posture management, compliance automation, and incident response. The next step is implementation. Pick the area with the highest risk in your organization and start there.