Search…

Managing drift and compliance

In this series (10 parts)
  1. Introduction to Infrastructure as Code
  2. Terraform fundamentals
  3. Terraform state management
  4. Terraform modules
  5. Terraform in CI/CD
  6. Ansible fundamentals
  7. Ansible roles and best practices
  8. Packer for machine images
  9. CloudFormation and CDK
  10. Managing drift and compliance

You write Terraform. You review pull requests. You run terraform apply through CI. Everything is codified. Then someone logs into the AWS console and adds an ingress rule to a security group because “it was urgent.” Your code now describes a world that no longer exists. That gap between what your code says and what actually runs is drift, and it is the single biggest threat to infrastructure as code.

What drift is

Drift is any difference between your declared infrastructure state and the actual state of running resources. It can happen for many reasons.

Manual changes through cloud consoles are the most common cause. Auto-scaling events create resources that Terraform does not manage. Third-party tools modify configurations. Even AWS itself changes defaults across API versions.

The consequences range from annoying to catastrophic. A security group with an unexpected open port is a compliance violation. A modified IAM policy could grant excessive permissions. A changed database parameter group could degrade performance without anyone understanding why.

flowchart TD
  A[Declared state in code] --> B{Matches actual state?}
  B -->|Yes| C[No drift - safe]
  B -->|No| D[Drift detected]
  D --> E[Security risk]
  D --> F[Failed deployments]
  D --> G[Compliance violations]
  D --> H[Debugging nightmares]

Drift creates a gap between what you think is running and what is actually running.

Detecting drift with Terraform

The simplest drift detection is a terraform plan that reports changes you did not make.

terraform plan -detailed-exitcode

Exit code 0 means no changes. Exit code 1 means an error. Exit code 2 means there are changes to apply. Schedule this in CI to catch drift early:

# .github/workflows/drift-check.yml
name: Drift Detection
on:
  schedule:
    - cron: '0 */6 * * *'  # every 6 hours

jobs:
  detect-drift:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3

      - name: Initialize
        run: terraform init

      - name: Check for drift
        id: plan
        run: terraform plan -detailed-exitcode -no-color
        continue-on-error: true
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

      - name: Notify on drift
        if: steps.plan.outcome == 'failure'
        run: |
          curl -X POST "${{ secrets.SLACK_WEBHOOK }}" \
            -H 'Content-Type: application/json' \
            -d '{"text":"Drift detected in production infrastructure. Check the workflow run for details."}'

This workflow runs every six hours. When terraform plan finds unexpected changes, it sends a Slack notification. The team investigates and either imports the change into Terraform or reverts the manual modification.

Limitations of terraform plan

Terraform only knows about resources it manages. If someone creates a security group outside of Terraform, terraform plan will not report it. You need additional tools to catch unmanaged resources.

AWS Config

AWS Config continuously monitors resource configurations and evaluates them against rules. Unlike Terraform, it watches everything in your account regardless of how it was created.

# cloudformation/config-rules.yml
Resources:
  SecurityGroupOpenIngress:
    Type: AWS::Config::ConfigRule
    Properties:
      ConfigRuleName: restricted-ssh
      Description: Checks that security groups do not allow unrestricted SSH
      Source:
        Owner: AWS
        SourceIdentifier: INCOMING_SSH_DISABLED
      Scope:
        ComplianceResourceTypes:
          - AWS::EC2::SecurityGroup

  EncryptedVolumes:
    Type: AWS::Config::ConfigRule
    Properties:
      ConfigRuleName: encrypted-volumes
      Description: Checks that EBS volumes are encrypted
      Source:
        Owner: AWS
        SourceIdentifier: ENCRYPTED_VOLUMES
      Scope:
        ComplianceResourceTypes:
          - AWS::EC2::Volume

  RequiredTags:
    Type: AWS::Config::ConfigRule
    Properties:
      ConfigRuleName: required-tags
      Description: Checks that resources have required tags
      InputParameters:
        tag1Key: Environment
        tag2Key: Owner
        tag3Key: CostCenter
      Source:
        Owner: AWS
        SourceIdentifier: REQUIRED_TAGS
      Scope:
        ComplianceResourceTypes:
          - AWS::EC2::Instance
          - AWS::RDS::DBInstance
          - AWS::S3::Bucket

AWS Config evaluates rules continuously. Non-compliant resources appear in the Config dashboard and can trigger automatic remediation through SSM Automation documents.

Driftctl

Driftctl is an open-source tool purpose-built for drift detection. It compares your Terraform state against the actual cloud state and reports three categories: managed resources (in Terraform), unmanaged resources (exist but not in Terraform), and missing resources (in Terraform but deleted).

# Install driftctl
curl -L https://github.com/snyk/driftctl/releases/latest/download/driftctl_linux_amd64 -o driftctl
chmod +x driftctl

# Scan for drift
./driftctl scan

Sample output:

Found resources not covered by IaC:
  aws_security_group_rule:
    - sgr-0abc123def456 (manually added SSH rule)
  aws_iam_policy:
    - arn:aws:iam::123456789012:policy/emergency-access

Found 42 resource(s)
 - 38 covered by IaC
 - 2 not covered by IaC
 - 2 missing on cloud provider

Coverage: 90%

The coverage percentage is a powerful metric. Track it over time to measure how well your team maintains infrastructure discipline.

Teams that measure IaC coverage consistently improve it. The metric creates accountability.

You can filter scans to specific resource types:

./driftctl scan --filter "Type=='aws_security_group'"

Integrate driftctl into CI alongside terraform plan for comprehensive drift detection.

Policy as code with OPA

Detecting drift is reactive. Policy as code is proactive. Open Policy Agent (OPA) evaluates Terraform plans against policies before changes reach your infrastructure.

OPA uses a language called Rego. Here is a policy that denies public S3 buckets:

# policy/s3.rego
package terraform.s3

import rego.v1

deny contains msg if {
    resource := input.resource_changes[_]
    resource.type == "aws_s3_bucket_public_access_block"
    resource.change.after.block_public_acls == false
    msg := sprintf("S3 bucket %s must block public ACLs", [resource.address])
}

deny contains msg if {
    resource := input.resource_changes[_]
    resource.type == "aws_s3_bucket"
    not has_public_access_block(resource.address)
    msg := sprintf("S3 bucket %s must have a public access block", [resource.address])
}

has_public_access_block(bucket_address) if {
    resource := input.resource_changes[_]
    resource.type == "aws_s3_bucket_public_access_block"
    contains(resource.address, bucket_address)
}

A policy that restricts instance types to approved sizes:

# policy/ec2.rego
package terraform.ec2

import rego.v1

allowed_instance_types := {
    "t3.micro", "t3.small", "t3.medium",
    "m5.large", "m5.xlarge",
    "r5.large", "r5.xlarge",
}

deny contains msg if {
    resource := input.resource_changes[_]
    resource.type == "aws_instance"
    instance_type := resource.change.after.instance_type
    not instance_type in allowed_instance_types
    msg := sprintf(
        "Instance %s uses type %s which is not in the approved list",
        [resource.address, instance_type]
    )
}

Evaluate policies against a Terraform plan:

# Generate plan JSON
terraform plan -out=plan.tfplan
terraform show -json plan.tfplan > plan.json

# Evaluate with OPA
opa eval --data policy/ --input plan.json "data.terraform.s3.deny" --format pretty
opa eval --data policy/ --input plan.json "data.terraform.ec2.deny" --format pretty

In CI, use Conftest which wraps OPA with a friendlier interface:

conftest test plan.json --policy policy/
flowchart LR
  A[Developer pushes code] --> B[CI: terraform plan]
  B --> C[Generate plan JSON]
  C --> D[OPA / Conftest evaluate]
  D -->|Pass| E[terraform apply]
  D -->|Fail| F[Block merge, notify dev]

OPA evaluates Terraform plans before apply. Non-compliant changes never reach production.

Checkov for IaC scanning

Checkov is a static analysis tool that scans Terraform, CloudFormation, Kubernetes, and Dockerfile configurations for security misconfigurations. It ships with over 1,000 built-in checks.

pip install checkov

# Scan Terraform directory
checkov -d ./terraform

# Scan a specific file
checkov -f ./terraform/main.tf

# Output as JSON for CI consumption
checkov -d ./terraform -o json > checkov-results.json

Sample output:

Passed checks: 14, Failed checks: 3, Skipped checks: 0

Check: CKV_AWS_18: "Ensure the S3 bucket has access logging enabled"
  FAILED for resource: aws_s3_bucket.data
  File: /main.tf:45-52

Check: CKV_AWS_145: "Ensure that S3 Bucket is encrypted with KMS"
  FAILED for resource: aws_s3_bucket.data
  File: /main.tf:45-52

Each check references a specific CIS or security benchmark. Fix the issue and re-run Checkov to confirm compliance.

Suppress false positives with inline comments:

resource "aws_s3_bucket" "public_assets" {
  #checkov:skip=CKV_AWS_18:Access logging not needed for public static assets
  bucket = "my-public-assets"
}

Add Checkov to your CI pipeline:

- name: Run Checkov
  uses: bridgecrewio/checkov-action@master
  with:
    directory: ./terraform
    framework: terraform
    soft_fail: false

Compliance frameworks

Policies do not exist in a vacuum. They map to compliance frameworks that auditors care about.

CIS Benchmarks provide prescriptive hardening guidelines. The CIS AWS Foundations Benchmark covers IAM, logging, monitoring, networking, and storage. Each recommendation maps to a Checkov check ID.

NIST 800-53 defines security controls for federal systems. Controls like AC-2 (Account Management) and SC-7 (Boundary Protection) translate into specific infrastructure policies: no public security groups, encrypted storage, MFA on root accounts.

SOC 2 focuses on availability, security, processing integrity, confidentiality, and privacy. Infrastructure policies support SOC 2 by ensuring encryption at rest and in transit, access logging, and change management through IaC.

Map your OPA policies and Checkov checks to framework controls:

# policy/compliance.rego
package terraform.compliance

import rego.v1

# CIS AWS 2.1.1 - Ensure S3 Bucket Policy is set to deny HTTP requests
deny contains msg if {
    resource := input.resource_changes[_]
    resource.type == "aws_s3_bucket"
    not has_ssl_enforcement(resource.address)
    msg := sprintf(
        "CIS 2.1.1: S3 bucket %s must enforce SSL-only access",
        [resource.address]
    )
}

has_ssl_enforcement(bucket_address) if {
    resource := input.resource_changes[_]
    resource.type == "aws_s3_bucket_policy"
    contains(resource.change.after.policy, "aws:SecureTransport")
}

For a deeper look at integrating compliance into your deployment pipeline, see compliance as code.

Bringing it all together

A mature drift and compliance pipeline combines reactive detection with proactive prevention:

flowchart TD
  subgraph Prevention
      A[Checkov scan] --> B[OPA policy check]
      B --> C[terraform apply]
  end
  subgraph Detection
      D[Scheduled terraform plan] --> E[Drift alert]
      F[AWS Config rules] --> G[Non-compliance alert]
      H[Driftctl scan] --> I[Coverage report]
  end
  E --> J[Remediation]
  G --> J
  I --> J
  J --> K[Update Terraform code]
  K --> A

Prevention catches problems before deployment. Detection catches problems that slip through. Both feed into remediation.

The prevention side runs in CI on every pull request. Checkov scans for known misconfigurations. OPA evaluates custom organizational policies. Only code that passes both reaches terraform apply.

The detection side runs on a schedule. Terraform plan catches drift in managed resources. Driftctl catches unmanaged resources. AWS Config catches compliance violations across the entire account. Alerts route to the team responsible for remediation.

What comes next

You now have a complete toolkit for maintaining infrastructure discipline: drift detection with three complementary tools, policy enforcement with OPA, static analysis with Checkov, and mapping to compliance frameworks. Together with the Terraform, Ansible, Packer, and CloudFormation skills from earlier articles, you can build, configure, bake, deploy, and govern infrastructure entirely through code.

Start typing to search across all content
navigate Enter open Esc close