Files
claude-skills-reference/engineering/terraform-patterns/references/state-management.md
2026-03-15 23:29:01 +01:00

12 KiB

Terraform State Management Reference

Backend Configuration Patterns

terraform {
  backend "s3" {
    bucket         = "mycompany-terraform-state"
    key            = "project/env/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"
    # Optional: KMS key for encryption
    # kms_key_id   = "arn:aws:kms:us-east-1:ACCOUNT:key/KEY_ID"
  }
}

Prerequisites:

# Bootstrap these resources manually or with a separate Terraform config
resource "aws_s3_bucket" "state" {
  bucket = "mycompany-terraform-state"

  lifecycle {
    prevent_destroy = true
  }
}

resource "aws_s3_bucket_versioning" "state" {
  bucket = aws_s3_bucket.state.id
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "state" {
  bucket = aws_s3_bucket.state.id
  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = "aws:kms"
    }
  }
}

resource "aws_s3_bucket_public_access_block" "state" {
  bucket                  = aws_s3_bucket.state.id
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

resource "aws_dynamodb_table" "locks" {
  name         = "terraform-locks"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"

  attribute {
    name = "LockID"
    type = "S"
  }
}

GCP: Google Cloud Storage

terraform {
  backend "gcs" {
    bucket = "mycompany-terraform-state"
    prefix = "project/env"
  }
}

Key features:

  • Native locking (no separate lock table needed)
  • Object versioning for state history
  • IAM-based access control
  • Encryption at rest by default

Azure: Blob Storage

terraform {
  backend "azurerm" {
    resource_group_name  = "terraform-state-rg"
    storage_account_name = "mycompanytfstate"
    container_name       = "tfstate"
    key                  = "project/env/terraform.tfstate"
  }
}

Key features:

  • Native blob locking
  • Encryption at rest with Microsoft-managed or customer-managed keys
  • RBAC-based access control

Terraform Cloud / Enterprise

terraform {
  cloud {
    organization = "mycompany"
    workspaces {
      name = "project-dev"
    }
  }
}

Key features:

  • Built-in state locking, encryption, and versioning
  • RBAC and team-based access control
  • Remote execution (plan/apply run in TF Cloud)
  • Sentinel policy-as-code integration
  • Cost estimation on plans

Environment Isolation Strategies

infrastructure/
├── environments/
│   ├── dev/
│   │   ├── main.tf
│   │   ├── backend.tf      # key = "project/dev/terraform.tfstate"
│   │   └── terraform.tfvars
│   ├── staging/
│   │   ├── main.tf
│   │   ├── backend.tf      # key = "project/staging/terraform.tfstate"
│   │   └── terraform.tfvars
│   └── prod/
│       ├── main.tf
│       ├── backend.tf      # key = "project/prod/terraform.tfstate"
│       └── terraform.tfvars
└── modules/
    └── ...

Pros:

  • Complete isolation — a mistake in dev can't affect prod
  • Different provider versions per environment
  • Different module versions per environment (pin prod, iterate in dev)
  • Clear audit trail — who changed what, where

Cons:

  • Some duplication across environment directories
  • Must update modules in each environment separately

Strategy 2: Terraform Workspaces

# Single directory, multiple workspaces
terraform {
  backend "s3" {
    bucket         = "mycompany-terraform-state"
    key            = "project/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}

# State files stored at:
# env:/dev/project/terraform.tfstate
# env:/staging/project/terraform.tfstate
# env:/prod/project/terraform.tfstate
terraform workspace new dev
terraform workspace select dev
terraform plan -var-file="env/dev.tfvars"

Pros:

  • Less duplication — single set of .tf files
  • Quick to switch between environments
  • Built-in workspace support in backends

Cons:

  • Shared code means a bug affects all environments simultaneously
  • Can't have different provider versions per workspace
  • Easy to accidentally apply to wrong workspace
  • Less isolation than separate directories

Strategy 3: Terragrunt (DRY Configuration)

infrastructure/
├── terragrunt.hcl          # Root — defines remote state pattern
├── modules/
│   └── vpc/
│       ├── main.tf
│       ├── variables.tf
│       └── outputs.tf
├── dev/
│   ├── terragrunt.hcl      # env = "dev"
│   └── vpc/
│       └── terragrunt.hcl  # inputs for dev VPC
├── staging/
│   └── ...
└── prod/
    └── ...
# Root terragrunt.hcl
remote_state {
  backend = "s3"
  generate = {
    path      = "backend.tf"
    if_exists = "overwrite_terragrunt"
  }
  config = {
    bucket         = "mycompany-terraform-state"
    key            = "${path_relative_to_include()}/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"
  }
}

# dev/vpc/terragrunt.hcl
terraform {
  source = "../../modules/vpc"
}

inputs = {
  environment = "dev"
  vpc_cidr    = "10.0.0.0/16"
}

Pros:

  • Maximum DRY — define module once, parameterize per environment
  • Automatic state key generation from directory structure
  • Dependency management between modules (dependency blocks)
  • run-all for applying multiple modules at once

Cons:

  • Additional tool dependency (Terragrunt)
  • Learning curve
  • Debugging can be harder (generated files)

State Migration Patterns

Local to Remote (S3)

# 1. Add backend configuration to backend.tf
# 2. Run init with migration flag
terraform init -migrate-state

# Terraform will prompt:
# "Do you want to copy existing state to the new backend?"
# Answer: yes

Between Remote Backends

# 1. Pull current state
terraform state pull > terraform.tfstate.backup

# 2. Update backend configuration in backend.tf

# 3. Reinitialize with migration
terraform init -migrate-state

# 4. Verify
terraform plan  # Should show no changes

State Import (Existing Resources)

# Import a single resource
terraform import aws_instance.web i-1234567890abcdef0

# Import with for_each key
terraform import 'aws_subnet.public["us-east-1a"]' subnet-0123456789abcdef0

# Bulk import (Terraform 1.5+ import blocks)
import {
  to = aws_instance.web
  id = "i-1234567890abcdef0"
}

State Move (Refactoring)

# Rename a resource (avoids destroy/recreate)
terraform state mv aws_instance.old_name aws_instance.new_name

# Move into a module
terraform state mv aws_instance.web module.compute.aws_instance.web

# Move between state files
terraform state mv -state-out=other.tfstate aws_instance.web aws_instance.web

State Locking

Why Locking Matters

Without locking, two concurrent terraform apply runs can corrupt state. The second apply reads stale state and may create duplicate resources or lose track of existing ones.

Lock Behavior by Backend

Backend Lock Mechanism Auto-Lock Force Unlock
S3 DynamoDB table Yes (if table configured) terraform force-unlock LOCK_ID
GCS Native blob locking Yes terraform force-unlock LOCK_ID
Azure Blob Native blob lease Yes terraform force-unlock LOCK_ID
TF Cloud Built-in Always Via UI or API
Consul Key-value lock Yes terraform force-unlock LOCK_ID
Local .terraform.lock.hcl Yes (single user) Delete lock file

Force Unlock (Emergency Only)

# Only use when you're certain no other process is running
terraform force-unlock LOCK_ID

# The LOCK_ID is shown in the error message when lock fails:
# Error: Error locking state: Error acquiring the state lock
# Lock Info:
#   ID:        12345678-abcd-1234-abcd-1234567890ab

State Security Best Practices

1. Encrypt at Rest

# S3 — server-side encryption
backend "s3" {
  encrypt    = true
  kms_key_id = "arn:aws:kms:us-east-1:ACCOUNT:key/KEY_ID"
}

2. Restrict Access

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject"
      ],
      "Resource": "arn:aws:s3:::mycompany-terraform-state/project/*",
      "Condition": {
        "StringEquals": {
          "aws:PrincipalTag/Team": "platform"
        }
      }
    },
    {
      "Effect": "Allow",
      "Action": [
        "dynamodb:GetItem",
        "dynamodb:PutItem",
        "dynamodb:DeleteItem"
      ],
      "Resource": "arn:aws:dynamodb:us-east-1:ACCOUNT:table/terraform-locks"
    }
  ]
}

3. Enable Versioning (State History)

resource "aws_s3_bucket_versioning" "state" {
  bucket = aws_s3_bucket.state.id
  versioning_configuration {
    status = "Enabled"
  }
}

Versioning lets you recover from state corruption by restoring a previous version.

4. Audit Access

  • Enable S3 access logging or CloudTrail data events
  • Monitor for unexpected state reads (potential secret extraction)
  • State files contain sensitive values — treat them like credentials

5. Sensitive Values in State

Terraform stores all resource attributes in state, including passwords, private keys, and tokens. This is unavoidable. Mitigate by:

  • Encrypting state at rest (KMS)
  • Restricting state file access (IAM)
  • Using sensitive = true on variables and outputs (prevents display, not storage)
  • Rotating secrets regularly (state contains the value at apply time)

Drift Detection and Reconciliation

Detect Drift

# Plan with detailed exit code
terraform plan -detailed-exitcode
# Exit 0 = no changes
# Exit 1 = error
# Exit 2 = changes detected (drift)

Common Drift Sources

Source Example Prevention
Console changes Someone edits SG rules in AWS Console SCPs to restrict console access, or accept and reconcile
Auto-scaling ASG launches instances not in state Don't manage individual instances; manage ASG
External tools Ansible modifies EC2 tags Agree on ownership boundaries
Dependent resource changes AMI deregistered Use data sources to detect, lifecycle ignore_changes

Reconciliation Options

# Option 1: Apply to restore desired state
terraform apply

# Option 2: Refresh state to match reality
terraform apply -refresh-only

# Option 3: Ignore specific attribute drift
resource "aws_instance" "web" {
  lifecycle {
    ignore_changes = [tags["LastModifiedBy"], ami]
  }
}

# Option 4: Import the manually-created resource
terraform import aws_security_group_rule.new sg-12345_ingress_tcp_443_443_0.0.0.0/0

Troubleshooting Checklist

Symptom Likely Cause Fix
"Error acquiring state lock" Concurrent run or crashed process Wait for other run to finish, or force-unlock
"Backend configuration changed" Backend config modified Run terraform init -reconfigure or -migrate-state
"Resource already exists" Resource created outside Terraform terraform import the resource
"No matching resource found" Resource deleted outside Terraform terraform state rm the resource
State file growing very large Too many resources in one state Split into smaller state files using modules
Slow plan/apply Large state file, many resources Split state, use -target for urgent changes
"Provider produced inconsistent result" Provider bug or API race condition Retry, or pin provider version
Workspace confusion Applied to wrong workspace Always check terraform workspace show before apply