Files
claude-skills-reference/engineering/terraform-patterns/references/state-management.md
2026-03-15 23:29:01 +01:00

473 lines
12 KiB
Markdown

# Terraform State Management Reference
## Backend Configuration Patterns
### AWS: S3 + DynamoDB (Recommended)
```hcl
terraform {
backend "s3" {
bucket = "mycompany-terraform-state"
key = "project/env/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-locks"
# Optional: KMS key for encryption
# kms_key_id = "arn:aws:kms:us-east-1:ACCOUNT:key/KEY_ID"
}
}
```
**Prerequisites:**
```hcl
# Bootstrap these resources manually or with a separate Terraform config
resource "aws_s3_bucket" "state" {
bucket = "mycompany-terraform-state"
lifecycle {
prevent_destroy = true
}
}
resource "aws_s3_bucket_versioning" "state" {
bucket = aws_s3_bucket.state.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_server_side_encryption_configuration" "state" {
bucket = aws_s3_bucket.state.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "aws:kms"
}
}
}
resource "aws_s3_bucket_public_access_block" "state" {
bucket = aws_s3_bucket.state.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
resource "aws_dynamodb_table" "locks" {
name = "terraform-locks"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
}
```
---
### GCP: Google Cloud Storage
```hcl
terraform {
backend "gcs" {
bucket = "mycompany-terraform-state"
prefix = "project/env"
}
}
```
**Key features:**
- Native locking (no separate lock table needed)
- Object versioning for state history
- IAM-based access control
- Encryption at rest by default
---
### Azure: Blob Storage
```hcl
terraform {
backend "azurerm" {
resource_group_name = "terraform-state-rg"
storage_account_name = "mycompanytfstate"
container_name = "tfstate"
key = "project/env/terraform.tfstate"
}
}
```
**Key features:**
- Native blob locking
- Encryption at rest with Microsoft-managed or customer-managed keys
- RBAC-based access control
---
### Terraform Cloud / Enterprise
```hcl
terraform {
cloud {
organization = "mycompany"
workspaces {
name = "project-dev"
}
}
}
```
**Key features:**
- Built-in state locking, encryption, and versioning
- RBAC and team-based access control
- Remote execution (plan/apply run in TF Cloud)
- Sentinel policy-as-code integration
- Cost estimation on plans
---
## Environment Isolation Strategies
### Strategy 1: Separate Directories (Recommended)
```
infrastructure/
├── environments/
│ ├── dev/
│ │ ├── main.tf
│ │ ├── backend.tf # key = "project/dev/terraform.tfstate"
│ │ └── terraform.tfvars
│ ├── staging/
│ │ ├── main.tf
│ │ ├── backend.tf # key = "project/staging/terraform.tfstate"
│ │ └── terraform.tfvars
│ └── prod/
│ ├── main.tf
│ ├── backend.tf # key = "project/prod/terraform.tfstate"
│ └── terraform.tfvars
└── modules/
└── ...
```
**Pros:**
- Complete isolation — a mistake in dev can't affect prod
- Different provider versions per environment
- Different module versions per environment (pin prod, iterate in dev)
- Clear audit trail — who changed what, where
**Cons:**
- Some duplication across environment directories
- Must update modules in each environment separately
### Strategy 2: Terraform Workspaces
```hcl
# Single directory, multiple workspaces
terraform {
backend "s3" {
bucket = "mycompany-terraform-state"
key = "project/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-locks"
encrypt = true
}
}
# State files stored at:
# env:/dev/project/terraform.tfstate
# env:/staging/project/terraform.tfstate
# env:/prod/project/terraform.tfstate
```
```bash
terraform workspace new dev
terraform workspace select dev
terraform plan -var-file="env/dev.tfvars"
```
**Pros:**
- Less duplication — single set of .tf files
- Quick to switch between environments
- Built-in workspace support in backends
**Cons:**
- Shared code means a bug affects all environments simultaneously
- Can't have different provider versions per workspace
- Easy to accidentally apply to wrong workspace
- Less isolation than separate directories
### Strategy 3: Terragrunt (DRY Configuration)
```
infrastructure/
├── terragrunt.hcl # Root — defines remote state pattern
├── modules/
│ └── vpc/
│ ├── main.tf
│ ├── variables.tf
│ └── outputs.tf
├── dev/
│ ├── terragrunt.hcl # env = "dev"
│ └── vpc/
│ └── terragrunt.hcl # inputs for dev VPC
├── staging/
│ └── ...
└── prod/
└── ...
```
```hcl
# Root terragrunt.hcl
remote_state {
backend = "s3"
generate = {
path = "backend.tf"
if_exists = "overwrite_terragrunt"
}
config = {
bucket = "mycompany-terraform-state"
key = "${path_relative_to_include()}/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-locks"
}
}
# dev/vpc/terragrunt.hcl
terraform {
source = "../../modules/vpc"
}
inputs = {
environment = "dev"
vpc_cidr = "10.0.0.0/16"
}
```
**Pros:**
- Maximum DRY — define module once, parameterize per environment
- Automatic state key generation from directory structure
- Dependency management between modules (`dependency` blocks)
- `run-all` for applying multiple modules at once
**Cons:**
- Additional tool dependency (Terragrunt)
- Learning curve
- Debugging can be harder (generated files)
---
## State Migration Patterns
### Local to Remote (S3)
```bash
# 1. Add backend configuration to backend.tf
# 2. Run init with migration flag
terraform init -migrate-state
# Terraform will prompt:
# "Do you want to copy existing state to the new backend?"
# Answer: yes
```
### Between Remote Backends
```bash
# 1. Pull current state
terraform state pull > terraform.tfstate.backup
# 2. Update backend configuration in backend.tf
# 3. Reinitialize with migration
terraform init -migrate-state
# 4. Verify
terraform plan # Should show no changes
```
### State Import (Existing Resources)
```bash
# Import a single resource
terraform import aws_instance.web i-1234567890abcdef0
# Import with for_each key
terraform import 'aws_subnet.public["us-east-1a"]' subnet-0123456789abcdef0
# Bulk import (Terraform 1.5+ import blocks)
import {
to = aws_instance.web
id = "i-1234567890abcdef0"
}
```
### State Move (Refactoring)
```bash
# Rename a resource (avoids destroy/recreate)
terraform state mv aws_instance.old_name aws_instance.new_name
# Move into a module
terraform state mv aws_instance.web module.compute.aws_instance.web
# Move between state files
terraform state mv -state-out=other.tfstate aws_instance.web aws_instance.web
```
---
## State Locking
### Why Locking Matters
Without locking, two concurrent `terraform apply` runs can corrupt state. The second apply reads stale state and may create duplicate resources or lose track of existing ones.
### Lock Behavior by Backend
| Backend | Lock Mechanism | Auto-Lock | Force Unlock |
|---------|---------------|-----------|--------------|
| S3 | DynamoDB table | Yes (if table configured) | `terraform force-unlock LOCK_ID` |
| GCS | Native blob locking | Yes | `terraform force-unlock LOCK_ID` |
| Azure Blob | Native blob lease | Yes | `terraform force-unlock LOCK_ID` |
| TF Cloud | Built-in | Always | Via UI or API |
| Consul | Key-value lock | Yes | `terraform force-unlock LOCK_ID` |
| Local | `.terraform.lock.hcl` | Yes (single user) | Delete lock file |
### Force Unlock (Emergency Only)
```bash
# Only use when you're certain no other process is running
terraform force-unlock LOCK_ID
# The LOCK_ID is shown in the error message when lock fails:
# Error: Error locking state: Error acquiring the state lock
# Lock Info:
# ID: 12345678-abcd-1234-abcd-1234567890ab
```
---
## State Security Best Practices
### 1. Encrypt at Rest
```hcl
# S3 — server-side encryption
backend "s3" {
encrypt = true
kms_key_id = "arn:aws:kms:us-east-1:ACCOUNT:key/KEY_ID"
}
```
### 2. Restrict Access
```json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject"
],
"Resource": "arn:aws:s3:::mycompany-terraform-state/project/*",
"Condition": {
"StringEquals": {
"aws:PrincipalTag/Team": "platform"
}
}
},
{
"Effect": "Allow",
"Action": [
"dynamodb:GetItem",
"dynamodb:PutItem",
"dynamodb:DeleteItem"
],
"Resource": "arn:aws:dynamodb:us-east-1:ACCOUNT:table/terraform-locks"
}
]
}
```
### 3. Enable Versioning (State History)
```hcl
resource "aws_s3_bucket_versioning" "state" {
bucket = aws_s3_bucket.state.id
versioning_configuration {
status = "Enabled"
}
}
```
Versioning lets you recover from state corruption by restoring a previous version.
### 4. Audit Access
- Enable S3 access logging or CloudTrail data events
- Monitor for unexpected state reads (potential secret extraction)
- State files contain sensitive values — treat them like credentials
### 5. Sensitive Values in State
Terraform stores all resource attributes in state, including passwords, private keys, and tokens. This is unavoidable. Mitigate by:
- Encrypting state at rest (KMS)
- Restricting state file access (IAM)
- Using `sensitive = true` on variables and outputs (prevents display, not storage)
- Rotating secrets regularly (state contains the value at apply time)
---
## Drift Detection and Reconciliation
### Detect Drift
```bash
# Plan with detailed exit code
terraform plan -detailed-exitcode
# Exit 0 = no changes
# Exit 1 = error
# Exit 2 = changes detected (drift)
```
### Common Drift Sources
| Source | Example | Prevention |
|--------|---------|------------|
| Console changes | Someone edits SG rules in AWS Console | SCPs to restrict console access, or accept and reconcile |
| Auto-scaling | ASG launches instances not in state | Don't manage individual instances; manage ASG |
| External tools | Ansible modifies EC2 tags | Agree on ownership boundaries |
| Dependent resource changes | AMI deregistered | Use data sources to detect, lifecycle ignore_changes |
### Reconciliation Options
```hcl
# Option 1: Apply to restore desired state
terraform apply
# Option 2: Refresh state to match reality
terraform apply -refresh-only
# Option 3: Ignore specific attribute drift
resource "aws_instance" "web" {
lifecycle {
ignore_changes = [tags["LastModifiedBy"], ami]
}
}
# Option 4: Import the manually-created resource
terraform import aws_security_group_rule.new sg-12345_ingress_tcp_443_443_0.0.0.0/0
```
---
## Troubleshooting Checklist
| Symptom | Likely Cause | Fix |
|---------|-------------|-----|
| "Error acquiring state lock" | Concurrent run or crashed process | Wait for other run to finish, or `force-unlock` |
| "Backend configuration changed" | Backend config modified | Run `terraform init -reconfigure` or `-migrate-state` |
| "Resource already exists" | Resource created outside Terraform | `terraform import` the resource |
| "No matching resource found" | Resource deleted outside Terraform | `terraform state rm` the resource |
| State file growing very large | Too many resources in one state | Split into smaller state files using modules |
| Slow plan/apply | Large state file, many resources | Split state, use `-target` for urgent changes |
| "Provider produced inconsistent result" | Provider bug or API race condition | Retry, or pin provider version |
| Workspace confusion | Applied to wrong workspace | Always check `terraform workspace show` before apply |