claude-skills-reference/engineering-team/aws-solution-architect/references/best_practices.md

# AWS Best Practices for Startups

Production-ready practices for serverless, cost optimization, security, and operational excellence.

---

## Table of Contents

- [Serverless Best Practices](#serverless-best-practices)
- [Cost Optimization](#cost-optimization)
- [Security Hardening](#security-hardening)
- [Scalability Patterns](#scalability-patterns)
- [DevOps and Reliability](#devops-and-reliability)
- [Common Pitfalls](#common-pitfalls)

---

## Serverless Best Practices

### Lambda Function Design

#### 1. Keep Functions Stateless

Store state externally in DynamoDB, S3, or ElastiCache.

```python
# BAD: Function-level state
cache = {}

def handler(event, context):
    if event['key'] in cache:
        return cache[event['key']]
    # ...

# GOOD: External state
import boto3
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('cache')

def handler(event, context):
    response = table.get_item(Key={'pk': event['key']})
    if 'Item' in response:
        return response['Item']['value']
    # ...
```

#### 2. Implement Idempotency

Handle retries gracefully with unique request IDs.

```python
import boto3
import hashlib

dynamodb = boto3.resource('dynamodb')
idempotency_table = dynamodb.Table('idempotency')

def handler(event, context):
    # Generate idempotency key
    idempotency_key = hashlib.sha256(
        f"{event['orderId']}-{event['action']}".encode()
    ).hexdigest()

    # Check if already processed
    try:
        response = idempotency_table.get_item(Key={'pk': idempotency_key})
        if 'Item' in response:
            return response['Item']['result']
    except Exception:
        pass

    # Process request
    result = process_order(event)

    # Store result for idempotency
    idempotency_table.put_item(
        Item={
            'pk': idempotency_key,
            'result': result,
            'ttl': int(time.time()) + 86400  # 24h TTL
        }
    )

    return result
```

#### 3. Optimize Cold Starts

```python
# Initialize outside handler (reused across invocations)
import boto3
from aws_xray_sdk.core import patch_all

# SDK initialization happens once
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('my-table')
patch_all()

def handler(event, context):
    # Handler code uses pre-initialized resources
    return table.get_item(Key={'pk': event['id']})
```

**Cold Start Reduction Techniques:**
- Use provisioned concurrency for critical paths
- Minimize package size (use layers for dependencies)
- Choose interpreted languages (Python, Node.js) over compiled
- Avoid VPC unless necessary (adds 6-10 sec cold start)

#### 4. Set Appropriate Timeouts

```yaml
# Lambda configuration
Functions:
  ApiHandler:
    Timeout: 10  # Shorter for synchronous APIs
    MemorySize: 512

  BackgroundProcessor:
    Timeout: 300  # Longer for async processing
    MemorySize: 1024
```

**Timeout Guidelines:**
- API handlers: 10-30 seconds
- Event processors: 60-300 seconds
- Use Step Functions for >15 minute workflows

---

## Cost Optimization

### 1. Right-Sizing Strategy

```bash
# Check EC2 utilization
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
  --start-time $(date -d '7 days ago' -u +"%Y-%m-%dT%H:%M:%SZ") \
  --end-time $(date -u +"%Y-%m-%dT%H:%M:%SZ") \
  --period 3600 \
  --statistics Average
```

**Right-Sizing Rules:**
- <10% CPU average: Downsize instance
- >80% CPU average: Consider upgrade or horizontal scaling
- Review every month for the first 6 months

### 2. Savings Plans and Reserved Instances

| Commitment | Savings | Best For |
|------------|---------|----------|
| No Upfront, 1-year | 20-30% | Unknown future |
| Partial Upfront, 1-year | 30-40% | Moderate confidence |
| All Upfront, 3-year | 50-60% | Stable workloads |

```bash
# Check Savings Plans recommendations
aws cost-explorer get-savings-plans-purchase-recommendation \
  --savings-plans-type COMPUTE_SP \
  --term-in-years ONE_YEAR \
  --payment-option NO_UPFRONT \
  --lookback-period-in-days THIRTY_DAYS
```

### 3. S3 Lifecycle Policies

```json
{
  "Rules": [
    {
      "ID": "Transition to cheaper storage",
      "Status": "Enabled",
      "Filter": {
        "Prefix": "logs/"
      },
      "Transitions": [
        { "Days": 30, "StorageClass": "STANDARD_IA" },
        { "Days": 90, "StorageClass": "GLACIER" }
      ],
      "Expiration": { "Days": 365 }
    }
  ]
}
```

### 4. Lambda Memory Optimization

Test different memory settings to find optimal cost/performance.

```python
# Use AWS Lambda Power Tuning
# https://github.com/alexcasalboni/aws-lambda-power-tuning

# Example results:
# 128 MB: 2000ms, $0.000042
# 512 MB: 500ms, $0.000042
# 1024 MB: 300ms, $0.000050

# Optimal: 512 MB (same cost, 4x faster)
```

### 5. NAT Gateway Alternatives

```
NAT Gateway: $0.045/hour + $0.045/GB = ~$32/month + data

Alternatives:
1. VPC Endpoints: $0.01/hour = ~$7.30/month (for AWS services)
2. NAT Instance: t3.nano = ~$3.80/month (limited throughput)
3. No NAT: Use VPC endpoints + Lambda outside VPC
```

### 6. CloudWatch Log Retention

```yaml
# Set retention policies to avoid unbounded growth
LogGroup:
  Type: AWS::Logs::LogGroup
  Properties:
    LogGroupName: /aws/lambda/my-function
    RetentionInDays: 14  # 7, 14, 30, 60, 90, etc.
```

**Retention Guidelines:**
- Development: 7 days
- Production non-critical: 30 days
- Production critical: 90 days
- Compliance requirements: As specified

---

## Security Hardening

### 1. IAM Least Privilege

```json
// BAD: Overly permissive
{
  "Effect": "Allow",
  "Action": "dynamodb:*",
  "Resource": "*"
}

// GOOD: Specific actions and resources
{
  "Effect": "Allow",
  "Action": [
    "dynamodb:GetItem",
    "dynamodb:PutItem",
    "dynamodb:Query"
  ],
  "Resource": [
    "arn:aws:dynamodb:us-east-1:123456789:table/users",
    "arn:aws:dynamodb:us-east-1:123456789:table/users/index/*"
  ]
}
```

### 2. Encryption Configuration

```yaml
# Enable encryption everywhere
Resources:
  # DynamoDB
  Table:
    Type: AWS::DynamoDB::Table
    Properties:
      SSESpecification:
        SSEEnabled: true
        SSEType: KMS
        KMSMasterKeyId: !Ref EncryptionKey

  # S3
  Bucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketEncryption:
        ServerSideEncryptionConfiguration:
          - ServerSideEncryptionByDefault:
              SSEAlgorithm: aws:kms
              KMSMasterKeyID: !Ref EncryptionKey

  # RDS
  Database:
    Type: AWS::RDS::DBInstance
    Properties:
      StorageEncrypted: true
      KmsKeyId: !Ref EncryptionKey
```

### 3. Network Isolation

```yaml
# Private subnets with VPC endpoints
Resources:
  PrivateSubnet:
    Type: AWS::EC2::Subnet
    Properties:
      MapPublicIpOnLaunch: false

  # DynamoDB Gateway Endpoint (free)
  DynamoDBEndpoint:
    Type: AWS::EC2::VPCEndpoint
    Properties:
      VpcId: !Ref VPC
      ServiceName: !Sub com.amazonaws.${AWS::Region}.dynamodb
      VpcEndpointType: Gateway
      RouteTableIds:
        - !Ref PrivateRouteTable

  # Secrets Manager Interface Endpoint
  SecretsEndpoint:
    Type: AWS::EC2::VPCEndpoint
    Properties:
      VpcId: !Ref VPC
      ServiceName: !Sub com.amazonaws.${AWS::Region}.secretsmanager
      VpcEndpointType: Interface
      PrivateDnsEnabled: true
```

### 4. Secrets Management

```python
# Never hardcode secrets
import boto3
import json

def get_secret(secret_name):
    client = boto3.client('secretsmanager')
    response = client.get_secret_value(SecretId=secret_name)
    return json.loads(response['SecretString'])

# Usage
db_creds = get_secret('prod/database/credentials')
connection = connect(
    host=db_creds['host'],
    user=db_creds['username'],
    password=db_creds['password']
)
```

### 5. API Protection

```yaml
# WAF + API Gateway
WebACL:
  Type: AWS::WAFv2::WebACL
  Properties:
    DefaultAction:
      Allow: {}
    Rules:
      - Name: RateLimit
        Priority: 1
        Action:
          Block: {}
        Statement:
          RateBasedStatement:
            Limit: 2000
            AggregateKeyType: IP
        VisibilityConfig:
          SampledRequestsEnabled: true
          CloudWatchMetricsEnabled: true
          MetricName: RateLimitRule

      - Name: AWSManagedRulesCommonRuleSet
        Priority: 2
        OverrideAction:
          None: {}
        Statement:
          ManagedRuleGroupStatement:
            VendorName: AWS
            Name: AWSManagedRulesCommonRuleSet
```

### 6. Audit Logging

```yaml
# Enable CloudTrail for all API calls
CloudTrail:
  Type: AWS::CloudTrail::Trail
  Properties:
    IsMultiRegionTrail: true
    IsLogging: true
    S3BucketName: !Ref AuditLogsBucket
    IncludeGlobalServiceEvents: true
    EnableLogFileValidation: true
    EventSelectors:
      - ReadWriteType: All
        IncludeManagementEvents: true
```

---

## Scalability Patterns

### 1. Horizontal vs Vertical Scaling

```
Horizontal (preferred):
- Add more Lambda concurrent executions
- Add more Fargate tasks
- Add more DynamoDB capacity

Vertical (when necessary):
- Increase Lambda memory
- Upgrade RDS instance
- Larger EC2 instances
```

### 2. Database Sharding

```python
# Partition by tenant ID
def get_table_for_tenant(tenant_id):
    shard = hash(tenant_id) % NUM_SHARDS
    return f"data-shard-{shard}"

# Or use DynamoDB single-table design with partition keys
def get_partition_key(tenant_id, entity_type, entity_id):
    return f"TENANT#{tenant_id}#{entity_type}#{entity_id}"
```

### 3. Caching Layers

```
Edge (CloudFront):     Global, static content, TTL: hours-days
Application (Redis):   Regional, session/query cache, TTL: minutes-hours
Database (DAX):        DynamoDB-specific, TTL: minutes
```

```python
# ElastiCache Redis caching pattern
import redis
import json

cache = redis.Redis(host='cache.abc123.cache.amazonaws.com', port=6379)

def get_user(user_id):
    # Check cache first
    cached = cache.get(f"user:{user_id}")
    if cached:
        return json.loads(cached)

    # Fetch from database
    user = db.get_user(user_id)

    # Cache for 5 minutes
    cache.setex(f"user:{user_id}", 300, json.dumps(user))

    return user
```

### 4. Auto-Scaling Configuration

```yaml
# ECS Service Auto-scaling
AutoScalingTarget:
  Type: AWS::ApplicationAutoScaling::ScalableTarget
  Properties:
    MaxCapacity: 10
    MinCapacity: 2
    ResourceId: !Sub service/${Cluster}/${Service.Name}
    ScalableDimension: ecs:service:DesiredCount
    ServiceNamespace: ecs

ScalingPolicy:
  Type: AWS::ApplicationAutoScaling::ScalingPolicy
  Properties:
    PolicyType: TargetTrackingScaling
    TargetTrackingScalingPolicyConfiguration:
      PredefinedMetricSpecification:
        PredefinedMetricType: ECSServiceAverageCPUUtilization
      TargetValue: 70
      ScaleInCooldown: 300
      ScaleOutCooldown: 60
```

---

## DevOps and Reliability

### 1. Infrastructure as Code

```bash
# Version control all infrastructure
git init
git add .
git commit -m "Initial infrastructure setup"

# Use separate stacks per environment
cdk deploy --context environment=dev
cdk deploy --context environment=staging
cdk deploy --context environment=production
```

### 2. Blue/Green Deployments

```yaml
# CodeDeploy Blue/Green for ECS
DeploymentGroup:
  Type: AWS::CodeDeploy::DeploymentGroup
  Properties:
    DeploymentConfigName: CodeDeployDefault.ECSAllAtOnce
    DeploymentStyle:
      DeploymentType: BLUE_GREEN
      DeploymentOption: WITH_TRAFFIC_CONTROL
    BlueGreenDeploymentConfiguration:
      DeploymentReadyOption:
        ActionOnTimeout: CONTINUE_DEPLOYMENT
        WaitTimeInMinutes: 0
      TerminateBlueInstancesOnDeploymentSuccess:
        Action: TERMINATE
        TerminationWaitTimeInMinutes: 5
```

### 3. Health Checks

```python
# Application health endpoint
from flask import Flask, jsonify
import boto3

app = Flask(__name__)

@app.route('/health')
def health():
    checks = {
        'database': check_database(),
        'cache': check_cache(),
        'external_api': check_external_api()
    }

    status = 'healthy' if all(checks.values()) else 'unhealthy'
    code = 200 if status == 'healthy' else 503

    return jsonify({'status': status, 'checks': checks}), code

def check_database():
    try:
        # Quick connectivity test
        db.execute('SELECT 1')
        return True
    except Exception:
        return False
```

### 4. Monitoring Setup

```yaml
# CloudWatch Dashboard
Dashboard:
  Type: AWS::CloudWatch::Dashboard
  Properties:
    DashboardName: production-overview
    DashboardBody: |
      {
        "widgets": [
          {
            "type": "metric",
            "properties": {
              "metrics": [
                ["AWS/Lambda", "Invocations", "FunctionName", "api-handler"],
                [".", "Errors", ".", "."],
                [".", "Duration", ".", ".", {"stat": "p99"}]
              ],
              "period": 60,
              "title": "Lambda Metrics"
            }
          }
        ]
      }

# Critical Alarms
ErrorAlarm:
  Type: AWS::CloudWatch::Alarm
  Properties:
    AlarmName: high-error-rate
    MetricName: Errors
    Namespace: AWS/Lambda
    Statistic: Sum
    Period: 60
    EvaluationPeriods: 3
    Threshold: 10
    ComparisonOperator: GreaterThanThreshold
    AlarmActions:
      - !Ref AlertTopic
```

---

## Common Pitfalls

### Technical Debt

| Pitfall | Solution |
|---------|----------|
| Over-engineering early | Start simple, scale when needed |
| Under-monitoring | Set up CloudWatch from day one |
| Ignoring costs | Enable Cost Explorer and billing alerts |
| Single region only | Plan for multi-region from start |

### Security Mistakes

| Mistake | Prevention |
|---------|------------|
| Public S3 buckets | Block public access, use bucket policies |
| Overly permissive IAM | Never use "*", specify resources |
| Hardcoded credentials | Use Secrets Manager, IAM roles |
| Unencrypted data | Enable encryption by default |

### Performance Issues

| Issue | Solution |
|-------|----------|
| No caching | Add CloudFront, ElastiCache early |
| Inefficient queries | Use indexes, avoid DynamoDB scans |
| Large Lambda packages | Use layers, minimize dependencies |
| N+1 queries | Implement DataLoader, batch operations |

### Cost Surprises

| Surprise | Prevention |
|----------|------------|
| Undeleted resources | Tag everything, review weekly |
| Data transfer costs | Keep traffic in same AZ/region |
| NAT Gateway charges | Use VPC endpoints for AWS services |
| Log accumulation | Set CloudWatch retention policies |