Files
claude-skills-reference/engineering-team/aws-solution-architect/references/best_practices.md
Alireza Rezvani c7dc957823 fix(skill): restructure aws-solution-architect for better organization (#61) (#114)
Complete restructure based on AI Agent Skills Benchmark feedback (original score: 66/100):

## Directory Reorganization
- Moved Python scripts to scripts/ directory
- Moved sample files to assets/ directory
- Created references/ directory with extracted content
- Removed HOW_TO_USE.md (integrated into SKILL.md)
- Removed __pycache__

## New Reference Files (3 files)
- architecture_patterns.md: 6 AWS patterns (serverless, microservices, three-tier,
  data processing, GraphQL, multi-region) with diagrams, cost breakdowns, pros/cons
- service_selection.md: Decision matrices for compute, database, storage, messaging,
  networking, security services with code examples
- best_practices.md: Serverless design, cost optimization, security hardening,
  scalability patterns, common pitfalls

## SKILL.md Rewrite
- Reduced from 345 lines to 307 lines (moved patterns to references/)
- Added trigger phrases to description ("design serverless architecture",
  "create CloudFormation templates", "optimize AWS costs")
- Structured around 6-step workflow instead of encyclopedia format
- Added Quick Start examples (MVP, Scaling, Cost Optimization, IaC)
- Removed marketing language ("Expert", "comprehensive")
- Consistent imperative voice throughout

## Structure Changes
- scripts/: architecture_designer.py, cost_optimizer.py, serverless_stack.py
- references/: architecture_patterns.md, service_selection.md, best_practices.md
- assets/: sample_input.json, expected_output.json

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-30 02:42:08 +01:00

14 KiB

AWS Best Practices for Startups

Production-ready practices for serverless, cost optimization, security, and operational excellence.


Table of Contents


Serverless Best Practices

Lambda Function Design

1. Keep Functions Stateless

Store state externally in DynamoDB, S3, or ElastiCache.

# BAD: Function-level state
cache = {}

def handler(event, context):
    if event['key'] in cache:
        return cache[event['key']]
    # ...

# GOOD: External state
import boto3
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('cache')

def handler(event, context):
    response = table.get_item(Key={'pk': event['key']})
    if 'Item' in response:
        return response['Item']['value']
    # ...

2. Implement Idempotency

Handle retries gracefully with unique request IDs.

import boto3
import hashlib

dynamodb = boto3.resource('dynamodb')
idempotency_table = dynamodb.Table('idempotency')

def handler(event, context):
    # Generate idempotency key
    idempotency_key = hashlib.sha256(
        f"{event['orderId']}-{event['action']}".encode()
    ).hexdigest()

    # Check if already processed
    try:
        response = idempotency_table.get_item(Key={'pk': idempotency_key})
        if 'Item' in response:
            return response['Item']['result']
    except Exception:
        pass

    # Process request
    result = process_order(event)

    # Store result for idempotency
    idempotency_table.put_item(
        Item={
            'pk': idempotency_key,
            'result': result,
            'ttl': int(time.time()) + 86400  # 24h TTL
        }
    )

    return result

3. Optimize Cold Starts

# Initialize outside handler (reused across invocations)
import boto3
from aws_xray_sdk.core import patch_all

# SDK initialization happens once
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('my-table')
patch_all()

def handler(event, context):
    # Handler code uses pre-initialized resources
    return table.get_item(Key={'pk': event['id']})

Cold Start Reduction Techniques:

  • Use provisioned concurrency for critical paths
  • Minimize package size (use layers for dependencies)
  • Choose interpreted languages (Python, Node.js) over compiled
  • Avoid VPC unless necessary (adds 6-10 sec cold start)

4. Set Appropriate Timeouts

# Lambda configuration
Functions:
  ApiHandler:
    Timeout: 10  # Shorter for synchronous APIs
    MemorySize: 512

  BackgroundProcessor:
    Timeout: 300  # Longer for async processing
    MemorySize: 1024

Timeout Guidelines:

  • API handlers: 10-30 seconds
  • Event processors: 60-300 seconds
  • Use Step Functions for >15 minute workflows

Cost Optimization

1. Right-Sizing Strategy

# Check EC2 utilization
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
  --start-time $(date -d '7 days ago' -u +"%Y-%m-%dT%H:%M:%SZ") \
  --end-time $(date -u +"%Y-%m-%dT%H:%M:%SZ") \
  --period 3600 \
  --statistics Average

Right-Sizing Rules:

  • <10% CPU average: Downsize instance
  • 80% CPU average: Consider upgrade or horizontal scaling

  • Review every month for the first 6 months

2. Savings Plans and Reserved Instances

Commitment Savings Best For
No Upfront, 1-year 20-30% Unknown future
Partial Upfront, 1-year 30-40% Moderate confidence
All Upfront, 3-year 50-60% Stable workloads
# Check Savings Plans recommendations
aws cost-explorer get-savings-plans-purchase-recommendation \
  --savings-plans-type COMPUTE_SP \
  --term-in-years ONE_YEAR \
  --payment-option NO_UPFRONT \
  --lookback-period-in-days THIRTY_DAYS

3. S3 Lifecycle Policies

{
  "Rules": [
    {
      "ID": "Transition to cheaper storage",
      "Status": "Enabled",
      "Filter": {
        "Prefix": "logs/"
      },
      "Transitions": [
        { "Days": 30, "StorageClass": "STANDARD_IA" },
        { "Days": 90, "StorageClass": "GLACIER" }
      ],
      "Expiration": { "Days": 365 }
    }
  ]
}

4. Lambda Memory Optimization

Test different memory settings to find optimal cost/performance.

# Use AWS Lambda Power Tuning
# https://github.com/alexcasalboni/aws-lambda-power-tuning

# Example results:
# 128 MB: 2000ms, $0.000042
# 512 MB: 500ms, $0.000042
# 1024 MB: 300ms, $0.000050

# Optimal: 512 MB (same cost, 4x faster)

5. NAT Gateway Alternatives

NAT Gateway: $0.045/hour + $0.045/GB = ~$32/month + data

Alternatives:
1. VPC Endpoints: $0.01/hour = ~$7.30/month (for AWS services)
2. NAT Instance: t3.nano = ~$3.80/month (limited throughput)
3. No NAT: Use VPC endpoints + Lambda outside VPC

6. CloudWatch Log Retention

# Set retention policies to avoid unbounded growth
LogGroup:
  Type: AWS::Logs::LogGroup
  Properties:
    LogGroupName: /aws/lambda/my-function
    RetentionInDays: 14  # 7, 14, 30, 60, 90, etc.

Retention Guidelines:

  • Development: 7 days
  • Production non-critical: 30 days
  • Production critical: 90 days
  • Compliance requirements: As specified

Security Hardening

1. IAM Least Privilege

// BAD: Overly permissive
{
  "Effect": "Allow",
  "Action": "dynamodb:*",
  "Resource": "*"
}

// GOOD: Specific actions and resources
{
  "Effect": "Allow",
  "Action": [
    "dynamodb:GetItem",
    "dynamodb:PutItem",
    "dynamodb:Query"
  ],
  "Resource": [
    "arn:aws:dynamodb:us-east-1:123456789:table/users",
    "arn:aws:dynamodb:us-east-1:123456789:table/users/index/*"
  ]
}

2. Encryption Configuration

# Enable encryption everywhere
Resources:
  # DynamoDB
  Table:
    Type: AWS::DynamoDB::Table
    Properties:
      SSESpecification:
        SSEEnabled: true
        SSEType: KMS
        KMSMasterKeyId: !Ref EncryptionKey

  # S3
  Bucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketEncryption:
        ServerSideEncryptionConfiguration:
          - ServerSideEncryptionByDefault:
              SSEAlgorithm: aws:kms
              KMSMasterKeyID: !Ref EncryptionKey

  # RDS
  Database:
    Type: AWS::RDS::DBInstance
    Properties:
      StorageEncrypted: true
      KmsKeyId: !Ref EncryptionKey

3. Network Isolation

# Private subnets with VPC endpoints
Resources:
  PrivateSubnet:
    Type: AWS::EC2::Subnet
    Properties:
      MapPublicIpOnLaunch: false

  # DynamoDB Gateway Endpoint (free)
  DynamoDBEndpoint:
    Type: AWS::EC2::VPCEndpoint
    Properties:
      VpcId: !Ref VPC
      ServiceName: !Sub com.amazonaws.${AWS::Region}.dynamodb
      VpcEndpointType: Gateway
      RouteTableIds:
        - !Ref PrivateRouteTable

  # Secrets Manager Interface Endpoint
  SecretsEndpoint:
    Type: AWS::EC2::VPCEndpoint
    Properties:
      VpcId: !Ref VPC
      ServiceName: !Sub com.amazonaws.${AWS::Region}.secretsmanager
      VpcEndpointType: Interface
      PrivateDnsEnabled: true

4. Secrets Management

# Never hardcode secrets
import boto3
import json

def get_secret(secret_name):
    client = boto3.client('secretsmanager')
    response = client.get_secret_value(SecretId=secret_name)
    return json.loads(response['SecretString'])

# Usage
db_creds = get_secret('prod/database/credentials')
connection = connect(
    host=db_creds['host'],
    user=db_creds['username'],
    password=db_creds['password']
)

5. API Protection

# WAF + API Gateway
WebACL:
  Type: AWS::WAFv2::WebACL
  Properties:
    DefaultAction:
      Allow: {}
    Rules:
      - Name: RateLimit
        Priority: 1
        Action:
          Block: {}
        Statement:
          RateBasedStatement:
            Limit: 2000
            AggregateKeyType: IP
        VisibilityConfig:
          SampledRequestsEnabled: true
          CloudWatchMetricsEnabled: true
          MetricName: RateLimitRule

      - Name: AWSManagedRulesCommonRuleSet
        Priority: 2
        OverrideAction:
          None: {}
        Statement:
          ManagedRuleGroupStatement:
            VendorName: AWS
            Name: AWSManagedRulesCommonRuleSet

6. Audit Logging

# Enable CloudTrail for all API calls
CloudTrail:
  Type: AWS::CloudTrail::Trail
  Properties:
    IsMultiRegionTrail: true
    IsLogging: true
    S3BucketName: !Ref AuditLogsBucket
    IncludeGlobalServiceEvents: true
    EnableLogFileValidation: true
    EventSelectors:
      - ReadWriteType: All
        IncludeManagementEvents: true

Scalability Patterns

1. Horizontal vs Vertical Scaling

Horizontal (preferred):
- Add more Lambda concurrent executions
- Add more Fargate tasks
- Add more DynamoDB capacity

Vertical (when necessary):
- Increase Lambda memory
- Upgrade RDS instance
- Larger EC2 instances

2. Database Sharding

# Partition by tenant ID
def get_table_for_tenant(tenant_id):
    shard = hash(tenant_id) % NUM_SHARDS
    return f"data-shard-{shard}"

# Or use DynamoDB single-table design with partition keys
def get_partition_key(tenant_id, entity_type, entity_id):
    return f"TENANT#{tenant_id}#{entity_type}#{entity_id}"

3. Caching Layers

Edge (CloudFront):     Global, static content, TTL: hours-days
Application (Redis):   Regional, session/query cache, TTL: minutes-hours
Database (DAX):        DynamoDB-specific, TTL: minutes
# ElastiCache Redis caching pattern
import redis
import json

cache = redis.Redis(host='cache.abc123.cache.amazonaws.com', port=6379)

def get_user(user_id):
    # Check cache first
    cached = cache.get(f"user:{user_id}")
    if cached:
        return json.loads(cached)

    # Fetch from database
    user = db.get_user(user_id)

    # Cache for 5 minutes
    cache.setex(f"user:{user_id}", 300, json.dumps(user))

    return user

4. Auto-Scaling Configuration

# ECS Service Auto-scaling
AutoScalingTarget:
  Type: AWS::ApplicationAutoScaling::ScalableTarget
  Properties:
    MaxCapacity: 10
    MinCapacity: 2
    ResourceId: !Sub service/${Cluster}/${Service.Name}
    ScalableDimension: ecs:service:DesiredCount
    ServiceNamespace: ecs

ScalingPolicy:
  Type: AWS::ApplicationAutoScaling::ScalingPolicy
  Properties:
    PolicyType: TargetTrackingScaling
    TargetTrackingScalingPolicyConfiguration:
      PredefinedMetricSpecification:
        PredefinedMetricType: ECSServiceAverageCPUUtilization
      TargetValue: 70
      ScaleInCooldown: 300
      ScaleOutCooldown: 60

DevOps and Reliability

1. Infrastructure as Code

# Version control all infrastructure
git init
git add .
git commit -m "Initial infrastructure setup"

# Use separate stacks per environment
cdk deploy --context environment=dev
cdk deploy --context environment=staging
cdk deploy --context environment=production

2. Blue/Green Deployments

# CodeDeploy Blue/Green for ECS
DeploymentGroup:
  Type: AWS::CodeDeploy::DeploymentGroup
  Properties:
    DeploymentConfigName: CodeDeployDefault.ECSAllAtOnce
    DeploymentStyle:
      DeploymentType: BLUE_GREEN
      DeploymentOption: WITH_TRAFFIC_CONTROL
    BlueGreenDeploymentConfiguration:
      DeploymentReadyOption:
        ActionOnTimeout: CONTINUE_DEPLOYMENT
        WaitTimeInMinutes: 0
      TerminateBlueInstancesOnDeploymentSuccess:
        Action: TERMINATE
        TerminationWaitTimeInMinutes: 5

3. Health Checks

# Application health endpoint
from flask import Flask, jsonify
import boto3

app = Flask(__name__)

@app.route('/health')
def health():
    checks = {
        'database': check_database(),
        'cache': check_cache(),
        'external_api': check_external_api()
    }

    status = 'healthy' if all(checks.values()) else 'unhealthy'
    code = 200 if status == 'healthy' else 503

    return jsonify({'status': status, 'checks': checks}), code

def check_database():
    try:
        # Quick connectivity test
        db.execute('SELECT 1')
        return True
    except Exception:
        return False

4. Monitoring Setup

# CloudWatch Dashboard
Dashboard:
  Type: AWS::CloudWatch::Dashboard
  Properties:
    DashboardName: production-overview
    DashboardBody: |
      {
        "widgets": [
          {
            "type": "metric",
            "properties": {
              "metrics": [
                ["AWS/Lambda", "Invocations", "FunctionName", "api-handler"],
                [".", "Errors", ".", "."],
                [".", "Duration", ".", ".", {"stat": "p99"}]
              ],
              "period": 60,
              "title": "Lambda Metrics"
            }
          }
        ]
      }

# Critical Alarms
ErrorAlarm:
  Type: AWS::CloudWatch::Alarm
  Properties:
    AlarmName: high-error-rate
    MetricName: Errors
    Namespace: AWS/Lambda
    Statistic: Sum
    Period: 60
    EvaluationPeriods: 3
    Threshold: 10
    ComparisonOperator: GreaterThanThreshold
    AlarmActions:
      - !Ref AlertTopic

Common Pitfalls

Technical Debt

Pitfall Solution
Over-engineering early Start simple, scale when needed
Under-monitoring Set up CloudWatch from day one
Ignoring costs Enable Cost Explorer and billing alerts
Single region only Plan for multi-region from start

Security Mistakes

Mistake Prevention
Public S3 buckets Block public access, use bucket policies
Overly permissive IAM Never use "*", specify resources
Hardcoded credentials Use Secrets Manager, IAM roles
Unencrypted data Enable encryption by default

Performance Issues

Issue Solution
No caching Add CloudFront, ElastiCache early
Inefficient queries Use indexes, avoid DynamoDB scans
Large Lambda packages Use layers, minimize dependencies
N+1 queries Implement DataLoader, batch operations

Cost Surprises

Surprise Prevention
Undeleted resources Tag everything, review weekly
Data transfer costs Keep traffic in same AZ/region
NAT Gateway charges Use VPC endpoints for AWS services
Log accumulation Set CloudWatch retention policies