Complete restructure based on AI Agent Skills Benchmark feedback (original score: 66/100):
## Directory Reorganization
- Moved Python scripts to scripts/ directory
- Moved sample files to assets/ directory
- Created references/ directory with extracted content
- Removed HOW_TO_USE.md (integrated into SKILL.md)
- Removed __pycache__
## New Reference Files (3 files)
- architecture_patterns.md: 6 AWS patterns (serverless, microservices, three-tier,
data processing, GraphQL, multi-region) with diagrams, cost breakdowns, pros/cons
- service_selection.md: Decision matrices for compute, database, storage, messaging,
networking, security services with code examples
- best_practices.md: Serverless design, cost optimization, security hardening,
scalability patterns, common pitfalls
## SKILL.md Rewrite
- Reduced from 345 lines to 307 lines (moved patterns to references/)
- Added trigger phrases to description ("design serverless architecture",
"create CloudFormation templates", "optimize AWS costs")
- Structured around 6-step workflow instead of encyclopedia format
- Added Quick Start examples (MVP, Scaling, Cost Optimization, IaC)
- Removed marketing language ("Expert", "comprehensive")
- Consistent imperative voice throughout
## Structure Changes
- scripts/: architecture_designer.py, cost_optimizer.py, serverless_stack.py
- references/: architecture_patterns.md, service_selection.md, best_practices.md
- assets/: sample_input.json, expected_output.json
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
14 KiB
14 KiB
AWS Best Practices for Startups
Production-ready practices for serverless, cost optimization, security, and operational excellence.
Table of Contents
- Serverless Best Practices
- Cost Optimization
- Security Hardening
- Scalability Patterns
- DevOps and Reliability
- Common Pitfalls
Serverless Best Practices
Lambda Function Design
1. Keep Functions Stateless
Store state externally in DynamoDB, S3, or ElastiCache.
# BAD: Function-level state
cache = {}
def handler(event, context):
if event['key'] in cache:
return cache[event['key']]
# ...
# GOOD: External state
import boto3
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('cache')
def handler(event, context):
response = table.get_item(Key={'pk': event['key']})
if 'Item' in response:
return response['Item']['value']
# ...
2. Implement Idempotency
Handle retries gracefully with unique request IDs.
import boto3
import hashlib
dynamodb = boto3.resource('dynamodb')
idempotency_table = dynamodb.Table('idempotency')
def handler(event, context):
# Generate idempotency key
idempotency_key = hashlib.sha256(
f"{event['orderId']}-{event['action']}".encode()
).hexdigest()
# Check if already processed
try:
response = idempotency_table.get_item(Key={'pk': idempotency_key})
if 'Item' in response:
return response['Item']['result']
except Exception:
pass
# Process request
result = process_order(event)
# Store result for idempotency
idempotency_table.put_item(
Item={
'pk': idempotency_key,
'result': result,
'ttl': int(time.time()) + 86400 # 24h TTL
}
)
return result
3. Optimize Cold Starts
# Initialize outside handler (reused across invocations)
import boto3
from aws_xray_sdk.core import patch_all
# SDK initialization happens once
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('my-table')
patch_all()
def handler(event, context):
# Handler code uses pre-initialized resources
return table.get_item(Key={'pk': event['id']})
Cold Start Reduction Techniques:
- Use provisioned concurrency for critical paths
- Minimize package size (use layers for dependencies)
- Choose interpreted languages (Python, Node.js) over compiled
- Avoid VPC unless necessary (adds 6-10 sec cold start)
4. Set Appropriate Timeouts
# Lambda configuration
Functions:
ApiHandler:
Timeout: 10 # Shorter for synchronous APIs
MemorySize: 512
BackgroundProcessor:
Timeout: 300 # Longer for async processing
MemorySize: 1024
Timeout Guidelines:
- API handlers: 10-30 seconds
- Event processors: 60-300 seconds
- Use Step Functions for >15 minute workflows
Cost Optimization
1. Right-Sizing Strategy
# Check EC2 utilization
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
--start-time $(date -d '7 days ago' -u +"%Y-%m-%dT%H:%M:%SZ") \
--end-time $(date -u +"%Y-%m-%dT%H:%M:%SZ") \
--period 3600 \
--statistics Average
Right-Sizing Rules:
- <10% CPU average: Downsize instance
-
80% CPU average: Consider upgrade or horizontal scaling
- Review every month for the first 6 months
2. Savings Plans and Reserved Instances
| Commitment | Savings | Best For |
|---|---|---|
| No Upfront, 1-year | 20-30% | Unknown future |
| Partial Upfront, 1-year | 30-40% | Moderate confidence |
| All Upfront, 3-year | 50-60% | Stable workloads |
# Check Savings Plans recommendations
aws cost-explorer get-savings-plans-purchase-recommendation \
--savings-plans-type COMPUTE_SP \
--term-in-years ONE_YEAR \
--payment-option NO_UPFRONT \
--lookback-period-in-days THIRTY_DAYS
3. S3 Lifecycle Policies
{
"Rules": [
{
"ID": "Transition to cheaper storage",
"Status": "Enabled",
"Filter": {
"Prefix": "logs/"
},
"Transitions": [
{ "Days": 30, "StorageClass": "STANDARD_IA" },
{ "Days": 90, "StorageClass": "GLACIER" }
],
"Expiration": { "Days": 365 }
}
]
}
4. Lambda Memory Optimization
Test different memory settings to find optimal cost/performance.
# Use AWS Lambda Power Tuning
# https://github.com/alexcasalboni/aws-lambda-power-tuning
# Example results:
# 128 MB: 2000ms, $0.000042
# 512 MB: 500ms, $0.000042
# 1024 MB: 300ms, $0.000050
# Optimal: 512 MB (same cost, 4x faster)
5. NAT Gateway Alternatives
NAT Gateway: $0.045/hour + $0.045/GB = ~$32/month + data
Alternatives:
1. VPC Endpoints: $0.01/hour = ~$7.30/month (for AWS services)
2. NAT Instance: t3.nano = ~$3.80/month (limited throughput)
3. No NAT: Use VPC endpoints + Lambda outside VPC
6. CloudWatch Log Retention
# Set retention policies to avoid unbounded growth
LogGroup:
Type: AWS::Logs::LogGroup
Properties:
LogGroupName: /aws/lambda/my-function
RetentionInDays: 14 # 7, 14, 30, 60, 90, etc.
Retention Guidelines:
- Development: 7 days
- Production non-critical: 30 days
- Production critical: 90 days
- Compliance requirements: As specified
Security Hardening
1. IAM Least Privilege
// BAD: Overly permissive
{
"Effect": "Allow",
"Action": "dynamodb:*",
"Resource": "*"
}
// GOOD: Specific actions and resources
{
"Effect": "Allow",
"Action": [
"dynamodb:GetItem",
"dynamodb:PutItem",
"dynamodb:Query"
],
"Resource": [
"arn:aws:dynamodb:us-east-1:123456789:table/users",
"arn:aws:dynamodb:us-east-1:123456789:table/users/index/*"
]
}
2. Encryption Configuration
# Enable encryption everywhere
Resources:
# DynamoDB
Table:
Type: AWS::DynamoDB::Table
Properties:
SSESpecification:
SSEEnabled: true
SSEType: KMS
KMSMasterKeyId: !Ref EncryptionKey
# S3
Bucket:
Type: AWS::S3::Bucket
Properties:
BucketEncryption:
ServerSideEncryptionConfiguration:
- ServerSideEncryptionByDefault:
SSEAlgorithm: aws:kms
KMSMasterKeyID: !Ref EncryptionKey
# RDS
Database:
Type: AWS::RDS::DBInstance
Properties:
StorageEncrypted: true
KmsKeyId: !Ref EncryptionKey
3. Network Isolation
# Private subnets with VPC endpoints
Resources:
PrivateSubnet:
Type: AWS::EC2::Subnet
Properties:
MapPublicIpOnLaunch: false
# DynamoDB Gateway Endpoint (free)
DynamoDBEndpoint:
Type: AWS::EC2::VPCEndpoint
Properties:
VpcId: !Ref VPC
ServiceName: !Sub com.amazonaws.${AWS::Region}.dynamodb
VpcEndpointType: Gateway
RouteTableIds:
- !Ref PrivateRouteTable
# Secrets Manager Interface Endpoint
SecretsEndpoint:
Type: AWS::EC2::VPCEndpoint
Properties:
VpcId: !Ref VPC
ServiceName: !Sub com.amazonaws.${AWS::Region}.secretsmanager
VpcEndpointType: Interface
PrivateDnsEnabled: true
4. Secrets Management
# Never hardcode secrets
import boto3
import json
def get_secret(secret_name):
client = boto3.client('secretsmanager')
response = client.get_secret_value(SecretId=secret_name)
return json.loads(response['SecretString'])
# Usage
db_creds = get_secret('prod/database/credentials')
connection = connect(
host=db_creds['host'],
user=db_creds['username'],
password=db_creds['password']
)
5. API Protection
# WAF + API Gateway
WebACL:
Type: AWS::WAFv2::WebACL
Properties:
DefaultAction:
Allow: {}
Rules:
- Name: RateLimit
Priority: 1
Action:
Block: {}
Statement:
RateBasedStatement:
Limit: 2000
AggregateKeyType: IP
VisibilityConfig:
SampledRequestsEnabled: true
CloudWatchMetricsEnabled: true
MetricName: RateLimitRule
- Name: AWSManagedRulesCommonRuleSet
Priority: 2
OverrideAction:
None: {}
Statement:
ManagedRuleGroupStatement:
VendorName: AWS
Name: AWSManagedRulesCommonRuleSet
6. Audit Logging
# Enable CloudTrail for all API calls
CloudTrail:
Type: AWS::CloudTrail::Trail
Properties:
IsMultiRegionTrail: true
IsLogging: true
S3BucketName: !Ref AuditLogsBucket
IncludeGlobalServiceEvents: true
EnableLogFileValidation: true
EventSelectors:
- ReadWriteType: All
IncludeManagementEvents: true
Scalability Patterns
1. Horizontal vs Vertical Scaling
Horizontal (preferred):
- Add more Lambda concurrent executions
- Add more Fargate tasks
- Add more DynamoDB capacity
Vertical (when necessary):
- Increase Lambda memory
- Upgrade RDS instance
- Larger EC2 instances
2. Database Sharding
# Partition by tenant ID
def get_table_for_tenant(tenant_id):
shard = hash(tenant_id) % NUM_SHARDS
return f"data-shard-{shard}"
# Or use DynamoDB single-table design with partition keys
def get_partition_key(tenant_id, entity_type, entity_id):
return f"TENANT#{tenant_id}#{entity_type}#{entity_id}"
3. Caching Layers
Edge (CloudFront): Global, static content, TTL: hours-days
Application (Redis): Regional, session/query cache, TTL: minutes-hours
Database (DAX): DynamoDB-specific, TTL: minutes
# ElastiCache Redis caching pattern
import redis
import json
cache = redis.Redis(host='cache.abc123.cache.amazonaws.com', port=6379)
def get_user(user_id):
# Check cache first
cached = cache.get(f"user:{user_id}")
if cached:
return json.loads(cached)
# Fetch from database
user = db.get_user(user_id)
# Cache for 5 minutes
cache.setex(f"user:{user_id}", 300, json.dumps(user))
return user
4. Auto-Scaling Configuration
# ECS Service Auto-scaling
AutoScalingTarget:
Type: AWS::ApplicationAutoScaling::ScalableTarget
Properties:
MaxCapacity: 10
MinCapacity: 2
ResourceId: !Sub service/${Cluster}/${Service.Name}
ScalableDimension: ecs:service:DesiredCount
ServiceNamespace: ecs
ScalingPolicy:
Type: AWS::ApplicationAutoScaling::ScalingPolicy
Properties:
PolicyType: TargetTrackingScaling
TargetTrackingScalingPolicyConfiguration:
PredefinedMetricSpecification:
PredefinedMetricType: ECSServiceAverageCPUUtilization
TargetValue: 70
ScaleInCooldown: 300
ScaleOutCooldown: 60
DevOps and Reliability
1. Infrastructure as Code
# Version control all infrastructure
git init
git add .
git commit -m "Initial infrastructure setup"
# Use separate stacks per environment
cdk deploy --context environment=dev
cdk deploy --context environment=staging
cdk deploy --context environment=production
2. Blue/Green Deployments
# CodeDeploy Blue/Green for ECS
DeploymentGroup:
Type: AWS::CodeDeploy::DeploymentGroup
Properties:
DeploymentConfigName: CodeDeployDefault.ECSAllAtOnce
DeploymentStyle:
DeploymentType: BLUE_GREEN
DeploymentOption: WITH_TRAFFIC_CONTROL
BlueGreenDeploymentConfiguration:
DeploymentReadyOption:
ActionOnTimeout: CONTINUE_DEPLOYMENT
WaitTimeInMinutes: 0
TerminateBlueInstancesOnDeploymentSuccess:
Action: TERMINATE
TerminationWaitTimeInMinutes: 5
3. Health Checks
# Application health endpoint
from flask import Flask, jsonify
import boto3
app = Flask(__name__)
@app.route('/health')
def health():
checks = {
'database': check_database(),
'cache': check_cache(),
'external_api': check_external_api()
}
status = 'healthy' if all(checks.values()) else 'unhealthy'
code = 200 if status == 'healthy' else 503
return jsonify({'status': status, 'checks': checks}), code
def check_database():
try:
# Quick connectivity test
db.execute('SELECT 1')
return True
except Exception:
return False
4. Monitoring Setup
# CloudWatch Dashboard
Dashboard:
Type: AWS::CloudWatch::Dashboard
Properties:
DashboardName: production-overview
DashboardBody: |
{
"widgets": [
{
"type": "metric",
"properties": {
"metrics": [
["AWS/Lambda", "Invocations", "FunctionName", "api-handler"],
[".", "Errors", ".", "."],
[".", "Duration", ".", ".", {"stat": "p99"}]
],
"period": 60,
"title": "Lambda Metrics"
}
}
]
}
# Critical Alarms
ErrorAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: high-error-rate
MetricName: Errors
Namespace: AWS/Lambda
Statistic: Sum
Period: 60
EvaluationPeriods: 3
Threshold: 10
ComparisonOperator: GreaterThanThreshold
AlarmActions:
- !Ref AlertTopic
Common Pitfalls
Technical Debt
| Pitfall | Solution |
|---|---|
| Over-engineering early | Start simple, scale when needed |
| Under-monitoring | Set up CloudWatch from day one |
| Ignoring costs | Enable Cost Explorer and billing alerts |
| Single region only | Plan for multi-region from start |
Security Mistakes
| Mistake | Prevention |
|---|---|
| Public S3 buckets | Block public access, use bucket policies |
| Overly permissive IAM | Never use "*", specify resources |
| Hardcoded credentials | Use Secrets Manager, IAM roles |
| Unencrypted data | Enable encryption by default |
Performance Issues
| Issue | Solution |
|---|---|
| No caching | Add CloudFront, ElastiCache early |
| Inefficient queries | Use indexes, avoid DynamoDB scans |
| Large Lambda packages | Use layers, minimize dependencies |
| N+1 queries | Implement DataLoader, batch operations |
Cost Surprises
| Surprise | Prevention |
|---|---|
| Undeleted resources | Tag everything, review weekly |
| Data transfer costs | Keep traffic in same AZ/region |
| NAT Gateway charges | Use VPC endpoints for AWS services |
| Log accumulation | Set CloudWatch retention policies |