Complete restructure based on AI Agent Skills Benchmark feedback (original score: 66/100):
## Directory Reorganization
- Moved Python scripts to scripts/ directory
- Moved sample files to assets/ directory
- Created references/ directory with extracted content
- Removed HOW_TO_USE.md (integrated into SKILL.md)
- Removed __pycache__
## New Reference Files (3 files)
- architecture_patterns.md: 6 AWS patterns (serverless, microservices, three-tier,
data processing, GraphQL, multi-region) with diagrams, cost breakdowns, pros/cons
- service_selection.md: Decision matrices for compute, database, storage, messaging,
networking, security services with code examples
- best_practices.md: Serverless design, cost optimization, security hardening,
scalability patterns, common pitfalls
## SKILL.md Rewrite
- Reduced from 345 lines to 307 lines (moved patterns to references/)
- Added trigger phrases to description ("design serverless architecture",
"create CloudFormation templates", "optimize AWS costs")
- Structured around 6-step workflow instead of encyclopedia format
- Added Quick Start examples (MVP, Scaling, Cost Optimization, IaC)
- Removed marketing language ("Expert", "comprehensive")
- Consistent imperative voice throughout
## Structure Changes
- scripts/: architecture_designer.py, cost_optimizer.py, serverless_stack.py
- references/: architecture_patterns.md, service_selection.md, best_practices.md
- assets/: sample_input.json, expected_output.json
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
632 lines
14 KiB
Markdown
632 lines
14 KiB
Markdown
# AWS Best Practices for Startups
|
|
|
|
Production-ready practices for serverless, cost optimization, security, and operational excellence.
|
|
|
|
---
|
|
|
|
## Table of Contents
|
|
|
|
- [Serverless Best Practices](#serverless-best-practices)
|
|
- [Cost Optimization](#cost-optimization)
|
|
- [Security Hardening](#security-hardening)
|
|
- [Scalability Patterns](#scalability-patterns)
|
|
- [DevOps and Reliability](#devops-and-reliability)
|
|
- [Common Pitfalls](#common-pitfalls)
|
|
|
|
---
|
|
|
|
## Serverless Best Practices
|
|
|
|
### Lambda Function Design
|
|
|
|
#### 1. Keep Functions Stateless
|
|
|
|
Store state externally in DynamoDB, S3, or ElastiCache.
|
|
|
|
```python
|
|
# BAD: Function-level state
|
|
cache = {}
|
|
|
|
def handler(event, context):
|
|
if event['key'] in cache:
|
|
return cache[event['key']]
|
|
# ...
|
|
|
|
# GOOD: External state
|
|
import boto3
|
|
dynamodb = boto3.resource('dynamodb')
|
|
table = dynamodb.Table('cache')
|
|
|
|
def handler(event, context):
|
|
response = table.get_item(Key={'pk': event['key']})
|
|
if 'Item' in response:
|
|
return response['Item']['value']
|
|
# ...
|
|
```
|
|
|
|
#### 2. Implement Idempotency
|
|
|
|
Handle retries gracefully with unique request IDs.
|
|
|
|
```python
|
|
import boto3
|
|
import hashlib
|
|
|
|
dynamodb = boto3.resource('dynamodb')
|
|
idempotency_table = dynamodb.Table('idempotency')
|
|
|
|
def handler(event, context):
|
|
# Generate idempotency key
|
|
idempotency_key = hashlib.sha256(
|
|
f"{event['orderId']}-{event['action']}".encode()
|
|
).hexdigest()
|
|
|
|
# Check if already processed
|
|
try:
|
|
response = idempotency_table.get_item(Key={'pk': idempotency_key})
|
|
if 'Item' in response:
|
|
return response['Item']['result']
|
|
except Exception:
|
|
pass
|
|
|
|
# Process request
|
|
result = process_order(event)
|
|
|
|
# Store result for idempotency
|
|
idempotency_table.put_item(
|
|
Item={
|
|
'pk': idempotency_key,
|
|
'result': result,
|
|
'ttl': int(time.time()) + 86400 # 24h TTL
|
|
}
|
|
)
|
|
|
|
return result
|
|
```
|
|
|
|
#### 3. Optimize Cold Starts
|
|
|
|
```python
|
|
# Initialize outside handler (reused across invocations)
|
|
import boto3
|
|
from aws_xray_sdk.core import patch_all
|
|
|
|
# SDK initialization happens once
|
|
dynamodb = boto3.resource('dynamodb')
|
|
table = dynamodb.Table('my-table')
|
|
patch_all()
|
|
|
|
def handler(event, context):
|
|
# Handler code uses pre-initialized resources
|
|
return table.get_item(Key={'pk': event['id']})
|
|
```
|
|
|
|
**Cold Start Reduction Techniques:**
|
|
- Use provisioned concurrency for critical paths
|
|
- Minimize package size (use layers for dependencies)
|
|
- Choose interpreted languages (Python, Node.js) over compiled
|
|
- Avoid VPC unless necessary (adds 6-10 sec cold start)
|
|
|
|
#### 4. Set Appropriate Timeouts
|
|
|
|
```yaml
|
|
# Lambda configuration
|
|
Functions:
|
|
ApiHandler:
|
|
Timeout: 10 # Shorter for synchronous APIs
|
|
MemorySize: 512
|
|
|
|
BackgroundProcessor:
|
|
Timeout: 300 # Longer for async processing
|
|
MemorySize: 1024
|
|
```
|
|
|
|
**Timeout Guidelines:**
|
|
- API handlers: 10-30 seconds
|
|
- Event processors: 60-300 seconds
|
|
- Use Step Functions for >15 minute workflows
|
|
|
|
---
|
|
|
|
## Cost Optimization
|
|
|
|
### 1. Right-Sizing Strategy
|
|
|
|
```bash
|
|
# Check EC2 utilization
|
|
aws cloudwatch get-metric-statistics \
|
|
--namespace AWS/EC2 \
|
|
--metric-name CPUUtilization \
|
|
--dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
|
|
--start-time $(date -d '7 days ago' -u +"%Y-%m-%dT%H:%M:%SZ") \
|
|
--end-time $(date -u +"%Y-%m-%dT%H:%M:%SZ") \
|
|
--period 3600 \
|
|
--statistics Average
|
|
```
|
|
|
|
**Right-Sizing Rules:**
|
|
- <10% CPU average: Downsize instance
|
|
- >80% CPU average: Consider upgrade or horizontal scaling
|
|
- Review every month for the first 6 months
|
|
|
|
### 2. Savings Plans and Reserved Instances
|
|
|
|
| Commitment | Savings | Best For |
|
|
|------------|---------|----------|
|
|
| No Upfront, 1-year | 20-30% | Unknown future |
|
|
| Partial Upfront, 1-year | 30-40% | Moderate confidence |
|
|
| All Upfront, 3-year | 50-60% | Stable workloads |
|
|
|
|
```bash
|
|
# Check Savings Plans recommendations
|
|
aws cost-explorer get-savings-plans-purchase-recommendation \
|
|
--savings-plans-type COMPUTE_SP \
|
|
--term-in-years ONE_YEAR \
|
|
--payment-option NO_UPFRONT \
|
|
--lookback-period-in-days THIRTY_DAYS
|
|
```
|
|
|
|
### 3. S3 Lifecycle Policies
|
|
|
|
```json
|
|
{
|
|
"Rules": [
|
|
{
|
|
"ID": "Transition to cheaper storage",
|
|
"Status": "Enabled",
|
|
"Filter": {
|
|
"Prefix": "logs/"
|
|
},
|
|
"Transitions": [
|
|
{ "Days": 30, "StorageClass": "STANDARD_IA" },
|
|
{ "Days": 90, "StorageClass": "GLACIER" }
|
|
],
|
|
"Expiration": { "Days": 365 }
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
### 4. Lambda Memory Optimization
|
|
|
|
Test different memory settings to find optimal cost/performance.
|
|
|
|
```python
|
|
# Use AWS Lambda Power Tuning
|
|
# https://github.com/alexcasalboni/aws-lambda-power-tuning
|
|
|
|
# Example results:
|
|
# 128 MB: 2000ms, $0.000042
|
|
# 512 MB: 500ms, $0.000042
|
|
# 1024 MB: 300ms, $0.000050
|
|
|
|
# Optimal: 512 MB (same cost, 4x faster)
|
|
```
|
|
|
|
### 5. NAT Gateway Alternatives
|
|
|
|
```
|
|
NAT Gateway: $0.045/hour + $0.045/GB = ~$32/month + data
|
|
|
|
Alternatives:
|
|
1. VPC Endpoints: $0.01/hour = ~$7.30/month (for AWS services)
|
|
2. NAT Instance: t3.nano = ~$3.80/month (limited throughput)
|
|
3. No NAT: Use VPC endpoints + Lambda outside VPC
|
|
```
|
|
|
|
### 6. CloudWatch Log Retention
|
|
|
|
```yaml
|
|
# Set retention policies to avoid unbounded growth
|
|
LogGroup:
|
|
Type: AWS::Logs::LogGroup
|
|
Properties:
|
|
LogGroupName: /aws/lambda/my-function
|
|
RetentionInDays: 14 # 7, 14, 30, 60, 90, etc.
|
|
```
|
|
|
|
**Retention Guidelines:**
|
|
- Development: 7 days
|
|
- Production non-critical: 30 days
|
|
- Production critical: 90 days
|
|
- Compliance requirements: As specified
|
|
|
|
---
|
|
|
|
## Security Hardening
|
|
|
|
### 1. IAM Least Privilege
|
|
|
|
```json
|
|
// BAD: Overly permissive
|
|
{
|
|
"Effect": "Allow",
|
|
"Action": "dynamodb:*",
|
|
"Resource": "*"
|
|
}
|
|
|
|
// GOOD: Specific actions and resources
|
|
{
|
|
"Effect": "Allow",
|
|
"Action": [
|
|
"dynamodb:GetItem",
|
|
"dynamodb:PutItem",
|
|
"dynamodb:Query"
|
|
],
|
|
"Resource": [
|
|
"arn:aws:dynamodb:us-east-1:123456789:table/users",
|
|
"arn:aws:dynamodb:us-east-1:123456789:table/users/index/*"
|
|
]
|
|
}
|
|
```
|
|
|
|
### 2. Encryption Configuration
|
|
|
|
```yaml
|
|
# Enable encryption everywhere
|
|
Resources:
|
|
# DynamoDB
|
|
Table:
|
|
Type: AWS::DynamoDB::Table
|
|
Properties:
|
|
SSESpecification:
|
|
SSEEnabled: true
|
|
SSEType: KMS
|
|
KMSMasterKeyId: !Ref EncryptionKey
|
|
|
|
# S3
|
|
Bucket:
|
|
Type: AWS::S3::Bucket
|
|
Properties:
|
|
BucketEncryption:
|
|
ServerSideEncryptionConfiguration:
|
|
- ServerSideEncryptionByDefault:
|
|
SSEAlgorithm: aws:kms
|
|
KMSMasterKeyID: !Ref EncryptionKey
|
|
|
|
# RDS
|
|
Database:
|
|
Type: AWS::RDS::DBInstance
|
|
Properties:
|
|
StorageEncrypted: true
|
|
KmsKeyId: !Ref EncryptionKey
|
|
```
|
|
|
|
### 3. Network Isolation
|
|
|
|
```yaml
|
|
# Private subnets with VPC endpoints
|
|
Resources:
|
|
PrivateSubnet:
|
|
Type: AWS::EC2::Subnet
|
|
Properties:
|
|
MapPublicIpOnLaunch: false
|
|
|
|
# DynamoDB Gateway Endpoint (free)
|
|
DynamoDBEndpoint:
|
|
Type: AWS::EC2::VPCEndpoint
|
|
Properties:
|
|
VpcId: !Ref VPC
|
|
ServiceName: !Sub com.amazonaws.${AWS::Region}.dynamodb
|
|
VpcEndpointType: Gateway
|
|
RouteTableIds:
|
|
- !Ref PrivateRouteTable
|
|
|
|
# Secrets Manager Interface Endpoint
|
|
SecretsEndpoint:
|
|
Type: AWS::EC2::VPCEndpoint
|
|
Properties:
|
|
VpcId: !Ref VPC
|
|
ServiceName: !Sub com.amazonaws.${AWS::Region}.secretsmanager
|
|
VpcEndpointType: Interface
|
|
PrivateDnsEnabled: true
|
|
```
|
|
|
|
### 4. Secrets Management
|
|
|
|
```python
|
|
# Never hardcode secrets
|
|
import boto3
|
|
import json
|
|
|
|
def get_secret(secret_name):
|
|
client = boto3.client('secretsmanager')
|
|
response = client.get_secret_value(SecretId=secret_name)
|
|
return json.loads(response['SecretString'])
|
|
|
|
# Usage
|
|
db_creds = get_secret('prod/database/credentials')
|
|
connection = connect(
|
|
host=db_creds['host'],
|
|
user=db_creds['username'],
|
|
password=db_creds['password']
|
|
)
|
|
```
|
|
|
|
### 5. API Protection
|
|
|
|
```yaml
|
|
# WAF + API Gateway
|
|
WebACL:
|
|
Type: AWS::WAFv2::WebACL
|
|
Properties:
|
|
DefaultAction:
|
|
Allow: {}
|
|
Rules:
|
|
- Name: RateLimit
|
|
Priority: 1
|
|
Action:
|
|
Block: {}
|
|
Statement:
|
|
RateBasedStatement:
|
|
Limit: 2000
|
|
AggregateKeyType: IP
|
|
VisibilityConfig:
|
|
SampledRequestsEnabled: true
|
|
CloudWatchMetricsEnabled: true
|
|
MetricName: RateLimitRule
|
|
|
|
- Name: AWSManagedRulesCommonRuleSet
|
|
Priority: 2
|
|
OverrideAction:
|
|
None: {}
|
|
Statement:
|
|
ManagedRuleGroupStatement:
|
|
VendorName: AWS
|
|
Name: AWSManagedRulesCommonRuleSet
|
|
```
|
|
|
|
### 6. Audit Logging
|
|
|
|
```yaml
|
|
# Enable CloudTrail for all API calls
|
|
CloudTrail:
|
|
Type: AWS::CloudTrail::Trail
|
|
Properties:
|
|
IsMultiRegionTrail: true
|
|
IsLogging: true
|
|
S3BucketName: !Ref AuditLogsBucket
|
|
IncludeGlobalServiceEvents: true
|
|
EnableLogFileValidation: true
|
|
EventSelectors:
|
|
- ReadWriteType: All
|
|
IncludeManagementEvents: true
|
|
```
|
|
|
|
---
|
|
|
|
## Scalability Patterns
|
|
|
|
### 1. Horizontal vs Vertical Scaling
|
|
|
|
```
|
|
Horizontal (preferred):
|
|
- Add more Lambda concurrent executions
|
|
- Add more Fargate tasks
|
|
- Add more DynamoDB capacity
|
|
|
|
Vertical (when necessary):
|
|
- Increase Lambda memory
|
|
- Upgrade RDS instance
|
|
- Larger EC2 instances
|
|
```
|
|
|
|
### 2. Database Sharding
|
|
|
|
```python
|
|
# Partition by tenant ID
|
|
def get_table_for_tenant(tenant_id):
|
|
shard = hash(tenant_id) % NUM_SHARDS
|
|
return f"data-shard-{shard}"
|
|
|
|
# Or use DynamoDB single-table design with partition keys
|
|
def get_partition_key(tenant_id, entity_type, entity_id):
|
|
return f"TENANT#{tenant_id}#{entity_type}#{entity_id}"
|
|
```
|
|
|
|
### 3. Caching Layers
|
|
|
|
```
|
|
Edge (CloudFront): Global, static content, TTL: hours-days
|
|
Application (Redis): Regional, session/query cache, TTL: minutes-hours
|
|
Database (DAX): DynamoDB-specific, TTL: minutes
|
|
```
|
|
|
|
```python
|
|
# ElastiCache Redis caching pattern
|
|
import redis
|
|
import json
|
|
|
|
cache = redis.Redis(host='cache.abc123.cache.amazonaws.com', port=6379)
|
|
|
|
def get_user(user_id):
|
|
# Check cache first
|
|
cached = cache.get(f"user:{user_id}")
|
|
if cached:
|
|
return json.loads(cached)
|
|
|
|
# Fetch from database
|
|
user = db.get_user(user_id)
|
|
|
|
# Cache for 5 minutes
|
|
cache.setex(f"user:{user_id}", 300, json.dumps(user))
|
|
|
|
return user
|
|
```
|
|
|
|
### 4. Auto-Scaling Configuration
|
|
|
|
```yaml
|
|
# ECS Service Auto-scaling
|
|
AutoScalingTarget:
|
|
Type: AWS::ApplicationAutoScaling::ScalableTarget
|
|
Properties:
|
|
MaxCapacity: 10
|
|
MinCapacity: 2
|
|
ResourceId: !Sub service/${Cluster}/${Service.Name}
|
|
ScalableDimension: ecs:service:DesiredCount
|
|
ServiceNamespace: ecs
|
|
|
|
ScalingPolicy:
|
|
Type: AWS::ApplicationAutoScaling::ScalingPolicy
|
|
Properties:
|
|
PolicyType: TargetTrackingScaling
|
|
TargetTrackingScalingPolicyConfiguration:
|
|
PredefinedMetricSpecification:
|
|
PredefinedMetricType: ECSServiceAverageCPUUtilization
|
|
TargetValue: 70
|
|
ScaleInCooldown: 300
|
|
ScaleOutCooldown: 60
|
|
```
|
|
|
|
---
|
|
|
|
## DevOps and Reliability
|
|
|
|
### 1. Infrastructure as Code
|
|
|
|
```bash
|
|
# Version control all infrastructure
|
|
git init
|
|
git add .
|
|
git commit -m "Initial infrastructure setup"
|
|
|
|
# Use separate stacks per environment
|
|
cdk deploy --context environment=dev
|
|
cdk deploy --context environment=staging
|
|
cdk deploy --context environment=production
|
|
```
|
|
|
|
### 2. Blue/Green Deployments
|
|
|
|
```yaml
|
|
# CodeDeploy Blue/Green for ECS
|
|
DeploymentGroup:
|
|
Type: AWS::CodeDeploy::DeploymentGroup
|
|
Properties:
|
|
DeploymentConfigName: CodeDeployDefault.ECSAllAtOnce
|
|
DeploymentStyle:
|
|
DeploymentType: BLUE_GREEN
|
|
DeploymentOption: WITH_TRAFFIC_CONTROL
|
|
BlueGreenDeploymentConfiguration:
|
|
DeploymentReadyOption:
|
|
ActionOnTimeout: CONTINUE_DEPLOYMENT
|
|
WaitTimeInMinutes: 0
|
|
TerminateBlueInstancesOnDeploymentSuccess:
|
|
Action: TERMINATE
|
|
TerminationWaitTimeInMinutes: 5
|
|
```
|
|
|
|
### 3. Health Checks
|
|
|
|
```python
|
|
# Application health endpoint
|
|
from flask import Flask, jsonify
|
|
import boto3
|
|
|
|
app = Flask(__name__)
|
|
|
|
@app.route('/health')
|
|
def health():
|
|
checks = {
|
|
'database': check_database(),
|
|
'cache': check_cache(),
|
|
'external_api': check_external_api()
|
|
}
|
|
|
|
status = 'healthy' if all(checks.values()) else 'unhealthy'
|
|
code = 200 if status == 'healthy' else 503
|
|
|
|
return jsonify({'status': status, 'checks': checks}), code
|
|
|
|
def check_database():
|
|
try:
|
|
# Quick connectivity test
|
|
db.execute('SELECT 1')
|
|
return True
|
|
except Exception:
|
|
return False
|
|
```
|
|
|
|
### 4. Monitoring Setup
|
|
|
|
```yaml
|
|
# CloudWatch Dashboard
|
|
Dashboard:
|
|
Type: AWS::CloudWatch::Dashboard
|
|
Properties:
|
|
DashboardName: production-overview
|
|
DashboardBody: |
|
|
{
|
|
"widgets": [
|
|
{
|
|
"type": "metric",
|
|
"properties": {
|
|
"metrics": [
|
|
["AWS/Lambda", "Invocations", "FunctionName", "api-handler"],
|
|
[".", "Errors", ".", "."],
|
|
[".", "Duration", ".", ".", {"stat": "p99"}]
|
|
],
|
|
"period": 60,
|
|
"title": "Lambda Metrics"
|
|
}
|
|
}
|
|
]
|
|
}
|
|
|
|
# Critical Alarms
|
|
ErrorAlarm:
|
|
Type: AWS::CloudWatch::Alarm
|
|
Properties:
|
|
AlarmName: high-error-rate
|
|
MetricName: Errors
|
|
Namespace: AWS/Lambda
|
|
Statistic: Sum
|
|
Period: 60
|
|
EvaluationPeriods: 3
|
|
Threshold: 10
|
|
ComparisonOperator: GreaterThanThreshold
|
|
AlarmActions:
|
|
- !Ref AlertTopic
|
|
```
|
|
|
|
---
|
|
|
|
## Common Pitfalls
|
|
|
|
### Technical Debt
|
|
|
|
| Pitfall | Solution |
|
|
|---------|----------|
|
|
| Over-engineering early | Start simple, scale when needed |
|
|
| Under-monitoring | Set up CloudWatch from day one |
|
|
| Ignoring costs | Enable Cost Explorer and billing alerts |
|
|
| Single region only | Plan for multi-region from start |
|
|
|
|
### Security Mistakes
|
|
|
|
| Mistake | Prevention |
|
|
|---------|------------|
|
|
| Public S3 buckets | Block public access, use bucket policies |
|
|
| Overly permissive IAM | Never use "*", specify resources |
|
|
| Hardcoded credentials | Use Secrets Manager, IAM roles |
|
|
| Unencrypted data | Enable encryption by default |
|
|
|
|
### Performance Issues
|
|
|
|
| Issue | Solution |
|
|
|-------|----------|
|
|
| No caching | Add CloudFront, ElastiCache early |
|
|
| Inefficient queries | Use indexes, avoid DynamoDB scans |
|
|
| Large Lambda packages | Use layers, minimize dependencies |
|
|
| N+1 queries | Implement DataLoader, batch operations |
|
|
|
|
### Cost Surprises
|
|
|
|
| Surprise | Prevention |
|
|
|----------|------------|
|
|
| Undeleted resources | Tag everything, review weekly |
|
|
| Data transfer costs | Keep traffic in same AZ/region |
|
|
| NAT Gateway charges | Use VPC endpoints for AWS services |
|
|
| Log accumulation | Set CloudWatch retention policies |
|