Files
claude-skills-reference/engineering-team/aws-solution-architect/references/best_practices.md
Alireza Rezvani c7dc957823 fix(skill): restructure aws-solution-architect for better organization (#61) (#114)
Complete restructure based on AI Agent Skills Benchmark feedback (original score: 66/100):

## Directory Reorganization
- Moved Python scripts to scripts/ directory
- Moved sample files to assets/ directory
- Created references/ directory with extracted content
- Removed HOW_TO_USE.md (integrated into SKILL.md)
- Removed __pycache__

## New Reference Files (3 files)
- architecture_patterns.md: 6 AWS patterns (serverless, microservices, three-tier,
  data processing, GraphQL, multi-region) with diagrams, cost breakdowns, pros/cons
- service_selection.md: Decision matrices for compute, database, storage, messaging,
  networking, security services with code examples
- best_practices.md: Serverless design, cost optimization, security hardening,
  scalability patterns, common pitfalls

## SKILL.md Rewrite
- Reduced from 345 lines to 307 lines (moved patterns to references/)
- Added trigger phrases to description ("design serverless architecture",
  "create CloudFormation templates", "optimize AWS costs")
- Structured around 6-step workflow instead of encyclopedia format
- Added Quick Start examples (MVP, Scaling, Cost Optimization, IaC)
- Removed marketing language ("Expert", "comprehensive")
- Consistent imperative voice throughout

## Structure Changes
- scripts/: architecture_designer.py, cost_optimizer.py, serverless_stack.py
- references/: architecture_patterns.md, service_selection.md, best_practices.md
- assets/: sample_input.json, expected_output.json

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-30 02:42:08 +01:00

632 lines
14 KiB
Markdown

# AWS Best Practices for Startups
Production-ready practices for serverless, cost optimization, security, and operational excellence.
---
## Table of Contents
- [Serverless Best Practices](#serverless-best-practices)
- [Cost Optimization](#cost-optimization)
- [Security Hardening](#security-hardening)
- [Scalability Patterns](#scalability-patterns)
- [DevOps and Reliability](#devops-and-reliability)
- [Common Pitfalls](#common-pitfalls)
---
## Serverless Best Practices
### Lambda Function Design
#### 1. Keep Functions Stateless
Store state externally in DynamoDB, S3, or ElastiCache.
```python
# BAD: Function-level state
cache = {}
def handler(event, context):
if event['key'] in cache:
return cache[event['key']]
# ...
# GOOD: External state
import boto3
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('cache')
def handler(event, context):
response = table.get_item(Key={'pk': event['key']})
if 'Item' in response:
return response['Item']['value']
# ...
```
#### 2. Implement Idempotency
Handle retries gracefully with unique request IDs.
```python
import boto3
import hashlib
dynamodb = boto3.resource('dynamodb')
idempotency_table = dynamodb.Table('idempotency')
def handler(event, context):
# Generate idempotency key
idempotency_key = hashlib.sha256(
f"{event['orderId']}-{event['action']}".encode()
).hexdigest()
# Check if already processed
try:
response = idempotency_table.get_item(Key={'pk': idempotency_key})
if 'Item' in response:
return response['Item']['result']
except Exception:
pass
# Process request
result = process_order(event)
# Store result for idempotency
idempotency_table.put_item(
Item={
'pk': idempotency_key,
'result': result,
'ttl': int(time.time()) + 86400 # 24h TTL
}
)
return result
```
#### 3. Optimize Cold Starts
```python
# Initialize outside handler (reused across invocations)
import boto3
from aws_xray_sdk.core import patch_all
# SDK initialization happens once
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('my-table')
patch_all()
def handler(event, context):
# Handler code uses pre-initialized resources
return table.get_item(Key={'pk': event['id']})
```
**Cold Start Reduction Techniques:**
- Use provisioned concurrency for critical paths
- Minimize package size (use layers for dependencies)
- Choose interpreted languages (Python, Node.js) over compiled
- Avoid VPC unless necessary (adds 6-10 sec cold start)
#### 4. Set Appropriate Timeouts
```yaml
# Lambda configuration
Functions:
ApiHandler:
Timeout: 10 # Shorter for synchronous APIs
MemorySize: 512
BackgroundProcessor:
Timeout: 300 # Longer for async processing
MemorySize: 1024
```
**Timeout Guidelines:**
- API handlers: 10-30 seconds
- Event processors: 60-300 seconds
- Use Step Functions for >15 minute workflows
---
## Cost Optimization
### 1. Right-Sizing Strategy
```bash
# Check EC2 utilization
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
--start-time $(date -d '7 days ago' -u +"%Y-%m-%dT%H:%M:%SZ") \
--end-time $(date -u +"%Y-%m-%dT%H:%M:%SZ") \
--period 3600 \
--statistics Average
```
**Right-Sizing Rules:**
- <10% CPU average: Downsize instance
- >80% CPU average: Consider upgrade or horizontal scaling
- Review every month for the first 6 months
### 2. Savings Plans and Reserved Instances
| Commitment | Savings | Best For |
|------------|---------|----------|
| No Upfront, 1-year | 20-30% | Unknown future |
| Partial Upfront, 1-year | 30-40% | Moderate confidence |
| All Upfront, 3-year | 50-60% | Stable workloads |
```bash
# Check Savings Plans recommendations
aws cost-explorer get-savings-plans-purchase-recommendation \
--savings-plans-type COMPUTE_SP \
--term-in-years ONE_YEAR \
--payment-option NO_UPFRONT \
--lookback-period-in-days THIRTY_DAYS
```
### 3. S3 Lifecycle Policies
```json
{
"Rules": [
{
"ID": "Transition to cheaper storage",
"Status": "Enabled",
"Filter": {
"Prefix": "logs/"
},
"Transitions": [
{ "Days": 30, "StorageClass": "STANDARD_IA" },
{ "Days": 90, "StorageClass": "GLACIER" }
],
"Expiration": { "Days": 365 }
}
]
}
```
### 4. Lambda Memory Optimization
Test different memory settings to find optimal cost/performance.
```python
# Use AWS Lambda Power Tuning
# https://github.com/alexcasalboni/aws-lambda-power-tuning
# Example results:
# 128 MB: 2000ms, $0.000042
# 512 MB: 500ms, $0.000042
# 1024 MB: 300ms, $0.000050
# Optimal: 512 MB (same cost, 4x faster)
```
### 5. NAT Gateway Alternatives
```
NAT Gateway: $0.045/hour + $0.045/GB = ~$32/month + data
Alternatives:
1. VPC Endpoints: $0.01/hour = ~$7.30/month (for AWS services)
2. NAT Instance: t3.nano = ~$3.80/month (limited throughput)
3. No NAT: Use VPC endpoints + Lambda outside VPC
```
### 6. CloudWatch Log Retention
```yaml
# Set retention policies to avoid unbounded growth
LogGroup:
Type: AWS::Logs::LogGroup
Properties:
LogGroupName: /aws/lambda/my-function
RetentionInDays: 14 # 7, 14, 30, 60, 90, etc.
```
**Retention Guidelines:**
- Development: 7 days
- Production non-critical: 30 days
- Production critical: 90 days
- Compliance requirements: As specified
---
## Security Hardening
### 1. IAM Least Privilege
```json
// BAD: Overly permissive
{
"Effect": "Allow",
"Action": "dynamodb:*",
"Resource": "*"
}
// GOOD: Specific actions and resources
{
"Effect": "Allow",
"Action": [
"dynamodb:GetItem",
"dynamodb:PutItem",
"dynamodb:Query"
],
"Resource": [
"arn:aws:dynamodb:us-east-1:123456789:table/users",
"arn:aws:dynamodb:us-east-1:123456789:table/users/index/*"
]
}
```
### 2. Encryption Configuration
```yaml
# Enable encryption everywhere
Resources:
# DynamoDB
Table:
Type: AWS::DynamoDB::Table
Properties:
SSESpecification:
SSEEnabled: true
SSEType: KMS
KMSMasterKeyId: !Ref EncryptionKey
# S3
Bucket:
Type: AWS::S3::Bucket
Properties:
BucketEncryption:
ServerSideEncryptionConfiguration:
- ServerSideEncryptionByDefault:
SSEAlgorithm: aws:kms
KMSMasterKeyID: !Ref EncryptionKey
# RDS
Database:
Type: AWS::RDS::DBInstance
Properties:
StorageEncrypted: true
KmsKeyId: !Ref EncryptionKey
```
### 3. Network Isolation
```yaml
# Private subnets with VPC endpoints
Resources:
PrivateSubnet:
Type: AWS::EC2::Subnet
Properties:
MapPublicIpOnLaunch: false
# DynamoDB Gateway Endpoint (free)
DynamoDBEndpoint:
Type: AWS::EC2::VPCEndpoint
Properties:
VpcId: !Ref VPC
ServiceName: !Sub com.amazonaws.${AWS::Region}.dynamodb
VpcEndpointType: Gateway
RouteTableIds:
- !Ref PrivateRouteTable
# Secrets Manager Interface Endpoint
SecretsEndpoint:
Type: AWS::EC2::VPCEndpoint
Properties:
VpcId: !Ref VPC
ServiceName: !Sub com.amazonaws.${AWS::Region}.secretsmanager
VpcEndpointType: Interface
PrivateDnsEnabled: true
```
### 4. Secrets Management
```python
# Never hardcode secrets
import boto3
import json
def get_secret(secret_name):
client = boto3.client('secretsmanager')
response = client.get_secret_value(SecretId=secret_name)
return json.loads(response['SecretString'])
# Usage
db_creds = get_secret('prod/database/credentials')
connection = connect(
host=db_creds['host'],
user=db_creds['username'],
password=db_creds['password']
)
```
### 5. API Protection
```yaml
# WAF + API Gateway
WebACL:
Type: AWS::WAFv2::WebACL
Properties:
DefaultAction:
Allow: {}
Rules:
- Name: RateLimit
Priority: 1
Action:
Block: {}
Statement:
RateBasedStatement:
Limit: 2000
AggregateKeyType: IP
VisibilityConfig:
SampledRequestsEnabled: true
CloudWatchMetricsEnabled: true
MetricName: RateLimitRule
- Name: AWSManagedRulesCommonRuleSet
Priority: 2
OverrideAction:
None: {}
Statement:
ManagedRuleGroupStatement:
VendorName: AWS
Name: AWSManagedRulesCommonRuleSet
```
### 6. Audit Logging
```yaml
# Enable CloudTrail for all API calls
CloudTrail:
Type: AWS::CloudTrail::Trail
Properties:
IsMultiRegionTrail: true
IsLogging: true
S3BucketName: !Ref AuditLogsBucket
IncludeGlobalServiceEvents: true
EnableLogFileValidation: true
EventSelectors:
- ReadWriteType: All
IncludeManagementEvents: true
```
---
## Scalability Patterns
### 1. Horizontal vs Vertical Scaling
```
Horizontal (preferred):
- Add more Lambda concurrent executions
- Add more Fargate tasks
- Add more DynamoDB capacity
Vertical (when necessary):
- Increase Lambda memory
- Upgrade RDS instance
- Larger EC2 instances
```
### 2. Database Sharding
```python
# Partition by tenant ID
def get_table_for_tenant(tenant_id):
shard = hash(tenant_id) % NUM_SHARDS
return f"data-shard-{shard}"
# Or use DynamoDB single-table design with partition keys
def get_partition_key(tenant_id, entity_type, entity_id):
return f"TENANT#{tenant_id}#{entity_type}#{entity_id}"
```
### 3. Caching Layers
```
Edge (CloudFront): Global, static content, TTL: hours-days
Application (Redis): Regional, session/query cache, TTL: minutes-hours
Database (DAX): DynamoDB-specific, TTL: minutes
```
```python
# ElastiCache Redis caching pattern
import redis
import json
cache = redis.Redis(host='cache.abc123.cache.amazonaws.com', port=6379)
def get_user(user_id):
# Check cache first
cached = cache.get(f"user:{user_id}")
if cached:
return json.loads(cached)
# Fetch from database
user = db.get_user(user_id)
# Cache for 5 minutes
cache.setex(f"user:{user_id}", 300, json.dumps(user))
return user
```
### 4. Auto-Scaling Configuration
```yaml
# ECS Service Auto-scaling
AutoScalingTarget:
Type: AWS::ApplicationAutoScaling::ScalableTarget
Properties:
MaxCapacity: 10
MinCapacity: 2
ResourceId: !Sub service/${Cluster}/${Service.Name}
ScalableDimension: ecs:service:DesiredCount
ServiceNamespace: ecs
ScalingPolicy:
Type: AWS::ApplicationAutoScaling::ScalingPolicy
Properties:
PolicyType: TargetTrackingScaling
TargetTrackingScalingPolicyConfiguration:
PredefinedMetricSpecification:
PredefinedMetricType: ECSServiceAverageCPUUtilization
TargetValue: 70
ScaleInCooldown: 300
ScaleOutCooldown: 60
```
---
## DevOps and Reliability
### 1. Infrastructure as Code
```bash
# Version control all infrastructure
git init
git add .
git commit -m "Initial infrastructure setup"
# Use separate stacks per environment
cdk deploy --context environment=dev
cdk deploy --context environment=staging
cdk deploy --context environment=production
```
### 2. Blue/Green Deployments
```yaml
# CodeDeploy Blue/Green for ECS
DeploymentGroup:
Type: AWS::CodeDeploy::DeploymentGroup
Properties:
DeploymentConfigName: CodeDeployDefault.ECSAllAtOnce
DeploymentStyle:
DeploymentType: BLUE_GREEN
DeploymentOption: WITH_TRAFFIC_CONTROL
BlueGreenDeploymentConfiguration:
DeploymentReadyOption:
ActionOnTimeout: CONTINUE_DEPLOYMENT
WaitTimeInMinutes: 0
TerminateBlueInstancesOnDeploymentSuccess:
Action: TERMINATE
TerminationWaitTimeInMinutes: 5
```
### 3. Health Checks
```python
# Application health endpoint
from flask import Flask, jsonify
import boto3
app = Flask(__name__)
@app.route('/health')
def health():
checks = {
'database': check_database(),
'cache': check_cache(),
'external_api': check_external_api()
}
status = 'healthy' if all(checks.values()) else 'unhealthy'
code = 200 if status == 'healthy' else 503
return jsonify({'status': status, 'checks': checks}), code
def check_database():
try:
# Quick connectivity test
db.execute('SELECT 1')
return True
except Exception:
return False
```
### 4. Monitoring Setup
```yaml
# CloudWatch Dashboard
Dashboard:
Type: AWS::CloudWatch::Dashboard
Properties:
DashboardName: production-overview
DashboardBody: |
{
"widgets": [
{
"type": "metric",
"properties": {
"metrics": [
["AWS/Lambda", "Invocations", "FunctionName", "api-handler"],
[".", "Errors", ".", "."],
[".", "Duration", ".", ".", {"stat": "p99"}]
],
"period": 60,
"title": "Lambda Metrics"
}
}
]
}
# Critical Alarms
ErrorAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: high-error-rate
MetricName: Errors
Namespace: AWS/Lambda
Statistic: Sum
Period: 60
EvaluationPeriods: 3
Threshold: 10
ComparisonOperator: GreaterThanThreshold
AlarmActions:
- !Ref AlertTopic
```
---
## Common Pitfalls
### Technical Debt
| Pitfall | Solution |
|---------|----------|
| Over-engineering early | Start simple, scale when needed |
| Under-monitoring | Set up CloudWatch from day one |
| Ignoring costs | Enable Cost Explorer and billing alerts |
| Single region only | Plan for multi-region from start |
### Security Mistakes
| Mistake | Prevention |
|---------|------------|
| Public S3 buckets | Block public access, use bucket policies |
| Overly permissive IAM | Never use "*", specify resources |
| Hardcoded credentials | Use Secrets Manager, IAM roles |
| Unencrypted data | Enable encryption by default |
### Performance Issues
| Issue | Solution |
|-------|----------|
| No caching | Add CloudFront, ElastiCache early |
| Inefficient queries | Use indexes, avoid DynamoDB scans |
| Large Lambda packages | Use layers, minimize dependencies |
| N+1 queries | Implement DataLoader, batch operations |
### Cost Surprises
| Surprise | Prevention |
|----------|------------|
| Undeleted resources | Tag everything, review weekly |
| Data transfer costs | Keep traffic in same AZ/region |
| NAT Gateway charges | Use VPC endpoints for AWS services |
| Log accumulation | Set CloudWatch retention policies |