1104 lines
31 KiB
Markdown
1104 lines
31 KiB
Markdown
# Zero-Downtime Migration Techniques
|
|
|
|
## Overview
|
|
|
|
Zero-downtime migrations are critical for maintaining business continuity and user experience during system changes. This guide provides comprehensive techniques, patterns, and implementation strategies for achieving true zero-downtime migrations across different system components.
|
|
|
|
## Core Principles
|
|
|
|
### 1. Backward Compatibility
|
|
Every change must be backward compatible until all clients have migrated to the new version.
|
|
|
|
### 2. Incremental Changes
|
|
Break large changes into smaller, independent increments that can be deployed and validated separately.
|
|
|
|
### 3. Feature Flags
|
|
Use feature toggles to control the rollout of new functionality without code deployments.
|
|
|
|
### 4. Graceful Degradation
|
|
Ensure systems continue to function even when some components are unavailable or degraded.
|
|
|
|
## Database Zero-Downtime Techniques
|
|
|
|
### Schema Evolution Without Downtime
|
|
|
|
#### 1. Additive Changes Only
|
|
**Principle:** Only add new elements; never remove or modify existing ones directly.
|
|
|
|
```sql
|
|
-- ✅ Good: Additive change
|
|
ALTER TABLE users ADD COLUMN middle_name VARCHAR(50);
|
|
|
|
-- ❌ Bad: Breaking change
|
|
ALTER TABLE users DROP COLUMN email;
|
|
```
|
|
|
|
#### 2. Multi-Phase Schema Evolution
|
|
|
|
**Phase 1: Expand**
|
|
```sql
|
|
-- Add new column alongside existing one
|
|
ALTER TABLE users ADD COLUMN email_address VARCHAR(255);
|
|
|
|
-- Add index concurrently (PostgreSQL)
|
|
CREATE INDEX CONCURRENTLY idx_users_email_address ON users(email_address);
|
|
```
|
|
|
|
**Phase 2: Dual Write (Application Code)**
|
|
```python
|
|
class UserService:
|
|
def create_user(self, name, email):
|
|
# Write to both old and new columns
|
|
user = User(
|
|
name=name,
|
|
email=email, # Old column
|
|
email_address=email # New column
|
|
)
|
|
return user.save()
|
|
|
|
def update_email(self, user_id, new_email):
|
|
# Update both columns
|
|
user = User.objects.get(id=user_id)
|
|
user.email = new_email
|
|
user.email_address = new_email
|
|
user.save()
|
|
return user
|
|
```
|
|
|
|
**Phase 3: Backfill Data**
|
|
```sql
|
|
-- Backfill existing data (in batches)
|
|
UPDATE users
|
|
SET email_address = email
|
|
WHERE email_address IS NULL
|
|
AND id BETWEEN ? AND ?;
|
|
```
|
|
|
|
**Phase 4: Switch Reads**
|
|
```python
|
|
class UserService:
|
|
def get_user_email(self, user_id):
|
|
user = User.objects.get(id=user_id)
|
|
# Switch to reading from new column
|
|
return user.email_address or user.email
|
|
```
|
|
|
|
**Phase 5: Contract**
|
|
```sql
|
|
-- After validation, remove old column
|
|
ALTER TABLE users DROP COLUMN email;
|
|
-- Rename new column if needed
|
|
ALTER TABLE users RENAME COLUMN email_address TO email;
|
|
```
|
|
|
|
### 3. Online Schema Changes
|
|
|
|
#### PostgreSQL Techniques
|
|
|
|
```sql
|
|
-- Safe column addition
|
|
ALTER TABLE orders ADD COLUMN status_new VARCHAR(20) DEFAULT 'pending';
|
|
|
|
-- Safe index creation
|
|
CREATE INDEX CONCURRENTLY idx_orders_status_new ON orders(status_new);
|
|
|
|
-- Safe constraint addition (after data validation)
|
|
ALTER TABLE orders ADD CONSTRAINT check_status_new
|
|
CHECK (status_new IN ('pending', 'processing', 'completed', 'cancelled'));
|
|
```
|
|
|
|
#### MySQL Techniques
|
|
|
|
```sql
|
|
-- Use pt-online-schema-change for large tables
|
|
pt-online-schema-change \
|
|
--alter "ADD COLUMN status VARCHAR(20) DEFAULT 'pending'" \
|
|
--execute \
|
|
D=mydb,t=orders
|
|
|
|
-- Online DDL (MySQL 5.6+)
|
|
ALTER TABLE orders
|
|
ADD COLUMN priority INT DEFAULT 1,
|
|
ALGORITHM=INPLACE,
|
|
LOCK=NONE;
|
|
```
|
|
|
|
### 4. Data Migration Strategies
|
|
|
|
#### Chunked Data Migration
|
|
|
|
```python
|
|
class DataMigrator:
|
|
def __init__(self, source_table, target_table, chunk_size=1000):
|
|
self.source_table = source_table
|
|
self.target_table = target_table
|
|
self.chunk_size = chunk_size
|
|
|
|
def migrate_data(self):
|
|
last_id = 0
|
|
total_migrated = 0
|
|
|
|
while True:
|
|
# Get next chunk
|
|
chunk = self.get_chunk(last_id, self.chunk_size)
|
|
|
|
if not chunk:
|
|
break
|
|
|
|
# Transform and migrate chunk
|
|
for record in chunk:
|
|
transformed = self.transform_record(record)
|
|
self.insert_or_update(transformed)
|
|
|
|
last_id = chunk[-1]['id']
|
|
total_migrated += len(chunk)
|
|
|
|
# Brief pause to avoid overwhelming the database
|
|
time.sleep(0.1)
|
|
|
|
self.log_progress(total_migrated)
|
|
|
|
return total_migrated
|
|
|
|
def get_chunk(self, last_id, limit):
|
|
return db.execute(f"""
|
|
SELECT * FROM {self.source_table}
|
|
WHERE id > %s
|
|
ORDER BY id
|
|
LIMIT %s
|
|
""", (last_id, limit))
|
|
```
|
|
|
|
#### Change Data Capture (CDC)
|
|
|
|
```python
|
|
class CDCProcessor:
|
|
def __init__(self):
|
|
self.kafka_consumer = KafkaConsumer('db_changes')
|
|
self.target_db = TargetDatabase()
|
|
|
|
def process_changes(self):
|
|
for message in self.kafka_consumer:
|
|
change = json.loads(message.value)
|
|
|
|
if change['operation'] == 'INSERT':
|
|
self.handle_insert(change)
|
|
elif change['operation'] == 'UPDATE':
|
|
self.handle_update(change)
|
|
elif change['operation'] == 'DELETE':
|
|
self.handle_delete(change)
|
|
|
|
def handle_insert(self, change):
|
|
transformed_data = self.transform_data(change['after'])
|
|
self.target_db.insert(change['table'], transformed_data)
|
|
|
|
def handle_update(self, change):
|
|
key = change['key']
|
|
transformed_data = self.transform_data(change['after'])
|
|
self.target_db.update(change['table'], key, transformed_data)
|
|
```
|
|
|
|
## Application Zero-Downtime Techniques
|
|
|
|
### 1. Blue-Green Deployments
|
|
|
|
#### Infrastructure Setup
|
|
|
|
```yaml
|
|
# Blue Environment (Current Production)
|
|
apiVersion: apps/v1
|
|
kind: Deployment
|
|
metadata:
|
|
name: app-blue
|
|
labels:
|
|
version: blue
|
|
app: myapp
|
|
spec:
|
|
replicas: 3
|
|
selector:
|
|
matchLabels:
|
|
app: myapp
|
|
version: blue
|
|
template:
|
|
metadata:
|
|
labels:
|
|
app: myapp
|
|
version: blue
|
|
spec:
|
|
containers:
|
|
- name: app
|
|
image: myapp:1.0.0
|
|
ports:
|
|
- containerPort: 8080
|
|
readinessProbe:
|
|
httpGet:
|
|
path: /health
|
|
port: 8080
|
|
initialDelaySeconds: 10
|
|
periodSeconds: 5
|
|
livenessProbe:
|
|
httpGet:
|
|
path: /health
|
|
port: 8080
|
|
initialDelaySeconds: 15
|
|
periodSeconds: 10
|
|
|
|
---
|
|
# Green Environment (New Version)
|
|
apiVersion: apps/v1
|
|
kind: Deployment
|
|
metadata:
|
|
name: app-green
|
|
labels:
|
|
version: green
|
|
app: myapp
|
|
spec:
|
|
replicas: 3
|
|
selector:
|
|
matchLabels:
|
|
app: myapp
|
|
version: green
|
|
template:
|
|
metadata:
|
|
labels:
|
|
app: myapp
|
|
version: green
|
|
spec:
|
|
containers:
|
|
- name: app
|
|
image: myapp:2.0.0
|
|
ports:
|
|
- containerPort: 8080
|
|
readinessProbe:
|
|
httpGet:
|
|
path: /health
|
|
port: 8080
|
|
initialDelaySeconds: 10
|
|
periodSeconds: 5
|
|
```
|
|
|
|
#### Service Switching
|
|
|
|
```yaml
|
|
# Service (switches between blue and green)
|
|
apiVersion: v1
|
|
kind: Service
|
|
metadata:
|
|
name: app-service
|
|
spec:
|
|
selector:
|
|
app: myapp
|
|
version: blue # Switch to 'green' for deployment
|
|
ports:
|
|
- port: 80
|
|
targetPort: 8080
|
|
type: LoadBalancer
|
|
```
|
|
|
|
#### Automated Deployment Script
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
|
|
# Blue-Green Deployment Script
|
|
NAMESPACE="production"
|
|
APP_NAME="myapp"
|
|
NEW_IMAGE="myapp:2.0.0"
|
|
|
|
# Determine current and target environments
|
|
CURRENT_VERSION=$(kubectl get service $APP_NAME-service -o jsonpath='{.spec.selector.version}')
|
|
|
|
if [ "$CURRENT_VERSION" = "blue" ]; then
|
|
TARGET_VERSION="green"
|
|
else
|
|
TARGET_VERSION="blue"
|
|
fi
|
|
|
|
echo "Current version: $CURRENT_VERSION"
|
|
echo "Target version: $TARGET_VERSION"
|
|
|
|
# Update target environment with new image
|
|
kubectl set image deployment/$APP_NAME-$TARGET_VERSION app=$NEW_IMAGE
|
|
|
|
# Wait for rollout to complete
|
|
kubectl rollout status deployment/$APP_NAME-$TARGET_VERSION --timeout=300s
|
|
|
|
# Run health checks
|
|
echo "Running health checks..."
|
|
TARGET_IP=$(kubectl get service $APP_NAME-$TARGET_VERSION -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
|
|
|
|
for i in {1..30}; do
|
|
if curl -f http://$TARGET_IP/health; then
|
|
echo "Health check passed"
|
|
break
|
|
fi
|
|
|
|
if [ $i -eq 30 ]; then
|
|
echo "Health check failed after 30 attempts"
|
|
exit 1
|
|
fi
|
|
|
|
sleep 2
|
|
done
|
|
|
|
# Switch traffic to new version
|
|
kubectl patch service $APP_NAME-service -p '{"spec":{"selector":{"version":"'$TARGET_VERSION'"}}}'
|
|
|
|
echo "Traffic switched to $TARGET_VERSION"
|
|
|
|
# Monitor for 5 minutes
|
|
echo "Monitoring new version..."
|
|
sleep 300
|
|
|
|
# Check if rollback is needed
|
|
ERROR_RATE=$(curl -s "http://monitoring.company.com/api/error_rate?service=$APP_NAME" | jq '.error_rate')
|
|
|
|
if (( $(echo "$ERROR_RATE > 0.05" | bc -l) )); then
|
|
echo "Error rate too high ($ERROR_RATE), rolling back..."
|
|
kubectl patch service $APP_NAME-service -p '{"spec":{"selector":{"version":"'$CURRENT_VERSION'"}}}'
|
|
exit 1
|
|
fi
|
|
|
|
echo "Deployment successful!"
|
|
```
|
|
|
|
### 2. Canary Deployments
|
|
|
|
#### Progressive Canary with Istio
|
|
|
|
```yaml
|
|
# Destination Rule
|
|
apiVersion: networking.istio.io/v1beta1
|
|
kind: DestinationRule
|
|
metadata:
|
|
name: myapp-destination
|
|
spec:
|
|
host: myapp
|
|
subsets:
|
|
- name: v1
|
|
labels:
|
|
version: v1
|
|
- name: v2
|
|
labels:
|
|
version: v2
|
|
|
|
---
|
|
# Virtual Service for Canary
|
|
apiVersion: networking.istio.io/v1beta1
|
|
kind: VirtualService
|
|
metadata:
|
|
name: myapp-canary
|
|
spec:
|
|
hosts:
|
|
- myapp
|
|
http:
|
|
- match:
|
|
- headers:
|
|
canary:
|
|
exact: "true"
|
|
route:
|
|
- destination:
|
|
host: myapp
|
|
subset: v2
|
|
- route:
|
|
- destination:
|
|
host: myapp
|
|
subset: v1
|
|
weight: 95
|
|
- destination:
|
|
host: myapp
|
|
subset: v2
|
|
weight: 5
|
|
```
|
|
|
|
#### Automated Canary Controller
|
|
|
|
```python
|
|
class CanaryController:
|
|
def __init__(self, istio_client, prometheus_client):
|
|
self.istio = istio_client
|
|
self.prometheus = prometheus_client
|
|
self.canary_weight = 5
|
|
self.max_weight = 100
|
|
self.weight_increment = 5
|
|
self.validation_window = 300 # 5 minutes
|
|
|
|
async def deploy_canary(self, app_name, new_version):
|
|
"""Deploy new version using canary strategy"""
|
|
|
|
# Start with small percentage
|
|
await self.update_traffic_split(app_name, self.canary_weight)
|
|
|
|
while self.canary_weight < self.max_weight:
|
|
# Monitor metrics for validation window
|
|
await asyncio.sleep(self.validation_window)
|
|
|
|
# Check canary health
|
|
if not await self.is_canary_healthy(app_name, new_version):
|
|
await self.rollback_canary(app_name)
|
|
raise Exception("Canary deployment failed health checks")
|
|
|
|
# Increase traffic to canary
|
|
self.canary_weight = min(
|
|
self.canary_weight + self.weight_increment,
|
|
self.max_weight
|
|
)
|
|
|
|
await self.update_traffic_split(app_name, self.canary_weight)
|
|
|
|
print(f"Canary traffic increased to {self.canary_weight}%")
|
|
|
|
print("Canary deployment completed successfully")
|
|
|
|
async def is_canary_healthy(self, app_name, version):
|
|
"""Check if canary version is healthy"""
|
|
|
|
# Check error rate
|
|
error_rate = await self.prometheus.query(
|
|
f'rate(http_requests_total{{app="{app_name}", version="{version}", status=~"5.."}}'
|
|
f'[5m]) / rate(http_requests_total{{app="{app_name}", version="{version}"}}[5m])'
|
|
)
|
|
|
|
if error_rate > 0.05: # 5% error rate threshold
|
|
return False
|
|
|
|
# Check response time
|
|
p95_latency = await self.prometheus.query(
|
|
f'histogram_quantile(0.95, rate(http_request_duration_seconds_bucket'
|
|
f'{{app="{app_name}", version="{version}"}}[5m]))'
|
|
)
|
|
|
|
if p95_latency > 2.0: # 2 second p95 threshold
|
|
return False
|
|
|
|
return True
|
|
|
|
async def update_traffic_split(self, app_name, canary_weight):
|
|
"""Update Istio virtual service with new traffic split"""
|
|
|
|
stable_weight = 100 - canary_weight
|
|
|
|
virtual_service = {
|
|
"apiVersion": "networking.istio.io/v1beta1",
|
|
"kind": "VirtualService",
|
|
"metadata": {"name": f"{app_name}-canary"},
|
|
"spec": {
|
|
"hosts": [app_name],
|
|
"http": [{
|
|
"route": [
|
|
{
|
|
"destination": {"host": app_name, "subset": "stable"},
|
|
"weight": stable_weight
|
|
},
|
|
{
|
|
"destination": {"host": app_name, "subset": "canary"},
|
|
"weight": canary_weight
|
|
}
|
|
]
|
|
}]
|
|
}
|
|
}
|
|
|
|
await self.istio.apply_virtual_service(virtual_service)
|
|
```
|
|
|
|
### 3. Rolling Updates
|
|
|
|
#### Kubernetes Rolling Update Strategy
|
|
|
|
```yaml
|
|
apiVersion: apps/v1
|
|
kind: Deployment
|
|
metadata:
|
|
name: rolling-update-app
|
|
spec:
|
|
replicas: 10
|
|
strategy:
|
|
type: RollingUpdate
|
|
rollingUpdate:
|
|
maxSurge: 2 # Can have 2 extra pods during update
|
|
maxUnavailable: 1 # At most 1 pod can be unavailable
|
|
selector:
|
|
matchLabels:
|
|
app: rolling-update-app
|
|
template:
|
|
metadata:
|
|
labels:
|
|
app: rolling-update-app
|
|
spec:
|
|
containers:
|
|
- name: app
|
|
image: myapp:2.0.0
|
|
ports:
|
|
- containerPort: 8080
|
|
readinessProbe:
|
|
httpGet:
|
|
path: /ready
|
|
port: 8080
|
|
initialDelaySeconds: 5
|
|
periodSeconds: 2
|
|
timeoutSeconds: 1
|
|
successThreshold: 1
|
|
failureThreshold: 3
|
|
livenessProbe:
|
|
httpGet:
|
|
path: /live
|
|
port: 8080
|
|
initialDelaySeconds: 10
|
|
periodSeconds: 10
|
|
```
|
|
|
|
#### Custom Rolling Update Controller
|
|
|
|
```python
|
|
class RollingUpdateController:
|
|
def __init__(self, k8s_client):
|
|
self.k8s = k8s_client
|
|
self.max_surge = 2
|
|
self.max_unavailable = 1
|
|
|
|
async def rolling_update(self, deployment_name, new_image):
|
|
"""Perform rolling update with custom logic"""
|
|
|
|
deployment = await self.k8s.get_deployment(deployment_name)
|
|
total_replicas = deployment.spec.replicas
|
|
|
|
# Calculate batch size
|
|
batch_size = min(self.max_surge, total_replicas // 5) # Update 20% at a time
|
|
|
|
updated_pods = []
|
|
|
|
for i in range(0, total_replicas, batch_size):
|
|
batch_end = min(i + batch_size, total_replicas)
|
|
|
|
# Update batch of pods
|
|
for pod_index in range(i, batch_end):
|
|
old_pod = await self.get_pod_by_index(deployment_name, pod_index)
|
|
|
|
# Create new pod with new image
|
|
new_pod = await self.create_updated_pod(old_pod, new_image)
|
|
|
|
# Wait for new pod to be ready
|
|
await self.wait_for_pod_ready(new_pod.metadata.name)
|
|
|
|
# Remove old pod
|
|
await self.k8s.delete_pod(old_pod.metadata.name)
|
|
|
|
updated_pods.append(new_pod)
|
|
|
|
# Brief pause between pod updates
|
|
await asyncio.sleep(2)
|
|
|
|
# Validate batch health before continuing
|
|
if not await self.validate_batch_health(updated_pods[-batch_size:]):
|
|
# Rollback batch
|
|
await self.rollback_batch(updated_pods[-batch_size:])
|
|
raise Exception("Rolling update failed validation")
|
|
|
|
print(f"Updated {batch_end}/{total_replicas} pods")
|
|
|
|
print("Rolling update completed successfully")
|
|
```
|
|
|
|
## Load Balancer and Traffic Management
|
|
|
|
### 1. Weighted Routing
|
|
|
|
#### NGINX Configuration
|
|
|
|
```nginx
|
|
upstream backend {
|
|
# Old version - 80% traffic
|
|
server old-app-1:8080 weight=4;
|
|
server old-app-2:8080 weight=4;
|
|
|
|
# New version - 20% traffic
|
|
server new-app-1:8080 weight=1;
|
|
server new-app-2:8080 weight=1;
|
|
}
|
|
|
|
server {
|
|
listen 80;
|
|
location / {
|
|
proxy_pass http://backend;
|
|
proxy_set_header Host $host;
|
|
proxy_set_header X-Real-IP $remote_addr;
|
|
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
|
|
|
# Health check headers
|
|
proxy_set_header X-Health-Check-Timeout 5s;
|
|
}
|
|
}
|
|
```
|
|
|
|
#### HAProxy Configuration
|
|
|
|
```haproxy
|
|
backend app_servers
|
|
balance roundrobin
|
|
option httpchk GET /health
|
|
|
|
# Old version servers
|
|
server old-app-1 old-app-1:8080 check weight 80
|
|
server old-app-2 old-app-2:8080 check weight 80
|
|
|
|
# New version servers
|
|
server new-app-1 new-app-1:8080 check weight 20
|
|
server new-app-2 new-app-2:8080 check weight 20
|
|
|
|
frontend app_frontend
|
|
bind *:80
|
|
default_backend app_servers
|
|
|
|
# Custom health check endpoint
|
|
acl health_check path_beg /health
|
|
http-request return status 200 content-type text/plain string "OK" if health_check
|
|
```
|
|
|
|
### 2. Circuit Breaker Implementation
|
|
|
|
```python
|
|
class CircuitBreaker:
|
|
def __init__(self, failure_threshold=5, recovery_timeout=60, expected_exception=Exception):
|
|
self.failure_threshold = failure_threshold
|
|
self.recovery_timeout = recovery_timeout
|
|
self.expected_exception = expected_exception
|
|
self.failure_count = 0
|
|
self.last_failure_time = None
|
|
self.state = 'CLOSED' # CLOSED, OPEN, HALF_OPEN
|
|
|
|
def call(self, func, *args, **kwargs):
|
|
"""Execute function with circuit breaker protection"""
|
|
|
|
if self.state == 'OPEN':
|
|
if self._should_attempt_reset():
|
|
self.state = 'HALF_OPEN'
|
|
else:
|
|
raise CircuitBreakerOpenException("Circuit breaker is OPEN")
|
|
|
|
try:
|
|
result = func(*args, **kwargs)
|
|
self._on_success()
|
|
return result
|
|
except self.expected_exception as e:
|
|
self._on_failure()
|
|
raise
|
|
|
|
def _should_attempt_reset(self):
|
|
return (
|
|
self.last_failure_time and
|
|
time.time() - self.last_failure_time >= self.recovery_timeout
|
|
)
|
|
|
|
def _on_success(self):
|
|
self.failure_count = 0
|
|
self.state = 'CLOSED'
|
|
|
|
def _on_failure(self):
|
|
self.failure_count += 1
|
|
self.last_failure_time = time.time()
|
|
|
|
if self.failure_count >= self.failure_threshold:
|
|
self.state = 'OPEN'
|
|
|
|
# Usage with service migration
|
|
@CircuitBreaker(failure_threshold=3, recovery_timeout=30)
|
|
def call_new_service(request):
|
|
return new_service.process(request)
|
|
|
|
def handle_request(request):
|
|
try:
|
|
return call_new_service(request)
|
|
except CircuitBreakerOpenException:
|
|
# Fallback to old service
|
|
return old_service.process(request)
|
|
```
|
|
|
|
## Monitoring and Validation
|
|
|
|
### 1. Health Check Implementation
|
|
|
|
```python
|
|
class HealthChecker:
|
|
def __init__(self):
|
|
self.checks = []
|
|
|
|
def add_check(self, name, check_func, timeout=5):
|
|
self.checks.append({
|
|
'name': name,
|
|
'func': check_func,
|
|
'timeout': timeout
|
|
})
|
|
|
|
async def run_checks(self):
|
|
"""Run all health checks and return status"""
|
|
results = {}
|
|
overall_status = 'healthy'
|
|
|
|
for check in self.checks:
|
|
try:
|
|
result = await asyncio.wait_for(
|
|
check['func'](),
|
|
timeout=check['timeout']
|
|
)
|
|
results[check['name']] = {
|
|
'status': 'healthy',
|
|
'result': result
|
|
}
|
|
except asyncio.TimeoutError:
|
|
results[check['name']] = {
|
|
'status': 'unhealthy',
|
|
'error': 'timeout'
|
|
}
|
|
overall_status = 'unhealthy'
|
|
except Exception as e:
|
|
results[check['name']] = {
|
|
'status': 'unhealthy',
|
|
'error': str(e)
|
|
}
|
|
overall_status = 'unhealthy'
|
|
|
|
return {
|
|
'status': overall_status,
|
|
'checks': results,
|
|
'timestamp': datetime.utcnow().isoformat()
|
|
}
|
|
|
|
# Example health checks
|
|
health_checker = HealthChecker()
|
|
|
|
async def database_check():
|
|
"""Check database connectivity"""
|
|
result = await db.execute("SELECT 1")
|
|
return result is not None
|
|
|
|
async def external_api_check():
|
|
"""Check external API availability"""
|
|
response = await http_client.get("https://api.example.com/health")
|
|
return response.status_code == 200
|
|
|
|
async def memory_check():
|
|
"""Check memory usage"""
|
|
memory_usage = psutil.virtual_memory().percent
|
|
if memory_usage > 90:
|
|
raise Exception(f"Memory usage too high: {memory_usage}%")
|
|
return f"Memory usage: {memory_usage}%"
|
|
|
|
health_checker.add_check("database", database_check)
|
|
health_checker.add_check("external_api", external_api_check)
|
|
health_checker.add_check("memory", memory_check)
|
|
```
|
|
|
|
### 2. Readiness vs Liveness Probes
|
|
|
|
```yaml
|
|
# Kubernetes Pod with proper health checks
|
|
apiVersion: v1
|
|
kind: Pod
|
|
metadata:
|
|
name: app-pod
|
|
spec:
|
|
containers:
|
|
- name: app
|
|
image: myapp:2.0.0
|
|
ports:
|
|
- containerPort: 8080
|
|
|
|
# Readiness probe - determines if pod should receive traffic
|
|
readinessProbe:
|
|
httpGet:
|
|
path: /ready
|
|
port: 8080
|
|
initialDelaySeconds: 5
|
|
periodSeconds: 3
|
|
timeoutSeconds: 2
|
|
successThreshold: 1
|
|
failureThreshold: 3
|
|
|
|
# Liveness probe - determines if pod should be restarted
|
|
livenessProbe:
|
|
httpGet:
|
|
path: /live
|
|
port: 8080
|
|
initialDelaySeconds: 30
|
|
periodSeconds: 10
|
|
timeoutSeconds: 5
|
|
successThreshold: 1
|
|
failureThreshold: 3
|
|
|
|
# Startup probe - gives app time to start before other probes
|
|
startupProbe:
|
|
httpGet:
|
|
path: /startup
|
|
port: 8080
|
|
initialDelaySeconds: 10
|
|
periodSeconds: 5
|
|
timeoutSeconds: 3
|
|
successThreshold: 1
|
|
failureThreshold: 30 # Allow up to 150 seconds for startup
|
|
```
|
|
|
|
### 3. Metrics and Alerting
|
|
|
|
```python
|
|
class MigrationMetrics:
|
|
def __init__(self, prometheus_client):
|
|
self.prometheus = prometheus_client
|
|
|
|
# Define custom metrics
|
|
self.migration_progress = Counter(
|
|
'migration_progress_total',
|
|
'Total migration operations completed',
|
|
['operation', 'status']
|
|
)
|
|
|
|
self.migration_duration = Histogram(
|
|
'migration_operation_duration_seconds',
|
|
'Time spent on migration operations',
|
|
['operation']
|
|
)
|
|
|
|
self.system_health = Gauge(
|
|
'system_health_score',
|
|
'Overall system health score (0-1)',
|
|
['component']
|
|
)
|
|
|
|
self.traffic_split = Gauge(
|
|
'traffic_split_percentage',
|
|
'Percentage of traffic going to each version',
|
|
['version']
|
|
)
|
|
|
|
def record_migration_step(self, operation, status, duration=None):
|
|
"""Record completion of a migration step"""
|
|
self.migration_progress.labels(operation=operation, status=status).inc()
|
|
|
|
if duration:
|
|
self.migration_duration.labels(operation=operation).observe(duration)
|
|
|
|
def update_health_score(self, component, score):
|
|
"""Update health score for a component"""
|
|
self.system_health.labels(component=component).set(score)
|
|
|
|
def update_traffic_split(self, version_weights):
|
|
"""Update traffic split metrics"""
|
|
for version, weight in version_weights.items():
|
|
self.traffic_split.labels(version=version).set(weight)
|
|
|
|
# Usage in migration
|
|
metrics = MigrationMetrics(prometheus_client)
|
|
|
|
def perform_migration_step(operation):
|
|
start_time = time.time()
|
|
|
|
try:
|
|
# Perform migration operation
|
|
result = execute_migration_operation(operation)
|
|
|
|
# Record success
|
|
duration = time.time() - start_time
|
|
metrics.record_migration_step(operation, 'success', duration)
|
|
|
|
return result
|
|
|
|
except Exception as e:
|
|
# Record failure
|
|
duration = time.time() - start_time
|
|
metrics.record_migration_step(operation, 'failure', duration)
|
|
raise
|
|
```
|
|
|
|
## Rollback Strategies
|
|
|
|
### 1. Immediate Rollback Triggers
|
|
|
|
```python
|
|
class AutoRollbackSystem:
|
|
def __init__(self, metrics_client, deployment_client):
|
|
self.metrics = metrics_client
|
|
self.deployment = deployment_client
|
|
self.rollback_triggers = {
|
|
'error_rate_spike': {
|
|
'threshold': 0.05, # 5% error rate
|
|
'window': 300, # 5 minutes
|
|
'auto_rollback': True
|
|
},
|
|
'latency_increase': {
|
|
'threshold': 2.0, # 2x baseline latency
|
|
'window': 600, # 10 minutes
|
|
'auto_rollback': False # Manual confirmation required
|
|
},
|
|
'availability_drop': {
|
|
'threshold': 0.95, # Below 95% availability
|
|
'window': 120, # 2 minutes
|
|
'auto_rollback': True
|
|
}
|
|
}
|
|
|
|
async def monitor_and_rollback(self, deployment_name):
|
|
"""Monitor deployment and trigger rollback if needed"""
|
|
|
|
while True:
|
|
for trigger_name, config in self.rollback_triggers.items():
|
|
if await self.check_trigger(trigger_name, config):
|
|
if config['auto_rollback']:
|
|
await self.execute_rollback(deployment_name, trigger_name)
|
|
else:
|
|
await self.alert_for_manual_rollback(deployment_name, trigger_name)
|
|
|
|
await asyncio.sleep(30) # Check every 30 seconds
|
|
|
|
async def check_trigger(self, trigger_name, config):
|
|
"""Check if rollback trigger condition is met"""
|
|
|
|
current_value = await self.metrics.get_current_value(trigger_name)
|
|
baseline_value = await self.metrics.get_baseline_value(trigger_name)
|
|
|
|
if trigger_name == 'error_rate_spike':
|
|
return current_value > config['threshold']
|
|
elif trigger_name == 'latency_increase':
|
|
return current_value > baseline_value * config['threshold']
|
|
elif trigger_name == 'availability_drop':
|
|
return current_value < config['threshold']
|
|
|
|
return False
|
|
|
|
async def execute_rollback(self, deployment_name, reason):
|
|
"""Execute automatic rollback"""
|
|
|
|
print(f"Executing automatic rollback for {deployment_name}. Reason: {reason}")
|
|
|
|
# Get previous revision
|
|
previous_revision = await self.deployment.get_previous_revision(deployment_name)
|
|
|
|
# Perform rollback
|
|
await self.deployment.rollback_to_revision(deployment_name, previous_revision)
|
|
|
|
# Notify stakeholders
|
|
await self.notify_rollback_executed(deployment_name, reason)
|
|
```
|
|
|
|
### 2. Data Rollback Strategies
|
|
|
|
```sql
|
|
-- Point-in-time recovery setup
|
|
-- Create restore point before migration
|
|
SELECT pg_create_restore_point('pre_migration_' || to_char(now(), 'YYYYMMDD_HH24MISS'));
|
|
|
|
-- Rollback using point-in-time recovery
|
|
-- (This would be executed on a separate recovery instance)
|
|
-- recovery.conf:
|
|
-- recovery_target_name = 'pre_migration_20240101_120000'
|
|
-- recovery_target_action = 'promote'
|
|
```
|
|
|
|
```python
|
|
class DataRollbackManager:
|
|
def __init__(self, database_client, backup_service):
|
|
self.db = database_client
|
|
self.backup = backup_service
|
|
|
|
async def create_rollback_point(self, migration_id):
|
|
"""Create a rollback point before migration"""
|
|
|
|
rollback_point = {
|
|
'migration_id': migration_id,
|
|
'timestamp': datetime.utcnow(),
|
|
'backup_location': None,
|
|
'schema_snapshot': None
|
|
}
|
|
|
|
# Create database backup
|
|
backup_path = await self.backup.create_backup(
|
|
f"pre_migration_{migration_id}_{int(time.time())}"
|
|
)
|
|
rollback_point['backup_location'] = backup_path
|
|
|
|
# Capture schema snapshot
|
|
schema_snapshot = await self.capture_schema_snapshot()
|
|
rollback_point['schema_snapshot'] = schema_snapshot
|
|
|
|
# Store rollback point metadata
|
|
await self.store_rollback_metadata(rollback_point)
|
|
|
|
return rollback_point
|
|
|
|
async def execute_rollback(self, migration_id):
|
|
"""Execute data rollback to specified point"""
|
|
|
|
rollback_point = await self.get_rollback_metadata(migration_id)
|
|
|
|
if not rollback_point:
|
|
raise Exception(f"No rollback point found for migration {migration_id}")
|
|
|
|
# Stop application traffic
|
|
await self.stop_application_traffic()
|
|
|
|
try:
|
|
# Restore from backup
|
|
await self.backup.restore_from_backup(
|
|
rollback_point['backup_location']
|
|
)
|
|
|
|
# Validate data integrity
|
|
await self.validate_data_integrity(
|
|
rollback_point['schema_snapshot']
|
|
)
|
|
|
|
# Update application configuration
|
|
await self.update_application_config(rollback_point)
|
|
|
|
# Resume application traffic
|
|
await self.resume_application_traffic()
|
|
|
|
print(f"Data rollback completed successfully for migration {migration_id}")
|
|
|
|
except Exception as e:
|
|
# If rollback fails, we have a serious problem
|
|
await self.escalate_rollback_failure(migration_id, str(e))
|
|
raise
|
|
```
|
|
|
|
## Best Practices Summary
|
|
|
|
### 1. Pre-Migration Checklist
|
|
- [ ] Comprehensive backup strategy in place
|
|
- [ ] Rollback procedures tested in staging
|
|
- [ ] Monitoring and alerting configured
|
|
- [ ] Health checks implemented
|
|
- [ ] Feature flags configured
|
|
- [ ] Team communication plan established
|
|
- [ ] Load balancer configuration prepared
|
|
- [ ] Database connection pooling optimized
|
|
|
|
### 2. During Migration
|
|
- [ ] Monitor key metrics continuously
|
|
- [ ] Validate each phase before proceeding
|
|
- [ ] Maintain detailed logs of all actions
|
|
- [ ] Keep stakeholders informed of progress
|
|
- [ ] Have rollback trigger ready
|
|
- [ ] Monitor user experience metrics
|
|
- [ ] Watch for performance degradation
|
|
- [ ] Validate data consistency
|
|
|
|
### 3. Post-Migration
|
|
- [ ] Continue monitoring for 24-48 hours
|
|
- [ ] Validate all business processes
|
|
- [ ] Update documentation
|
|
- [ ] Conduct post-migration retrospective
|
|
- [ ] Archive migration artifacts
|
|
- [ ] Update disaster recovery procedures
|
|
- [ ] Plan for legacy system decommissioning
|
|
|
|
### 4. Common Pitfalls to Avoid
|
|
- Don't skip testing rollback procedures
|
|
- Don't ignore performance impact
|
|
- Don't rush through validation phases
|
|
- Don't forget to communicate with stakeholders
|
|
- Don't assume health checks are sufficient
|
|
- Don't neglect data consistency validation
|
|
- Don't underestimate time requirements
|
|
- Don't overlook dependency impacts
|
|
|
|
This comprehensive guide provides the foundation for implementing zero-downtime migrations across various system components while maintaining high availability and data integrity. |