Files
claude-skills-reference/engineering-team/gcp-cloud-architect/references/best_practices.md
Reza Rezvani 87f3a007c9 feat(engineering,ra-qm): add secrets-vault-manager, sql-database-assistant, gcp-cloud-architect, soc2-compliance
secrets-vault-manager (403-line SKILL.md, 3 scripts, 3 references):
- HashiCorp Vault, AWS SM, Azure KV, GCP SM integration
- Secret rotation, dynamic secrets, audit logging, emergency procedures

sql-database-assistant (457-line SKILL.md, 3 scripts, 3 references):
- Query optimization, migration generation, schema exploration
- Multi-DB support (PostgreSQL, MySQL, SQLite, SQL Server)
- ORM patterns (Prisma, Drizzle, TypeORM, SQLAlchemy)

gcp-cloud-architect (418-line SKILL.md, 3 scripts, 3 references):
- 6-step workflow mirroring aws-solution-architect for GCP
- Cloud Run, GKE, BigQuery, Cloud Functions, cost optimization
- Completes cloud trifecta (AWS + Azure + GCP)

soc2-compliance (417-line SKILL.md, 3 scripts, 3 references):
- SOC 2 Type I & II preparation, Trust Service Criteria mapping
- Control matrix generation, evidence tracking, gap analysis
- First SOC 2 skill in ra-qm-team (joins GDPR, ISO 27001, ISO 13485)

All 12 scripts pass --help. Docs generated, mkdocs.yml nav updated.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 14:05:11 +01:00

12 KiB

GCP Best Practices

Production-ready practices for naming, labels, IAM, networking, monitoring, and disaster recovery.


Table of Contents


Naming Conventions

Resource Naming Pattern

{environment}-{project}-{resource-type}-{purpose}

Examples:
  prod-myapp-gke-cluster
  dev-myapp-sql-primary
  staging-myapp-run-api
  prod-myapp-gcs-uploads

Project Naming

{org}-{team}-{environment}

Examples:
  acme-platform-prod
  acme-platform-dev
  acme-data-prod

Naming Rules

Resource Format Max Length Example
Project ID lowercase, hyphens 30 chars acme-platform-prod
GKE Cluster lowercase, hyphens 40 chars prod-api-cluster
Cloud Run lowercase, hyphens 49 chars prod-myapp-api
Cloud SQL lowercase, hyphens 84 chars prod-myapp-sql-primary
GCS Bucket lowercase, hyphens, dots 63 chars acme-prod-myapp-uploads
Service Account lowercase, hyphens 30 chars myapp-run-sa

Labels and Organization

Required Labels

Apply these labels to all resources:

labels:
  environment: "prod"          # dev, staging, prod
  team: "platform"             # team owning the resource
  app: "myapp"                 # application name
  cost-center: "eng-001"       # billing allocation
  managed-by: "terraform"      # terraform, gcloud, console

Label-Based Cost Reporting

# Export billing data to BigQuery with labels
# Then query by label:
SELECT
  labels.value AS environment,
  SUM(cost) AS total_cost
FROM `billing_export.gcp_billing_export_v1_*`
CROSS JOIN UNNEST(labels) AS labels
WHERE labels.key = 'environment'
GROUP BY environment
ORDER BY total_cost DESC

Organization Hierarchy

Organization
├── Folder: Production
│   ├── Project: platform-prod
│   ├── Project: data-prod
│   └── Project: ml-prod
├── Folder: Non-Production
│   ├── Project: platform-dev
│   ├── Project: platform-staging
│   └── Project: data-dev
└── Folder: Shared Services
    ├── Project: shared-networking
    ├── Project: shared-security
    └── Project: shared-monitoring

IAM and Security

Principle of Least Privilege

# BAD: Basic roles are too broad
gcloud projects add-iam-policy-binding my-project \
  --member="user:dev@example.com" \
  --role="roles/editor"

# GOOD: Use predefined roles
gcloud projects add-iam-policy-binding my-project \
  --member="user:dev@example.com" \
  --role="roles/run.developer"

Service Account Best Practices

# 1. Create dedicated SA per workload
gcloud iam service-accounts create myapp-api-sa \
  --display-name="MyApp API Service Account"

# 2. Grant only required roles
gcloud projects add-iam-policy-binding my-project \
  --member="serviceAccount:myapp-api-sa@my-project.iam.gserviceaccount.com" \
  --role="roles/datastore.user"

# 3. Use Workload Identity for GKE (no key files)
gcloud iam service-accounts add-iam-policy-binding \
  myapp-api-sa@my-project.iam.gserviceaccount.com \
  --role="roles/iam.workloadIdentityUser" \
  --member="serviceAccount:my-project.svc.id.goog[default/myapp-api-ksa]"

# 4. NEVER download SA key files in production
# Instead, use attached service accounts or impersonation

VPC Service Controls

# Create a service perimeter to restrict data exfiltration
gcloud access-context-manager perimeters create my-perimeter \
  --title="Production Data Perimeter" \
  --resources="projects/123456" \
  --restricted-services="bigquery.googleapis.com,storage.googleapis.com" \
  --policy=$POLICY_ID

Organization Policies

# Restrict external IPs on VMs
gcloud resource-manager org-policies set-policy \
  --project=my-project policy.yaml

# policy.yaml
constraint: compute.vmExternalIpAccess
listPolicy:
  allValues: DENY

# Restrict public Cloud Storage
constraint: storage.publicAccessPrevention
booleanPolicy:
  enforced: true

Encryption

Layer Service Default
At rest Google-managed keys Always enabled
At rest CMEK (Cloud KMS) Optional, recommended
In transit TLS 1.3 Always enabled
Application Cloud KMS Encrypt sensitive fields
# Create CMEK key for Cloud SQL
gcloud kms keys create myapp-sql-key \
  --keyring=myapp-keyring \
  --location=us-central1 \
  --purpose=encryption

# Use CMEK with Cloud SQL
gcloud sql instances create myapp-db \
  --disk-encryption-key=projects/my-project/locations/us-central1/keyRings/myapp-keyring/cryptoKeys/myapp-sql-key

Networking

VPC Design

# Create custom VPC (avoid default network)
gcloud compute networks create myapp-vpc \
  --subnet-mode=custom

# Create subnets with secondary ranges for GKE
gcloud compute networks subnets create myapp-subnet \
  --network=myapp-vpc \
  --region=us-central1 \
  --range=10.0.0.0/20 \
  --secondary-range pods=10.4.0.0/14,services=10.8.0.0/20 \
  --enable-private-google-access

Shared VPC

Use Shared VPC for multi-project environments:

Host Project (shared-networking)
├── VPC: shared-vpc
│   ├── Subnet: prod-us-central1 → Service Project: platform-prod
│   ├── Subnet: prod-europe-west1 → Service Project: platform-prod
│   └── Subnet: dev-us-central1 → Service Project: platform-dev

Firewall Rules

# Allow internal traffic
gcloud compute firewall-rules create allow-internal \
  --network=myapp-vpc \
  --allow=tcp,udp,icmp \
  --source-ranges=10.0.0.0/8

# Allow health checks from Google load balancers
gcloud compute firewall-rules create allow-health-checks \
  --network=myapp-vpc \
  --allow=tcp:8080 \
  --source-ranges=35.191.0.0/16,130.211.0.0/22 \
  --target-tags=allow-health-check

# Deny all other ingress (implicit, but be explicit)
gcloud compute firewall-rules create deny-all-ingress \
  --network=myapp-vpc \
  --action=DENY \
  --rules=all \
  --direction=INGRESS \
  --priority=65534

Private Google Access

Always enable Private Google Access to reach GCP APIs without public IPs:

gcloud compute networks subnets update myapp-subnet \
  --region=us-central1 \
  --enable-private-google-access

Monitoring and Logging

Cloud Monitoring Setup

# Create uptime check
gcloud monitoring uptime create \
  --display-name="API Health Check" \
  --resource-type=cloud-run-revision \
  --resource-labels="service_name=myapp-api,location=us-central1" \
  --check-request-path="/health" \
  --period=60s

# Create alerting policy
gcloud alpha monitoring policies create \
  --display-name="High Error Rate" \
  --condition-display-name="Cloud Run 5xx > 1%" \
  --condition-filter='resource.type="cloud_run_revision" AND metric.type="run.googleapis.com/request_count" AND metric.labels.response_code_class="5xx"' \
  --condition-threshold-value=1 \
  --notification-channels="projects/my-project/notificationChannels/12345"

Key Metrics to Monitor

Service Metric Alert Threshold
Cloud Run request_latencies (p99) >2s
Cloud Run request_count (5xx) >1% of total
Cloud SQL cpu/utilization >80%
Cloud SQL disk/utilization >85%
GKE container/cpu/utilization >80%
GKE node/cpu/allocatable_utilization >85%
Pub/Sub subscription/oldest_unacked_message_age >300s
BigQuery query/execution_time >60s

Log-Based Metrics

# Create a metric for application errors
gcloud logging metrics create app-errors \
  --description="Application error count" \
  --log-filter='resource.type="cloud_run_revision" AND severity>=ERROR'

# Create log sink to BigQuery for analysis
gcloud logging sinks create audit-logs-bq \
  bigquery.googleapis.com/projects/my-project/datasets/audit_logs \
  --log-filter='logName="projects/my-project/logs/cloudaudit.googleapis.com%2Factivity"'

Log Exclusion (Cost Reduction)

# Exclude verbose debug logs to save on Cloud Logging costs
gcloud logging sinks create _Default \
  --log-filter='NOT (severity="DEBUG" OR severity="DEFAULT")' \
  --description="Exclude debug-level logs"

# Or create exclusion filters
gcloud logging exclusions create exclude-debug \
  --log-filter='severity="DEBUG"' \
  --description="Exclude debug logs to reduce costs"

Cost Optimization

Committed Use Discounts

Term Compute Discount Memory Discount
1 year 37% 37%
3 years 55% 55%
# Check recommendations
gcloud recommender recommendations list \
  --project=my-project \
  --location=us-central1 \
  --recommender=google.compute.commitment.UsageCommitmentRecommender

Sustained Use Discounts

Automatic discounts for resources running >25% of the month:

Usage Discount
25-50% 20%
50-75% 40%
75-100% 60%

BigQuery Cost Control

-- Use partitioning to limit data scanned
CREATE TABLE my_dataset.events
PARTITION BY DATE(timestamp)
CLUSTER BY event_type
AS SELECT * FROM raw_events;

-- Estimate query cost before running
-- Use --dry_run flag
bq query --dry_run --use_legacy_sql=false \
  'SELECT * FROM my_dataset.events WHERE DATE(timestamp) = "2026-01-01"'

Cloud Storage Optimization

# Enable Autoclass for automatic class management
gsutil mb -l us-central1 --autoclass gs://my-bucket/

# Set lifecycle policy
gsutil lifecycle set lifecycle.json gs://my-bucket/

Disaster Recovery

RPO/RTO Targets

Tier RPO RTO Strategy
Tier 1 (Critical) 0 <1 hour Multi-region active-active
Tier 2 (Important) <1 hour <4 hours Regional HA + cross-region backup
Tier 3 (Standard) <24 hours <24 hours Automated backups + restore

Backup Strategy

# Cloud SQL automated backups
gcloud sql instances patch myapp-db \
  --backup-start-time=02:00 \
  --enable-point-in-time-recovery

# Firestore scheduled exports
gcloud firestore export gs://myapp-backups/firestore/$(date +%Y%m%d)

# GKE cluster backup with Backup for GKE
gcloud beta container backup-restore backup-plans create myapp-plan \
  --project=my-project \
  --location=us-central1 \
  --cluster=projects/my-project/locations/us-central1/clusters/myapp-cluster \
  --all-namespaces \
  --cron-schedule="0 2 * * *"

Multi-Region Failover

# Cloud SQL cross-region replica for DR
gcloud sql instances create myapp-db-replica \
  --master-instance-name=myapp-db \
  --region=us-east1

# Promote replica during failover
gcloud sql instances promote-replica myapp-db-replica

Common Pitfalls

Technical Debt

Pitfall Solution
Using default VPC Always create custom VPCs
Not enabling audit logs Enable Cloud Audit Logs from day one
Single-region deployment Plan for multi-zone at minimum
No IaC Use Terraform from the start

Security Mistakes

Mistake Prevention
SA key files in code Use Workload Identity, attached SAs
Public GCS buckets Enable org policy for public access prevention
Basic roles (Owner/Editor) Use predefined or custom roles
No encryption key management Use CMEK for sensitive data
Default service account Create dedicated SAs per workload

Performance Issues

Issue Solution
Cold starts on Cloud Run Set min-instances=1 for latency-critical services
Slow BigQuery queries Partition tables, use clustering, avoid SELECT *
GKE pod scheduling delays Use PodDisruptionBudget, pre-provision with Autopilot
Firestore hotspots Distribute writes across document IDs evenly

Cost Surprises

Surprise Prevention
Undeleted resources Label everything, review weekly
Egress costs Keep traffic in same region, use Private Google Access
Cloud NAT charges Use Private Google Access for GCP service traffic
Log ingestion costs Set exclusion filters for debug/verbose logs
BigQuery full scans Always use partitioning and clustering
Idle GKE clusters Delete dev clusters nightly, use Autopilot