firefrost-gaming/claude-skills-reference

Files

Reza Rezvani 87f3a007c9 feat(engineering,ra-qm): add secrets-vault-manager, sql-database-assistant, gcp-cloud-architect, soc2-compliance

secrets-vault-manager (403-line SKILL.md, 3 scripts, 3 references):
- HashiCorp Vault, AWS SM, Azure KV, GCP SM integration
- Secret rotation, dynamic secrets, audit logging, emergency procedures

sql-database-assistant (457-line SKILL.md, 3 scripts, 3 references):
- Query optimization, migration generation, schema exploration
- Multi-DB support (PostgreSQL, MySQL, SQLite, SQL Server)
- ORM patterns (Prisma, Drizzle, TypeORM, SQLAlchemy)

gcp-cloud-architect (418-line SKILL.md, 3 scripts, 3 references):
- 6-step workflow mirroring aws-solution-architect for GCP
- Cloud Run, GKE, BigQuery, Cloud Functions, cost optimization
- Completes cloud trifecta (AWS + Azure + GCP)

soc2-compliance (417-line SKILL.md, 3 scripts, 3 references):
- SOC 2 Type I & II preparation, Trust Service Criteria mapping
- Control matrix generation, evidence tracking, gap analysis
- First SOC 2 skill in ra-qm-team (joins GDPR, ISO 27001, ISO 13485)

All 12 scripts pass --help. Docs generated, mkdocs.yml nav updated.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-25 14:05:11 +01:00

12 KiB

Raw Blame History

GCP Best Practices

Production-ready practices for naming, labels, IAM, networking, monitoring, and disaster recovery.

Naming Conventions
Labels and Organization
IAM and Security
Networking
Monitoring and Logging
Cost Optimization
Disaster Recovery
Common Pitfalls

Naming Conventions

Resource Naming Pattern

{environment}-{project}-{resource-type}-{purpose}

Examples:
  prod-myapp-gke-cluster
  dev-myapp-sql-primary
  staging-myapp-run-api
  prod-myapp-gcs-uploads

Project Naming

{org}-{team}-{environment}

Examples:
  acme-platform-prod
  acme-platform-dev
  acme-data-prod

Naming Rules

Resource	Format	Max Length	Example
Project ID	lowercase, hyphens	30 chars	acme-platform-prod
GKE Cluster	lowercase, hyphens	40 chars	prod-api-cluster
Cloud Run	lowercase, hyphens	49 chars	prod-myapp-api
Cloud SQL	lowercase, hyphens	84 chars	prod-myapp-sql-primary
GCS Bucket	lowercase, hyphens, dots	63 chars	acme-prod-myapp-uploads
Service Account	lowercase, hyphens	30 chars	myapp-run-sa

Labels and Organization

Required Labels

Apply these labels to all resources:

labels:
  environment: "prod"          # dev, staging, prod
  team: "platform"             # team owning the resource
  app: "myapp"                 # application name
  cost-center: "eng-001"       # billing allocation
  managed-by: "terraform"      # terraform, gcloud, console

Label-Based Cost Reporting

# Export billing data to BigQuery with labels
# Then query by label:
SELECT
  labels.value AS environment,
  SUM(cost) AS total_cost
FROM `billing_export.gcp_billing_export_v1_*`
CROSS JOIN UNNEST(labels) AS labels
WHERE labels.key = 'environment'
GROUP BY environment
ORDER BY total_cost DESC

Organization Hierarchy

Organization
├── Folder: Production
│   ├── Project: platform-prod
│   ├── Project: data-prod
│   └── Project: ml-prod
├── Folder: Non-Production
│   ├── Project: platform-dev
│   ├── Project: platform-staging
│   └── Project: data-dev
└── Folder: Shared Services
    ├── Project: shared-networking
    ├── Project: shared-security
    └── Project: shared-monitoring

IAM and Security

Principle of Least Privilege

# BAD: Basic roles are too broad
gcloud projects add-iam-policy-binding my-project \
  --member="user:dev@example.com" \
  --role="roles/editor"

# GOOD: Use predefined roles
gcloud projects add-iam-policy-binding my-project \
  --member="user:dev@example.com" \
  --role="roles/run.developer"

Service Account Best Practices

# 1. Create dedicated SA per workload
gcloud iam service-accounts create myapp-api-sa \
  --display-name="MyApp API Service Account"

# 2. Grant only required roles
gcloud projects add-iam-policy-binding my-project \
  --member="serviceAccount:myapp-api-sa@my-project.iam.gserviceaccount.com" \
  --role="roles/datastore.user"

# 3. Use Workload Identity for GKE (no key files)
gcloud iam service-accounts add-iam-policy-binding \
  myapp-api-sa@my-project.iam.gserviceaccount.com \
  --role="roles/iam.workloadIdentityUser" \
  --member="serviceAccount:my-project.svc.id.goog[default/myapp-api-ksa]"

# 4. NEVER download SA key files in production
# Instead, use attached service accounts or impersonation

VPC Service Controls

# Create a service perimeter to restrict data exfiltration
gcloud access-context-manager perimeters create my-perimeter \
  --title="Production Data Perimeter" \
  --resources="projects/123456" \
  --restricted-services="bigquery.googleapis.com,storage.googleapis.com" \
  --policy=$POLICY_ID

Organization Policies

# Restrict external IPs on VMs
gcloud resource-manager org-policies set-policy \
  --project=my-project policy.yaml

# policy.yaml
constraint: compute.vmExternalIpAccess
listPolicy:
  allValues: DENY

# Restrict public Cloud Storage
constraint: storage.publicAccessPrevention
booleanPolicy:
  enforced: true

Encryption

Layer	Service	Default
At rest	Google-managed keys	Always enabled
At rest	CMEK (Cloud KMS)	Optional, recommended
In transit	TLS 1.3	Always enabled
Application	Cloud KMS	Encrypt sensitive fields

# Create CMEK key for Cloud SQL
gcloud kms keys create myapp-sql-key \
  --keyring=myapp-keyring \
  --location=us-central1 \
  --purpose=encryption

# Use CMEK with Cloud SQL
gcloud sql instances create myapp-db \
  --disk-encryption-key=projects/my-project/locations/us-central1/keyRings/myapp-keyring/cryptoKeys/myapp-sql-key

Networking

VPC Design

# Create custom VPC (avoid default network)
gcloud compute networks create myapp-vpc \
  --subnet-mode=custom

# Create subnets with secondary ranges for GKE
gcloud compute networks subnets create myapp-subnet \
  --network=myapp-vpc \
  --region=us-central1 \
  --range=10.0.0.0/20 \
  --secondary-range pods=10.4.0.0/14,services=10.8.0.0/20 \
  --enable-private-google-access

Shared VPC

Use Shared VPC for multi-project environments:

Host Project (shared-networking)
├── VPC: shared-vpc
│   ├── Subnet: prod-us-central1 → Service Project: platform-prod
│   ├── Subnet: prod-europe-west1 → Service Project: platform-prod
│   └── Subnet: dev-us-central1 → Service Project: platform-dev

Firewall Rules

# Allow internal traffic
gcloud compute firewall-rules create allow-internal \
  --network=myapp-vpc \
  --allow=tcp,udp,icmp \
  --source-ranges=10.0.0.0/8

# Allow health checks from Google load balancers
gcloud compute firewall-rules create allow-health-checks \
  --network=myapp-vpc \
  --allow=tcp:8080 \
  --source-ranges=35.191.0.0/16,130.211.0.0/22 \
  --target-tags=allow-health-check

# Deny all other ingress (implicit, but be explicit)
gcloud compute firewall-rules create deny-all-ingress \
  --network=myapp-vpc \
  --action=DENY \
  --rules=all \
  --direction=INGRESS \
  --priority=65534

Private Google Access

Always enable Private Google Access to reach GCP APIs without public IPs:

gcloud compute networks subnets update myapp-subnet \
  --region=us-central1 \
  --enable-private-google-access

Monitoring and Logging

Cloud Monitoring Setup

# Create uptime check
gcloud monitoring uptime create \
  --display-name="API Health Check" \
  --resource-type=cloud-run-revision \
  --resource-labels="service_name=myapp-api,location=us-central1" \
  --check-request-path="/health" \
  --period=60s

# Create alerting policy
gcloud alpha monitoring policies create \
  --display-name="High Error Rate" \
  --condition-display-name="Cloud Run 5xx > 1%" \
  --condition-filter='resource.type="cloud_run_revision" AND metric.type="run.googleapis.com/request_count" AND metric.labels.response_code_class="5xx"' \
  --condition-threshold-value=1 \
  --notification-channels="projects/my-project/notificationChannels/12345"

Key Metrics to Monitor

Service	Metric	Alert Threshold
Cloud Run	request_latencies (p99)	>2s
Cloud Run	request_count (5xx)	>1% of total
Cloud SQL	cpu/utilization	>80%
Cloud SQL	disk/utilization	>85%
GKE	container/cpu/utilization	>80%
GKE	node/cpu/allocatable_utilization	>85%
Pub/Sub	subscription/oldest_unacked_message_age	>300s
BigQuery	query/execution_time	>60s

Log-Based Metrics

# Create a metric for application errors
gcloud logging metrics create app-errors \
  --description="Application error count" \
  --log-filter='resource.type="cloud_run_revision" AND severity>=ERROR'

# Create log sink to BigQuery for analysis
gcloud logging sinks create audit-logs-bq \
  bigquery.googleapis.com/projects/my-project/datasets/audit_logs \
  --log-filter='logName="projects/my-project/logs/cloudaudit.googleapis.com%2Factivity"'

Log Exclusion (Cost Reduction)

# Exclude verbose debug logs to save on Cloud Logging costs
gcloud logging sinks create _Default \
  --log-filter='NOT (severity="DEBUG" OR severity="DEFAULT")' \
  --description="Exclude debug-level logs"

# Or create exclusion filters
gcloud logging exclusions create exclude-debug \
  --log-filter='severity="DEBUG"' \
  --description="Exclude debug logs to reduce costs"

Cost Optimization

Committed Use Discounts

Term	Compute Discount	Memory Discount
1 year	37%	37%
3 years	55%	55%

# Check recommendations
gcloud recommender recommendations list \
  --project=my-project \
  --location=us-central1 \
  --recommender=google.compute.commitment.UsageCommitmentRecommender

Sustained Use Discounts

Automatic discounts for resources running >25% of the month:

Usage	Discount
25-50%	20%
50-75%	40%
75-100%	60%

BigQuery Cost Control

-- Use partitioning to limit data scanned
CREATE TABLE my_dataset.events
PARTITION BY DATE(timestamp)
CLUSTER BY event_type
AS SELECT * FROM raw_events;

-- Estimate query cost before running
-- Use --dry_run flag
bq query --dry_run --use_legacy_sql=false \
  'SELECT * FROM my_dataset.events WHERE DATE(timestamp) = "2026-01-01"'

Cloud Storage Optimization

# Enable Autoclass for automatic class management
gsutil mb -l us-central1 --autoclass gs://my-bucket/

# Set lifecycle policy
gsutil lifecycle set lifecycle.json gs://my-bucket/

Disaster Recovery

RPO/RTO Targets

Tier	RPO	RTO	Strategy
Tier 1 (Critical)	0	<1 hour	Multi-region active-active
Tier 2 (Important)	<1 hour	<4 hours	Regional HA + cross-region backup
Tier 3 (Standard)	<24 hours	<24 hours	Automated backups + restore

Backup Strategy

# Cloud SQL automated backups
gcloud sql instances patch myapp-db \
  --backup-start-time=02:00 \
  --enable-point-in-time-recovery

# Firestore scheduled exports
gcloud firestore export gs://myapp-backups/firestore/$(date +%Y%m%d)

# GKE cluster backup with Backup for GKE
gcloud beta container backup-restore backup-plans create myapp-plan \
  --project=my-project \
  --location=us-central1 \
  --cluster=projects/my-project/locations/us-central1/clusters/myapp-cluster \
  --all-namespaces \
  --cron-schedule="0 2 * * *"

Multi-Region Failover

# Cloud SQL cross-region replica for DR
gcloud sql instances create myapp-db-replica \
  --master-instance-name=myapp-db \
  --region=us-east1

# Promote replica during failover
gcloud sql instances promote-replica myapp-db-replica

Common Pitfalls

Technical Debt

Pitfall	Solution
Using default VPC	Always create custom VPCs
Not enabling audit logs	Enable Cloud Audit Logs from day one
Single-region deployment	Plan for multi-zone at minimum
No IaC	Use Terraform from the start

Security Mistakes

Mistake	Prevention
SA key files in code	Use Workload Identity, attached SAs
Public GCS buckets	Enable org policy for public access prevention
Basic roles (Owner/Editor)	Use predefined or custom roles
No encryption key management	Use CMEK for sensitive data
Default service account	Create dedicated SAs per workload

Performance Issues

Issue	Solution
Cold starts on Cloud Run	Set min-instances=1 for latency-critical services
Slow BigQuery queries	Partition tables, use clustering, avoid SELECT *
GKE pod scheduling delays	Use PodDisruptionBudget, pre-provision with Autopilot
Firestore hotspots	Distribute writes across document IDs evenly

Cost Surprises

Surprise	Prevention
Undeleted resources	Label everything, review weekly
Egress costs	Keep traffic in same region, use Private Google Access
Cloud NAT charges	Use Private Google Access for GCP service traffic
Log ingestion costs	Set exclusion filters for debug/verbose logs
BigQuery full scans	Always use partitioning and clustering
Idle GKE clusters	Delete dev clusters nightly, use Autopilot

12 KiB Raw Blame History