firefrost-gaming/claude-skills-reference

Files

Reza Rezvani 2056ba251f feat(engineering-team): add azure-cloud-architect, security-pen-testing; extend terraform-patterns

azure-cloud-architect (451-line SKILL.md, 3 scripts, 3 references):
- 6-step workflow mirroring aws-solution-architect for Azure
- Bicep/ARM templates, AKS, Functions, Cosmos DB, cost optimization
- architecture_designer.py, cost_optimizer.py, bicep_generator.py

security-pen-testing (850-line SKILL.md, 3 scripts, 3 references):
- OWASP Top 10 systematic audit, offensive security testing
- XSS/SQLi/SSRF/IDOR detection, secret scanning, API security
- vulnerability_scanner.py, dependency_auditor.py, pentest_report_generator.py
- Responsible disclosure workflow included

terraform-patterns extended (487 → 740 lines):
- Multi-cloud provider configuration
- OpenTofu compatibility notes
- Infracost integration for PR cost estimation
- Import existing infrastructure patterns
- Terragrunt DRY multi-environment patterns

Updated engineering-team plugin.json (26 → 28 skills).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-25 13:32:22 +01:00

12 KiB

Raw Blame History

Azure Best Practices

Production-ready practices for naming, tagging, security, networking, monitoring, and disaster recovery on Azure.

Naming Conventions
Tagging Strategy
RBAC and Least Privilege
Network Security
Monitoring and Alerting
Disaster Recovery
Common Pitfalls

Naming Conventions

Follow the Azure Cloud Adoption Framework (CAF) naming convention for consistency and automation.

Format

<resource-type>-<workload>-<environment>-<region>-<instance>

Examples

Resource	Naming Pattern	Example
Resource Group	rg-<workload>-<env>	rg-myapp-prod
App Service	app-<workload>-<env>	app-myapp-prod
App Service Plan	plan-<workload>-<env>	plan-myapp-prod
Azure SQL Server	sql-<workload>-<env>	sql-myapp-prod
Azure SQL Database	sqldb-<workload>-<env>	sqldb-myapp-prod
Storage Account	st<workload><env> (no hyphens)	stmyappprod
Key Vault	kv-<workload>-<env>	kv-myapp-prod
AKS Cluster	aks-<workload>-<env>	aks-myapp-prod
Container Registry	cr<workload><env> (no hyphens)	crmyappprod
Virtual Network	vnet-<workload>-<env>	vnet-myapp-prod
Subnet	snet-<purpose>	snet-app, snet-data
NSG	nsg-<subnet-name>	nsg-snet-app
Public IP	pip-<resource>-<env>	pip-agw-prod
Cosmos DB	cosmos-<workload>-<env>	cosmos-myapp-prod
Service Bus	sb-<workload>-<env>	sb-myapp-prod
Event Hubs	evh-<workload>-<env>	evh-myapp-prod
Log Analytics	log-<workload>-<env>	log-myapp-prod
Application Insights	ai-<workload>-<env>	ai-myapp-prod

Rules

Lowercase only (some resources require it — be consistent everywhere)
Hyphens as separators (except where disallowed: storage accounts, container registries)
No longer than the resource type max length (e.g., storage accounts max 24 characters)
Environment abbreviations: dev, stg, prod
Region abbreviations: eus (East US), weu (West Europe), sea (Southeast Asia)

Tagging Strategy

Tags enable cost allocation, ownership tracking, and automation. Apply to every resource.

Required Tags

Tag Key	Purpose	Example Values
environment	Cost splitting, policy targeting	dev, staging, production
app-name	Workload identification	myapp, data-pipeline
owner	Team or individual responsible	platform-team, jane.doe@company.com
cost-center	Finance allocation	CC-1234, engineering

Recommended Tags

Tag Key	Purpose	Example Values
created-by	IaC or manual tracking	bicep, terraform, portal
data-classification	Security posture	public, internal, confidential
compliance	Regulatory requirements	hipaa, gdpr, sox
auto-shutdown	Dev/test cost savings	true, false

Enforcement

Use Azure Policy to enforce tagging:

{
  "if": {
    "allOf": [
      { "field": "tags['environment']", "exists": "false" },
      { "field": "type", "notEquals": "Microsoft.Resources/subscriptions/resourceGroups" }
    ]
  },
  "then": { "effect": "deny" }
}

RBAC and Least Privilege

Principles

Use built-in roles before creating custom roles
Assign roles to groups, not individual users
Scope to the narrowest level — resource group or resource, not subscription
Use Managed Identity for service-to-service — never store credentials
Enable Entra ID PIM (Privileged Identity Management) for just-in-time admin access

Common Role Assignments

Persona	Scope	Role
Developer	Resource Group (dev)	Contributor
Developer	Resource Group (prod)	Reader
CI/CD pipeline	Resource Group	Contributor (via workload identity)
App Service	Key Vault	Key Vault Secrets User
App Service	Azure SQL	SQL DB Contributor (or Entra auth)
AKS pod	Cosmos DB	Cosmos DB Built-in Data Contributor
Security team	Subscription	Security Reader
Platform team	Subscription	Owner (with PIM)

Workload Identity Federation

For CI/CD pipelines (GitHub Actions, Azure DevOps), use workload identity federation instead of service principal secrets:

# Create federated credential (GitHub Actions example)
az ad app federated-credential create \
  --id <app-object-id> \
  --parameters '{
    "name": "github-main",
    "issuer": "https://token.actions.githubusercontent.com",
    "subject": "repo:org/repo:ref:refs/heads/main",
    "audiences": ["api://AzureADTokenExchange"]
  }'

Network Security

Defense in Depth

Layer	Control	Implementation
Edge	DDoS + WAF	Azure DDoS Protection + Front Door WAF
Perimeter	Firewall	Azure Firewall or NVA for hub VNet
Network	Segmentation	VNet + subnets + NSGs
Application	Access control	Private Endpoints + Managed Identity
Data	Encryption	TLS 1.2+ in transit, CMK at rest

Private Endpoints

Every PaaS service in production must use Private Endpoints:

Service	Private Endpoint Support	Private DNS Zone
Azure SQL	Yes	privatelink.database.windows.net
Cosmos DB	Yes	privatelink.documents.azure.com
Key Vault	Yes	privatelink.vaultcore.azure.net
Storage (Blob)	Yes	privatelink.blob.core.windows.net
Container Registry	Yes	privatelink.azurecr.io
Service Bus	Yes	privatelink.servicebus.windows.net
App Service	VNet Integration (outbound) + Private Endpoint (inbound)	privatelink.azurewebsites.net

NSG Rules Baseline

Every subnet should have an NSG. Start with deny-all inbound, then open only what is needed:

Priority  Direction  Action  Source          Destination     Port
100       Inbound    Allow   Front Door      App Subnet      443
200       Inbound    Allow   App Subnet      Data Subnet     1433,5432
300       Inbound    Allow   VNet            VNet            Any (internal)
4096      Inbound    Deny    Any             Any             Any

Application Gateway + WAF

For single-region web apps without Front Door:

Application Gateway v2 with WAF enabled
OWASP 3.2 rule set + custom rules
Rate limiting per client IP
Bot protection (managed rule set)
SSL termination with Key Vault certificate

Monitoring and Alerting

Monitoring Stack

Application Insights (APM + distributed tracing)
        │
        ▼
Log Analytics Workspace (central log store)
        │
        ▼
Azure Monitor Alerts (metric + log-based)
        │
        ▼
Action Groups (email, Teams, PagerDuty, webhook)

Essential Alerts

Alert	Condition	Severity
App Service HTTP 5xx	> 10 in 5 minutes	Critical (Sev 1)
App Service response time	P95 > 2 seconds	Warning (Sev 2)
Azure SQL DTU/CPU	> 80% for 10 minutes	Warning (Sev 2)
Azure SQL deadlocks	> 0	Warning (Sev 2)
Cosmos DB throttled requests	429 count > 10 in 5 min	Warning (Sev 2)
AKS node CPU	> 80% for 10 minutes	Warning (Sev 2)
AKS pod restart count	> 5 in 10 minutes	Critical (Sev 1)
Key Vault access denied	> 0	Critical (Sev 1)
Budget threshold	80% of monthly budget	Warning (Sev 3)
Budget threshold	100% of monthly budget	Critical (Sev 1)

KQL Queries for Troubleshooting

App Service slow requests:

requests
| where duration > 2000
| summarize count(), avg(duration), percentile(duration, 95) by name
| order by count_ desc
| take 10

Failed dependencies (SQL, HTTP, etc.):

dependencies
| where success == false
| summarize count() by type, target, resultCode
| order by count_ desc

AKS pod errors:

KubePodInventory
| where PodStatus != "Running" and PodStatus != "Succeeded"
| summarize count() by PodStatus, Namespace, Name
| order by count_ desc

Application Insights Configuration

Enable distributed tracing with W3C trace context
Set sampling to 5-10% for high-volume production (100% for dev)
Enable profiler for .NET applications
Enable snapshot debugger for exception analysis
Configure availability tests (URL ping every 5 minutes from multiple regions)

Disaster Recovery

RPO/RTO Mapping

Tier	RPO	RTO	Strategy	Cost
Tier 1 (critical)	< 5 minutes	< 1 hour	Active-active multi-region	2x
Tier 2 (important)	< 1 hour	< 4 hours	Warm standby	1.3x
Tier 3 (standard)	< 24 hours	< 24 hours	Backup and restore	1.1x
Tier 4 (non-critical)	< 72 hours	< 72 hours	Rebuild from IaC	1x

Backup Strategy

Service	Backup Method	Retention
Azure SQL	Automated backups	7 days (short-term), 10 years (long-term)
Cosmos DB	Continuous backup + point-in-time restore	7-30 days
Blob Storage	Soft delete + versioning + geo-redundant	30 days soft delete
AKS	Velero backup to Blob Storage	7 days
Key Vault	Soft delete + purge protection	90 days
App Service	Manual or automated (Backup and Restore feature)	Custom

Storage Redundancy

Redundancy	Regions	Durability	Use Case
LRS	1 (3 copies)	11 nines	Dev/test, easily recreatable data
ZRS	1 (3 AZs)	12 nines	Production, zone failure protection
GRS	2 (6 copies)	16 nines	Business-critical, regional failure protection
GZRS	2 (3 AZs + secondary)	16 nines	Most critical data, best protection

Default to ZRS for production. Use GRS/GZRS only when cross-region DR is required.

DR Testing Checklist

Verify automated backups are running and retention is correct
Test point-in-time restore for databases (monthly)
Test regional failover for SQL failover groups (quarterly)
Validate IaC can recreate full environment from scratch
Test Front Door failover by taking down primary region health endpoint
Document and test runbook for manual failover steps
Measure actual RTO vs target during DR drill

Common Pitfalls

Cost Pitfalls

Pitfall	Impact	Prevention
No budget alerts	Unexpected bills	Set alerts at 50%, 80%, 100% on day one
Premium tier in dev/test	3-5x overspend	Use Basic/Free tiers, auto-shutdown VMs
Orphaned resources	Silent monthly charges	Tag everything, review Cost Management weekly
Ignoring Reserved Instances	35-55% overpay on steady workloads	Review Azure Advisor quarterly
Over-provisioned Cosmos DB RU/s	Paying for unused throughput	Use autoscale or serverless

Security Pitfalls

Pitfall	Impact	Prevention
Secrets in App Settings	Leaked credentials	Use Key Vault references
Public PaaS endpoints	Exposed attack surface	Private Endpoints + VNet integration
Contributor role on subscription	Overprivileged access	Scope to resource group, use PIM
No diagnostic settings	Blind to attacks	Enable on every resource from day one
SQL password authentication	Weak identity model	Entra-only auth, Managed Identity

Operational Pitfalls

Pitfall	Impact	Prevention
Manual portal deployments	Drift, no audit trail	Bicep for everything, block portal changes via Policy
No health checks configured	Silent failures	/health endpoint, Front Door probes, App Service checks
Single region deployment	Single point of failure	At minimum, use Availability Zones
No tagging strategy	Cannot track costs/ownership	Enforce via Azure Policy from day one
Ignoring Azure Advisor	Missed optimizations	Weekly review, enable email digest

12 KiB Raw Blame History