Files
claude-skills-reference/engineering-team/azure-cloud-architect/references/best_practices.md
Reza Rezvani 2056ba251f feat(engineering-team): add azure-cloud-architect, security-pen-testing; extend terraform-patterns
azure-cloud-architect (451-line SKILL.md, 3 scripts, 3 references):
- 6-step workflow mirroring aws-solution-architect for Azure
- Bicep/ARM templates, AKS, Functions, Cosmos DB, cost optimization
- architecture_designer.py, cost_optimizer.py, bicep_generator.py

security-pen-testing (850-line SKILL.md, 3 scripts, 3 references):
- OWASP Top 10 systematic audit, offensive security testing
- XSS/SQLi/SSRF/IDOR detection, secret scanning, API security
- vulnerability_scanner.py, dependency_auditor.py, pentest_report_generator.py
- Responsible disclosure workflow included

terraform-patterns extended (487 → 740 lines):
- Multi-cloud provider configuration
- OpenTofu compatibility notes
- Infracost integration for PR cost estimation
- Import existing infrastructure patterns
- Terragrunt DRY multi-environment patterns

Updated engineering-team plugin.json (26 → 28 skills).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 13:32:22 +01:00

12 KiB

Azure Best Practices

Production-ready practices for naming, tagging, security, networking, monitoring, and disaster recovery on Azure.


Table of Contents


Naming Conventions

Follow the Azure Cloud Adoption Framework (CAF) naming convention for consistency and automation.

Format

<resource-type>-<workload>-<environment>-<region>-<instance>

Examples

Resource Naming Pattern Example
Resource Group rg-<workload>-<env> rg-myapp-prod
App Service app-<workload>-<env> app-myapp-prod
App Service Plan plan-<workload>-<env> plan-myapp-prod
Azure SQL Server sql-<workload>-<env> sql-myapp-prod
Azure SQL Database sqldb-<workload>-<env> sqldb-myapp-prod
Storage Account st<workload><env> (no hyphens) stmyappprod
Key Vault kv-<workload>-<env> kv-myapp-prod
AKS Cluster aks-<workload>-<env> aks-myapp-prod
Container Registry cr<workload><env> (no hyphens) crmyappprod
Virtual Network vnet-<workload>-<env> vnet-myapp-prod
Subnet snet-<purpose> snet-app, snet-data
NSG nsg-<subnet-name> nsg-snet-app
Public IP pip-<resource>-<env> pip-agw-prod
Cosmos DB cosmos-<workload>-<env> cosmos-myapp-prod
Service Bus sb-<workload>-<env> sb-myapp-prod
Event Hubs evh-<workload>-<env> evh-myapp-prod
Log Analytics log-<workload>-<env> log-myapp-prod
Application Insights ai-<workload>-<env> ai-myapp-prod

Rules

  • Lowercase only (some resources require it — be consistent everywhere)
  • Hyphens as separators (except where disallowed: storage accounts, container registries)
  • No longer than the resource type max length (e.g., storage accounts max 24 characters)
  • Environment abbreviations: dev, stg, prod
  • Region abbreviations: eus (East US), weu (West Europe), sea (Southeast Asia)

Tagging Strategy

Tags enable cost allocation, ownership tracking, and automation. Apply to every resource.

Required Tags

Tag Key Purpose Example Values
environment Cost splitting, policy targeting dev, staging, production
app-name Workload identification myapp, data-pipeline
owner Team or individual responsible platform-team, jane.doe@company.com
cost-center Finance allocation CC-1234, engineering
Tag Key Purpose Example Values
created-by IaC or manual tracking bicep, terraform, portal
data-classification Security posture public, internal, confidential
compliance Regulatory requirements hipaa, gdpr, sox
auto-shutdown Dev/test cost savings true, false

Enforcement

Use Azure Policy to enforce tagging:

{
  "if": {
    "allOf": [
      { "field": "tags['environment']", "exists": "false" },
      { "field": "type", "notEquals": "Microsoft.Resources/subscriptions/resourceGroups" }
    ]
  },
  "then": { "effect": "deny" }
}

RBAC and Least Privilege

Principles

  1. Use built-in roles before creating custom roles
  2. Assign roles to groups, not individual users
  3. Scope to the narrowest level — resource group or resource, not subscription
  4. Use Managed Identity for service-to-service — never store credentials
  5. Enable Entra ID PIM (Privileged Identity Management) for just-in-time admin access

Common Role Assignments

Persona Scope Role
Developer Resource Group (dev) Contributor
Developer Resource Group (prod) Reader
CI/CD pipeline Resource Group Contributor (via workload identity)
App Service Key Vault Key Vault Secrets User
App Service Azure SQL SQL DB Contributor (or Entra auth)
AKS pod Cosmos DB Cosmos DB Built-in Data Contributor
Security team Subscription Security Reader
Platform team Subscription Owner (with PIM)

Workload Identity Federation

For CI/CD pipelines (GitHub Actions, Azure DevOps), use workload identity federation instead of service principal secrets:

# Create federated credential (GitHub Actions example)
az ad app federated-credential create \
  --id <app-object-id> \
  --parameters '{
    "name": "github-main",
    "issuer": "https://token.actions.githubusercontent.com",
    "subject": "repo:org/repo:ref:refs/heads/main",
    "audiences": ["api://AzureADTokenExchange"]
  }'

Network Security

Defense in Depth

Layer Control Implementation
Edge DDoS + WAF Azure DDoS Protection + Front Door WAF
Perimeter Firewall Azure Firewall or NVA for hub VNet
Network Segmentation VNet + subnets + NSGs
Application Access control Private Endpoints + Managed Identity
Data Encryption TLS 1.2+ in transit, CMK at rest

Private Endpoints

Every PaaS service in production must use Private Endpoints:

Service Private Endpoint Support Private DNS Zone
Azure SQL Yes privatelink.database.windows.net
Cosmos DB Yes privatelink.documents.azure.com
Key Vault Yes privatelink.vaultcore.azure.net
Storage (Blob) Yes privatelink.blob.core.windows.net
Container Registry Yes privatelink.azurecr.io
Service Bus Yes privatelink.servicebus.windows.net
App Service VNet Integration (outbound) + Private Endpoint (inbound) privatelink.azurewebsites.net

NSG Rules Baseline

Every subnet should have an NSG. Start with deny-all inbound, then open only what is needed:

Priority  Direction  Action  Source          Destination     Port
100       Inbound    Allow   Front Door      App Subnet      443
200       Inbound    Allow   App Subnet      Data Subnet     1433,5432
300       Inbound    Allow   VNet            VNet            Any (internal)
4096      Inbound    Deny    Any             Any             Any

Application Gateway + WAF

For single-region web apps without Front Door:

  • Application Gateway v2 with WAF enabled
  • OWASP 3.2 rule set + custom rules
  • Rate limiting per client IP
  • Bot protection (managed rule set)
  • SSL termination with Key Vault certificate

Monitoring and Alerting

Monitoring Stack

Application Insights (APM + distributed tracing)
        │
        ▼
Log Analytics Workspace (central log store)
        │
        ▼
Azure Monitor Alerts (metric + log-based)
        │
        ▼
Action Groups (email, Teams, PagerDuty, webhook)

Essential Alerts

Alert Condition Severity
App Service HTTP 5xx > 10 in 5 minutes Critical (Sev 1)
App Service response time P95 > 2 seconds Warning (Sev 2)
Azure SQL DTU/CPU > 80% for 10 minutes Warning (Sev 2)
Azure SQL deadlocks > 0 Warning (Sev 2)
Cosmos DB throttled requests 429 count > 10 in 5 min Warning (Sev 2)
AKS node CPU > 80% for 10 minutes Warning (Sev 2)
AKS pod restart count > 5 in 10 minutes Critical (Sev 1)
Key Vault access denied > 0 Critical (Sev 1)
Budget threshold 80% of monthly budget Warning (Sev 3)
Budget threshold 100% of monthly budget Critical (Sev 1)

KQL Queries for Troubleshooting

App Service slow requests:

requests
| where duration > 2000
| summarize count(), avg(duration), percentile(duration, 95) by name
| order by count_ desc
| take 10

Failed dependencies (SQL, HTTP, etc.):

dependencies
| where success == false
| summarize count() by type, target, resultCode
| order by count_ desc

AKS pod errors:

KubePodInventory
| where PodStatus != "Running" and PodStatus != "Succeeded"
| summarize count() by PodStatus, Namespace, Name
| order by count_ desc

Application Insights Configuration

  • Enable distributed tracing with W3C trace context
  • Set sampling to 5-10% for high-volume production (100% for dev)
  • Enable profiler for .NET applications
  • Enable snapshot debugger for exception analysis
  • Configure availability tests (URL ping every 5 minutes from multiple regions)

Disaster Recovery

RPO/RTO Mapping

Tier RPO RTO Strategy Cost
Tier 1 (critical) < 5 minutes < 1 hour Active-active multi-region 2x
Tier 2 (important) < 1 hour < 4 hours Warm standby 1.3x
Tier 3 (standard) < 24 hours < 24 hours Backup and restore 1.1x
Tier 4 (non-critical) < 72 hours < 72 hours Rebuild from IaC 1x

Backup Strategy

Service Backup Method Retention
Azure SQL Automated backups 7 days (short-term), 10 years (long-term)
Cosmos DB Continuous backup + point-in-time restore 7-30 days
Blob Storage Soft delete + versioning + geo-redundant 30 days soft delete
AKS Velero backup to Blob Storage 7 days
Key Vault Soft delete + purge protection 90 days
App Service Manual or automated (Backup and Restore feature) Custom

Storage Redundancy

Redundancy Regions Durability Use Case
LRS 1 (3 copies) 11 nines Dev/test, easily recreatable data
ZRS 1 (3 AZs) 12 nines Production, zone failure protection
GRS 2 (6 copies) 16 nines Business-critical, regional failure protection
GZRS 2 (3 AZs + secondary) 16 nines Most critical data, best protection

Default to ZRS for production. Use GRS/GZRS only when cross-region DR is required.

DR Testing Checklist

  • Verify automated backups are running and retention is correct
  • Test point-in-time restore for databases (monthly)
  • Test regional failover for SQL failover groups (quarterly)
  • Validate IaC can recreate full environment from scratch
  • Test Front Door failover by taking down primary region health endpoint
  • Document and test runbook for manual failover steps
  • Measure actual RTO vs target during DR drill

Common Pitfalls

Cost Pitfalls

Pitfall Impact Prevention
No budget alerts Unexpected bills Set alerts at 50%, 80%, 100% on day one
Premium tier in dev/test 3-5x overspend Use Basic/Free tiers, auto-shutdown VMs
Orphaned resources Silent monthly charges Tag everything, review Cost Management weekly
Ignoring Reserved Instances 35-55% overpay on steady workloads Review Azure Advisor quarterly
Over-provisioned Cosmos DB RU/s Paying for unused throughput Use autoscale or serverless

Security Pitfalls

Pitfall Impact Prevention
Secrets in App Settings Leaked credentials Use Key Vault references
Public PaaS endpoints Exposed attack surface Private Endpoints + VNet integration
Contributor role on subscription Overprivileged access Scope to resource group, use PIM
No diagnostic settings Blind to attacks Enable on every resource from day one
SQL password authentication Weak identity model Entra-only auth, Managed Identity

Operational Pitfalls

Pitfall Impact Prevention
Manual portal deployments Drift, no audit trail Bicep for everything, block portal changes via Policy
No health checks configured Silent failures /health endpoint, Front Door probes, App Service checks
Single region deployment Single point of failure At minimum, use Availability Zones
No tagging strategy Cannot track costs/ownership Enforce via Azure Policy from day one
Ignoring Azure Advisor Missed optimizations Weekly review, enable email digest