Azure Best Practices
Production-ready practices for naming, tagging, security, networking, monitoring, and disaster recovery on Azure.
Table of Contents
Naming Conventions
Follow the Azure Cloud Adoption Framework (CAF) naming convention for consistency and automation.
Format
Examples
| Resource |
Naming Pattern |
Example |
| Resource Group |
rg-<workload>-<env> |
rg-myapp-prod |
| App Service |
app-<workload>-<env> |
app-myapp-prod |
| App Service Plan |
plan-<workload>-<env> |
plan-myapp-prod |
| Azure SQL Server |
sql-<workload>-<env> |
sql-myapp-prod |
| Azure SQL Database |
sqldb-<workload>-<env> |
sqldb-myapp-prod |
| Storage Account |
st<workload><env> (no hyphens) |
stmyappprod |
| Key Vault |
kv-<workload>-<env> |
kv-myapp-prod |
| AKS Cluster |
aks-<workload>-<env> |
aks-myapp-prod |
| Container Registry |
cr<workload><env> (no hyphens) |
crmyappprod |
| Virtual Network |
vnet-<workload>-<env> |
vnet-myapp-prod |
| Subnet |
snet-<purpose> |
snet-app, snet-data |
| NSG |
nsg-<subnet-name> |
nsg-snet-app |
| Public IP |
pip-<resource>-<env> |
pip-agw-prod |
| Cosmos DB |
cosmos-<workload>-<env> |
cosmos-myapp-prod |
| Service Bus |
sb-<workload>-<env> |
sb-myapp-prod |
| Event Hubs |
evh-<workload>-<env> |
evh-myapp-prod |
| Log Analytics |
log-<workload>-<env> |
log-myapp-prod |
| Application Insights |
ai-<workload>-<env> |
ai-myapp-prod |
Rules
- Lowercase only (some resources require it — be consistent everywhere)
- Hyphens as separators (except where disallowed: storage accounts, container registries)
- No longer than the resource type max length (e.g., storage accounts max 24 characters)
- Environment abbreviations:
dev, stg, prod
- Region abbreviations:
eus (East US), weu (West Europe), sea (Southeast Asia)
Tagging Strategy
Tags enable cost allocation, ownership tracking, and automation. Apply to every resource.
Required Tags
| Tag Key |
Purpose |
Example Values |
| environment |
Cost splitting, policy targeting |
dev, staging, production |
| app-name |
Workload identification |
myapp, data-pipeline |
| owner |
Team or individual responsible |
platform-team, jane.doe@company.com |
| cost-center |
Finance allocation |
CC-1234, engineering |
Recommended Tags
| Tag Key |
Purpose |
Example Values |
| created-by |
IaC or manual tracking |
bicep, terraform, portal |
| data-classification |
Security posture |
public, internal, confidential |
| compliance |
Regulatory requirements |
hipaa, gdpr, sox |
| auto-shutdown |
Dev/test cost savings |
true, false |
Enforcement
Use Azure Policy to enforce tagging:
RBAC and Least Privilege
Principles
- Use built-in roles before creating custom roles
- Assign roles to groups, not individual users
- Scope to the narrowest level — resource group or resource, not subscription
- Use Managed Identity for service-to-service — never store credentials
- Enable Entra ID PIM (Privileged Identity Management) for just-in-time admin access
Common Role Assignments
| Persona |
Scope |
Role |
| Developer |
Resource Group (dev) |
Contributor |
| Developer |
Resource Group (prod) |
Reader |
| CI/CD pipeline |
Resource Group |
Contributor (via workload identity) |
| App Service |
Key Vault |
Key Vault Secrets User |
| App Service |
Azure SQL |
SQL DB Contributor (or Entra auth) |
| AKS pod |
Cosmos DB |
Cosmos DB Built-in Data Contributor |
| Security team |
Subscription |
Security Reader |
| Platform team |
Subscription |
Owner (with PIM) |
Workload Identity Federation
For CI/CD pipelines (GitHub Actions, Azure DevOps), use workload identity federation instead of service principal secrets:
Network Security
Defense in Depth
| Layer |
Control |
Implementation |
| Edge |
DDoS + WAF |
Azure DDoS Protection + Front Door WAF |
| Perimeter |
Firewall |
Azure Firewall or NVA for hub VNet |
| Network |
Segmentation |
VNet + subnets + NSGs |
| Application |
Access control |
Private Endpoints + Managed Identity |
| Data |
Encryption |
TLS 1.2+ in transit, CMK at rest |
Private Endpoints
Every PaaS service in production must use Private Endpoints:
| Service |
Private Endpoint Support |
Private DNS Zone |
| Azure SQL |
Yes |
privatelink.database.windows.net |
| Cosmos DB |
Yes |
privatelink.documents.azure.com |
| Key Vault |
Yes |
privatelink.vaultcore.azure.net |
| Storage (Blob) |
Yes |
privatelink.blob.core.windows.net |
| Container Registry |
Yes |
privatelink.azurecr.io |
| Service Bus |
Yes |
privatelink.servicebus.windows.net |
| App Service |
VNet Integration (outbound) + Private Endpoint (inbound) |
privatelink.azurewebsites.net |
NSG Rules Baseline
Every subnet should have an NSG. Start with deny-all inbound, then open only what is needed:
Application Gateway + WAF
For single-region web apps without Front Door:
- Application Gateway v2 with WAF enabled
- OWASP 3.2 rule set + custom rules
- Rate limiting per client IP
- Bot protection (managed rule set)
- SSL termination with Key Vault certificate
Monitoring and Alerting
Monitoring Stack
Essential Alerts
| Alert |
Condition |
Severity |
| App Service HTTP 5xx |
> 10 in 5 minutes |
Critical (Sev 1) |
| App Service response time |
P95 > 2 seconds |
Warning (Sev 2) |
| Azure SQL DTU/CPU |
> 80% for 10 minutes |
Warning (Sev 2) |
| Azure SQL deadlocks |
> 0 |
Warning (Sev 2) |
| Cosmos DB throttled requests |
429 count > 10 in 5 min |
Warning (Sev 2) |
| AKS node CPU |
> 80% for 10 minutes |
Warning (Sev 2) |
| AKS pod restart count |
> 5 in 10 minutes |
Critical (Sev 1) |
| Key Vault access denied |
> 0 |
Critical (Sev 1) |
| Budget threshold |
80% of monthly budget |
Warning (Sev 3) |
| Budget threshold |
100% of monthly budget |
Critical (Sev 1) |
KQL Queries for Troubleshooting
App Service slow requests:
Failed dependencies (SQL, HTTP, etc.):
AKS pod errors:
Application Insights Configuration
- Enable distributed tracing with W3C trace context
- Set sampling to 5-10% for high-volume production (100% for dev)
- Enable profiler for .NET applications
- Enable snapshot debugger for exception analysis
- Configure availability tests (URL ping every 5 minutes from multiple regions)
Disaster Recovery
RPO/RTO Mapping
| Tier |
RPO |
RTO |
Strategy |
Cost |
| Tier 1 (critical) |
< 5 minutes |
< 1 hour |
Active-active multi-region |
2x |
| Tier 2 (important) |
< 1 hour |
< 4 hours |
Warm standby |
1.3x |
| Tier 3 (standard) |
< 24 hours |
< 24 hours |
Backup and restore |
1.1x |
| Tier 4 (non-critical) |
< 72 hours |
< 72 hours |
Rebuild from IaC |
1x |
Backup Strategy
| Service |
Backup Method |
Retention |
| Azure SQL |
Automated backups |
7 days (short-term), 10 years (long-term) |
| Cosmos DB |
Continuous backup + point-in-time restore |
7-30 days |
| Blob Storage |
Soft delete + versioning + geo-redundant |
30 days soft delete |
| AKS |
Velero backup to Blob Storage |
7 days |
| Key Vault |
Soft delete + purge protection |
90 days |
| App Service |
Manual or automated (Backup and Restore feature) |
Custom |
Storage Redundancy
| Redundancy |
Regions |
Durability |
Use Case |
| LRS |
1 (3 copies) |
11 nines |
Dev/test, easily recreatable data |
| ZRS |
1 (3 AZs) |
12 nines |
Production, zone failure protection |
| GRS |
2 (6 copies) |
16 nines |
Business-critical, regional failure protection |
| GZRS |
2 (3 AZs + secondary) |
16 nines |
Most critical data, best protection |
Default to ZRS for production. Use GRS/GZRS only when cross-region DR is required.
DR Testing Checklist
Common Pitfalls
Cost Pitfalls
| Pitfall |
Impact |
Prevention |
| No budget alerts |
Unexpected bills |
Set alerts at 50%, 80%, 100% on day one |
| Premium tier in dev/test |
3-5x overspend |
Use Basic/Free tiers, auto-shutdown VMs |
| Orphaned resources |
Silent monthly charges |
Tag everything, review Cost Management weekly |
| Ignoring Reserved Instances |
35-55% overpay on steady workloads |
Review Azure Advisor quarterly |
| Over-provisioned Cosmos DB RU/s |
Paying for unused throughput |
Use autoscale or serverless |
Security Pitfalls
| Pitfall |
Impact |
Prevention |
| Secrets in App Settings |
Leaked credentials |
Use Key Vault references |
| Public PaaS endpoints |
Exposed attack surface |
Private Endpoints + VNet integration |
| Contributor role on subscription |
Overprivileged access |
Scope to resource group, use PIM |
| No diagnostic settings |
Blind to attacks |
Enable on every resource from day one |
| SQL password authentication |
Weak identity model |
Entra-only auth, Managed Identity |
Operational Pitfalls
| Pitfall |
Impact |
Prevention |
| Manual portal deployments |
Drift, no audit trail |
Bicep for everything, block portal changes via Policy |
| No health checks configured |
Silent failures |
/health endpoint, Front Door probes, App Service checks |
| Single region deployment |
Single point of failure |
At minimum, use Availability Zones |
| No tagging strategy |
Cannot track costs/ownership |
Enforce via Azure Policy from day one |
| Ignoring Azure Advisor |
Missed optimizations |
Weekly review, enable email digest |