Syed Ibtesam Haider

Azure

There is a particular kind of pressure that comes with a greenfield Azure environment. No legacy to work around, no inherited decisions to live with, no existing team habits to fight against. Just a blank subscription and the knowledge that every decision you make in the next few weeks will be significantly harder to change in six months. That pressure is clarifying when you approach it correctly and paralyzing when you do not.

I have built greenfield Azure environments from scratch multiple times across different industries and different scales. Some of those projects went well from day one. Some of them taught me lessons the hard way, discovering halfway through provisioning that an architectural decision made in week two created a constraint in week eight that required either a painful refactor or accepting a compromise that would live in the platform forever. The patterns I am going to share here come from both categories of experience.

This post is specifically about Terraform as the mechanism for building a greenfield Azure environment. Not about whether to use Terraform versus Bicep versus Pulumi. That is a separate conversation. This is for the organization that has already decided on Terraform and wants to understand how to structure a greenfield build in a way that scales, stays maintainable and does not create technical debt from the first commit.

The target reader here is an engineer or architect who is responsible for making the foundational decisions on a new Azure platform. If that is you, the decisions you make before writing a single line of Terraform will determine whether this platform is something your organization builds on confidently for years or something it spends years trying to get out from underneath.

Why Greenfield Is Harder Than It Looks

Most engineers find greenfield projects appealing precisely because there is nothing to work around. The reality is that the absence of constraints is itself a challenge. When you inherit an existing environment you have a reference point. You can see what was done, understand why it was done that way even if you disagree with it and make incremental improvements. When you start from scratch you have to make dozens of decisions simultaneously without the benefit of seeing how those decisions interact until you have already made them.

The decisions that matter most in a greenfield Terraform build are not the ones that take the longest to write. They are the structural decisions that take the longest to think through. How do you separate networking from application infrastructure. How do you handle state across multiple environments. How do you structure modules so they can be reused without becoming so generic they are impossible to understand. How do you handle secrets. How do you design the address space so you never run out of room or create conflicts with future requirements. These are the questions worth spending days on before you open a code editor.

The Cost of Getting the Foundation Wrong

VNet address space designed too small requires a full network redesign to fix, which means destroying and recreating every resource that depends on it
State files structured incorrectly in early stages create blast radius problems that are painful to untangle without downtime
Module boundaries drawn in the wrong place produce coupling that makes isolated changes impossible without touching unrelated resources
Naming conventions established inconsistently in early resources are almost impossible to standardize later without resource recreation
RBAC assignments designed too broadly in early stages create security debt that is politically difficult to tighten once teams are accustomed to their permissions
Private endpoint architecture added retrospectively to a platform designed for public access requires rebuilding networking and reconfiguring every dependent service

Every one of those items is recoverable. None of them is fatal. But recovering from them costs significantly more time and creates significantly more risk than getting them right at the start. The engineering time you invest in design before writing Terraform is the highest return on investment activity in a greenfield project bar none.

The Three Weeks Before You Write Terraform

On every greenfield Azure project I have led or contributed to that went well, we spent meaningful time in design before touching code. The exact duration varies with scope but the principle is consistent. On a medium sized platform covering four environments across two Azure regions I spent three weeks in design sessions before the first terraform init was run. That felt slow at the time. Looking back it was the decision that made everything else possible.

The design phase needs to produce specific outputs, not general principles. A list of values like we will use private endpoints and we will follow least privilege is not a design. A document that specifies the exact private DNS zone for each Azure service, the exact address range for each subnet in each environment, the exact RBAC role assignments for each identity and the exact module boundary decisions is a design. The specificity is what makes it actionable and what prevents the design from being reinterpreted differently by different people when they start writing code.

What the Design Document Must Cover

Complete address space plan for every VNet and subnet across every environment including room for growth
Resource naming convention with character limit compliance verified for every resource type
Tagging strategy with mandatory tags defined and ownership of each tag documented
Module boundary decisions with explicit statements about what logic belongs where
Remote state structure including storage account layout, container naming and locking strategy
RBAC assignment plan covering every identity and every scope it needs access to
Private endpoint and private DNS zone requirements for every PaaS service in scope
Environment promotion strategy covering how code moves from development through to production
Secret management approach covering where secrets live and how services access them at runtime
Dependency map showing which resources must exist before others can be provisioned

This document does not need to be beautiful. It needs to be specific and agreed upon before code is written. In my experience the most valuable thing this document does is force conversations about decisions that engineers might otherwise make independently and inconsistently. When two engineers are working in parallel on different parts of the infrastructure and both believe they have latitude to make the address space decision their own way, you end up with overlapping ranges and a conflict that is painful to resolve. The document prevents that class of problem before it starts.

The Naming Convention Problem Nobody Takes Seriously Enough

I want to spend more time on naming conventions than most posts do because in my experience this is consistently underestimated in the planning phase and consistently regretted in the operational phase.

Azure has wildly inconsistent character limits across resource types. A virtual network allows up to 80 characters. A storage account allows a maximum of 24 characters and prohibits hyphens. A Key Vault is limited to 24 characters. A container registry name must be globally unique, alphanumeric only and between 5 and 50 characters. Any naming convention that works for a virtual network may be impossible to apply to a storage account without truncation that destroys its descriptive value.

The consequence of not resolving this upfront is that you end up with two categories of resource names in your environment. Resources where the full convention could be applied and resources where it had to be abbreviated or altered because it hit a character limit. That inconsistency is visible in the portal, in monitoring tools, in cost management views and in audit logs forever.

A naming pattern that works across Azure resource types

Structure: resource-type-prefix + workload + environment-short-code + region-short-code + optional-numeric-suffix
Example for resources with hyphens allowed: vnet-platform-prd-uks, kv-platform-prd-uks, aks-platform-prd-uks
Example for storage accounts with no hyphens and 24 character limit: stplatformprd001
Keep environment short codes to three characters maximum: dev, qa, uat, stg, prd
Keep region short codes to three characters maximum: uks for UK South, ukw for UK West, eus for East US
Validate every resource type in your scope against the naming pattern before finalising it
Document the complete list of resource type prefixes in the design document so every engineer uses the same ones
Never use full words where abbreviations exist and are unambiguous

The validation step is the one teams skip. Before writing any Terraform, take every resource type you plan to provision and test your naming convention against its character limit and allowed character set. Do this in a spreadsheet with example names generated for each resource type in each environment. If any example name exceeds the limit or contains a prohibited character, fix the convention before it gets encoded into Terraform variables and outputs that will be painful to change.

Address Space Planning That Survives the Future

Address space planning for a greenfield environment has to account for requirements that do not exist yet. This sounds paradoxical but it is the practical reality of infrastructure design. The organization that tells you they only need one VNet today will in two years need to peer that VNet to a new acquisition, connect it to an on-premise network through ExpressRoute and potentially peer to a partner organization's Azure environment. If the address space you designed is overlapping with any of those future networks you have a conflict that is very expensive to resolve.

The safest approach is to allocate a dedicated private address range per environment that is non-overlapping not just internally but within the context of the broader organization's network addressing scheme. Work with the network team to understand what ranges are already in use on-premise and in any existing cloud environments before you design your ranges. Then design your ranges to be non-overlapping with current use and leave room for growth without requiring renumbering.

A practical address space structure for a four environment platform

Production primary: 10.0.0.0/16 giving 65,536 addresses across production workloads
Production secondary for DR: 10.1.0.0/16 keeping secondary non-overlapping with primary
UAT: 10.2.0.0/16 sized identically to production to enable accurate load testing
QA: 10.3.0.0/16 with identical structure for consistent environment parity
Development: 10.4.0.0/16 with flexibility for experimental subnets
Hub VNet per environment takes the first /24 of its range for shared services
Each spoke VNet takes a /24 from the second block onwards within the environment range
Subnets within spokes are /25 or /26 depending on the workload scale requirement
Leave at least 50 percent of each environment range unallocated for future spoke additions

The reason I size UAT identically to production is specific and worth explaining. UAT is the environment where the business validates that the system behaves correctly at production scale before a release. If UAT has different resource sizing, different network topology or different subnet capacity to production, the validation it provides is proportionally less reliable. Every meaningful difference between UAT and production is a category of problem that UAT cannot catch. Identical address space structure is a small cost that eliminates one category of environment difference.

Structuring Your Terraform Repository

The Terraform repository structure is the decision that has the longest lasting consequences and the one I have seen debated most intensely among senior engineers. There is no universally correct answer but there are answers that are clearly wrong for certain contexts and I have a strong opinion about what works at enterprise scale based on having operated in environments where different approaches were in use.

The three main approaches are Terraform Workspaces, Terragrunt with separate state per environment and separate root modules per environment. I have used all three in production.

Terraform Workspaces

Single module definition with workspace-specific variable overrides
Simplest to set up initially requiring minimal repository structure
Workspaces share a backend configuration making the separation feel artificial
Engineers must explicitly select the correct workspace before every operation
Applying in the wrong workspace is a persistent and serious operational risk
Works adequately for simple configurations with minimal environment differences
Becomes increasingly difficult to manage as environments diverge in their requirements
Not recommended for enterprise environments where production blast radius is a serious concern

Terragrunt

DRY approach eliminates duplication of backend and provider configuration
Separate state file per environment enforced by the tool rather than by convention
Dependency management between modules is explicit and machine enforced
Requires engineers to learn Terragrunt in addition to Terraform
Adds a tool dependency that must be versioned and maintained
Excellent choice for organizations with strong Terraform maturity and willingness to invest in the toolchain
The HCL configuration files at the root of each environment directory are clean and minimal
My preferred approach when the team has the maturity to use it well

Separate root modules per environment

Each environment has its own directory containing its own backend configuration and variable file
Running terraform apply from inside a directory can only ever affect that environment
The blast radius of a mistake is contained by the directory structure itself
No additional tooling required beyond Terraform itself
Engineers inheriting the codebase can understand the structure without learning new tools
Some duplication of backend and provider configuration across environment directories
My recommendation for organizations prioritizing clarity and safety over elegance
The approach I use on greenfield projects where the receiving team may not have deep Terraform maturity

The reason I default to separate root modules for greenfield enterprise projects is specific. When I build something I will hand over to another team I optimize for that team's ability to operate it safely rather than for the elegance of the implementation. Separate root modules make it structurally impossible to affect the wrong environment without physically navigating to the wrong directory and running apply there. That is a harder mistake to make than selecting the wrong workspace name. Over the lifetime of a platform the accidents that do not happen are worth more than the elegance that nobody notices.

Remote State: The Foundation of Everything Else

Remote state is not optional for any serious Terraform deployment. Local state files are appropriate for learning exercises and nothing else. In a team environment with multiple engineers and a CI/CD pipeline running Terraform, local state files lead to state corruption, lost resources and the kind of infrastructure incident that ends careers.

The Azure native remote state backend uses Azure Blob Storage for state file storage and Azure Blob lease-based locking to prevent concurrent operations on the same state file. The setup is straightforward but the organizational structure around it requires deliberate thought.

Remote state organizational structure

Store state in a dedicated management subscription or resource group isolated from workload subscriptions
Use a separate storage account container per environment to prevent cross-environment state access
Use a separate state key per module within each environment for granular blast radius control
Enable storage account versioning so accidental state corruption can be recovered from a previous version
Enable soft delete on the storage account with a meaningful retention period
Restrict access to the state storage account using Azure RBAC with minimum required permissions
Grant the pipeline service principal Storage Blob Data Contributor scoped to the specific containers it needs
Grant engineers Storage Blob Data Reader for inspection and debugging without write access
Never allow engineers to modify state files directly outside of Terraform operations
Enable diagnostic logging on the state storage account to audit all access attempts

The separation of state storage from workload subscriptions is the detail that matters most. If your state files live in the same subscription as the resources they describe, a catastrophic operation in that subscription, accidental deletion, a security incident, a subscription suspension, affects both your running infrastructure and your ability to manage it. Putting state in a dedicated management subscription that has extremely limited access and no workloads means it survives whatever happens to the workload subscriptions.

Module Design: Where Most Teams Get It Wrong

Module design is the aspect of Terraform architecture that has the most impact on long term maintainability and the most variation in how teams approach it. I have seen two failure modes that are roughly equally common and equally damaging in different ways.

The first failure mode is no modules at all. Everything in a single root module with hundreds of resources defined inline. This works until it does not and then it fails catastrophically because a single plan touches everything and a single apply error can leave you halfway through a change with no clean path forward.

The second failure mode is over-modularization. A module for every individual resource, nested three levels deep, with so many input variables and output values that understanding what any given module does requires reading five files simultaneously. This creates the illusion of good structure while making the code harder to understand and maintain than a flat configuration would be.

The principle I use for module boundary decisions is that a module should represent a cohesive piece of infrastructure that is always deployed together, always has the same lifecycle and makes sense as a named concept to the people operating the platform.

Modules that make sense as cohesive concepts

hub-network: VNet, subnets, NSGs, Azure Firewall, Bastion, route tables and VNet peering for the hub
spoke-network: Spoke VNet, subnets, NSGs, route table associations and peering to the hub
private-dns: All private DNS zones and their VNet links grouped by service category
aks-cluster: AKS cluster, node pools, managed identity, RBAC assignments and diagnostic settings
container-registry: ACR, private endpoint, DNS group and RBAC assignments together
key-vault: Key Vault, private endpoint, DNS group, RBAC assignments and access policies together
monitoring: Log Analytics Workspace, Application Insights, diagnostic settings and alert rules
build-agents: VMSS, network interface, managed identity, custom script extension and NSG rules

Notice that each module in that list includes the private endpoint and DNS zone group for the service it describes. I group these together deliberately because a PaaS service and its private endpoint are not independently useful. You would never deploy Azure Key Vault without its private endpoint in a private-first architecture and you would never deploy a private endpoint without the service it connects to. Grouping them in the same module reflects their actual deployment lifecycle and prevents the situation where the module for Key Vault is applied successfully but the private endpoint is in a different module that fails, leaving the Key Vault provisioned but unreachable.

The Pipeline That Runs Your Terraform

The Terraform itself is only half of the infrastructure as code story. The pipeline that runs it is equally important and equally deserving of deliberate design. A Terraform codebase with well structured modules and clean state management that is executed manually by engineers with local credentials is still an infrastructure as code implementation with serious gaps in auditability, consistency and safety.

Every Terraform operation in a production environment should run through a pipeline. The pipeline enforces consistency in how Terraform is initialized, which version is used, how authentication is handled and what approvals are required before apply. Consistency in these things is what gives the team confidence that a plan output accurately predicts what apply will do, which is the foundational assumption that makes Terraform safe to operate.

What the Terraform pipeline must enforce

Pinned Terraform version specified in a required_version constraint and installed consistently by the pipeline
Pinned provider versions specified in required_providers with pessimistic constraint operators
OIDC federation for authentication to Azure with no client secrets stored in the pipeline
Terraform validate running before plan to catch syntax errors before they produce a confusing plan output
Plan output published as a pipeline artifact and presented in a readable format for human review
Manual approval gate between plan and apply for UAT and production environments
Apply using the saved plan file rather than re-running plan to prevent plan-apply discrepancy
Post-apply validation step confirming key resources exist and are in the expected state
Drift detection run on a schedule comparing actual Azure state to the Terraform state file
Notification on apply failure with enough context to start investigating without opening four dashboards

The plan-apply separation with a saved plan file is worth emphasizing because it is frequently skipped in the interest of pipeline simplicity. When you run terraform plan and save the output to a plan file, then run terraform apply against that specific plan file, you guarantee that apply executes exactly the operations that were reviewed and approved. If you run terraform plan in one pipeline stage and terraform apply fresh in another stage without using the saved plan file, there is a window between the two where the state of Azure could have changed, causing apply to produce a different set of operations than the plan showed. This is rare in practice but it is a class of risk that the saved plan approach eliminates entirely at negligible cost.

Hub and Spoke: Why Every Enterprise Azure Platform Should Use It

The Hub and Spoke network topology is the standard enterprise Azure network architecture for good reason. It is not the only valid topology but for organizations with multiple workloads, multiple environments and a need for centralized security controls it is the right default and the one I implement on every greenfield enterprise platform.

The core principle is simple. Shared security services and connectivity resources sit in a central hub VNet. Workload environments sit in spoke VNets. All traffic between spokes flows through the hub rather than directly between spokes, which means the hub is the chokepoint where inspection, logging and policy enforcement happens. No traffic enters or leaves a spoke without the hub knowing about it.

What belongs in the hub

Azure Firewall for centralized outbound traffic inspection and control
Azure Bastion for secure administrative access to VMs without public IP addresses
VPN Gateway or ExpressRoute Gateway for on-premise connectivity
Azure Firewall Policy with rule collections covering permitted outbound destinations
Hub VNet with dedicated subnets for each service including the mandatory AzureFirewallSubnet and AzureBastionSubnet
Shared private DNS zones linked to all spoke VNets for consistent PaaS resolution
Network monitoring through NSG flow logs and Azure Firewall diagnostic logs to Log Analytics

What belongs in spokes

Workload specific subnets sized for the resources they will contain
NSGs per subnet with rules specific to that subnet's traffic profile
Route tables with a default route pointing all traffic to the hub Azure Firewall
Private endpoints for PaaS services used by that spoke's workloads
AKS node pool subnets, application subnets, private endpoint subnets separated by function
No direct internet access from spoke subnets, all outbound through the hub firewall

The User Defined Route forcing all traffic through the firewall is the detail that makes the hub and spoke architecture actually work rather than just looking correct on a diagram. I have seen hub and spoke implementations where the route tables were not applied to the spoke subnets, meaning spoke-to-spoke traffic went directly between VNets through Azure's default routing and bypassed the firewall entirely. The architecture looked right in every diagram and every documentation artifact. It provided essentially no centralized inspection or control because the traffic never actually went through the hub. The route table is the mechanism. Without it the topology is decoration.

Private Endpoints: Design Them In From the Start

Private endpoints are the mechanism for connecting Azure PaaS services, Key Vault, Storage, SQL, Container Registry, AKS API server and many others, to your private VNet so they are reachable only from within the network rather than from the public internet. Designing private endpoints in from the start is fundamentally different from adding them retrospectively and the difference in effort is significant enough that I want to make this point explicitly.

Adding private endpoints to a service that was originally deployed with public access requires updating the service configuration to disable public access, creating the private endpoint in the right subnet, creating the private DNS zone if it does not already exist, linking the zone to the VNet, and verifying that all existing clients that were connecting through the public endpoint now resolve to the private endpoint correctly. In a production environment that existing clients depend on this is a change with real downtime risk if anything in that sequence goes wrong.

In a greenfield Terraform deployment you create the service with public access disabled from the first apply, create the private endpoint at the same time, group them in the same module so they are always deployed together, and the platform is private from the first moment it exists. That is the only moment when private-first costs nothing.

Private DNS zones required per Azure service

Azure Container Registry: privatelink.azurecr.io
Azure Key Vault: privatelink.vaultcore.azure.net
Azure Storage blob: privatelink.blob.core.windows.net
Azure Storage DFS for ADLS Gen2: privatelink.dfs.core.windows.net
Azure SQL Database: privatelink.database.windows.net
Azure Kubernetes Service API server: managed automatically in node resource group but requires VNet link
Log Analytics OMS: privatelink.oms.opinsights.azure.com
Log Analytics ODS: privatelink.ods.opinsights.azure.com
Azure Monitor: privatelink.monitor.azure.com
Azure Automation: privatelink.azure-automation.net
Azure App Service and Function Apps: privatelink.azurewebsites.net

Every zone in that list needs to be provisioned and linked to every VNet that needs to resolve the corresponding service privately. If you have four spoke VNets and a hub VNet all needing to reach Key Vault through its private endpoint, all five VNets need a link to the privatelink.vaultcore.azure.net zone. Missing a VNet link is one of the most common private endpoint connectivity issues I have diagnosed and it is always invisible in the resource configuration because the zone and the private endpoint both look correct. The symptom is that DNS resolution for the service FQDN returns the public IP from the affected VNet instead of the private endpoint IP, which manifests as a connectivity failure that looks like a network problem rather than a DNS problem.

Identity and Secrets: The Decisions That Affect Everything

The identity and secrets architecture decisions you make in a greenfield build affect every service you provision and every team that operates the platform. Getting these right from the start is significantly easier than retrofitting a proper identity model onto a platform where services have been using shared credentials and broad permissions for months.

The principle I apply consistently is that every service identity should be a Managed Identity rather than a service principal with a client secret wherever Azure supports it, which in a modern Azure environment is almost everywhere. Managed Identities have no credential to store, no expiry to manage and no rotation schedule to maintain. The lifecycle of the identity is tied to the lifecycle of the resource it is assigned to. When the resource is deleted the identity is deleted. There is no orphaned service principal sitting in Entra ID with credentials that were created by someone who left the organization two years ago.

RBAC assignment principles for a greenfield platform

Assign roles at the narrowest possible scope, resource level preferred over resource group level, resource group over subscription
Never assign Owner at subscription scope to any non-human identity
Never assign Contributor at subscription scope when a narrower role achieves the same result
Use built-in roles wherever they exist rather than custom roles which require ongoing maintenance
Grant AKS kubelet identity AcrPull on the specific container registry it pulls from, not on the subscription
Grant pipeline service principal Storage Blob Data Contributor on the specific state containers, not the storage account
Grant Key Vault Secrets User to service identities that read secrets, not Key Vault Contributor which also allows management operations
Document every RBAC assignment in the Terraform code with a comment explaining the business reason
Review RBAC assignments as part of the quarterly access review process from day one, not starting from the first incident

The comment explaining the business reason for each RBAC assignment is the detail that gets skipped most consistently and costs the most time during security audits. An auditor who sees azurerm_role_assignment with no context needs to investigate what that assignment does, whether it is appropriate and whether it is still needed. An auditor who sees the same assignment with a comment that says this grants the AKS kubelet identity read access to the platform container registry so nodes can pull application images during pod scheduling can move on in thirty seconds. Multiply that by fifty RBAC assignments across a platform and the documentation value is enormous.

The Handover Problem: Building for the Team That Comes After You

Every greenfield project ends with a handover. Sometimes that handover is to a permanent internal team. Sometimes it is to a different team within the same organization. Sometimes it is to a managed service provider. Regardless of who receives the platform, the quality of that handover is determined almost entirely by how the Terraform codebase was written during the build.

I have received platforms built by other engineers and I have handed over platforms I built myself. The difference between a good handover and a difficult one is not the complexity of the infrastructure. It is whether the code communicates its own intent clearly enough that the receiving team can operate it without needing the original engineers in the room.

What makes a Terraform codebase handover-ready

Every module has a README explaining what it does, what it requires as inputs and what it produces as outputs
Non-obvious decisions are documented inline with comments explaining the why not just the what
Variable files per environment contain only values, not logic, making them readable without Terraform knowledge
Outputs are named descriptively so their purpose is clear without reading the resource that produced them
The repository README includes a getting started section covering prerequisites, authentication and first apply sequence
A bootstrap runbook documents the exact sequence for standing up a brand new environment from scratch
Terraform version and provider versions are pinned explicitly with a comment explaining when they were last reviewed
The pipeline configuration includes comments explaining why approval gates exist where they do
Sensitive decisions like why a particular subnet was sized the way it was are documented rather than left to guesswork
A known issues section documents current limitations and the reasoning behind accepted compromises

The bootstrap runbook deserves specific mention because it is the document that gets used most urgently and under the most pressure. When a new environment needs to be provisioned, when a disaster recovery scenario requires standing up infrastructure from scratch or when a new engineer joins the team and needs to understand the apply sequence, the bootstrap runbook is what they reach for. Writing it during the project when the knowledge is fresh and the sequence is being validated is the right time. Trying to reconstruct it six months later from memory is how you get a document that is mostly correct and occasionally catastrophically wrong in the parts that were forgotten.

What Good Looks Like After Six Months

The measure of a well built greenfield Terraform platform is not how it looks on day one. It is how it behaves six months after handover when the team operating it has changed, when new requirements have arrived that the original design did not anticipate and when the first significant incident has tested whether the platform is as robust as the design intended it to be.

A well built platform at six months looks like this. New environments can be provisioned by running the documented apply sequence without engineering the original architect to be present. New services can be added to existing environments by writing a new module or extending an existing one without touching unrelated infrastructure. Terraform plan runs show no unexpected drift because nothing has been changed outside of Terraform. Security reviews produce no critical findings because the private endpoint and RBAC foundation held. The team can articulate why every major decision was made because it is documented in the codebase.

A platform built without the design investment at the start looks very different at six months. Drift between environments because some configuration was changed manually. RBAC assignments that nobody can explain. Address space that is almost exhausted. Module boundaries that make adding a new service require touching six files. Documentation that describes what was built rather than why it was built that way.

The difference between those two outcomes is almost entirely determined by what happened in the first three weeks of the project before any Terraform was written. That is the investment worth making.

Terraform Greenfield Azure

Comments

Leave a comment