VPC design is one of those decisions that seems straightforward until you’ve made a mistake and spent two days tracing why your containers can’t reach your database, or received a bill with $4,000 in unexpected NAT gateway charges.
Most teams either use the default VPC (which was designed for experimentation, not production) or build something custom that grows organically without a coherent design. Both approaches accumulate problems.
Here’s an opinionated, practical design that works at scale.
The three-tier subnet model
The foundation of a well-designed VPC is three tiers of subnets with different exposure levels:
Public subnets have a route to the internet via an Internet Gateway. Resources here get public IP addresses and are directly reachable from the internet. Only put things here that need to be: load balancers, NAT gateways, bastion hosts (or Session Manager replaces these entirely).
Private subnets have outbound internet access via a NAT Gateway but no inbound internet routes. Application servers, containers, Lambda functions in a VPC, and most compute belongs here. They can reach the internet (for package downloads, API calls) but can’t be reached from it directly.
Isolated subnets have no internet route at all — no NAT Gateway, no Internet Gateway. Databases, caches, and data stores belong here. They should only be reachable from resources within the VPC, never from the internet. If something in an isolated subnet needs to call an AWS service, use a VPC endpoint.
The discipline is keeping each tier strict. An RDS instance in a private subnet that’s “accidentally” reachable from the internet because a security group was misconfigured is the threat model you’re designing against.
Multi-AZ placement
Deploy each subnet tier across at least two Availability Zones. For production, three. This means you have:
- 2-3 public subnets (one per AZ)
- 2-3 private subnets (one per AZ)
- 2-3 isolated subnets (one per AZ)
Every resource that supports Multi-AZ (RDS, ElastiCache, ALB) should span at least two AZs. This is the difference between “the database survives an AZ failure” and “we’re down until the AZ comes back.”
CIDR sizing: plan for more than you think you’ll need. VPC CIDR expansion is painful. /16 gives you 65,536 addresses. Use /20 per subnet (4,096 addresses) to leave headroom for large ECS or EKS clusters, which consume IPs per pod.
NAT Gateway: placement and cost
NAT Gateways provide outbound internet access for private subnet resources. Common mistake: one NAT Gateway in one AZ, shared by all private subnets.
The problem: if that AZ fails, all your private subnet resources lose internet access. And cross-AZ traffic through the NAT Gateway incurs data transfer charges.
Best practice: One NAT Gateway per AZ, in the public subnet of that AZ. Each private subnet routes to the NAT Gateway in its own AZ. This eliminates cross-AZ transfer charges for NAT traffic and provides AZ-level fault isolation.
The cost: NAT Gateways are $0.045/hour ($32/month) plus data processing charges. Three NAT Gateways is ~$96/month. For most production workloads, this is the right tradeoff. For development environments where AZ fault tolerance isn’t required, one NAT Gateway is fine.
If your data transfer through NAT Gateway is significant, VPC endpoints (below) can reduce it substantially.
VPC endpoints: reduce costs and improve security
VPC endpoints let resources in private or isolated subnets communicate with AWS services (S3, DynamoDB, SSM, Secrets Manager, ECR) without traversing the internet through NAT Gateway.
Gateway endpoints (S3 and DynamoDB) are free and should always be configured. Add them to every VPC. A Lambda function that reads from S3 without a gateway endpoint sends that traffic through NAT Gateway — you pay NAT data processing charges for what could be free.
Interface endpoints (everything else — SSM, Secrets Manager, ECR, CloudWatch, etc.) cost ~$0.01/hour per AZ plus data processing. Worth it for high-volume services (ECR image pulls, SSM parameter reads at scale). Evaluate based on your actual traffic.
VPC endpoints also improve security: traffic to S3 and DynamoDB never leaves the AWS network, even through a gateway endpoint. Endpoint policies can restrict which resources the endpoint can access.
Security groups vs NACLs
Security groups are stateful, instance-level firewalls. They evaluate rules on both inbound and outbound traffic, automatically allowing return traffic for established connections. Security groups are the primary mechanism for controlling traffic between resources.
NACLs (Network Access Control Lists) are stateless subnet-level firewalls. You have to explicitly allow both inbound and outbound traffic (including ephemeral ports for return traffic). NACLs are applied at the subnet boundary.
Practical guidance:
- Use security groups for all traffic control between application tiers. They’re easier to reason about and manage than NACLs.
- Keep NACLs at default (allow all) unless you have a specific requirement — like blocking traffic from known-bad CIDR ranges at the network level.
- Never use
0.0.0.0/0on port 22 or 3389 in a security group. Use Session Manager instead of bastion hosts. - Reference security groups by ID in rules rather than CIDR ranges where possible. A rule that says “allow from the application security group” is more maintainable than “allow from 10.0.1.0/24.”
The “default VPC” problem
The AWS default VPC exists for convenience, not production use. It has all subnets configured as public, which means every EC2 instance you launch gets a public IP by default. It’s fine for experimentation; it’s a liability for production.
Delete the default VPC in all regions you use once you have your production VPC in place. This prevents resources from being accidentally launched there and removes a source of audit findings.
In regions you don’t use, delete the default VPC too — it’s a potential resource if credentials are ever compromised.
Common mistakes
CIDR ranges too small. Running out of IP addresses in a subnet requires recreating it, which means migrating resources. Start larger than you think you need.
All resources in public subnets. “It’s easier to debug” is the justification. It’s also a security audit finding and a real exposure. Put compute in private subnets.
Single NAT Gateway. AZ failure takes down all outbound traffic. One per AZ costs ~$64/month more. Worth it for production.
No VPC endpoints for S3/DynamoDB. You’re paying NAT data processing charges for traffic that could be free. Takes 5 minutes to add and often saves hundreds per month.
Overly permissive security groups. 0.0.0.0/0 on any port other than 80/443 on a public-facing load balancer is a finding. Audit security groups quarterly.
When to bring in help
VPC design is foundational — getting it right takes 2-3 days and getting it wrong requires painful rework later. If you’re starting a new AWS environment or inheriting one that grew organically, a VPC design review is a high-leverage engagement.
I design VPC architectures as part of infrastructure foundation engagements, and I can help you build something that holds up as you scale.
Contact me or email nick@coldsmokeconsulting.com.
Nick Allevato is an AWS Certified Solutions Architect Professional with 20 years of infrastructure experience. He runs Cold Smoke Consulting, an independent AWS consulting practice.