Route 53 health checks monitor your endpoints and can automatically reroute DNS traffic when an endpoint becomes unhealthy. Combined with failover routing policies, this is how you build DNS-level disaster recovery that triggers without human intervention.
Most teams know this exists. Fewer have it configured correctly. Here’s what a working setup looks like.
Health check types
Endpoint health checks: Route 53 health checkers (distributed globally) send HTTP, HTTPS, or TCP requests to a specified endpoint at a configurable interval. If the endpoint fails to respond within the timeout or returns a non-2xx HTTP status, Route 53 marks it unhealthy.
Calculated health checks: A composite health check that is healthy only when a specified number of child health checks are healthy. Use to aggregate multiple endpoint checks into a single health signal.
CloudWatch alarm health checks: A health check that mirrors the state of a CloudWatch alarm. If the alarm is in ALARM state, the health check is unhealthy. This lets you base DNS failover on any CloudWatch metric — not just HTTP endpoint reachability.
Endpoint health check configuration
resource "aws_route53_health_check" "primary" {
fqdn = "api.example.com"
port = 443
type = "HTTPS"
resource_path = "/health"
failure_threshold = 3 # Unhealthy after 3 consecutive failures
request_interval = 30 # Check every 30 seconds (or 10 for faster failover)
regions = [
"us-east-1",
"us-west-2",
"eu-west-1"
] # Which Route 53 health checker regions to use
tags = {
Name = "primary-api-health-check"
}
}
Critical settings:
request_interval: 30 seconds (standard) or 10 seconds (fast). Fast health checks respond to failures in ~30 seconds; standard in ~90 seconds. Fast costs $1/month per health check vs. $0.50/month for standard.
failure_threshold: How many consecutive failures before the endpoint is declared unhealthy. At 30-second intervals with threshold 3, failover triggers after ~90 seconds. At 10-second intervals with threshold 3, failover triggers after ~30 seconds.
resource_path: The HTTP path Route 53 will request. This should be a dedicated health endpoint (/health, /ping) that verifies actual application functionality — not just that the web server is responding. A good health endpoint checks database connectivity, cache availability, and any other critical dependencies.
regions: Route 53 health checkers are distributed globally. Using checkers from multiple regions prevents false positives from a single-region network issue. Route 53 considers an endpoint unhealthy when more than 18% of health checkers report it unhealthy.
Failover routing policy
Failover routing lets you configure a primary and secondary record for the same DNS name. Route 53 routes traffic to the primary when it’s healthy and to the secondary when it’s not.
# Primary record — us-east-1 ALB
resource "aws_route53_record" "primary" {
zone_id = aws_route53_zone.main.zone_id
name = "api.example.com"
type = "A"
failover_routing_policy {
type = "PRIMARY"
}
set_identifier = "primary"
health_check_id = aws_route53_health_check.primary.id
alias {
name = aws_lb.primary.dns_name
zone_id = aws_lb.primary.zone_id
evaluate_target_health = true
}
}
# Secondary record — us-west-2 ALB (failover target)
resource "aws_route53_record" "secondary" {
zone_id = aws_route53_zone.main.zone_id
name = "api.example.com"
type = "A"
failover_routing_policy {
type = "SECONDARY"
}
set_identifier = "secondary"
# No health_check_id needed on secondary — it's the fallback
alias {
name = aws_lb.secondary.dns_name
zone_id = aws_lb.secondary.zone_id
evaluate_target_health = true
}
}
evaluate_target_health = true on the alias record: Route 53 checks the health of the ALB target groups, not just the ALB itself. If all EC2 instances behind the ALB are unhealthy, the alias record is treated as unhealthy even if the ALB is responding. This is what you want — DNS failover that triggers on actual application health, not just load balancer availability.
TTL considerations
DNS TTL controls how long resolvers cache your DNS records. For failover to work quickly, your DNS TTL must be low enough that clients pick up the change within your acceptable failover time.
Standard failover math:
- Health check detection: 90 seconds (30s interval × 3 failures)
- DNS propagation: depends on TTL + resolver cache
With TTL = 300 seconds (5 minutes), your total failover time is up to 90s (detection) + 300s (DNS cache) = ~6.5 minutes maximum.
With TTL = 60 seconds: ~2.5 minutes maximum failover time.
Trade-off: Lower TTL means more DNS queries (very small cost) and faster failover. For health-check-backed records where quick failover matters, use 60 seconds or lower.
Alias records note: AWS alias records (pointing to ALBs, CloudFront, etc.) don’t honor the TTL you set in the same way — the actual TTL used is controlled by AWS. For non-alias A/CNAME records with failover, TTL is your primary lever.
Calculated health checks for complex scenarios
A single endpoint health check is sometimes insufficient. Example: your application is healthy if the web tier is up AND the database is reachable. If either fails, you want failover to trigger.
# Individual checks
resource "aws_route53_health_check" "web_tier" {
fqdn = "api.example.com"
port = 443
type = "HTTPS"
resource_path = "/health/web"
# ...
}
resource "aws_route53_health_check" "db_connectivity" {
fqdn = "api.example.com"
port = 443
type = "HTTPS"
resource_path = "/health/db"
# ...
}
# Calculated check: healthy only if both children are healthy
resource "aws_route53_health_check" "composite" {
type = "CALCULATED"
child_health_threshold = 2 # Require both children healthy
child_healthchecks = [
aws_route53_health_check.web_tier.id,
aws_route53_health_check.db_connectivity.id
]
}
Use the composite check as the health_check_id on your failover primary record. Failover now triggers if either the web tier or the database connectivity check fails.
CloudWatch alarm-based health checks
For failures that aren’t detectable via HTTP (resource exhaustion, Lambda throttling, SQS queue backup), use CloudWatch alarm health checks:
# CloudWatch alarm: API error rate > 10%
resource "aws_cloudwatch_metric_alarm" "api_error_rate" {
alarm_name = "api-error-rate-high"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "5XXError"
namespace = "AWS/ApplicationELB"
period = 60
statistic = "Average"
threshold = 10
# ...
}
# Route 53 health check tied to this alarm
resource "aws_route53_health_check" "alarm_based" {
type = "CLOUDWATCH_METRIC"
cloudwatch_alarm_name = aws_cloudwatch_metric_alarm.api_error_rate.alarm_name
cloudwatch_alarm_region = "us-east-1"
insufficient_data_health_status = "Healthy" # Treat missing data as healthy
}
This triggers DNS failover when your API error rate exceeds 10% — something an HTTP endpoint check might not catch if the server is up but returning errors.
Latency-based routing with health checks
Failover routing (primary/secondary) is a binary on/off. For active-active multi-region setups, latency-based routing with health checks routes each user to their nearest healthy region:
resource "aws_route53_record" "us_east" {
zone_id = aws_route53_zone.main.zone_id
name = "api.example.com"
type = "A"
latency_routing_policy {
region = "us-east-1"
}
set_identifier = "us-east-1"
health_check_id = aws_route53_health_check.us_east.id
alias {
name = aws_lb.us_east.dns_name
zone_id = aws_lb.us_east.zone_id
evaluate_target_health = true
}
}
resource "aws_route53_record" "us_west" {
# Same structure, region = "us-west-2"
# ...
}
resource "aws_route53_record" "eu_west" {
# Same structure, region = "eu-west-1"
# ...
}
With health checks on latency-based records: Route 53 routes each user to the lowest-latency healthy region. If us-east-1 goes down, East Coast users automatically route to the next closest healthy region.
What Route 53 failover doesn’t solve
Database consistency: DNS failover routes traffic to your secondary region. Your secondary database must be caught up. With RDS read replicas, there’s a replication lag — data written to primary shortly before failure may not be in the replica. For truly active-active, use Aurora Global Database (sub-second replication) or DynamoDB Global Tables.
Application session state: If sessions are stored in memory or a regional Redis, sessions from primary-region users will be lost when they’re rerouted to secondary. Design for stateless applications or use cross-region session storage.
Failback: Route 53 automatically fails back to the primary when health checks recover. This is usually correct, but can cause traffic oscillation if the primary is intermittently unhealthy. Consider manual control over failback for production traffic.
Propagation isn’t instant: Even with TTL = 60, some DNS resolvers cache aggressively or don’t respect TTL. Large enterprise networks with caching resolvers may take longer to see DNS changes. Route 53 failover is not a substitute for application-level retry logic.
Cost
Route 53 health checks:
- Standard (30s interval): $0.50/month
- Fast (10s interval): $1.00/month
- Calculated and CloudWatch alarm checks: $0.50/month
- Additional region checks: +$0.01/month per region beyond standard 3
For a full failover setup with calculated checks: roughly $3-5/month. The cost of having no failover during an outage vastly exceeds this.
If you’re building multi-region architecture or want DNS-level failover for a production workload, I’m available to help design it.
Nick Allevato is an AWS Certified Solutions Architect Professional with 20 years of infrastructure experience. He runs Cold Smoke Consulting, an independent AWS consulting practice.