AWS CloudWatch: What to Actually Monitor

CloudWatch is one of those AWS services where the defaults are nearly useless and the right configuration isn’t obvious. Left alone, it collects data that nobody looks at and sends alerts that trained your team to ignore the alert channel.

The goal is different: a small set of metrics that reliably signal real problems, dashboards that give you operational situational awareness at a glance, and alarms that fire rarely and always mean something.

Here’s what actually works.

The three-tier monitoring model

Think about observability in three tiers:

Infrastructure: EC2, ECS tasks, Lambda, RDS — is the platform running?

Application: Errors, latency, request rates, queue depths — is the service working correctly?

Business: Order rates, login volume, job completion rates — is the product delivering value?

CloudWatch can cover all three. Most teams only configure the infrastructure tier, miss application signals, and never get to business metrics. Start with infrastructure, add application immediately, and treat business metrics as the long-term goal.

EC2: the metrics that matter

If you’re running EC2 instances, the defaults CloudWatch collects (CPU, network, disk I/O) tell you some things but miss others.

What to alarm on:

CPUUtilization > 85% sustained for 5 minutes — instance is CPU-constrained, not a momentary spike
StatusCheckFailed = 1 — instance or host is unhealthy; this needs immediate attention
NetworkIn/NetworkOut — set alarms relative to your baseline; sudden spikes or drops signal problems

What the defaults miss:

Memory utilization isn’t collected by default. Neither is disk space. You need the CloudWatch Unified Agent to collect these from inside the instance. Without it, you’ll discover you’re out of disk when the application starts failing.

Install the CloudWatch Unified Agent on all EC2 instances and configure it to collect:

mem_used_percent — alarm at 85%
disk_used_percent for your application and log volumes — alarm at 80%

ECS and containers: watching task health

For ECS services, the critical metrics:

RunningTaskCount — if this drops below your desired count, tasks are crashing. Alarm on this.
CPUUtilization and MemoryUtilization at the service level — sustained high memory is often a leak
TargetResponseTime from the ALB target group — this is where latency problems show up first

The most common ECS monitoring gap: teams alarm on the EC2 host metrics but not on the ECS service metrics. A task can crash and restart repeatedly while the host looks healthy.

For ECS on Fargate: you don’t have host-level visibility, so service-level metrics are even more important.

Lambda: the four essential alarms

Lambda functions need four CloudWatch alarms as a minimum:

1. Errors — Errors metric, alarm if > 0 for 5 minutes. Lambda errors are unexpected; you want to know immediately.

2. Throttles — Throttles > 0. Throttles mean you’re hitting concurrency limits. They cause silent request failures that don’t appear in your error rate.

3. Duration — Duration approaching your timeout limit. If your function is configured with a 30-second timeout and P99 duration reaches 28 seconds, you’re about to have timeouts.

4. ConcurrentExecutions — unexpected spikes in concurrent Lambda invocations are often the first signal of a runaway retry loop. Alarm when this exceeds 2–3x your normal peak.

The killer combination: a Lambda with a bug that causes every message to fail, retrying against an SQS queue with a long visibility timeout, and a 5-minute Lambda timeout. Concurrent executions spike, duration maxes out, costs compound. An alarm on ConcurrentExecutions catches this before the AWS bill does.

RDS: database signals

For RDS instances:

DatabaseConnections — sustained near your max_connections limit causes new connections to fail. Alarm at 80% of your connection limit.
FreeableMemory — low memory causes query performance degradation. Alarm when < 256MB.
FreeStorageSpace — alarm at 20% remaining. RDS storage auto-scaling can help, but don’t rely on it without an alarm.
CPUUtilization > 80% sustained — indicates query performance problems or under-provisioned instance
ReadLatency and WriteLatency — establish your baseline and alarm on deviations. A 2x latency spike usually means something changed.

For Aurora specifically: add AuroraReplicaLag if you’re using reader endpoints. High replication lag means your read replicas are serving stale data.

ALB: watching the front door

Application Load Balancer metrics give you the application-level view:

HTTPCode_Target_5XX_Count > 0 for 5 minutes — backend errors reaching clients
HTTPCode_ELB_5XX_Count > 0 — ALB itself is failing (not the targets)
TargetResponseTime P99 — user-visible latency. Alarm when P99 exceeds your SLA.
UnHealthyHostCount > 0 — targets failing health checks; immediate alarm

The HTTPCode_Target_5XX_Count vs HTTPCode_ELB_5XX_Count distinction matters for diagnosis: 5XX from the target means your application is erroring; 5XX from the ALB means the connection is failing before it reaches your application.

SQS: queue depth is a proxy for problems

SQS queues tell you more than they appear to:

ApproximateNumberOfMessagesVisible — messages waiting to be processed. If this grows, your consumers are slow or down.
ApproximateAgeOfOldestMessage — how long messages have been waiting. Alarm when this exceeds your acceptable processing SLA.
Dead-letter queue depth > 0 — messages that failed processing and exhausted retries. This is a hard alarm; DLQ messages mean something is broken.

A queue depth that trends upward over hours indicates a capacity problem. A queue depth spike followed by a drop indicates a burst that was absorbed. A queue depth that stays at zero while your DLQ grows means your consumers are crashing.

CloudWatch dashboards: the operational view

Alarms tell you when something is wrong. Dashboards tell you what’s happening.

Build one dashboard per environment (production, staging) with:

A row for each service tier (ALB → ECS/EC2 → RDS)
Error rates, latency, and request volume for the application layer
Resource utilization (CPU, memory, disk) for compute
Database connections, latency, and free storage

The dashboard should answer “is production healthy right now?” in 30 seconds without digging into individual metrics.

Practical tip: Use CloudWatch Metrics Insights queries to build percentile views (P50/P95/P99) for latency metrics. Average latency hides the tail — a healthy P50 with a bad P99 means 1% of your users are having a terrible experience.

What not to alarm on

CPU spikes. Short CPU spikes are normal. Alarm on sustained high CPU, not momentary peaks. A 5-minute average above 85% means something; a 1-minute spike to 95% is usually noise.

Individual Lambda invocation errors. If you have Lambda functions triggered by user actions, a single error is not worth waking someone up. Use anomaly detection or alarm on error rate rather than error count.

Low-severity CloudWatch agent collection failures. The agent will occasionally fail to report a metric. Don’t alarm on every metric collection gap.

The goal is alarms that fire rarely, always mean something, and always require action. Teams that over-alarm train themselves to ignore the alert channel entirely.

Log Insights: the investigation layer

CloudWatch Logs Insights is underused. It lets you query your log groups with a SQL-like syntax:

fields @timestamp, @message
| filter @message like /ERROR/
| stats count(*) as errorCount by bin(5m)
| sort errorCount desc

Set up saved queries for your most common investigation patterns: error rate by service, slow request identification, specific error message frequency. During an incident, you want to run these queries in seconds rather than writing them from scratch.

When to bring in help

The difference between CloudWatch configured well and CloudWatch producing noise is about 2-3 days of focused work. If your team is either blind to production problems or drowning in alerts, an observability review is high-leverage.

I configure CloudWatch as part of infrastructure engagements and can assess your current monitoring posture and close the gaps.

Contact me or email nick@coldsmokeconsulting.com.

Nick Allevato is an AWS Certified Solutions Architect Professional with 20 years of infrastructure experience. He runs Cold Smoke Consulting, an independent AWS consulting practice.