EKS Cost Optimization: Where Kubernetes Bills Go Wrong on AWS

EKS costs more than teams expect, and the waste is usually concentrated in a few predictable places. Node groups sized for peak load running at 20% utilization, no Spot instances anywhere in the cluster, data transfer between availability zones accumulating quietly, and NAT Gateway charges for traffic that could go through VPC endpoints.

Here’s a systematic approach to finding and eliminating EKS waste.

Node group sizing: the biggest lever

The most common EKS cost problem is static node groups sized for peak load. The cluster runs 24/7 at whatever node count was set during initial provisioning, regardless of actual workload demand.

Cluster Autoscaler and Karpenter solve this, but you have to configure them correctly.

Cluster Autoscaler scales existing node groups up and down based on pending pods and underutilized nodes. Common misconfiguration: --scale-down-utilization-threshold is set too high, so nodes never scale down even when they’re 40% utilized. The default is 0.5 (50%). For non-critical workloads, dropping this to 0.3-0.4 and enabling --scale-down-enabled gets significant idle capacity removed.

Karpenter is the more aggressive option — it provisions individual EC2 instances directly (bypassing managed node groups) and selects instance types dynamically based on pending pod requirements. For clusters with variable workloads, Karpenter typically achieves 30-50% better utilization than Cluster Autoscaler because it right-sizes at the instance level rather than scaling pre-defined node groups.

Node utilization baseline

Check actual utilization before assuming the nodes are needed:

kubectl top nodes

If you see nodes consistently below 30-40% CPU and memory utilization, you’re over-provisioned. The fix depends on your workload: reduce node count for static workloads, add autoscaling for dynamic workloads, or right-size instance types.

Spot instances: the largest single savings lever

Spot instances on EKS typically reduce EC2 costs by 60-80% for interruptible workloads. Most EKS clusters run everything on On-Demand.

The pattern that works:

On-Demand node group: Minimum capacity for stateful workloads (databases, caches, anything that can’t tolerate interruption). Use smaller instance types in this group — m5.large or m5.xlarge.
Spot node group (or Karpenter Spot provisioner): The majority of stateless application capacity. Use multiple instance families to improve Spot availability (m5, m4, m5a, m5n all have different interruption rates).

To configure pods for Spot tolerance:

affinity:
  nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
            - key: eks.amazonaws.com/capacityType
              operator: In
              values:
                - SPOT
tolerations:
  - key: "spot"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"

The concern about Spot interruptions is real but manageable. AWS gives a 2-minute warning before reclaiming a Spot instance. With proper node drain handling (the AWS Node Termination Handler), pods are gracefully rescheduled before the instance is terminated. Applications that are stateless and have more than one replica handle Spot interruptions transparently.

The risk is high for: databases, stateful services with long warm-up times, jobs that can’t checkpoint. Everything else is a good Spot candidate.

Data transfer: the hidden cost

Data transfer costs in EKS fall into three categories:

Cross-AZ traffic

EKS spreads pods across availability zones for resilience. When a pod in us-east-1a talks to a pod in us-east-1b, AWS charges $0.01/GB each direction. For a service mesh with high request volume, this adds up quickly and is invisible until you look at your VPC data transfer line item.

The fix depends on your traffic patterns:

Topology-aware routing (Kubernetes 1.27+): Keeps traffic within the same AZ when possible. Add to your service:

spec:
  trafficDistribution: PreferClose

This tells kube-proxy to prefer endpoints in the same zone. For most applications, this eliminates 80%+ of cross-AZ traffic with zero application changes.

For older clusters: The topologyKeys approach on EndpointSlices achieves similar routing, though it’s deprecated in favor of the above.

Note: topology-aware routing trades some load balancing evenness for reduced data transfer costs. Verify your pod distribution across AZs is roughly even before enabling it.

NAT Gateway charges

Pods in private subnets that call AWS services (S3, DynamoDB, SSM, ECR, STS) route through the NAT Gateway by default. At $0.045/GB, this is the same trap as with Lambda and ECS.

VPC Gateway Endpoints (free) for S3 and DynamoDB. VPC Interface Endpoints (hourly cost + $0.01/GB) for ECR, SSM, STS, and other services. The ECR endpoint is particularly valuable — every node pulling container images traverses the NAT Gateway without it.

Calculate: if your nodes pull 1GB of images per day and you have 20 nodes, that’s $0.90/day or $27/month in NAT charges just for image pulls — eliminated by a $7/month interface endpoint.

Fargate data transfer

EKS Fargate charges the same data transfer rates as regular EC2. Cross-AZ traffic and NAT Gateway costs apply. Fargate clusters often have higher data transfer costs because Fargate spins up a new ENI per pod (rather than per node), which can increase cross-AZ traffic frequency for short-lived pods.

Compute right-sizing: instance family selection

The default EKS managed node group instance type is often whatever was chosen at cluster creation — frequently m5.large or m5.xlarge regardless of actual workload requirements. Two common mismatches:

Memory-optimized workloads on general-purpose instances: If your pods consistently hit memory limits while CPU sits at 30%, you’re on the wrong instance family. r5 or r6g instances have double the memory-to-CPU ratio of m5 at roughly the same price per resource unit. Moving memory-bound workloads to r-family instances eliminates OOM kills and reduces replica count.

Compute-optimized workloads on general-purpose instances: If you’re running CPU-intensive workloads (ML inference, data processing), c5 or c6g instances have better CPU performance per dollar than m5.

Graviton (ARM) instances: m7g, c7g, r7g instances are typically 10-20% cheaper than equivalent x86 instances and often have better performance. Most containerized workloads run on ARM without modification — multi-arch container images are now the standard. The EKS-optimized ARM AMI is maintained by AWS.

Persistent volume costs

EBS volumes attached to persistent workloads accumulate costs and are often over-provisioned.

PVC sizing: Most teams provision PVCs with generous headroom. A database that uses 20GB of actual data often runs on a 100GB volume. Use the kubectl get pvc output combined with actual pod disk usage to identify over-provisioned volumes.

EBS volume type: The default EKS storage class uses gp2. gp3 is the same price for baseline I/O but allows you to provision throughput and IOPS independently — often 20% cheaper than equivalently-performing gp2. Migrating existing PVCs requires snapshotting and restoring to a new gp3 volume.

Unattached volumes: When pods are deleted without proper cleanup, PVCs and their backing EBS volumes can persist. These unattached volumes appear in your EC2 console as “available” state and continue billing at $0.08/GB/month. Audit for these regularly:

kubectl get pvc --all-namespaces | grep -v Bound

Any PVC not in “Bound” state should be investigated.

Control plane and logging costs

The EKS control plane is $0.10/hour per cluster ($72/month). For teams with multiple clusters (dev/staging/prod), this adds up. Consider whether separate dev/staging clusters are necessary or if namespaces within a single cluster achieve the same isolation at lower cost.

EKS control plane logging: Disabled by default. Enabling all log types (API server, audit, authenticator, controller manager, scheduler) can add $30-100+/month in CloudWatch Logs storage depending on cluster activity. Enable logging selectively — audit logs are required for HIPAA/SOC2, but you may not need controller manager logs in production. Route logs through CloudWatch with a short retention period (7 days) rather than indefinite storage.

Container image size and pull frequency

Oversized container images affect costs indirectly: larger images mean more ECR storage, more data transfer during pulls, and slower node startup times (which means more over-provisioning “just in case”).

ECR storage is $0.10/GB/month. An organization with dozens of services and dozens of image versions can accumulate GB of stored images. ECR lifecycle policies expire old images automatically:

{
  "rules": [
    {
      "rulePriority": 1,
      "selection": {
        "tagStatus": "untagged",
        "countType": "imageCountMoreThan",
        "countNumber": 3
      },
      "action": {"type": "expire"}
    }
  ]
}

This keeps only the three most recent untagged images. Combine with rules that expire tagged images older than 30 days for non-production repositories.

The audit approach

For a structured EKS cost review:

Pull EC2 costs in Cost Explorer, filter by tag or resource type, last 3 months
Check node utilization with kubectl top nodes — anything below 40% is a target
Check Spot adoption — what percentage of your nodes are Spot?
Review data transfer in Cost Explorer — cross-AZ and NAT Gateway line items
Audit VPC endpoints — are S3, DynamoDB, and ECR endpoints in place?
Check PVC utilization — are volumes provisioned for actual usage?
Review instance families — does the workload match the instance type?

Teams that haven’t done this review typically find 30-50% savings available.

When to bring in help

EKS cost optimization involves both Kubernetes internals and AWS billing mechanics — the combination means it often falls through the cracks between the application team and the infrastructure team. A focused audit takes 2-3 days and delivers specific, actionable recommendations with projected savings.

I work with engineering teams to audit and optimize EKS cost posture, including autoscaling configuration, Spot migration, and data transfer elimination. Let’s talk.

Nick Allevato is an AWS Certified Solutions Architect Professional with 20 years of infrastructure experience. He runs Cold Smoke Consulting, an independent AWS consulting practice.