AWS SQS: Dead Letter Queues, Visibility Timeout, and Retry Patterns

SQS (Simple Queue Service) is one of AWS’s most reliable services and one of the most commonly misconfigured. The interface is simple — send a message, receive a message, delete a message — which creates false confidence. The failure modes are subtle and show up at scale or under incident conditions.

Here’s what to get right from the start.

Queue types

Standard queues:

At-least-once delivery — messages may be delivered more than once
Best-effort ordering — messages may be delivered out of order
Nearly unlimited throughput
Default for most use cases

FIFO queues:

Exactly-once processing — each message delivered exactly once
Strict first-in-first-out ordering within a message group
3,000 messages/second with batching, 300 without
Required for order-dependent workflows, financial transactions, sequential processing

Choose FIFO only when ordering or exactly-once semantics are genuinely required. FIFO queues have lower throughput limits and higher per-message costs. Most applications that think they need FIFO can be designed to work with idempotent Standard queue processing.

Visibility timeout: the most misunderstood setting

When a consumer receives a message from SQS, the message becomes invisible to other consumers for the visibility timeout period. If the consumer successfully processes the message and deletes it before the timeout expires, the message is gone. If the consumer crashes, times out, or fails to delete the message, the visibility timeout expires and the message becomes visible again for another consumer to pick up.

Default visibility timeout: 30 seconds.

The problem: If your processing takes longer than 30 seconds — or spikes to longer during heavy load — the message reappears in the queue and gets processed again by another consumer while the first consumer is still working on it. You get duplicate processing with no indication anything went wrong.

Sizing the visibility timeout:

Set to at least 6x your expected processing time (to account for retries and slow processing)
If processing can take up to 30 seconds at p99, set visibility timeout to at least 3-5 minutes
For Lambda consumers: Lambda’s maximum execution time is 15 minutes; set visibility timeout to match or exceed your function timeout

Heartbeating for long-running jobs: If processing time is genuinely variable and can exceed the timeout, call ChangeMessageVisibility periodically during processing to extend the timeout. This is the SQS equivalent of a heartbeat — telling SQS “I’m still working, don’t make this visible again.”

import boto3
import threading

sqs = boto3.client('sqs')

def extend_visibility(queue_url, receipt_handle, extension_seconds=30, interval=20):
    """Extend message visibility every interval seconds."""
    while True:
        threading.Event().wait(interval)
        try:
            sqs.change_message_visibility(
                QueueUrl=queue_url,
                ReceiptHandle=receipt_handle,
                VisibilityTimeout=extension_seconds
            )
        except Exception:
            break  # Message was deleted or timeout extension failed

# Start heartbeat in background thread during processing
heartbeat = threading.Thread(
    target=extend_visibility,
    args=(queue_url, receipt_handle),
    daemon=True
)
heartbeat.start()

Dead letter queues

A dead letter queue (DLQ) is a separate SQS queue that receives messages that couldn’t be processed after a configured number of attempts. Configure it in the source queue’s settings via maxReceiveCount.

resource "aws_sqs_queue" "main" {
  name                       = "main-queue"
  visibility_timeout_seconds = 300
  message_retention_seconds  = 86400  # 1 day

  redrive_policy = jsonencode({
    deadLetterTargetArn = aws_sqs_queue.dlq.arn
    maxReceiveCount     = 5  # Move to DLQ after 5 failed processing attempts
  })
}

resource "aws_sqs_queue" "dlq" {
  name                      = "main-queue-dlq"
  message_retention_seconds = 1209600  # 14 days — keep failed messages longer
}

maxReceiveCount: When a message’s ApproximateReceiveCount attribute exceeds this number, SQS moves it to the DLQ. The right value depends on your processing pattern:

If failures are transient (network timeouts, temporary downstream outages): set to 5-10
If failures indicate bad data: set lower (3-5) to fail fast
Never set to 1 — this sends any message that fails once directly to DLQ with no retry

DLQ retention: Set higher than your source queue retention. You want failed messages to persist long enough for investigation. 14 days is a common choice.

Alert on DLQ depth: A DLQ with messages is a signal that something is wrong. Create a CloudWatch alarm on ApproximateNumberOfMessagesVisible for the DLQ:

resource "aws_cloudwatch_metric_alarm" "dlq_messages" {
  alarm_name          = "main-queue-dlq-not-empty"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "ApproximateNumberOfMessagesVisible"
  namespace           = "AWS/SQS"
  period              = 60
  statistic           = "Sum"
  threshold           = 0
  alarm_description   = "Messages in DLQ — processing failures need investigation"
  alarm_actions       = [aws_sns_topic.alerts.arn]

  dimensions = {
    QueueName = aws_sqs_queue.dlq.name
  }
}

DLQ redrive: After fixing the root cause of processing failures, use SQS DLQ redrive to move messages back to the source queue for reprocessing. Available in the console (Queues → DLQ → Start DLQ redrive) or via API.

The retry storm problem

The visibility timeout and retry behavior combine to create a dangerous failure pattern: the retry storm.

Scenario: A downstream service (database, external API) goes down. Your SQS consumers start failing to process messages. Each failed message reappears after the visibility timeout and gets retried. With 10,000 messages in the queue and a 30-second visibility timeout, you’re generating a constant flood of processing attempts against a service that’s already down. When the downstream recovers, the retry storm continues overwhelming it.

Mitigations:

1. Exponential backoff at the application level: When a consumer detects a downstream failure, extend the message’s visibility timeout exponentially before releasing it. First failure → 30s delay; second → 60s; third → 120s.

2. maxReceiveCount + DLQ: After N failures, messages move to DLQ and stop generating retries against the recovering downstream service.

3. SQS delay queues: Set a delivery delay (up to 15 minutes) on the queue. New messages don’t appear to consumers until the delay expires. Useful for rate limiting initial processing, less useful for retry storms in progress.

4. Circuit breaker in the consumer: The consumer tracks recent failure rate; when it exceeds a threshold, the consumer stops receiving from SQS entirely (or pauses the Lambda event source mapping). Prevents futile retries from piling up.

Lambda as SQS consumer: specific considerations

Lambda + SQS is a common pattern with non-obvious failure behavior.

Batch size: Lambda can receive 1-10,000 messages per invocation (configured via event source mapping). Larger batches are more efficient but mean all messages in the batch are treated as a group for retry purposes — if one message causes a function failure, all messages in the batch retry.

Partial batch failure: Lambda supports partial batch failure response. Return a batchItemFailures list to tell SQS which messages failed and should be retried, while successfully processed messages are deleted:

def handler(event, context):
    failures = []
    
    for record in event['Records']:
        message_id = record['messageId']
        try:
            process_message(record['body'])
        except Exception as e:
            failures.append({'itemIdentifier': message_id})
    
    return {'batchItemFailures': failures}

Without partial batch failure response, Lambda reports success or failure for the entire batch. A single bad message causes all good messages to retry unnecessarily.

Lambda timeout vs visibility timeout: Lambda’s function timeout must be less than the SQS visibility timeout. If the function times out, Lambda may not have a chance to report the failure to SQS, and the message might be processed again before the visibility timeout expires. Set visibility timeout = Lambda timeout × 6.

Concurrent Lambda invocations: Each SQS event source mapping scales Lambda concurrency based on queue depth. In a busy queue, Lambda can scale to hundreds of concurrent invocations. If your downstream can’t handle that throughput, use the event source mapping’s ScalingConfig.MaximumConcurrency to cap it.

FIFO queue patterns

FIFO queues deduplicate messages within a 5-minute window using either:

MessageDeduplicationId — explicitly set per message
Content-based deduplication — SHA-256 hash of the message body (enable at queue level)

Message groups: FIFO queues use MessageGroupId to define ordering boundaries. Messages within the same group are processed in order; messages in different groups can be processed in parallel. Design group IDs to maximize parallelism:

Use customer ID or order ID as group ID to serialize that customer’s events
Avoid using a single group ID for all messages — this eliminates parallelism and creates a bottleneck

Cost

SQS charges:

Standard: $0.40 per million requests
FIFO: $0.50 per million requests
First 1 million requests/month: free

At scale, SQS cost is almost always negligible compared to the compute that processes the messages. The exception: high-frequency polling without messages. If consumers poll every second with empty returns, those polls still count as requests. Use long polling (20-second wait) to reduce empty polls:

response = sqs.receive_message(
    QueueUrl=queue_url,
    MaxNumberOfMessages=10,
    WaitTimeSeconds=20  # Long polling — waits up to 20s for messages
)

Long polling reduces empty responses, cuts request count, and lowers cost at low-volume queues.

Monitoring the right metrics

Metric	What it signals
`ApproximateNumberOfMessagesVisible`	Queue backlog — if growing, consumers can’t keep up
`ApproximateAgeOfOldestMessage`	Processing lag — old messages indicate stuck consumers
`NumberOfMessagesSent`	Producer throughput
`NumberOfMessagesDeleted`	Successful processing rate
DLQ `ApproximateNumberOfMessagesVisible`	Processing failures — alert on > 0

The most important alarm: ApproximateAgeOfOldestMessage exceeding your SLA. If messages are waiting more than 5 minutes in a queue that should process in seconds, something is wrong with your consumers.

If you’re designing event-driven architecture on AWS or troubleshooting SQS reliability problems, I’m available to help.

Nick Allevato is an AWS Certified Solutions Architect Professional with 20 years of infrastructure experience. He runs Cold Smoke Consulting, an independent AWS consulting practice.