AWS Step Functions vs SQS vs EventBridge: Choosing the Right Orchestration Tool

SQS, Step Functions, and EventBridge all route work between systems, and teams frequently reach for the wrong one. SQS for a workflow that needs visibility into individual step state. Step Functions for a simple queue that just needs decoupled processing. EventBridge for point-to-point service calls that should be direct API calls instead.

The choice matters: it affects cost at scale, operational visibility, failure handling, and how hard the system is to debug when something goes wrong.

The one-line version

SQS: One queue, one job type, one consumer. Best for decoupled async processing where visibility into individual message state doesn’t matter.
Step Functions: Multi-step workflows with explicit state, branching, and error handling. Best when you need to know exactly where a workflow is and what failed.
EventBridge: Event routing and fan-out. Best when one event should trigger multiple independent consumers, or when you’re integrating AWS services and SaaS applications.

SQS: the workhorse

SQS is a message queue. A producer puts messages in; a consumer takes them out and processes them. Simple, reliable, cheap.

Pricing: $0.40 per million requests (standard queue). SQS FIFO queues (guaranteed ordering, exactly-once processing) are $0.50 per million. Very low cost at any volume.

Where it excels:

Decoupling producer and consumer. An API endpoint that needs to return quickly can drop work into SQS and return 202. A worker picks it up asynchronously. The API doesn’t wait for processing to complete.
Rate limiting and backpressure. If your downstream system (a database, an external API) can only handle 100 requests per second and your producer generates 1,000/second, SQS absorbs the burst and the consumer processes at its own pace.
Retry with exponential backoff. Messages that fail processing become visible again after a configurable visibility timeout. After maxReceiveCount failures, they go to a dead-letter queue (DLQ) for inspection.
Lambda integration. Lambda’s SQS trigger is batch-aware: it reads up to 10,000 messages per poll (configurable), processes them, and reports per-message success/failure so Lambda can re-enqueue only the failed messages.

Where it doesn’t work:

Multi-step workflows. SQS has no concept of workflow state. If you need “step 1 → step 2 → step 3, with different error handling per step,” you’re building that coordination layer yourself. Every step needs its own queue, and you’re manually tracking where work is.
Visibility into in-flight work. SQS tells you the approximate number of messages in the queue. It doesn’t tell you which specific message is being processed or what step it’s on.
Long-running work with intermediate state. SQS has a maximum message retention of 14 days. Visibility timeout is maximum 12 hours. For work that takes longer than 12 hours with intermediate state, you need something else.

Common mistake: Using SQS when you need a workflow. Teams build “workflow” systems on SQS by chaining queues — Queue A → Lambda → Queue B → Lambda → Queue C — and quickly discover that debugging failures (which step failed? where is this particular job?) requires reading through CloudWatch Logs and reconstructing state manually. Step Functions exists for this.

Step Functions: explicit workflow orchestration

Step Functions is a state machine service. You define states (Task, Choice, Wait, Parallel, Map) and transitions between them. AWS manages the state of every workflow execution.

Pricing: Two modes:

Standard Workflows: $0.025 per 1,000 state transitions. An 8-step workflow costs $0.0002 per execution. At 1 million executions/month with 8 steps each: $200/month.
Express Workflows: $0.00001 per execution + $0.00001667 per GB-second. Designed for high-volume, short-duration workflows (<5 minutes). Much cheaper per execution for simple workflows.

For high-volume simple workflows (millions of executions/day), Express is substantially cheaper. For complex, long-running, or audit-critical workflows, Standard provides full execution history.

Where it excels:

Visibility and debuggability. Every execution has a full timeline in the AWS console: which step it’s on, how long each step took, what the input/output was at each state. When something fails, you see exactly where and why without log archaeology.
Complex branching and error handling. Step Functions natively supports: Choice (if/else), Parallel (run branches simultaneously), Map (process an array of items in parallel), Retry (with exponential backoff, per state), Catch (route to a failure handler when retries exhausted). Building this in application code is significantly more complex.
Long-running workflows. Standard Workflows can run for up to a year. A workflow that starts a batch job, waits for it to complete (using a WaitForTaskToken callback), and then processes the results — all over 48 hours — is straightforward in Step Functions.
Human approval steps. The WaitForTaskToken pattern pauses a workflow indefinitely until a token is returned. An approval flow that emails a manager, waits for approval, and then continues the workflow is a natural fit.
AWS service integrations. Step Functions has optimized integrations for Lambda, ECS, DynamoDB, SNS, SQS, Bedrock, Glue, SageMaker, and more — often without writing Lambda to wrap the API call. Run an ECS task, wait for it to complete, and store the result in DynamoDB — all in the state machine definition.

Where it doesn’t work:

High-throughput simple queuing. If you’re just moving messages through a single processing step at millions per hour, SQS is 100x cheaper and simpler.
Real-time, sub-second processing. Step Functions Express Workflows have some execution overhead. Not appropriate for paths where sub-100ms end-to-end latency matters.
Event fan-out. If one event should trigger 10 different consumers, Step Functions is the wrong tool. EventBridge is.

Common mistake: Using Step Functions as an expensive queue. A Step Functions Standard Workflow wrapping a single Lambda function — where you could just use SQS + Lambda — costs orders of magnitude more per invocation.

EventBridge: event routing and integration

EventBridge is an event bus. Producers put events on the bus; rules match events to targets; targets receive the events. One event can trigger multiple independent targets simultaneously.

Pricing: $1.00 per million events published to custom event buses. EventBridge Pipes (point-to-point integrations) are priced per event processed. Scheduled rules (cron-based) are $1.00 per million invocations.

Where it excels:

Fan-out to multiple consumers. “Order placed” event → trigger: inventory service, notification service, analytics service, fraud check service. EventBridge routes one event to all four targets simultaneously. No coupling between producers and consumers.
AWS service integration. EventBridge natively integrates with 200+ AWS services and SaaS applications as sources and targets. EC2 instance state changes, CodePipeline events, GuardDuty findings, Stripe webhooks — all can trigger EventBridge rules without custom polling code.
Decoupled service architecture. Services publish events to EventBridge without knowing who consumes them. New consumers can be added without changing the producer. This is the event-driven architecture pattern.
Scheduled tasks. EventBridge Scheduler is the modern replacement for CloudWatch Events cron expressions. Schedule Lambda functions, ECS tasks, or API calls on any cron or rate expression.

Where it doesn’t work:

Guaranteed delivery with retry. EventBridge delivers to targets with retry (retries on failure for up to 24 hours), but it’s not a durable queue. SQS provides stronger delivery guarantees and DLQ support for failed messages.
Stateful workflows. EventBridge has no concept of workflow state. It fires events at targets; what happens after that is the target’s responsibility.
High-throughput point-to-point. If Service A always sends messages to Service B with no fan-out needed, a direct API call or SQS queue is simpler and cheaper than EventBridge.

Common mistake: Using EventBridge for point-to-point service calls. If your event rule has exactly one target and no conditional routing, you’ve added EventBridge overhead (latency, cost, another service to monitor) for no benefit. Direct API call or SQS.

Decision matrix

Requirement	Use
Simple async job queue, one consumer	SQS
Rate limiting and backpressure	SQS
Multi-step workflow with visible state	Step Functions
Branching, parallel steps, long-running	Step Functions
Human approval or external callback	Step Functions Standard
High-volume simple workflow (<5 min)	Step Functions Express
One event → multiple consumers	EventBridge
AWS service integration / SaaS events	EventBridge
Scheduled tasks / cron	EventBridge Scheduler
Point-to-point at very high volume	Direct call or SQS

Common patterns

SQS + Lambda (the workhorse):

API → SQS → Lambda (consumer) → downstream

Simple, cheap, auto-scaling. Lambda scales to SQS queue depth.

Step Functions + Lambda (orchestration):

Trigger → Step Functions → Lambda (step 1) → Lambda (step 2) → notify

Full visibility, explicit error handling per step, execution history for audit.

EventBridge + Step Functions (event-driven orchestration):

Service A publishes event → EventBridge → Step Functions (starts workflow)
                                       → Service B
                                       → Service C

Event triggers fan-out to multiple targets, one of which is a multi-step workflow.

SQS → Step Functions (batch processing with orchestration):

SQS → Lambda → Step Functions (per-item workflow)

SQS absorbs burst from producer; Lambda reads batches and starts one Step Functions execution per item. Each item has full workflow visibility and error handling.

Cost comparison example

Processing 1 million jobs/month, each going through 3 steps:

Approach	Est. Cost/Month
3x SQS queues + Lambda	~$1.20 SQS + Lambda compute
Step Functions Standard (3 states × 1M)	$75 + Lambda compute
Step Functions Express	~$0.01 + Lambda compute

Standard Step Functions is expensive for high-volume simple workflows. Express is not. Know which mode you need before committing.

Getting the architecture right

Choosing between these services is one of those decisions that’s easy to get wrong early and expensive to undo later. If you’re designing a new workflow system or evaluating your current architecture, I’m available to help.

Nick Allevato is an AWS Certified Solutions Architect Professional with 20 years of infrastructure experience. He runs Cold Smoke Consulting, an independent AWS consulting practice.