Most AWS Bedrock proof-of-concepts don’t make it to production.
The PoC works. The demo impresses stakeholders. Then the engineering team starts asking the hard questions — latency, cost, compliance, observability — and the timeline stretches from “next sprint” to “next quarter” to “maybe next year.”
I’ve built production Bedrock systems and helped teams rescue stalled PoCs. The failure pattern is almost always the same. Here’s what it looks like and how to avoid it.
What goes wrong between PoC and production
The PoC calls the model directly with no caching or throttle control. Works fine in a demo. In production, you hit API rate limits during load spikes, costs become unpredictable, and you have no visibility into what’s driving usage.
There’s no prompt versioning. The PoC has one prompt hardcoded in the application. In production, you need to iterate on prompts, A/B test them, and roll back when a change degrades output quality. Without version control for prompts, every change is a deployment.
Retrieval is bolted on, not designed. Most PoCs add a vector search at the last minute. The chunk sizes are wrong for the use case, the embeddings aren’t optimized, and the retrieval quality is inconsistent. When users notice the model “making things up,” this is usually why.
There’s no evaluation framework. Nobody defined what “good” looks like. In production, you need automated evals that run on every model or prompt change — otherwise you’re shipping blind.
HIPAA, SOC 2, or other compliance requirements weren’t considered. If you’re in healthcare or finance, the model you used for the PoC may not be HIPAA-eligible. The data you’re sending may not be allowed to leave your VPC. These aren’t minor details — they can require re-architecting the entire system.
The production architecture
Here’s what a production Bedrock system actually needs.
1. Model invocation layer
Don’t call Bedrock directly from your application. Use an abstraction layer that handles:
- Retry logic with exponential backoff — Bedrock API has throttling limits; you need graceful degradation
- Model fallback — if Claude 3.5 Sonnet is throttled, fall back to Claude 3 Haiku for non-critical requests
- Request/response logging — every invocation should be logged with model, prompt hash, token counts, latency, and output hash (not the full content — that’s a compliance issue)
- Cost attribution — tag invocations by feature, user segment, or tenant so you know what’s expensive
AWS Bedrock now supports Inference Profiles for cross-region routing, which helps with availability and latency — worth using if you’re at scale.
2. Prompt management
Treat prompts like code. Options:
- AWS Systems Manager Parameter Store — simple, cheap, supports versioning
- DynamoDB with version tracking — more flexible if you’re A/B testing
- Dedicated prompt management service (Langfuse, PromptLayer) — if you need evaluation built in
The key: prompts should be deployable independently of application code. Prompt changes are frequent; full deployments are expensive.
3. RAG pipeline (if you’re doing retrieval)
The retrieval-augmented generation architecture has several tunable decisions that the PoC usually gets wrong:
Chunk size: Too small and you lose context; too large and you dilute relevance. For most document types, 512–1024 tokens with 10–20% overlap works well. Experiment with your actual content.
Embedding model: AWS Bedrock includes Amazon Titan Embeddings (cheap, good baseline) and Cohere Embed (better for multilingual or specialized domains). Match the model to your content type.
Knowledge base options:
- Amazon Bedrock Knowledge Bases — fully managed, integrates with S3, OpenSearch, or Aurora pgvector. Fastest path to production.
- DIY with pgvector — more control, cheaper at scale, requires more engineering
- OpenSearch Serverless — good for large knowledge bases needing hybrid search
Reranking: Add a reranking step (Cohere Rerank via Bedrock) if retrieval quality is inconsistent. It significantly improves results without changing the rest of the architecture.
4. Guardrails
Bedrock Guardrails lets you define content policies — topics to block, PII handling, grounding checks — that apply to every model invocation without changing your application code.
For production systems, you want at minimum:
- PII detection and redaction (especially if users can submit arbitrary input)
- Grounding check (did the model’s response actually come from your retrieved context?)
- Topic filtering (prevent the model from answering questions outside its intended scope)
5. Observability
You can’t debug a production AI system without good observability. Minimum viable setup:
- CloudWatch Logs for invocation logs (token counts, latency, model ID)
- CloudWatch Metrics for latency percentiles (p50, p95, p99) and error rates
- X-Ray tracing across the full request path (API → Lambda → Bedrock → response)
- Custom metrics for business-level quality signals (user feedback, downstream conversion)
If you’re using LangChain or LlamaIndex, add LangSmith or Langfuse for chain-level tracing — standard CloudWatch won’t give you the prompt/response visibility you need.
6. Compliance
If you’re in a regulated industry:
HIPAA: Bedrock is HIPAA-eligible with a BAA from AWS. But you need to use HIPAA-eligible services throughout: VPC endpoints to avoid internet transit, encrypted S3 for document storage, CloudTrail data events enabled on relevant buckets, and no PHI in CloudWatch logs (log prompt hashes, not prompt content).
SOC 2: Focus on access control (who can invoke which models), audit logging (CloudTrail on all Bedrock API calls), and data retention (how long are inputs/outputs stored?).
What a Bedrock production engagement looks like
When I work with a team to take a Bedrock PoC to production, the engagement covers:
- Architecture review — what needs to change before this can ship
- Infrastructure build — VPC setup, IAM roles, Knowledge Base or RAG pipeline, guardrails
- Evaluation framework — define “good” outputs, build automated evals
- Observability — CloudWatch + X-Ray + custom metrics
- Load testing — validate behavior under realistic traffic, identify throttling thresholds
- Compliance review — if applicable
Typical timeline: 4–8 weeks for a focused engagement. Greenfield is faster than rescue.
The shortcut
If you want to move fast without building all of this yourself, Amazon Bedrock Agents handles the orchestration, tool use, and session management. It’s worth using if your use case fits the managed model. The tradeoff is less control over the execution graph and higher per-invocation cost.
For simple question-answering over a document corpus: use Bedrock Knowledge Bases. For complex multi-step reasoning or tool use: Bedrock Agents. For custom orchestration with specific control requirements: build it yourself with the Converse API and Lambda.
Get started
If you have a Bedrock PoC that needs to get to production, or you’re starting fresh and want to build it right the first time, I’m available for scoped engagements.
Contact me or email nick@coldsmokeconsulting.com.
Nick Allevato is an AWS Certified Solutions Architect Professional with 20 years of infrastructure experience. He runs Cold Smoke Consulting, an independent AWS consulting practice specializing in cloud architecture, cost optimization, and production AI/GenAI systems.