Apr 16, 2026 // aws

AWS Bedrock Knowledge Bases: Building RAG Pipelines on AWS

Bedrock Knowledge Bases is AWS's managed RAG infrastructure. Here's how it works, what it costs, and when to use it versus building your own retrieval pipeline.

Retrieval-Augmented Generation (RAG) is the pattern that makes large language models useful for domain-specific questions. Instead of relying on the model’s training data, you retrieve relevant documents at query time and include them in the prompt. The model reasons over your content, not just what it learned during training.

AWS Bedrock Knowledge Bases is a managed service that handles the retrieval infrastructure — ingestion, embedding, storage, and retrieval — so you don’t have to build it yourself. Here’s how it works and when it’s the right choice.


How Bedrock Knowledge Bases works

The RAG pipeline has four stages:

  1. Ingestion — Documents are chunked into segments. Bedrock supports S3 as the data source; documents can be PDF, Word, HTML, Markdown, CSV, or plain text.

  2. Embedding — Each chunk is converted to a vector embedding using an embedding model (Amazon Titan Embeddings, Cohere Embed, or other available models). Embeddings are numerical representations of semantic meaning.

  3. Storage — Embeddings are stored in a vector database. Bedrock supports several vector store backends: Amazon OpenSearch Serverless, Amazon Aurora PostgreSQL (with pgvector), MongoDB Atlas, Pinecone, and Redis Enterprise Cloud.

  4. Retrieval — At query time, the user’s question is embedded using the same model. The vector store finds the N most semantically similar chunks (nearest-neighbor search). Those chunks are passed to the LLM as context.

Bedrock Knowledge Bases manages steps 1-4 as a service. You provide the S3 bucket and connect a vector store; Bedrock handles ingestion jobs, embedding, and the retrieval API.


Cost model

Embedding (during ingestion):

  • Amazon Titan Embeddings G1: $0.0001/1,000 tokens
  • Amazon Titan Embeddings V2: $0.00002/1,000 tokens
  • Cohere Embed: $0.0001/1,000 tokens

For a 10,000-page knowledge base with average 500 tokens per page chunk: 10,000 × 500 tokens = 5M tokens. At Titan V2 prices: $0.10 to embed the entire corpus. Ingestion embedding is cheap.

Vector store:

  • OpenSearch Serverless: $0.24/OCU-hour (minimum 2 OCUs = ~$345/month)
  • Aurora PostgreSQL (pgvector): Standard Aurora pricing, starting ~$50-100/month for small instances
  • External (Pinecone, MongoDB Atlas, Redis): their own pricing, billed separately

Model inference (at query time):

  • Standard Bedrock model pricing applies (Claude Sonnet: $3/million input tokens, $15/million output tokens)
  • Each RAG query adds the retrieved chunks to the prompt — context size affects cost

The OpenSearch Serverless floor is the biggest cost driver. Even with zero queries, you pay $345/month minimum for the vector store. For small knowledge bases or low-query applications, this can make Bedrock Knowledge Bases more expensive than alternatives.


Vector store selection

StoreMin costBest for
OpenSearch Serverless~$345/monthLarge corpora, high query volume, AWS-native
Aurora PostgreSQL (pgvector)~$50-100/monthSmaller corpora, teams already on Aurora, lower cost floor
Pinecone$70/month (starter)Teams already using Pinecone
MongoDB Atlas$0 (free tier) for smallMixed workloads with MongoDB
Redis EnterprisecustomLowest latency retrieval

For most new Bedrock RAG projects: Aurora PostgreSQL with pgvector is the most cost-effective vector store at typical startup/enterprise scales. It avoids the OpenSearch Serverless $345/month floor while still being fully managed and AWS-native.

When to use OpenSearch Serverless: Very large corpora (millions of chunks), high query volume where OCU cost becomes proportional to usage, or teams already familiar with OpenSearch.


Chunking strategies

Chunking strategy significantly affects retrieval quality. Bedrock Knowledge Bases offers three modes:

Fixed-size chunking: Splits documents at a fixed token count (default 300 tokens) with configurable overlap (default 20%). Simple and predictable. Works well for uniform content (product catalogs, FAQ pages). Performs poorly for long-form documents where context spans multiple pages.

Hierarchical chunking: Creates parent and child chunks. Child chunks (small, ~100 tokens) are used for retrieval; parent chunks (larger, ~500 tokens) are returned as context. The child provides precise matching; the parent provides surrounding context. Better retrieval quality for narrative documents.

Semantic chunking: Uses an embedding model to identify natural semantic boundaries in text. Chunks end where the topic changes, not at a token count. Highest quality, highest cost (embedding is done during chunking, not just at final chunk storage).

Recommendation: Start with fixed-size chunking for prototyping. Move to hierarchical or semantic chunking if retrieval quality is insufficient for your use case. Don’t optimize chunking strategy before validating that retrieval quality is actually the bottleneck.


The retrieval API

Once a Knowledge Base is created and synced, retrieval is a simple API call:

import boto3

bedrock_agent_runtime = boto3.client('bedrock-agent-runtime', region_name='us-east-1')

# Retrieve relevant chunks without generating a response
response = bedrock_agent_runtime.retrieve(
    knowledgeBaseId='YOUR-KB-ID',
    retrievalQuery={'text': 'What is our refund policy for enterprise customers?'},
    retrievalConfiguration={
        'vectorSearchConfiguration': {
            'numberOfResults': 5,  # Return top 5 chunks
            'overrideSearchType': 'HYBRID'  # Semantic + keyword search
        }
    }
)

for result in response['retrievalResults']:
    print(f"Score: {result['score']:.3f}")
    print(f"Content: {result['content']['text'][:200]}")
    print(f"Source: {result['location']['s3Location']['uri']}")
    print()

Hybrid search (semantic + keyword) is available when using OpenSearch Serverless as the vector store. It combines vector similarity search with BM25 keyword search and typically outperforms pure semantic search for queries containing proper nouns, product names, or specific identifiers.

Retrieve-and-Generate API: Bedrock also provides a combined retrieve_and_generate API that retrieves chunks and generates a response in one call, with automatic citation of sources:

response = bedrock_agent_runtime.retrieve_and_generate(
    input={'text': 'What is our enterprise refund policy?'},
    retrieveAndGenerateConfiguration={
        'type': 'KNOWLEDGE_BASE',
        'knowledgeBaseConfiguration': {
            'knowledgeBaseId': 'YOUR-KB-ID',
            'modelArn': 'arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-sonnet-20240229-v1:0',
            'retrievalConfiguration': {
                'vectorSearchConfiguration': {'numberOfResults': 5}
            }
        }
    }
)

print(response['output']['text'])
# Response includes citations with source document references
for citation in response['citations']:
    print(citation['retrievedReferences'])

When to use Bedrock Knowledge Bases vs. building your own

Use Bedrock Knowledge Bases when:

  • You want to minimize infrastructure management — no embedding pipeline to build, no vector store to operate
  • Your data source is S3 and you want managed sync (Bedrock can automatically re-ingest when S3 objects change)
  • You need AWS-native IAM integration and audit trails via CloudTrail
  • You’re already in the Bedrock ecosystem and want consistent tooling

Build your own retrieval pipeline when:

  • You need chunking strategies beyond what Bedrock supports (e.g., custom document parsers, table extraction from PDFs)
  • You need retrieval from non-S3 sources (databases, APIs, real-time data)
  • You need sub-$50/month cost with a small knowledge base (Aurora pgvector directly, without the Bedrock managed layer)
  • You need retrieval latency below what the managed API provides
  • You want to use embedding models not available in Bedrock

The hybrid approach: Use Bedrock for the LLM inference (Claude via Bedrock API), build your own retrieval pipeline with pgvector or Pinecone, and call the LLM with the retrieved context manually. You get Bedrock’s model access without the Knowledge Bases managed infrastructure.


IAM configuration

Knowledge Bases require an execution role with access to the S3 data source, the embedding model, and the vector store:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "bedrock:InvokeModel"
      ],
      "Resource": "arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-embed-text-v2:0"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::your-documents-bucket",
        "arn:aws:s3:::your-documents-bucket/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "aoss:APIAccessAll"
      ],
      "Resource": "arn:aws:aoss:us-east-1:ACCOUNT-ID:collection/YOUR-COLLECTION-ID"
    }
  ]
}

Scope the S3 permissions to the specific bucket and prefix containing your knowledge base documents. Wildcard S3 access on a Knowledge Base role is a common misconfiguration.


Production considerations

Sync frequency: Bedrock Knowledge Bases syncs data from S3 when you manually trigger a sync or when you configure an event-driven sync (S3 event → Lambda → start ingestion job). For frequently updated content, build an automated sync pipeline. For static reference content, manual sync is fine.

Metadata filtering: Bedrock Knowledge Bases supports attaching metadata to chunks during ingestion (via .metadata.json files alongside documents in S3). At query time, you can filter results by metadata — e.g., only retrieve chunks from documents tagged department: legal or product: enterprise. Useful for multi-tenant knowledge bases or role-based access to different document sets.

Evaluation: Retrieval quality is hard to evaluate without a ground truth dataset. Build a set of question-answer pairs where you know the correct source document, then measure whether retrieval returns that document in the top-N results. Poor retrieval quality is almost always a chunking or embedding model problem, not a model reasoning problem.


If you’re building a RAG pipeline on AWS and want help with architecture, vector store selection, or retrieval quality, I’m available for an engagement.


Nick Allevato is an AWS Certified Solutions Architect Professional with 20 years of infrastructure experience. He runs Cold Smoke Consulting, an independent AWS consulting practice.


← all writing