Understanding Retrieval-Augmented Generation (RAG)

Article 1: Understanding Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) is an AI architecture pattern that combines the reasoning capabilities of Large Language Models (LLMs) with external knowledge sources. Instead of relying solely on information learned during model training, RAG retrieves relevant information from documents, databases, APIs, or knowledge repositories and provides that information as context to the model before generating a response.

Traditional LLMs only know what they learned during training unless new context is provided at runtime. They also cannot automatically access private PDFs, internal documents, databases, or company knowledge. RAG solves this by adding a retrieval layer that finds relevant information first, then gives that information to the language model before it answers.

In simple terms, RAG means: search first, answer second.

Why RAG Exists

Reduce hallucinations by grounding responses in trusted data.
Allow models to answer questions about private documents.
Avoid expensive model retraining.
Enable real-time access to changing business information.
Provide citations and source attribution.
Support enterprise knowledge management systems.

Core Components of a RAG System

Document Repository
Document Ingestion Pipeline
Chunking Engine
Embedding Model
Vector Database
Retriever
Reranker, optional but useful for large document sets
Large Language Model
Chat Application

High-Level Architecture

Ingestion Flow

PDF / Documents
  |
  v
Text Extraction / OCR
  |
  v
Chunking
  |
  v
Embeddings
  |
  v
Vector Database


Question Answering Flow

User Question
  |
  v
Question Embedding
  |
  v
Retriever
  |
  v
Relevant Chunks
  |
  v
LLM Prompt
  |
  v
Answer with Citations

Document Ingestion Pipeline

The ingestion pipeline prepares documents for retrieval. This process is sometimes called indexing or knowledge base creation.

PDF Documents
     |
     v
Text Extraction
     |
     v
Chunking
     |
     v
Embedding Generation
     |
     v
Vector Database Storage

Text Extraction

The first step is extracting text from PDFs, Word documents, web pages, or structured data sources. For scanned PDFs, OCR technology may be required.

Chunking

Large documents usually should not be sent to the model as one large prompt. Instead, documents are split into smaller chunks so the system can retrieve only the most relevant parts for each user question.

Fixed-size chunking
Paragraph-based chunking
Semantic chunking
Section-aware chunking
Hierarchical chunking

Example chunk sizes:

500 tokens
750 tokens
1000 tokens

Overlap is often introduced to preserve context.

50 token overlap
100 token overlap
150 token overlap

Embeddings

Embeddings convert text into high-dimensional numerical vectors that represent semantic meaning.

For example:

"What is AWS Bedrock?"

[0.123, -0.551, 0.782, ...]

Similar content produces similar vector representations, enabling semantic search.

Vector Databases

A vector database stores embeddings along with metadata such as document name, page number, section title, and chunk ID. When a user asks a question, the system searches this database for chunks that are semantically similar to the question.

OpenSearch
Pinecone
Qdrant
Weaviate
Milvus
FAISS
pgvector

Retrieval Process

When a user submits a question, the same embedding model converts the question into a vector.

User Question
      |
      v
Question Embedding
      |
      v
Similarity Search
      |
      v
Top Matching Chunks

Reranking

A retriever may return many possible matching chunks. A reranker reviews those chunks and reorders them based on which ones are most useful for answering the question. This improves answer quality, especially when documents are large or when many chunks look similar.

Vector Search
      |
      v
Top 20 Chunks
      |
      v
Reranker
      |
      v
Best 5 Chunks
      |
      v
LLM

Common RAG Retrieval Patterns

Most introductory examples describe vector-based RAG, where questions and document chunks are converted into embeddings and matched by semantic similarity. In production systems, teams may use additional retrieval patterns depending on the type of data and the accuracy requirements.

Vector RAG: Uses embeddings and semantic similarity search to retrieve relevant chunks.
Hybrid RAG: Combines vector search with keyword search such as BM25. This helps when exact terms, product names, IDs, or legal phrases matter.
Keyword RAG: Uses sparse retrieval, inverted indexes, or BM25-style search without relying only on embeddings.
SQL RAG: Retrieves exact answers from structured databases and tabular data using generated or predefined SQL queries.
Knowledge Graph RAG: Uses entities, relationships, and graph traversal to answer questions that depend on connected information.
Reasoning-Based RAG: Uses document hierarchy, table of contents, metadata, or multi-step retrieval when a question requires more than one lookup.

For many PDF-based applications, vector RAG is a good starting point. As the system matures, hybrid retrieval, metadata filtering, reranking, and structured retrieval can improve accuracy.

Prompt Augmentation

The retrieved chunks are inserted into the model prompt as context. The prompt should instruct the model to answer using only the provided context when possible and to include citations or page references.

System Prompt:
You are a helpful assistant. Answer using the provided context.
If the answer is not found in the context, say that the document does not contain enough information.

Context:
Chunk 1 - Page 4
Chunk 2 - Page 9
Chunk 3 - Page 12

Question:
What is AWS Bedrock?

Benefits of RAG

Access to proprietary data
Reduced hallucinations
Lower cost than model fine-tuning
Near real-time knowledge updates
Enterprise scalability
Improved answer quality

Practical Recommendation for PDF-Based RAG

For a PDF question-answering application, do not send the full PDF to the model every time. Store the PDF, extract the text, split it into chunks, create embeddings, and store those embeddings in a vector database. At question time, retrieve only the most relevant chunks and send those chunks to the model.

Use OCR when the PDF contains scanned pages.
Store page numbers and source file names as metadata.
Use citations so users can verify answers.
Use reranking when the PDF is large or retrieval quality is weak.
Re-index the document when the PDF content changes.

When Should You Use RAG?

Use RAG when the model needs to answer questions using private, internal, or frequently changing information.
Use RAG when users need answers from PDFs, policies, contracts, technical documents, tickets, or knowledge bases.
Use RAG when citations and source traceability are important.
Use RAG when retraining or fine-tuning a model would be too expensive, slow, or unnecessary.
Use RAG when the application needs to reduce hallucinations by grounding answers in trusted sources.

Common Enterprise Use Cases

Chat with PDFs
Legal document analysis
Contract review
Customer support assistants
Knowledge management systems
Healthcare documentation
Financial research assistants
Technical documentation search

Continue the RAG series: