Understanding Retrieval-Augmented Generation (RAG)
Retrieval-Augmented Generation (RAG) is an AI architecture pattern that combines the reasoning capabilities of Large Language Models (LLMs) with external knowledge sources. Instead of relying solely on information learned during model training, RAG retrieves relevant information from documents, databases, APIs, or knowledge repositories and provides that information as context to the model before generating a response.
Traditional LLMs only know what they learned during training unless new context is provided at runtime. They also cannot automatically access private PDFs, internal documents, databases, or company knowledge. RAG solves this by adding a retrieval layer that finds relevant information first, then gives that information to the language model before it answers.
In simple terms, RAG means: search first, answer second.
Why RAG Exists
Core Components of a RAG System
High-Level Architecture
Ingestion Flow
PDF / Documents
|
v
Text Extraction / OCR
|
v
Chunking
|
v
Embeddings
|
v
Vector Database
Question Answering Flow
User Question
|
v
Question Embedding
|
v
Retriever
|
v
Relevant Chunks
|
v
LLM Prompt
|
v
Answer with Citations
Document Ingestion Pipeline
The ingestion pipeline prepares documents for retrieval. This process is sometimes called indexing or knowledge base creation.
PDF Documents
|
v
Text Extraction
|
v
Chunking
|
v
Embedding Generation
|
v
Vector Database Storage
Text Extraction
The first step is extracting text from PDFs, Word documents, web pages, or structured data sources. For scanned PDFs, OCR technology may be required.
Chunking
Large documents usually should not be sent to the model as one large prompt. Instead, documents are split into smaller chunks so the system can retrieve only the most relevant parts for each user question.
Example chunk sizes:
Overlap is often introduced to preserve context.
Embeddings
Embeddings convert text into high-dimensional numerical vectors that represent semantic meaning.
For example:
"What is AWS Bedrock?"
[0.123, -0.551, 0.782, ...]
Similar content produces similar vector representations, enabling semantic search.
Vector Databases
A vector database stores embeddings along with metadata such as document name, page number, section title, and chunk ID. When a user asks a question, the system searches this database for chunks that are semantically similar to the question.
Retrieval Process
When a user submits a question, the same embedding model converts the question into a vector.
User Question
|
v
Question Embedding
|
v
Similarity Search
|
v
Top Matching Chunks
Reranking
A retriever may return many possible matching chunks. A reranker reviews those chunks and reorders them based on which ones are most useful for answering the question. This improves answer quality, especially when documents are large or when many chunks look similar.
Vector Search
|
v
Top 20 Chunks
|
v
Reranker
|
v
Best 5 Chunks
|
v
LLM
Common RAG Retrieval Patterns
Most introductory examples describe vector-based RAG, where questions and document chunks are converted into embeddings and matched by semantic similarity. In production systems, teams may use additional retrieval patterns depending on the type of data and the accuracy requirements.
For many PDF-based applications, vector RAG is a good starting point. As the system matures, hybrid retrieval, metadata filtering, reranking, and structured retrieval can improve accuracy.
Prompt Augmentation
The retrieved chunks are inserted into the model prompt as context. The prompt should instruct the model to answer using only the provided context when possible and to include citations or page references.
System Prompt:
You are a helpful assistant. Answer using the provided context.
If the answer is not found in the context, say that the document does not contain enough information.
Context:
Chunk 1 - Page 4
Chunk 2 - Page 9
Chunk 3 - Page 12
Question:
What is AWS Bedrock?
Benefits of RAG
Practical Recommendation for PDF-Based RAG
For a PDF question-answering application, do not send the full PDF to the model every time. Store the PDF, extract the text, split it into chunks, create embeddings, and store those embeddings in a vector database. At question time, retrieve only the most relevant chunks and send those chunks to the model.
Common Enterprise Use Cases
Posted on June 08, 2026 by Amit Pandya in AI, RAG