Desk2Mob

Desk2Mob

Desk2Mob

Understanding Retrieval-Augmented Generation (RAG)

Understanding Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) is an AI architecture pattern that combines the reasoning capabilities of Large Language Models (LLMs) with external knowledge sources. Instead of relying solely on information learned during model training, RAG retrieves relevant information from documents, databases, APIs, or knowledge repositories and provides that information as context to the model before generating a response.

Traditional LLMs only know what they learned during training unless new context is provided at runtime. They also cannot automatically access private PDFs, internal documents, databases, or company knowledge. RAG solves this by adding a retrieval layer that finds relevant information first, then gives that information to the language model before it answers.

In simple terms, RAG means: search first, answer second.

Why RAG Exists

  • Reduce hallucinations by grounding responses in trusted data.
  • Allow models to answer questions about private documents.
  • Avoid expensive model retraining.
  • Enable real-time access to changing business information.
  • Provide citations and source attribution.
  • Support enterprise knowledge management systems.

Core Components of a RAG System

  • Document Repository
  • Document Ingestion Pipeline
  • Chunking Engine
  • Embedding Model
  • Vector Database
  • Retriever
  • Reranker, optional but useful for large document sets
  • Large Language Model
  • Chat Application

High-Level Architecture

Ingestion Flow

PDF / Documents
  |
  v
Text Extraction / OCR
  |
  v
Chunking
  |
  v
Embeddings
  |
  v
Vector Database


Question Answering Flow

User Question
  |
  v
Question Embedding
  |
  v
Retriever
  |
  v
Relevant Chunks
  |
  v
LLM Prompt
  |
  v
Answer with Citations

Document Ingestion Pipeline

The ingestion pipeline prepares documents for retrieval. This process is sometimes called indexing or knowledge base creation.

PDF Documents
     |
     v
Text Extraction
     |
     v
Chunking
     |
     v
Embedding Generation
     |
     v
Vector Database Storage

Text Extraction

The first step is extracting text from PDFs, Word documents, web pages, or structured data sources. For scanned PDFs, OCR technology may be required.

Chunking

Large documents usually should not be sent to the model as one large prompt. Instead, documents are split into smaller chunks so the system can retrieve only the most relevant parts for each user question.

  • Fixed-size chunking
  • Paragraph-based chunking
  • Semantic chunking
  • Section-aware chunking
  • Hierarchical chunking

Example chunk sizes:

  • 500 tokens
  • 750 tokens
  • 1000 tokens

Overlap is often introduced to preserve context.

  • 50 token overlap
  • 100 token overlap
  • 150 token overlap

Embeddings

Embeddings convert text into high-dimensional numerical vectors that represent semantic meaning.

For example:

"What is AWS Bedrock?"

[0.123, -0.551, 0.782, ...]

Similar content produces similar vector representations, enabling semantic search.

Vector Databases

A vector database stores embeddings along with metadata such as document name, page number, section title, and chunk ID. When a user asks a question, the system searches this database for chunks that are semantically similar to the question.

  • OpenSearch
  • Pinecone
  • Qdrant
  • Weaviate
  • Milvus
  • FAISS
  • pgvector

Retrieval Process

When a user submits a question, the same embedding model converts the question into a vector.

User Question
      |
      v
Question Embedding
      |
      v
Similarity Search
      |
      v
Top Matching Chunks

Reranking

A retriever may return many possible matching chunks. A reranker reviews those chunks and reorders them based on which ones are most useful for answering the question. This improves answer quality, especially when documents are large or when many chunks look similar.

Vector Search
      |
      v
Top 20 Chunks
      |
      v
Reranker
      |
      v
Best 5 Chunks
      |
      v
LLM

Common RAG Retrieval Patterns

Most introductory examples describe vector-based RAG, where questions and document chunks are converted into embeddings and matched by semantic similarity. In production systems, teams may use additional retrieval patterns depending on the type of data and the accuracy requirements.

  • Vector RAG: Uses embeddings and semantic similarity search to retrieve relevant chunks.
  • Hybrid RAG: Combines vector search with keyword search such as BM25. This helps when exact terms, product names, IDs, or legal phrases matter.
  • Keyword RAG: Uses sparse retrieval, inverted indexes, or BM25-style search without relying only on embeddings.
  • SQL RAG: Retrieves exact answers from structured databases and tabular data using generated or predefined SQL queries.
  • Knowledge Graph RAG: Uses entities, relationships, and graph traversal to answer questions that depend on connected information.
  • Reasoning-Based RAG: Uses document hierarchy, table of contents, metadata, or multi-step retrieval when a question requires more than one lookup.

For many PDF-based applications, vector RAG is a good starting point. As the system matures, hybrid retrieval, metadata filtering, reranking, and structured retrieval can improve accuracy.

Prompt Augmentation

The retrieved chunks are inserted into the model prompt as context. The prompt should instruct the model to answer using only the provided context when possible and to include citations or page references.

System Prompt:
You are a helpful assistant. Answer using the provided context.
If the answer is not found in the context, say that the document does not contain enough information.

Context:
Chunk 1 - Page 4
Chunk 2 - Page 9
Chunk 3 - Page 12

Question:
What is AWS Bedrock?

Benefits of RAG

  • Access to proprietary data
  • Reduced hallucinations
  • Lower cost than model fine-tuning
  • Near real-time knowledge updates
  • Enterprise scalability
  • Improved answer quality

Practical Recommendation for PDF-Based RAG

For a PDF question-answering application, do not send the full PDF to the model every time. Store the PDF, extract the text, split it into chunks, create embeddings, and store those embeddings in a vector database. At question time, retrieve only the most relevant chunks and send those chunks to the model.

  • Use OCR when the PDF contains scanned pages.
  • Store page numbers and source file names as metadata.
  • Use citations so users can verify answers.
  • Use reranking when the PDF is large or retrieval quality is weak.
  • Re-index the document when the PDF content changes.

Common Enterprise Use Cases

  • Chat with PDFs
  • Legal document analysis
  • Contract review
  • Customer support assistants
  • Knowledge management systems
  • Healthcare documentation
  • Financial research assistants
  • Technical documentation search

Posted on June 08, 2026 by Amit Pandya in AI, RAG


All Posts