Skip to main content

Command Palette

Search for a command to run...

๐Ÿ” Introduction to RAG (Retrieval-Augmented Generation)

Published
โ€ข3 min read

๐Ÿ“Œ What Is RAG?

RAG is a technique that enhances the performance of Large Language Models (LLMs) by allowing them to retrieve information from external sources before generating a response. This is especially useful when working with dynamic or large datasets that canโ€™t fit into an LLM's limited token window.

๐Ÿค” Why Not Just Fine-Tune LLMs?

LLMs are pre-trained on vast amounts of general-purpose data. But businesses often need them to respond based on specific internal dataโ€”like sales reports, documents, or customer queries.

A common solution is:

  • Fine-tune the LLM on internal data (e.g., documentation, business files).

But this has some serious drawbacks:

  1. โŒ Frequent retraining is not feasible โ€” business data updates in real time.

  2. โŒ Costs skyrocket โ€” training uses GPUs and large resources.

  3. โŒ Complex pipeline โ€” requires preprocessing, data cleaning, and setup.

  4. โŒ No real-time context โ€” even if you train daily, youโ€™ll always lag behind.

โžก๏ธ Conclusion: Fine-tuning works, but isnโ€™t optimal for ever-changing data.

๐Ÿ’ก Alternative: Prompt Engineering with System Prompts

A quick workaround:

  • Load static business data into the system prompt of a powerful model like GPT-4.1 or Claude Sonnet.

  • This allows the chatbot to respond in the context of your business.

But hereโ€™s the issue:

  • Most models have token limits:

    • GPT-4.1 & Claude 3: ~128k tokens

    • GPT-4.1 Turbo: up to 1M tokens

  • Business documents often exceed this โ€” you canโ€™t load everything.

โœ… Enter: RAG to the Rescue

RAG solves this by combining two ideas:

  • Data Indexing: Index your data once.

  • Data Retrieval: Dynamically fetch only the relevant parts at query time.

Basic RAG Pipeline

โš™๏ธ How RAG Works

1. Ingestion & Indexing

Goal: Preprocess and store data for efficient retrieval.

  • ๐Ÿ“„ Data Source: e.g., PDF, CSV, Notion, etc.

  • โœ‚๏ธ Chunking: Split text into small sections (e.g., paragraphs).

  • ๐Ÿ”ข Embeddings: Convert each chunk into a semantic vector using embedding models.

  • ๐Ÿง  Store: Save these vectors (along with metadata like page number) in a Vector Database (e.g., FAISS, Chroma, Weaviate).

Example:

ChunkTextMetadata
C1Node.js is a JS runtime...Page 2
C2fs is the file system module...Page 3

2. Retrieval (Query Time)

Goal: Answer user questions by retrieving only the most relevant chunks.

  • ๐Ÿ” User sends query: โ€œWhat is a module in Node.js?โ€

  • ๐Ÿ”— Embed the query โ†’ Find similar vectors from the DB

  • ๐Ÿ“ฆ Return top relevant chunks + metadata

  • ๐Ÿง  Send: Query + Chunks โ†’ LLM โ†’ Final answer

Result: Efficient, accurate, and context-aware responses with minimal token usage.

๐ŸŽฏ Why Do We Need Optimal RAG?

  • โŒ Naive approach: Pass the entire document to the LLM โ€” wasteful and expensive.

  • โœ… Optimal RAG:

    • Reduces token usage by retrieving only relevant data.

    • Improves latency and scalability.

    • Gives better context-aware answers for dynamic datasets.

๐Ÿง  Key Takeaways

  • Fine-tuning is expensive and rigid.

  • Prompt-based context injection hits token limits.

  • RAG provides a scalable, cost-efficient, and real-time alternative.

  • Optimal RAG minimizes waste and maximizes relevance.