Session 3: Retrieval-Augmented Generation
Learning Objectives:
- Understand when and why to use RAG
- Learn the RAG pipeline architecture
- Implement semantic search with embeddings
- Apply chunking strategies for scientific documents
- Build a simple RAG system for bioinformatics
Recap: The Context Window Problem
From Session 2, we know:
- Context windows are limited (e.g., 4K, 128K, 200K tokens)
- Everything must fit in N tokens
- Attention mechanism processes the entire context
Question: What if your knowledge base is millions of tokens?
Real-World Scenario
You want an LLM to answer questions about:
- Your lab’s 500 protocols (∼2M tokens)
- 10,000 research papers (∼50M tokens)
- Genomic annotation databases (∼100M+ tokens)
- Your experiment notes from 5 years (∼500K tokens)
Problem: This won’t fit in any context window
Solution: Retrieval-Augmented Generation (RAG)
What is RAG?
Retrieval-Augmented Generation = dynamically retrieve relevant information and add it to the context
Key insight: Don’t put everything in context, just what’s relevant to the current query
Analogy: Like having an index in a textbook - you don’t read the whole book for every question
RAG Pipeline Overview
1. INDEXING (done once)
Documents → Chunks → Embeddings → Vector Database
2. RETRIEVAL (per query)
Query → Embedding → Search DB → Top-K chunks
3. AUGMENTATION (per query)
Context = System + Retrieved chunks + Query
4. GENERATION (per query)
LLM(Context) → Response
Why RAG Works
Advantages:
- Access to knowledge beyond training data
- Up-to-date information (update the index, not the model)
- Domain-specific knowledge (your protocols, data, papers)
- Traceable sources (know where answers come from)
- No fine-tuning required
vs Fine-Tuning:
- RAG: Dynamic, updatable, traceable
- Fine-tuning: Static, expensive, requires expertise
Embeddings Revisited
From the LLM Primer:
- Embeddings = vector representations of text
- Each token gets mapped to a high-dimensional vector
- Similar meanings → similar vectors
New concept: We can create embeddings for entire chunks of text, not just tokens
Semantic Similarity
Text embeddings capture semantic meaning
Example:
"BRCA1 mutation"
"breast cancer susceptibility gene defect"
These have different words but similar embeddings because they mean similar things
This enables semantic search - find by meaning, not just keywords
The Indexing Phase (Step 1)
Goal: Prepare documents for efficient retrieval
Steps:
- Load documents (PDFs, text files, databases, etc.)
- Split into chunks (more on this next)
- Generate embeddings for each chunk
- Store in vector database (specialized DB for similarity search)
Done once (or when documents change)
Document Chunking
Why chunk?
- Embeddings work best on coherent units of meaning
- Retrieval precision (return specific relevant parts, not whole docs)
- Context window limits (even retrieved text must fit)
Chunk size tradeoff:
- Too small: Loss of context, fragmented information
- Too large: Less precise retrieval, more noise
Typical sizes: 200-1000 tokens per chunk with 10-20% overlap
Chunking Strategies
1. Fixed-size chunks
- Split every N characters/tokens
- Simple but may break mid-sentence/concept
2. Sentence-based
- Split on sentence boundaries
- More coherent but variable size
3. Semantic chunks
- Split on topic changes
- Best coherence but computationally expensive
4. Structure-based (for scientific papers)
- Split by section (Abstract, Methods, Results, etc.)
- Preserves logical organization
Chunking for Scientific Papers
Recommended approach:
chunks = [
{"section": "Abstract", "text": "...", "metadata": {...}},
{"section": "Introduction", "text": "...", "metadata": {...}},
{"section": "Methods", "text": "...", "metadata": {...}},
{"section": "Results", "text": "...", "metadata": {...}},
{"section": "Discussion", "text": "...", "metadata": {...}}
]
Benefit: Can prioritize sections by query type
- “How was this done?” → Methods
- “What did they find?” → Results
Vector Databases
Regular database: Exact match queries
SELECT * FROM papers WHERE title = "BRCA1 mutations"
Vector database: Similarity queries
results = db.similarity_search(
query_embedding,
top_k=5
)
Popular options: ChromaDB, Pinecone, Weaviate, FAISS, Qdrant
The Retrieval Phase (Step 2)
For each query:
- Convert query to embedding (same model as indexing)
- Search vector DB for most similar chunks
- Rank by similarity score (cosine similarity, dot product)
- Return top-K chunks (typically K=3-10)
Fast: Optimized for high-dimensional similarity search
Similarity Metrics
How to measure “closeness” of vectors?
Cosine similarity: Angle between vectors
- Range: -1 to 1 (1 = identical)
- Most common for text
Euclidean distance: Geometric distance
- Range: 0 to ∞ (0 = identical)
- Sensitive to magnitude
Dot product: Direct vector multiplication
- Combines angle and magnitude
The Augmentation Phase (Step 3)
Build the context for the LLM:
context = f"""
System: You are an expert bioinformatician. Answer based on
the provided documents. Cite sources.
Retrieved Information:
{retrieved_chunk_1}
{retrieved_chunk_2}
{retrieved_chunk_3}
User Query: {user_question}
"""
The Generation Phase (Step 4)
Pass augmented context to LLM:
response = completion(
model="anthropic/claude-sonnet-4-20250514",
messages=[{"role": "user", "content": context}]
)
The LLM generates answer based on:
- Its training knowledge
- + Retrieved specific information
RAG vs Context Stuffing
Context Stuffing: Put all documents in context
- Limited by context window
- Expensive (all tokens processed)
- Attention spread across irrelevant info
RAG: Retrieve only relevant parts
- Scalable (millions of documents)
- Cheaper (only process relevant chunks)
- Focused attention on pertinent information
Retrieval Quality Matters
The RAG Bottleneck:
If retrieval fails to find relevant chunks, the LLM can’t help
Common issues:
- Query-document vocabulary mismatch
- Chunk boundaries split important info
- Top-K too small (missed relevant docs)
- Top-K too large (noise dilutes signal)
Solution: Hybrid retrieval strategies
Hybrid Retrieval
Semantic search alone may miss exact keyword matches
Keyword search alone may miss semantic matches
Hybrid approach:
- Semantic search (embedding similarity)
- + Keyword search (BM25, TF-IDF)
- Combine and rerank results
Best of both worlds
Reranking
Problem: Initial retrieval may have noise in top-K
Solution: Reranking step
- Retrieve top-20 candidates (broad net)
- Rerank using more sophisticated model
- Take top-5 for context
Reranker: Specialized model trained to score query-document relevance
Enhance retrieval with structured filters:
results = db.similarity_search(
query_embedding,
filter={
"organism": "Homo sapiens",
"year": {"$gte": 2020},
"journal": "Nature Genetics"
},
top_k=5
)
Combines semantic search with traditional filters
1. Protocol search
- “How do we extract RNA from tissue samples?”
- Retrieve from lab protocol database
2. Literature review
- “What’s known about APOE variants and Alzheimer’s?”
- Search thousands of papers
3. Annotation lookup
- “What pathways is EGFR involved in?”
- Query curated databases
4. Experiment notes
- “When did we last run sample XYZ?”
- Search lab notebooks
RAG Limitations
1. Retrieval accuracy
- Garbage in, garbage out
- Missed relevant docs = wrong answers
2. Chunk boundary issues
- Important info split across chunks
- May need larger chunks or overlap
3. Conflicting information
- Retrieved chunks may contradict
- LLM must reconcile (not always successful)
4. Computational cost
- Embedding generation
- Vector search overhead
When NOT to Use RAG
Skip RAG when:
- Information fits comfortably in context window
- General knowledge questions (LLM already knows)
- Real-time data needs (use tools/APIs instead - Session 3)
- Highly structured queries (use databases directly)
Use RAG when:
- Large knowledge base (beyond context limits)
- Domain-specific information
- Frequently updated content
- Need source attribution
RAG vs Fine-Tuning vs Prompting
Prompting:
- Knowledge in context
- Best for: Task formatting, examples
RAG:
- Knowledge retrieved externally
- Best for: Large/updating knowledge bases
Fine-Tuning:
- Knowledge in model weights
- Best for: Behavior patterns, style, domain expertise
Often combined for best results
Practical Demo Overview
We’ll build two RAG systems:
- Paper Q&A: Query a collection of genomics papers
- Protocol Assistant: Search lab protocols and methods
We’ll see:
- Document loading and chunking
- Embedding generation
- Vector database setup
- Query and retrieval
- Answer generation with sources
Demo 1: Paper Q&A System
System components:
- ChromaDB for vector storage
- Sentence-transformers for embeddings
- LiteLLM for generation
- Sample genomics papers
Flow:
- Load papers → chunk by section
- Generate embeddings → store in ChromaDB
- Query: “What sequencing methods were used?”
- Retrieve relevant chunks
- Generate answer with citations
Demo 2: Protocol Search
System components:
- Structured protocol documents
- Metadata (author, date, tags)
- Hybrid search (semantic + metadata filters)
Flow:
- Index protocols with metadata
- Query: “RNA extraction from blood samples”
- Filter: protocols from last 2 years
- Retrieve and generate step-by-step answer
Key Takeaways
- RAG extends LLM knowledge beyond training and context limits
- Pipeline: Index → Retrieve → Augment → Generate
- Embeddings enable semantic search (meaning, not just keywords)
- Chunking strategy matters for retrieval quality
- Vector databases make similarity search fast
- Hybrid approaches (semantic + keyword) often work best
- RAG ≠ perfect - retrieval quality is critical
Theory ↔ Practice Connections
Theory: Embeddings capture semantic meaning in vector space
Practice: Similar concepts cluster together, enabling semantic search
Theory: Attention mechanism processes all context tokens
Practice: Only include relevant retrieved chunks to focus attention
Theory: Context windows have hard token limits
Practice: RAG circumvents limits by selective retrieval
Looking Ahead: Session 4
Next topic: Tool Use & Function Calling
The problem we’ll solve:
RAG retrieves static documents. What about dynamic information?
- Database queries
- API calls
- Running computations
- Accessing real-time data
Solution: Give LLMs the ability to use tools
Resources
Libraries:
- LlamaIndex - https://www.llamaindex.ai/
- LangChain - https://www.langchain.com/
- ChromaDB - https://www.trychroma.com/
- Sentence-Transformers - https://www.sbert.net/
Papers:
- “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” (Lewis et al., 2020)
- “REALM: Retrieval-Augmented Language Model Pre-Training” (Guu et al., 2020)
Demo code: lectures/demos/session_3/
Questions?
Next session: Tool Use & Function Calling