Session 3: Retrieval-Augmented Generation

Learning Objectives:


Recap: The Context Window Problem

From Session 2, we know:

Question: What if your knowledge base is millions of tokens?


Real-World Scenario

You want an LLM to answer questions about:

Problem: This won’t fit in any context window

Solution: Retrieval-Augmented Generation (RAG)


What is RAG?

Retrieval-Augmented Generation = dynamically retrieve relevant information and add it to the context

Key insight: Don’t put everything in context, just what’s relevant to the current query

Analogy: Like having an index in a textbook - you don’t read the whole book for every question


RAG Pipeline Overview

1. INDEXING (done once)
   Documents → Chunks → Embeddings → Vector Database

2. RETRIEVAL (per query)
   Query → Embedding → Search DB → Top-K chunks

3. AUGMENTATION (per query)
   Context = System + Retrieved chunks + Query

4. GENERATION (per query)
   LLM(Context) → Response

Why RAG Works

Advantages:

vs Fine-Tuning:


Embeddings Revisited

From the LLM Primer:

New concept: We can create embeddings for entire chunks of text, not just tokens


Semantic Similarity

Text embeddings capture semantic meaning

Example:

"BRCA1 mutation" 
"breast cancer susceptibility gene defect"

These have different words but similar embeddings because they mean similar things

This enables semantic search - find by meaning, not just keywords


The Indexing Phase (Step 1)

Goal: Prepare documents for efficient retrieval

Steps:

  1. Load documents (PDFs, text files, databases, etc.)
  2. Split into chunks (more on this next)
  3. Generate embeddings for each chunk
  4. Store in vector database (specialized DB for similarity search)

Done once (or when documents change)


Document Chunking

Why chunk?

Chunk size tradeoff:

Typical sizes: 200-1000 tokens per chunk with 10-20% overlap


Chunking Strategies

1. Fixed-size chunks

2. Sentence-based

3. Semantic chunks

4. Structure-based (for scientific papers)


Chunking for Scientific Papers

Recommended approach:

chunks = [
    {"section": "Abstract", "text": "...", "metadata": {...}},
    {"section": "Introduction", "text": "...", "metadata": {...}},
    {"section": "Methods", "text": "...", "metadata": {...}},
    {"section": "Results", "text": "...", "metadata": {...}},
    {"section": "Discussion", "text": "...", "metadata": {...}}
]

Benefit: Can prioritize sections by query type


Vector Databases

Regular database: Exact match queries

SELECT * FROM papers WHERE title = "BRCA1 mutations"

Vector database: Similarity queries

results = db.similarity_search(
    query_embedding,
    top_k=5
)

Popular options: ChromaDB, Pinecone, Weaviate, FAISS, Qdrant


The Retrieval Phase (Step 2)

For each query:

  1. Convert query to embedding (same model as indexing)
  2. Search vector DB for most similar chunks
  3. Rank by similarity score (cosine similarity, dot product)
  4. Return top-K chunks (typically K=3-10)

Fast: Optimized for high-dimensional similarity search


Similarity Metrics

How to measure “closeness” of vectors?

Cosine similarity: Angle between vectors

Euclidean distance: Geometric distance

Dot product: Direct vector multiplication


The Augmentation Phase (Step 3)

Build the context for the LLM:

context = f"""
System: You are an expert bioinformatician. Answer based on 
the provided documents. Cite sources.

Retrieved Information:
{retrieved_chunk_1}

{retrieved_chunk_2}

{retrieved_chunk_3}

User Query: {user_question}
"""

The Generation Phase (Step 4)

Pass augmented context to LLM:

response = completion(
    model="anthropic/claude-sonnet-4-20250514",
    messages=[{"role": "user", "content": context}]
)

The LLM generates answer based on:


RAG vs Context Stuffing

Context Stuffing: Put all documents in context

RAG: Retrieve only relevant parts


Retrieval Quality Matters

The RAG Bottleneck:

If retrieval fails to find relevant chunks, the LLM can’t help

Common issues:

Solution: Hybrid retrieval strategies


Hybrid Retrieval

Semantic search alone may miss exact keyword matches

Keyword search alone may miss semantic matches

Hybrid approach:

  1. Semantic search (embedding similarity)
  2. + Keyword search (BM25, TF-IDF)
  3. Combine and rerank results

Best of both worlds


Reranking

Problem: Initial retrieval may have noise in top-K

Solution: Reranking step

  1. Retrieve top-20 candidates (broad net)
  2. Rerank using more sophisticated model
  3. Take top-5 for context

Reranker: Specialized model trained to score query-document relevance


Metadata Filtering

Enhance retrieval with structured filters:

results = db.similarity_search(
    query_embedding,
    filter={
        "organism": "Homo sapiens",
        "year": {"$gte": 2020},
        "journal": "Nature Genetics"
    },
    top_k=5
)

Combines semantic search with traditional filters


RAG in Bioinformatics: Use Cases

1. Protocol search

2. Literature review

3. Annotation lookup

4. Experiment notes


RAG Limitations

1. Retrieval accuracy

2. Chunk boundary issues

3. Conflicting information

4. Computational cost


When NOT to Use RAG

Skip RAG when:

Use RAG when:


RAG vs Fine-Tuning vs Prompting

Prompting:

RAG:

Fine-Tuning:

Often combined for best results


Practical Demo Overview

We’ll build two RAG systems:

  1. Paper Q&A: Query a collection of genomics papers
  2. Protocol Assistant: Search lab protocols and methods

We’ll see:


Demo 1: Paper Q&A System

System components:

Flow:

  1. Load papers → chunk by section
  2. Generate embeddings → store in ChromaDB
  3. Query: “What sequencing methods were used?”
  4. Retrieve relevant chunks
  5. Generate answer with citations

Demo 2: Protocol Search

System components:

Flow:

  1. Index protocols with metadata
  2. Query: “RNA extraction from blood samples”
  3. Filter: protocols from last 2 years
  4. Retrieve and generate step-by-step answer

Key Takeaways

  1. RAG extends LLM knowledge beyond training and context limits
  2. Pipeline: Index → Retrieve → Augment → Generate
  3. Embeddings enable semantic search (meaning, not just keywords)
  4. Chunking strategy matters for retrieval quality
  5. Vector databases make similarity search fast
  6. Hybrid approaches (semantic + keyword) often work best
  7. RAG ≠ perfect - retrieval quality is critical

Theory ↔ Practice Connections

Theory: Embeddings capture semantic meaning in vector space

Practice: Similar concepts cluster together, enabling semantic search


Theory: Attention mechanism processes all context tokens

Practice: Only include relevant retrieved chunks to focus attention


Theory: Context windows have hard token limits

Practice: RAG circumvents limits by selective retrieval


Looking Ahead: Session 4

Next topic: Tool Use & Function Calling

The problem we’ll solve:

RAG retrieves static documents. What about dynamic information?

Solution: Give LLMs the ability to use tools


Resources

Libraries:

Papers:

Demo code: lectures/demos/session_3/


Questions?

Next session: Tool Use & Function Calling