Building RAG Systems: Lessons Learned from Production

Retrieval-Augmented Generation (RAG) has become the go-to pattern for building LLM applications that need to work with custom data. After implementing RAG systems in production, here are the lessons I’ve learned—the hard way.

What is RAG, Really?

At its core, RAG is elegantly simple: instead of fine-tuning an LLM on your data (expensive and complex), you retrieve relevant context at query time and inject it into the prompt.

User Query → Retrieve Relevant Docs → Augment Prompt → Generate Response

But as with most things in software, the devil is in the details.

Lesson 1: Chunking Strategy Matters More Than You Think

Your chunking strategy can make or break retrieval quality. Here’s what I’ve learned:

Don’t Just Split by Token Count

# ❌ Naive approach
chunks = text_splitter.split_text(document, chunk_size=500)

# ✅ Better: Use semantic boundaries
chunks = recursive_splitter.split_text(
    document,
    separators=["\n\n", "\n", ". ", " "],
    chunk_size=500,
    chunk_overlap=50
)

Consider Document Structure

For structured documents (PDFs, markdown), preserve hierarchy:

def smart_chunk(document):
    # Preserve headers and their content together
    sections = extract_sections(document)
    chunks = []
    for section in sections:
        if len(section) > MAX_CHUNK_SIZE:
            chunks.extend(split_preserving_context(section))
        else:
            chunks.append(section)
    return chunks

Lesson 2: Embedding Selection is Critical

Not all embeddings are created equal. Here’s my decision framework:

Use Case	Recommended Model	Why
General text	`text-embedding-3-small`	Good balance of cost/quality
Technical docs	`text-embedding-3-large`	Better semantic understanding
Privacy-sensitive	`all-MiniLM-L6-v2`	Runs locally, no API calls

The Privacy Trade-off

For sensitive data, local embeddings are non-negotiable:

from sentence_transformers import SentenceTransformer

# Runs entirely on your infrastructure
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(chunks)

Lesson 3: Hybrid Search Outperforms Pure Vector Search

Vector similarity isn’t always enough. Combining it with keyword search (BM25) significantly improves results:

from langchain.retrievers import EnsembleRetriever

vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
bm25_retriever = BM25Retriever.from_documents(docs, k=5)

# Combine with weighted fusion
ensemble = EnsembleRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    weights=[0.6, 0.4]
)

This is especially important for:

Technical documentation with specific terminology
Code snippets and error messages
Queries with exact phrases

Lesson 4: Context Window Management

Just because you can stuff 100k tokens doesn’t mean you should.

The “Lost in the Middle” Problem

LLMs struggle with information in the middle of long contexts. Structure your retrieved content:

def format_context(retrieved_docs):
    # Most relevant at start and end
    sorted_docs = sorted(retrieved_docs, key=lambda x: x.score)
    
    top_half = sorted_docs[:len(sorted_docs)//2]
    bottom_half = sorted_docs[len(sorted_docs)//2:]
    
    # Interleave for better attention distribution
    return interleave(top_half, reversed(bottom_half))

Lesson 5: Evaluation is Hard But Essential

You can’t improve what you can’t measure. Here’s a practical evaluation approach:

def evaluate_rag_response(query, response, ground_truth):
    metrics = {
        "relevance": compute_relevance(response, ground_truth),
        "faithfulness": check_hallucination(response, retrieved_context),
        "completeness": check_coverage(response, ground_truth),
        "latency_ms": measure_latency()
    }
    return metrics

Key Metrics to Track

Retrieval Precision: Are we fetching relevant documents?
Answer Faithfulness: Is the answer grounded in retrieved context?
End-to-end Latency: User experience matters
Cost per Query: Especially with paid embeddings/LLMs

Common Pitfalls to Avoid

1. Ignoring Metadata

Always preserve and leverage document metadata:

chunk = Document(
    page_content=text,
    metadata={
        "source": filename,
        "page": page_num,
        "section": section_title,
        "date": doc_date
    }
)

2. Not Handling Edge Cases

What happens when retrieval returns nothing relevant?

if max(doc.score for doc in retrieved) < RELEVANCE_THRESHOLD:
    return "I don't have enough information to answer this question accurately."

3. Over-engineering the First Version

Start simple, measure, iterate. Don’t build a complex multi-stage retrieval pipeline before proving the basic approach works.

Wrapping Up

Building RAG systems is an iterative process. Start with the basics, measure everything, and optimize based on real user feedback. The perfect architecture is the one that solves your specific problem—not the most sophisticated one.

In future posts, I’ll dive deeper into specific topics like multi-modal RAG, agent-based retrieval, and production deployment strategies.

Have questions or want to share your RAG experiences? Reach out on LinkedIn or GitHub!