Hestur AIHestur
    All Articles
    RAG Systems

    Building Custom RAG Systems: A Complete Guide

    A step-by-step guide to building RAG systems that actually work in production: chunking strategies, embedding model selection, reranking, hybrid search, and accuracy benchmarks.

    2 min read
    1. Retrieval quality is the main determinant of RAG accuracy.
      • Ingestion and query pipelines are standard; what matters most is how well retrieved chunks match user intent.
    2. Parsing quality is foundational.
      • Simple PDF text extractors work only for clean, text-based PDFs.
      • For complex/scanned/layout-heavy docs, use robust parsers (e.g., Unstructured.io, LlamaParse) or retrieval will fail downstream.
    3. Chunking is the highest-impact design choice.
      • Too large → blurry, mixed-topic embeddings; imprecise retrieval.
      • Too small → not enough context per chunk; LLM can’t answer well.
      • Baseline: ~512-token fixed-size chunks with ~10% overlap.
      • Upgrade paths:
        • Semantic / structure-aware chunking for well-structured docs.
        • Recursive splitting (e.g., LangChain’s RecursiveCharacterTextSplitter).
        • Parent-document (hierarchical) retrieval: small chunks for retrieval, larger parents for context.
    4. Metadata is the second-biggest lever after chunking.
      • Always store rich metadata: source, type, date/version, author/department, topic tags, page/section.
      • Use metadata filters at query time (e.g., restrict to a specific contract, time range, or document type) before similarity search.
    5. Use a single embedding model consistently.
      • Same model for ingestion and queries is mandatory.
      • Defaults:
        • text-embedding-3-small for most use cases (cheap, fast, strong general performance).
        • text-embedding-3-large when precision is critical (legal/medical) and higher cost is acceptable.
    6. Vector storage choices depend on scale and infra.
      • pgvector: easiest if you already use Postgres; good up to a few hundred thousand vectors.
      • Weaviate / Qdrant: strong open-source, production-ready vector DBs with hybrid search options.
      • Pinecone: managed, minimal infra overhead.
    7. Query pipeline should be multi-step and retrieval-centric.
      • Optional query rewriting to make questions more retrieval-friendly.
      • Embed query with same model as documents.
      • Retrieve top-K (typically 4–10) via cosine similarity.
      • Optional but recommended re-ranking with a cross-encoder (e.g., Cohere Rerank, FlashRank, Jina Reranker) to improve relevance.
      • Construct context with citations and prompt the LLM to stay grounded in that context.
    8. Hybrid search usually beats pure vector search in enterprises.
      • Combine vector search with BM25 keyword search (e.g., via Elasticsearch/OpenSearch or native hybrid in Weaviate).
      • Merge results with something like Reciprocal Rank Fusion (RRF).
      • Especially important for identifiers, codes, acronyms, and domain-specific terms.
    9. You need an evaluation dataset before production.
      • Build 50–200 question–answer pairs from real documents, with known correct answers and source docs.
    Hestur AI

    Need a RAG system that actually works?

    95% retrieval accuracy. 50+ file formats. Proper chunking, reranking, and citation support — built on your data. Free PoC.

    All Articles2 min read