- Retrieval quality is the main determinant of RAG accuracy.
- Ingestion and query pipelines are standard; what matters most is how well retrieved chunks match user intent.
- Parsing quality is foundational.
- Simple PDF text extractors work only for clean, text-based PDFs.
- For complex/scanned/layout-heavy docs, use robust parsers (e.g., Unstructured.io, LlamaParse) or retrieval will fail downstream.
- Chunking is the highest-impact design choice.
- Too large → blurry, mixed-topic embeddings; imprecise retrieval.
- Too small → not enough context per chunk; LLM can’t answer well.
- Baseline: ~512-token fixed-size chunks with ~10% overlap.
- Upgrade paths:
- Semantic / structure-aware chunking for well-structured docs.
- Recursive splitting (e.g., LangChain’s RecursiveCharacterTextSplitter).
- Parent-document (hierarchical) retrieval: small chunks for retrieval, larger parents for context.
- Metadata is the second-biggest lever after chunking.
- Always store rich metadata: source, type, date/version, author/department, topic tags, page/section.
- Use metadata filters at query time (e.g., restrict to a specific contract, time range, or document type) before similarity search.
- Use a single embedding model consistently.
- Same model for ingestion and queries is mandatory.
- Defaults:
text-embedding-3-smallfor most use cases (cheap, fast, strong general performance).text-embedding-3-largewhen precision is critical (legal/medical) and higher cost is acceptable.
- Vector storage choices depend on scale and infra.
- pgvector: easiest if you already use Postgres; good up to a few hundred thousand vectors.
- Weaviate / Qdrant: strong open-source, production-ready vector DBs with hybrid search options.
- Pinecone: managed, minimal infra overhead.
- Query pipeline should be multi-step and retrieval-centric.
- Optional query rewriting to make questions more retrieval-friendly.
- Embed query with same model as documents.
- Retrieve top-K (typically 4–10) via cosine similarity.
- Optional but recommended re-ranking with a cross-encoder (e.g., Cohere Rerank, FlashRank, Jina Reranker) to improve relevance.
- Construct context with citations and prompt the LLM to stay grounded in that context.
- Hybrid search usually beats pure vector search in enterprises.
- Combine vector search with BM25 keyword search (e.g., via Elasticsearch/OpenSearch or native hybrid in Weaviate).
- Merge results with something like Reciprocal Rank Fusion (RRF).
- Especially important for identifiers, codes, acronyms, and domain-specific terms.
- You need an evaluation dataset before production.
- Build 50–200 question–answer pairs from real documents, with known correct answers and source docs.
All Articles
RAG Systems
Building Custom RAG Systems: A Complete Guide
A step-by-step guide to building RAG systems that actually work in production: chunking strategies, embedding model selection, reranking, hybrid search, and accuracy benchmarks.
Hestur AI Team
2 min read
Hestur AI
Need a RAG system that actually works?
95% retrieval accuracy. 50+ file formats. Proper chunking, reranking, and citation support — built on your data. Free PoC.
All Articles2 min read