Hestur AIHestur
    Retrieval-Augmented Generation

    RAG That Retrieves the Right Answer, Every Time

    Most RAG projects fail because retrieval is treated as a solved problem. We engineer precision retrieval pipelines: hybrid search, reranking, and citation tracking. Your AI finds the right chunk and never invents the rest.

    95% retrieval accuracy across 50+ file formats, including video and audio

    Data streams converging into a point of light over a vast library of documents and audio waveforms

    The Problem We Solve

    Before Hestur AI

    • Your AI answers with confidence — from the wrong paragraph
    • Engineers spend weeks prompt-tuning around broken retrieval
    • No way to trace which source an answer came from
    • File types locked to PDF and plain text — audio, video, spreadsheets excluded
    • Every employee can query everything — no permission boundaries in retrieval

    After Hestur AI

    • Retrieval returns the right chunk 95%+ of the time, measured on eval sets
    • Answers are grounded in verified sources with full citation trails
    • Every response links to the exact document, page, and paragraph
    • 50+ file formats ingested — MP4, MP3, XLSX, scanned PDFs, HTML, and more
    • Permission-aware retrieval — users only surface what their role allows

    Key Results

    95%

    Retrieval Accuracy

    Measured on held-out evaluation sets before go-live

    50+

    File Formats

    Including video, audio, spreadsheets, and scanned documents

    <200ms

    Query Latency

    At production scale with full reranking pipeline

    40%

    Hallucination Reduction

    Versus standard single-stage dense retrieval

    Technical Capabilities

    Hybrid semantic + keyword (BM25) searchCross-encoder rerankingChunk-level citation trackingMulti-modal ingestion (PDF, DOCX, MP4, MP3, XLSX, HTML, OCR)Permission-aware retrieval with row-level and document-level ACLQuery decomposition and intent routingMetadata filtering and faceted searchStreaming responses with source attributionLangChain, LangGraph, LlamaIndex, Haystack orchestrationPinecone, Weaviate, pgvector vector store optionsOn-premises deployment for regulated industriesEvaluation harness with continuous accuracy monitoring

    Why Your Last RAG Project Hallucinated — and How to Fix It

    The engineering team built a chatbot on your knowledge base. It looked impressive in the demo. In production, it started answering questions with information that didn't exist anywhere in your documents. An executive asked it about a client contract, got a confident, specific, completely fabricated answer, and sent it to the client before anyone caught it.

    This is the story we hear most often before a new engagement starts.

    The failure almost never lives in the language model. Modern LLMs — GPT-4o, Claude 3.5, Gemini 1.5 — are extraordinarily good at generating coherent text from context. The failure lives in what you handed the model: the wrong context, retrieved from the wrong document, for the wrong reason.

    Fix the retrieval. The hallucinations stop.

    We have rebuilt failing RAG systems at companies in financial services, healthcare, and legal tech. In every case, the root cause was the same: single-stage dense retrieval, basic text splitting, no reranking, and no measurement of whether retrieval was working at all. Nobody had defined what 'correct retrieval' looked like. Nobody had built an eval set. Nobody had measured accuracy before shipping.

    We fix all of that.

    How We Engineer 95% Retrieval Accuracy

    95% retrieval accuracy sounds like a claim. Here is the specific engineering that produces it.

    Step 1 — Hybrid Search: Semantic Plus Keyword

    Standard RAG uses dense vector search: your query is embedded, and the system finds document chunks whose embeddings are closest in vector space. This works well for semantic similarity — 'what are the payment terms?' matches 'invoices are due within 30 days.'

    It fails on exact terms. A query for 'Regulation 1.4(b)(iii)' or 'product SKU AX-2297' has no semantic neighbour. Dense search returns vaguely relevant paragraphs. The right paragraph, which contains the exact string, ranks lower or not at all.

    Hybrid search runs two retrieval systems in parallel and fuses their results: dense retrieval using text-embedding-3-large or equivalent, capturing semantic meaning; and BM25 sparse retrieval, capturing exact keyword and token matches. Reciprocal Rank Fusion (RRF) combines both result sets. Conceptual queries get strong dense results. Lexical queries get strong keyword results. This alone moves retrieval accuracy from roughly 52% to approximately 78% on the benchmark sets we maintain for internal quality tracking.

    Step 2 — Cross-Encoder Reranking

    The first retrieval pass is designed for speed, not precision. It returns the top-k candidates quickly from a corpus of millions of chunks. A second pass re-scores those candidates with a cross-encoder model — a transformer that reads the query and each candidate chunk together, attending to the relationship between them rather than their independent embeddings.

    Cross-encoders are slower because they cannot be pre-computed. That is why they run only on the top-k shortlist, not the full corpus. But they are dramatically more accurate than bi-encoder similarity scores. We use Cohere Rerank or a domain-fine-tuned equivalent. This second pass typically adds another 8–12 percentage points of retrieval accuracy on top of hybrid search.

    Step 3 — Precision Chunking by File Type

    Most RAG implementations split documents by character count or by sentence. This produces chunks that cut across logical boundaries — a clause from one contract paragraph and a clause from another end up in the same chunk, producing a blended embedding that accurately represents neither.

    We write format-specific parsers: PDFs get layout-aware extraction that respects headers, tables, columns, and footnotes, with OCR for scanned documents. Spreadsheets get table-aware chunking where rows are serialized with headers preserved so context is never stripped. Audio files (MP3, M4A, WAV) get transcription via OpenAI Whisper with speaker diarization, and timestamps are embedded as metadata so citations link to the exact moment in a recording. Video files get audio extraction, transcription, and keyframe analysis for slide-based content.

    The chunk boundary is the single highest-leverage decision in a RAG system. We get it right before indexing, not after.

    50+ File Formats, Natively Ingested

    Enterprise knowledge lives everywhere: SharePoint, Confluence, Google Drive, email archives, recorded meetings, scanned contracts, and proprietary databases. Any RAG system that can only read PDF and plain text is ignoring most of your organizational knowledge.

    We build ingestion pipelines that handle the full range of formats your organization actually uses.

    Format support is not a checkbox. Each format requires purpose-built extraction logic. A PDF parser that doesn't understand multi-column layouts will join text from two columns into a single garbled line. Ours detects layout structure. An audio transcription pipeline that doesn't embed timestamps into chunk metadata can't tell you where in a three-hour board recording a specific statement was made. Ours does. A spreadsheet parser that strips headers from rows to reduce token count makes the data uninterpretable in retrieval. Ours preserves them.

    The engineering effort for reliable multi-format ingestion is substantial. It is also a one-time build that becomes permanent infrastructure your organization owns.

    The Full RAG Stack We Deploy

    We do not have a favourite framework or a product to sell you. We choose the stack that fits your constraints: existing infrastructure, team expertise, deployment environment, and latency requirements.

    Orchestration

    • LangChain for linear retrieval pipelines with tool use and memory.
    • LangGraph for multi-step agentic RAG — query decomposition, parallel retrieval, synthesis across multiple indices.
    • LlamaIndex for document-centric workloads with complex index types, including knowledge graphs and hierarchical node parsers.
    • Haystack for teams that need a production-grade, modular pipeline framework with strong observability tooling.

    Each has tradeoffs. LangGraph is the right choice for a complex query router that decomposes a question into sub-queries and synthesizes results. LlamaIndex is the right choice for a knowledge graph over a dense technical corpus. We advise based on your actual use case, not on framework preference.

    Vector Stores

    Pinecone is managed, serverless, and fast to provision — a good default for teams without infrastructure overhead. Weaviate is open-source, self-hostable, with rich metadata filtering, making it the right choice for hybrid workloads with complex schemas and for regulated industries that require on-premises deployment. pgvector is the right choice if you're already on Postgres and don't want a separate infrastructure dependency; it's viable for corpora under approximately ten million chunks at moderate query volume.

    Permission-Aware Retrieval for Enterprise

    Standard RAG systems return results based only on semantic relevance. They do not know that a junior analyst should not query documents marked for executive review, or that a support agent at a regional office should only retrieve knowledge relevant to their geography.

    Permission-aware retrieval enforces access control at query time — not as a post-processing filter on the LLM response, which can be bypassed, but at the vector store layer, before results reach the model.

    We implement two access control patterns. Document-level ACL: each chunk is indexed with a metadata field containing its permitted audience — user IDs, role IDs, group IDs, or a combination. At query time, the permission filter is applied as a metadata constraint, and the vector store only returns chunks the querying identity is allowed to see. No retrieval happens against restricted content.

    Row-level security for structured knowledge: for organizations with complex permission hierarchies — a financial services firm where different fund teams cannot see each other's portfolio data — we build permission predicates that mirror your existing access control system and apply them at retrieval time.

    Both patterns integrate with standard identity providers: Auth0, Okta, Azure AD, or your own JWT issuer. We do not maintain a separate permission store; we read from the authoritative source you already have.

    Citation Support — Every Answer Traceable to Its Source

    Hallucinations are not just embarrassing. In legal, healthcare, and financial services contexts, an AI answer that cannot be traced to a source document is a compliance liability.

    We build citation support into every RAG system we deliver. Every response includes the source document name, the page number or section heading, the exact paragraph or timestamp from which the answer was drawn, and a confidence indicator based on the reranker's relevance score.

    Citations are not appended as an afterthought. They are structured data, extracted during retrieval, passed to the model as explicit context, and formatted in the response according to your requirements — inline footnotes, a sidebar panel, a JSON object for downstream processing, or a rendered expandable source viewer.

    For teams that need audit trails — regulated industries, legal teams, compliance functions — every query and its associated sources are logged to a structured store. You can reconstruct exactly what your AI said and exactly where it got each piece of information. This data model must be built into the chunking and indexing pipeline from the start. We build it in from day one.

    Hallucination Reduction in Practice

    A hallucination in a RAG system is almost always a retrieval failure. The model was asked a question. The retrieval system returned context that did not contain the answer. The model, trained to produce helpful responses, generated a plausible-sounding answer from its parametric knowledge — knowledge that may be outdated, domain-mismatched, or simply wrong for your specific documents.

    The fix is not prompt engineering. The fix is retrieval accuracy.

    Retrieval precision: We run evals before go-live. A held-out evaluation set of query/expected-source pairs lets us measure what percentage of queries retrieve the correct chunk in the top-3 results. We do not ship until this number meets the agreed target.

    Confidence-based abstention: When the best-retrieved chunk scores below a threshold on the cross-encoder, the system flags low confidence or abstains and asks for clarification rather than generating from weak context. Users learn to trust high-confidence answers because the system earns credibility by knowing what it doesn't know.

    Context-grounded generation: The system prompt explicitly instructs the model to answer only from the provided context and to say 'I don't have information on that' when the context is insufficient. This alone reduces hallucination frequency significantly compared to systems that allow the model to draw on general knowledge.

    Continuous eval in production: We deploy an evaluation harness that samples production queries and runs them through a judge model to detect potential hallucinations. When accuracy degrades — because new documents were added and changed the distribution — you are alerted before your users start noticing.

    On-Premises Deployment for Regulated Industries

    Financial services, healthcare, government, and defence organizations often cannot send document data to external API endpoints. A cloud-hosted vector store or an OpenAI API call is not compatible with their data residency, security, or compliance requirements.

    We deploy fully on-premises RAG systems for these organizations. The full stack — embedding model, vector store, reranker, and LLM — runs inside your network. Nothing leaves. The embedding model is a locally-hosted open-weight model such as nomic-embed-text or an E5 variant fine-tuned on your domain. The vector store is Weaviate or pgvector on your own infrastructure. The LLM is Llama 3.1 70B or equivalent, served via vLLM. The reranker is a cross-encoder model hosted locally.

    Performance does not materially degrade. Modern open-weight models match or approach GPT-4o on domain-specific tasks when combined with precision retrieval. The accuracy numbers cited on this page are achievable entirely within a closed network. We have deployed this architecture for organizations that hold regulated personal data, classified information, and attorney-client privileged documents.

    What Gets Built — The Complete Engagement

    A RAG engagement with Hestur AI delivers a complete, production-ready system: an ingestion pipeline with format-specific parsers for every file type in your corpus, scheduled re-ingestion on your cadence, and delta updates to avoid full re-indexing on each run; a vector index configured with your metadata schema, permission filters, and index strategy; a hybrid retrieval service with dense + BM25 RRF fusion, cross-encoder reranking, and citation extraction; a REST or gRPC API layer for your application to query; an evaluation harness with offline eval set, production sampling, and drift alerting; and handoff documentation with architecture decision records, a runbook, and an onboarding session for your engineering team.

    Timeline is typically four to eight weeks from scope to production, depending on corpus size and format diversity. We start with a scoped Proof of Concept on a representative subset of your documents, validate accuracy against your defined targets, and only then build out to full corpus scale. If the PoC doesn't hit your accuracy targets, we don't proceed — you've spent two weeks instead of eight finding out the approach needs adjustment.

    If your organization also runs automated business processes, the retrieval service integrates cleanly with your AI workflow automation infrastructure — knowledge retrieval and agentic process execution are designed to connect from the start.

    How It Works

    From discovery to production in weeks, not quarters

    01

    Audit your corpus and define success

    We map your document types, query patterns, and accuracy requirements. We define what "correct retrieval" means before writing a line of code.

    02

    Build format-specific ingestion

    PDFs get layout-aware extraction, audio gets Whisper transcription with timestamps, spreadsheets get table-aware chunking. Every file type gets a purpose-built parser.

    03

    Engineer hybrid retrieval

    Dense embeddings capture semantic meaning. BM25 captures exact terms. A cross-encoder reranker re-scores the top candidates. You get the most relevant chunk, not just the most similar one.

    04

    Add citation and permission layers

    Every answer traces to source document, page, and paragraph. Permission filters run at query time — users only surface content their role allows.

    05

    Evaluate, tune, and monitor

    We run your system against a held-out evaluation set, tune until accuracy targets are hit, and deploy alerting that catches retrieval drift before your users do.

    Industry Applications

    Legal

    Contract review and precedent research with citation-level traceability

    60% reduction in research time

    Healthcare

    Clinical protocol and formulary lookups with role-based and facility-level access controls

    Zero cross-department data leakage

    Financial Services

    Regulatory document search and audit-ready citation trails, on-prem deployment available

    Full compliance with data residency requirements

    Technology

    Internal knowledge base search across Confluence, Notion, Slack archives, and code repositories

    80% reduction in time-to-answer on internal queries

    Government & Defence

    Classified document retrieval with air-gap deployment support

    Deployable in fully disconnected environments

    Frequently Asked Questions

    What's the difference between standard RAG and what you build?

    Standard RAG uses single-stage dense vector search to retrieve document chunks and passes them to an LLM. It works for simple use cases but fails on exact terminology, produces irrelevant context under query distribution shifts, and has no measurement layer. We build hybrid retrieval (semantic + keyword combined), add a cross-encoder reranker for second-pass re-scoring, write format-specific document parsers so chunking respects document structure, and instrument an evaluation harness before go-live. The result is a measured retrieval accuracy figure — typically 95%+ on held-out eval sets — rather than an educated guess.

    Why did our previous RAG system hallucinate so much?

    In almost every case we've investigated, the root cause is retrieval failure, not the language model. When retrieval returns irrelevant or marginally relevant context, the LLM — trained to always produce a helpful response — fills the gap from its parametric knowledge, which may not match your specific documents. The fix is not prompt engineering; it's retrieval accuracy. We measure retrieval precision before go-live so you know what percentage of queries surface the right chunk before the model ever sees it.

    Can you index our internal knowledge bases like Confluence, Notion, and SharePoint?

    Yes. We write connectors for the source systems you already use — SharePoint, Confluence, Notion, Google Drive, Slack export archives, Jira, and others. Documents are fetched, parsed with format-specific extractors, chunked, and indexed on a schedule. Delta ingestion means only changed or new documents are re-processed on subsequent runs, keeping the index current without a full rebuild each time.

    How does permission-aware retrieval work technically?

    Each document chunk is indexed with a metadata field containing its permitted audience — user IDs, group IDs, or role names from your identity provider. At query time, we append a metadata filter to every vector store query that restricts results to chunks the authenticated user is allowed to see. This runs at the vector store layer, before results reach the LLM — it is not a post-generation content filter. We integrate with your existing identity provider (Auth0, Okta, Azure AD, or a custom JWT issuer) to read the caller's role claims.

    Do you support on-premises deployment?

    Yes. For organizations with data residency or compliance requirements that prohibit sending document data to external APIs, we deploy the full stack on your infrastructure: open-weight embedding models, Weaviate or pgvector as the vector store, a locally-hosted cross-encoder reranker, and an open-weight LLM served via vLLM. Nothing leaves your network. The accuracy numbers we cite on this page are achievable entirely on-prem. We have deployed this architecture for financial services firms, healthcare organizations, and government clients.

    What file formats can you ingest?

    50+, including PDF (with OCR for scanned documents), DOCX, PPTX, XLSX, CSV, HTML, Markdown, JSON, XML, plain text, audio files (MP3, WAV, M4A — transcribed via Whisper with timestamps), video files (MP4, MOV — audio extracted and transcribed), and scanned images. Each format has a purpose-built parser that preserves structural context: table headers stay with rows, section headings stay with their content, and timestamps are embedded as chunk metadata for audio and video.

    What is the typical engagement timeline?

    For a production RAG system on a mid-size corpus, the typical timeline is four to eight weeks from scope to production deployment. We start with a two-week Proof of Concept on a representative subset of your corpus, validate retrieval accuracy against agreed targets, and then build to full scale. If the PoC doesn't hit your accuracy targets, we don't proceed — you've spent two weeks instead of eight finding out the approach needs adjustment.

    Can you integrate RAG into an existing application?

    Yes. We deliver a REST or gRPC API layer that your existing application queries, or we build the RAG system as an embedded service within your application stack. If you have an existing chat interface, internal tool, or workflow automation that needs to query a knowledge base, we wire the retrieval service into it. The API response format — including citations, confidence scores, and source references — is designed around your downstream integration requirements.

    How is retrieval accuracy measured and what does 95% mean specifically?

    We construct a held-out evaluation set before go-live: a minimum of 100 query/expected-source pairs that cover the range of query types your users submit. We measure retrieval recall@3 — what percentage of test queries have the correct source chunk in the top 3 retrieved results. 95% means that for 95 out of 100 representative queries, the correct source is in the top 3 results handed to the LLM. We report this number before launch and deploy ongoing sampling in production to monitor for drift.

    What does the RAG system cost to run in production?

    Operating costs are driven by four factors: embedding model API calls (or self-hosted embedding compute), vector store hosting, LLM API calls (or self-hosted LLM compute), and reranker API calls. On a mid-size corpus with moderate query volume (10,000–50,000 queries/month), all-in API costs for a cloud deployment typically run $500–$3,000/month depending on model choices and query patterns. On-prem deployments have higher upfront infrastructure cost but lower ongoing API costs. We include an operating cost estimate in every scope.

    Stop Debugging Hallucinations. Start Shipping RAG That Works.

    Book a 30-minute call. We'll audit your retrieval pipeline and tell you exactly what needs to change.