Hestur AIHestur
    All Articles
    AI Technology

    Vector Embeddings Explained: The Foundation of RAG

    A technical explanation of how vector embeddings work, why they are the foundation of modern RAG systems, and how to choose the right embedding model and vector database for your use case.

    2 min read

    Vector embeddings turn text into high-dimensional numeric vectors where semantic similarity becomes geometric proximity. This enables modern semantic search, RAG, and matching systems to work even when queries and documents use different wording.

    Core ideas:

    • Embeddings as vectors:
      • Text → list of floats (e.g. 1,536-dim vector from text-embedding-3-small).
      • Similar meaning → vectors close together (high cosine similarity / small angle).
      • Different meaning → vectors far apart.
    • Why embeddings beat pure keyword search:
      • Keyword/BM25 rely on exact tokens and miss synonyms, paraphrases, and spelling variants.
      • Embeddings capture semantics, so “cancel subscription” ≈ “unsubscribe” ≈ “stop my plan”.
    • Common embedding models:
      • text-embedding-3-small (OpenAI, 1,536 dims): best cost/performance for most.
      • text-embedding-3-large (OpenAI, 3,072 dims): more accurate, ~5× cost.
      • embed-english-v3.0 (Cohere, 1,024 dims): strong multilingual.
      • all-MiniLM-L6-v2 (HF, 384 dims): free, local, great for prototyping.
      • mxbai-embed-large (Mixedbread, 1,024 dims): strong open-source.

    More dimensions → more nuance but higher storage and latency. For most business use cases, text-embedding-3-small is a solid default.

    • Vector databases & ANN search:
      • Need efficient nearest-neighbour search over millions of vectors.
      • Use ANN indexes (e.g. HNSW, IVF-Flat) to get millisecond queries.
      • Typical choices:
        • Pinecone: fully managed, fast to start.
        • Weaviate: open-source + managed, strong hybrid search.
        • Qdrant: open-source, performant, good for self-hosting.
        • pgvector: Postgres extension, great if you’re already on Postgres.
        • ChromaDB: simple local dev store.
    • Chunking (critical for retrieval quality):
      • Don’t embed whole long docs as a single vector.
      • Split into chunks (≈256–1,024 tokens) and embed each.
      • Practical strategies:
        • Fixed-size with overlap: N tokens with 10–20% overlap.
        • Semantic chunking: split on paragraphs/sections.
        • Hierarchical / parent-child: small chunks for retrieval + larger parent for context.
      • Common failure mode: chunks too large (e.g. 2,000 tokens) → relevant info buried in noise.
    • Similarity metrics:
      • Cosine similarity: angle only; standard for text embeddings.
      • Dot product: fast, but magnitude-sensitive; OK if vectors are normalized.
      • Euclidean distance: less effective in high-dimensional text spaces.
      • In practice, use cosine unless your vector DB recommends otherwise for a specific index.
    • Hybrid search (vector + keyword):
      • Pure vector search can underperform on exact identifiers (SKUs, codes, names).
      • Combine BM25 keyword search with vector search and merge via RRF (Reciprocal Rank Fusion).
      • Often best-performing setup for enterprise search.
      • Weaviate supports hybrid natively; Pinecone/pgvector typically combine in app code or via frameworks (e.g. LlamaIndex).
    • RAG pipeline overview:
      • Ingest:
        1. Chunk documents.
        2. Embed each chunk.
        3. Store vectors + metadata in a vector DB.
    Hestur AI

    Let's build your AI solution.

    Ex-FAANG engineers. Production-ready in 2–4 weeks. Voice AI, RAG, automation. Free PoC, money-back guarantee.

    All Articles2 min read