Sphere Partners

TL;DR — Key Takeaways

73% of enterprise RAG deployments fail within the first year — due to knowledge base maintenance problems, not model problems (Brainfish AI, 2025). GraphRAG achieves 80% accuracy on complex enterprise queries vs ~50% for standard retrieval (Microsoft Research). Semantic chunking improves recall to 91–92% vs 85–90% for fixed-size (Weaviate). pgvector runs 75–79% cheaper than Pinecone at production scale. Retrieval quality — not model quality — determines whether enterprise AI gives correct answers about your organisation.

73%

of enterprise RAG deployments fail within the first year due to knowledge base problems (Brainfish AI, 2025)

80%

accuracy on complex enterprise queries with GraphRAG vs ~50% without (Microsoft Research)

91–92%

retrieval recall with semantic chunking vs 85–90% for fixed-size (Weaviate)

79%

cheaper: pgvector vs Pinecone at equivalent production scale (2026 benchmarks)

In 2020, Patrick Lewis and colleagues at Facebook AI Research published a paper that would become one of the most cited works in applied language model research. "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" demonstrated that pairing a language model with a retrieval system produced significantly more accurate, factual, and specific answers than asking the model to answer from training data alone (Lewis et al., NeurIPS 2020). The paper set state-of-the-art on multiple open-domain question-answering benchmarks and established RAG as the standard architecture for enterprise AI deployments.

Four years later, RAG is marketing language. Every enterprise AI vendor describes their platform as having a knowledge base or "trained on your documents." In practice, architectures vary enormously — and so does the quality of answers they produce. A 2025 analysis found that 73% of enterprise RAG deployments fail within the first year — not due to model or retrieval algorithm problems, but due to knowledge base maintenance failures: stale documents, coverage gaps, and poor extraction quality from source files (Brainfish AI, 2025).

Understanding what actually happens when employee questions flow through a RAG pipeline explains why some deployments produce accurate, cited responses while others hallucinate confidently and disappoint consistently. The failure is almost always architectural, not technical.

“The bottleneck in most RAG systems is not the language model — it is the retrieval. If you're retrieving the wrong chunks, even the best model in the world will produce a wrong or hallucinated answer. The model can only work with what you give it. Getting retrieval right is the core engineering problem.”

Jerry Liu, Co-Founder and CEO, LlamaIndex — Building Production RAG Systems, LlamaIndex Blog, January 2025. (llamaindex.ai/blog)

What "Trained on Your Documents" Actually Means

Vendors that describe their platform as "AI trained on your documents" are using a shortcut that obscures how the system works. Actually training a model on new documents means updating its weights — an expensive process taking days to weeks, requiring significant compute, and making the knowledge static from the moment training completes. New documents would require retraining. Changed documents would require retraining.

What actually happens is retrieval at inference time. Your documents are processed into a searchable index. When an employee asks a question, the system searches that index, retrieves the most relevant content, and injects it into the model's context before generating an answer. The model's weights never change — what changes is what the model can see when generating each response.

This distinction has practical consequences. It means your knowledge base can be updated without touching the model. It means answer quality is a function of retrieval quality, not model quality. And it means the architectural decisions about how documents are processed before storage determine whether the retrieval step finds the right content or returns noise.

Stage 1 — Ingest: Why Source Document Quality Matters

The pipeline begins when source documents are added to the knowledge base. PDFs, Word documents, plain text, web pages, and spreadsheets each require different extraction logic to produce clean text. A PDF with text embedded as an image requires OCR; a table formatted for visual layout produces garbled plain text if extracted naively; a document with headers, footers, and navigation menus embeds that boilerplate into every chunk derived from it.

Each ingested document is stored with metadata: title, source type, document ID, and page numbering. This metadata surfaces in citations — when the system retrieves a chunk, it tells the user "Source: Q4 Procurement Policy, page 7" rather than just "your documents." The granularity and accuracy of citations depends entirely on how well metadata was captured at ingest time. Checking extraction quality as part of the ingestion workflow — not after the index is built — prevents corrupted chunks from propagating into the vector store.

Stage 2 — Chunking: The Most Consequential Parameter Nobody Discusses

Documents are too long to embed and retrieve as units. A 200-page policy manual embedded as a single vector produces an embedding that captures the document's general theme but cannot be matched to specific questions about any particular policy within it. The solution is chunking: splitting documents into smaller segments before embedding.

Chunk size is one of the most consequential parameters in RAG system design. Research consistently points to 400–600 token chunks with 50–100 token overlap between consecutive chunks as the most reliable range for enterprise document types. NVIDIA's benchmarking of chunking strategies found that recursive character splitting at 512 tokens with 50–100 token overlap produced 69% accuracy in retrieval testing across a large real-document benchmark (NVIDIA, 2024). Shorter chunks (under 100 tokens) produce precise embeddings but lack enough context to be useful when retrieved. Chunks above 1,000 tokens dilute the embedding and match poorly to specific queries.

Chunking Strategy	Retrieval Recall	Best For	Trade-off
Fixed-size (512 tokens, 50–100 overlap)	85–90%	Well-structured policy docs, procedures, regulatory text	May split mid-sentence at natural breaks; fast ingest
Semantic chunking	91–92%	Mixed-format documents: tables, narrative, code blocks	Higher ingest-time compute; better context preservation
Hierarchical chunking	~90%	Documents with clear header/section structure	Requires consistent document formatting to work well
Fixed-size, no overlap	75–82%	Simple, well-delimited FAQ or short-form content only	Content split at boundaries loses cross-boundary context
Chunks >1,000 tokens	60–70%	Not recommended for most enterprise use cases	Diluted embeddings; poor specific-query matching

Weaviate's research found semantic chunking improves retrieval recall to 91–92% compared to 85–90% for fixed-size approaches — a meaningful improvement at the cost of additional ingest-time computation (Weaviate, 2024). For documents with highly irregular structure — mixed tables, narrative paragraphs, code blocks — semantic chunking outperforms fixed-size by a larger margin. For well-structured policy documents and procedures, fixed-size chunking with appropriate parameters is often sufficient and significantly faster to ingest.

Stage 3 — Embedding: How Meaning Becomes Searchable

Each chunk is converted to a vector embedding — a high-dimensional numerical representation of its semantic meaning. The embedding model's quality determines how well resulting vectors capture meaning in a way that enables retrieval. The MTEB (Massive Text Embedding Benchmark) leaderboard is the standard reference, evaluating models across eight retrieval tasks and 56 datasets — the industry reference for embedding model selection in RAG implementations.

OpenAI's text-embedding-3-small and text-embedding-3-large are among the most widely deployed embedding models for enterprise RAG. The generation that replaced ada-002 improved meaningfully on retrieval benchmarks: OpenAI's own benchmarks showed the MIRACL multi-language retrieval score improving from 31.4% to 44.0% with the newer generation, and MTEB English scores from 61.0% to 62.3% (OpenAI, January 2024). For 1536-dimensional embeddings, text-embedding-3-small provides a strong balance between cost, speed, and retrieval quality for most enterprise document types.

Embeddings are stored in a vector database purpose-built for cosine similarity search across high-dimensional vectors. pgvector, the PostgreSQL vector extension, is a common choice for organisations already using PostgreSQL. Its HNSW indexing, added in version 0.7, enables fast approximate nearest-neighbour search that performs competitively with dedicated vector databases at enterprise knowledge base scales. Production-scale cost comparisons consistently show pgvector running 75–79% cheaper than cloud-native vector databases like Pinecone for equivalent workloads (2026 benchmarks) — a significant operational consideration for deployments managing large knowledge bases.

Stage 4 — Retrieval: What Separates Accurate Systems from Expensive Noise

Retrieval Accuracy on Complex Enterprise Queries — By Architecture

Sources: Microsoft Research GraphRAG; AWS hybrid search research; NVIDIA chunking benchmarks, 2024.

Baseline LLM — no retrieval (parametric memory only)~50%

Standard vector RAG — fixed-size chunks, single-stage retrieval~69%

Hybrid RAG — vector + BM25, semantic chunks, re-ranking~75%

GraphRAG — knowledge graph + vector retrieval (Microsoft Research)~80%

When an employee submits a query, the retrieval stage runs two operations before returning results. The first is query rewriting. In an ongoing conversation, queries frequently reference prior context: "What does the policy say about that?" The "that" is meaningful to the user but meaningless to the retrieval system, which sees each query in isolation. A fast model rewrites conversational queries to be self-contained. Research on query rewriting for RAG systems consistently shows accuracy improvements, with some studies finding 3–6 percentage point gains on enterprise question sets where many queries are conversational rather than standalone (arXiv, April 2024).

The second is the vector search itself. The rewritten query is embedded using the same model that embedded the chunks, and a cosine similarity search finds chunks whose vectors are geometrically closest to the query vector. The key insight: a chunk about "client contract renewal terms" will be geometrically close to a query about "how do we extend a customer agreement" even with zero lexical overlap. The model bridges the vocabulary gap that keyword search cannot.

More advanced approaches improve accuracy in specific scenarios. Hybrid search combines dense vector retrieval with sparse retrieval (keyword matching via BM25), using fusion methods like Reciprocal Rank Fusion to merge result sets. Research on hybrid retrieval shows that keyword search catches exact matches on proper nouns, product identifiers, and regulatory codes that semantic search may miss — particularly relevant for enterprise knowledge bases heavy in specific identifiers. Re-ranking with a cross-encoder model as a second pass over initial retrieval results further improves precision, at additional latency cost.

Microsoft Research's GraphRAG project takes a different approach: using an LLM to build a knowledge graph from the document corpus, then querying the graph for complex questions spanning multiple documents. It achieved 80% accuracy on complex enterprise queries compared to approximately 50% for standard vector retrieval — at significantly higher indexing cost (Microsoft Research, 2024). For knowledge bases with heavily interconnected information — regulatory documents that cross-reference each other, client histories with entity dependencies — graph-structured retrieval meaningfully outperforms pure vector approaches.

Stage 5 — Generation and Citation: Why the Source Reference Is the Governance Control

The retrieved chunks are formatted and injected into the model's context before inference. The model generates an answer grounded in the retrieved content, citing specific chunks by source document and page number. Good citation formatting tells the user which document, which page, and what text the answer was derived from — not just that "the knowledge base was consulted."

The model typically generates two sections: what the knowledge base says (with citations), and supplementary general guidance from training where documents do not fully cover the question. Keeping these sections distinct prevents the model from blending retrieved organisational knowledge with general-knowledge inference in a way that makes the mixture impossible to verify.

For regulated industries, the citation is the governance control. An AI that answers "this is our GDPR procedure" with a citation to "GDPR Compliance Manual, page 12, section 4.3" gives a compliance officer the ability to verify the answer and audit the AI's reasoning. An AI that answers the same question without a citation provides confident text with no verifiable basis — which is acceptable for general knowledge and inadequate for any compliance-relevant decision.

Why Do Enterprise RAG Deployments Fail After Launch?

The most common failure mode is staleness. The knowledge base reflects what was ingested, not what is currently true. A policy document updated three months ago but not re-indexed will produce confident, incorrect answers based on the old version. The retrieval system has no mechanism to detect that a document has been superseded — it retrieves the highest-similarity chunk regardless of whether that chunk is still accurate. Knowledge base management requires a workflow for re-indexing documents when source material changes, and flagging content with known update cycles.

Coverage gaps produce the second major failure mode. RAG can only retrieve what is in the knowledge base. Questions about topics not covered by uploaded documents return either low-confidence retrievals from tangentially related content, or general-knowledge responses that may not reflect the organisation's specific practices. Identifying coverage gaps requires systematically asking the AI about topics the knowledge base might not cover and tracking where it falls back to general guidance rather than citing a document.

Extraction quality from source documents creates a third failure mode easy to overlook. A scanned PDF where text was not accurately extracted produces corrupted chunks with poor embeddings. A table formatted for visual presentation embeds as garbled text. A document with redundant boilerplate on every page produces chunks where a significant portion of content is navigation noise. Checking extraction quality at ingest time prevents these problems from propagating through the entire vector index.

For the broader context of why retrieval accuracy ultimately determines enterprise AI value — and what persistent memory provides beyond what documents can capture — the post on why enterprise AI fails on company-specific questions covers the three-tier context architecture. For the compliance dimensions of knowledge base access — who retrieves which documents and how that is audited — the post on per-team content policy governance covers how access controls and audit logging interact with the RAG pipeline.

Frequently Asked Questions

RAG (Retrieval-Augmented Generation) pairs a language model with a vector search system over your internal documents. When a user asks a question, the system retrieves the most semantically relevant document segments and injects them into the model's context before generating a response. The model's weights never change — answer quality is a function of retrieval quality. The model can only work with what the retrieval stage provides.

Research consistently points to 400–600 token chunks with 50–100 token overlap as the most reliable range for enterprise document types. NVIDIA's benchmarking found 512-token chunks with overlap produced 69% retrieval accuracy across a large real-document benchmark (NVIDIA, 2024). Shorter chunks under 100 tokens lack enough context to be useful; chunks above 1,000 tokens dilute the embedding and match poorly to specific queries.

Semantic chunking uses a model to identify natural breaks before splitting, improving retrieval recall to 91–92% versus 85–90% for fixed-size approaches (Weaviate, 2024). It is most valuable for mixed-format documents with tables, narrative, and code blocks, where fixed-size splits regularly break at semantically awkward points. For well-structured policy documents and procedures, fixed-size chunking with appropriate parameters is often sufficient and significantly faster to ingest.

Hybrid search combines dense vector retrieval (semantic similarity) with sparse retrieval (keyword matching via BM25), using fusion methods like Reciprocal Rank Fusion to merge result sets. Keyword search catches exact matches on proper nouns, client names, product identifiers, and regulatory codes that semantic search may miss — making hybrid retrieval particularly valuable for enterprise knowledge bases heavy in specific identifiers. AWS research on hybrid pipelines found hallucination rate reductions of 40–60% compared to baseline inference.

The most common causes: the document was updated in the source system but not re-indexed (staleness); the chunk containing the answer was split in a way that removed key context (chunking problem); the query was phrased so differently from the document's wording that the vector search did not surface the right chunk (vocabulary gap — where hybrid search helps); or the document was extracted poorly at ingest (scanned PDF, visually formatted table). Each of these is an architectural problem, not a model problem.

GraphRAG, developed by Microsoft Research, builds a knowledge graph from the document corpus before indexing, enabling queries that require synthesising information across multiple connected documents. It achieved 80% accuracy on complex enterprise queries versus approximately 50% for standard vector retrieval (Microsoft Research, 2024) — at significantly higher indexing cost. It is most valuable for knowledge bases with heavily interconnected information: regulatory documents that cross-reference each other, client histories with entity dependencies, or policy libraries where answers require synthesis across multiple sources.

How RAG Works in Enterprise AI — And Why Your Knowledge Base Architecture Determines Answer Quality