Sphere wins 2026 Global Recognition Award
Sphere Partners

Enterprise RAG Architecture: The 6-Layer Framework That Actually Scales

Six architectural layers that enterprise RAG production systems actually need — plus the orchestration, memory, and governance concerns that turn a demo into something you can put in front of regulated users.

10 min read
Understanding RAG - part 01
In this article

Open any introductory article on retrieval-augmented generation and you'll see the same diagram: three boxes. A document store on the left, a vector search in the middle, a language model on the right. Arrows connecting them. It's a useful mental model for a weekend prototype — and it's exactly why an estimated 60% of enterprise RAG pilots never make it to production.

The gap isn't the idea. RAG works. The gap is that the three-box diagram hides the layers where real enterprise deployments succeed or fail: how data actually gets in, how it's split, how it's indexed, how retrieval is tuned, and how the whole thing is orchestrated, remembered, and governed once real users and real data are involved.

This guide breaks enterprise RAG architecture into the six layers a production system actually needs — plus the orchestration, memory, and governance concerns that turn a demo into something you can put in front of regulated, security-conscious users. If you're scoping a build, use it as a blueprint. If you've already got a pilot that "works in the demo but not in production," use it as a diagnostic.

Why three boxes isn't enough

A prototype answers one question: can the model retrieve a relevant passage and write a coherent answer? For a curated set of clean PDFs and a friendly evaluator, the answer is almost always yes.

Production asks harder questions. Can it ingest from SharePoint, Confluence, a ticketing system, and a 15-year-old document management platform — and keep permissions intact? Does retrieval still surface the right passage when the corpus grows from 500 documents to 5 million? Can you prove why it gave a particular answer when compliance asks six months later? And does it actually learn — or does it forget everything the moment the session ends?

None of those questions live inside the three boxes. They live in the layers most diagrams leave out. Here are the six that matter.

The 6 layers of enterprise RAG architecture

Think of these as a pipeline. Data flows down through ingestion to the index; a query flows back up through retrieval to generation. Each layer has a job, a failure mode, and a set of enterprise-specific decisions.

Layer 1 — Ingestion

Job: Connect to enterprise source systems and pull content in — securely, incrementally, and with permission metadata intact.

This is where most teams underestimate the work. Enterprise knowledge doesn't live in a tidy folder; it lives in dozens of systems, each with its own auth model and permission structure. The ingestion layer needs connectors that authenticate properly, pull only what's changed since the last sync, and — critically — carry source-level permissions along with the content so downstream retrieval can respect who is allowed to see what. In SphereIQ, every connector (Slack, SharePoint, Google Drive, GitHub, Jira, and more) records the source's access groups so that retrieval can later filter to the caller's visible sources.

Anti-pattern: Dumping everything into one flat index with no source-level access control. The day a salesperson's query returns a passage from an HR investigation, the project is over. Permissions belong in the architecture from day one.

Layer 2 — Chunking

Job: Split documents into retrievable units that preserve enough context to be useful on their own.

Split too small and a retrieved chunk loses the context that made it meaningful ("the limit is 30 days" — the limit on what?). Split too large and you dilute relevance and waste tokens. The right strategy is content-aware: respect document structure, keep semantically related text together, and attach metadata — source, title, page number, and the exact character range of the text — to every chunk so citations can point back to the precise region later.

Anti-pattern: Naive fixed-length splitting that slices tables in half and severs sentences from their headings. It's the single most common cause of "the answer was technically in the documents but the system couldn't find it."

Layer 3 — Embedding

Job: Convert each chunk into a vector representation the system can search by meaning rather than by keyword.

The embedding model determines what "similar" means. SphereIQ uses 1,536-dimension embeddings stored directly in PostgreSQL via pgvector — which keeps the vectors inside the same governed database as the rest of the enterprise data rather than shipping them to a separate third-party service. Two enterprise concerns dominate here: where the embedding runs (sending proprietary content to an external embedding API may be a non-starter for regulated data) and consistency (you must embed queries with the same model you embedded the corpus, and re-embedding millions of chunks when you switch models is expensive).

Anti-pattern: Picking an embedding model for benchmark scores alone, without checking whether it can run inside your security boundary.

Layer 4 — Index

Job: Store the vectors (and their metadata) in a system that can search them fast at enterprise scale.

This is the vector database layer. SphereIQ runs it on pgvector with an IVFFlat index for fast cosine similarity — meaning the vector store is your existing Postgres, not another system to secure and operate. Whatever the engine, the index must support metadata and permission filtering (so retrieval can be scoped to a department, date range, or access level before similarity ranking), stay fast as the corpus grows, and fit your deployment model — including fully self-hosted when data can't leave your environment.

Anti-pattern: Treating the vector store as a commodity and discovering, at scale, that it can't filter by permission efficiently — forcing a choice between slow queries and leaking restricted content.

Layer 5 — Retrieval

Job: Given a query, return the right chunks — not just the vector-nearest ones — and only the ones this user is allowed to see.

Retrieval is where accuracy is won or lost. Two things separate production retrieval from a prototype. First, access control runs before ranking: SphereIQ filters candidate chunks to the caller's permitted sources, then ranks by cosine similarity, returning the top matches above a relevance threshold — so a user can never retrieve from a source they couldn't open directly. Second, the strongest systems use hybrid retrieval — combining semantic (vector) search with keyword/lexical (BM25) search and fusing the results — so exact terms (a product code, a policy number, a name) aren't lost to fuzzy similarity. In Sphere's production engagements, hybrid retrieval with rank fusion has driven step-changes in quality, including a 66% improvement in retrieval accuracy on a regulated tax-and-compliance deployment and a 5x relevance lift on a hybrid lexical-plus-vector system.

Anti-pattern: Relying on vector similarity alone — and assuming permissions can be enforced in the UI after the fact. Both assumptions break in the enterprise.

Layer 6 — Generation

Job: Synthesize a grounded, cited answer from the retrieved context — and signal its own confidence instead of bluffing.

The model is the most visible layer and the least differentiating. What matters is grounding and control. SphereIQ separates "from your documents" content (which must carry inline citations like [1], [2] back to the exact source passage) from general knowledge (which may not), and derives a confidence level from the strength of the top retrieval match so a low-confidence answer is framed cautiously rather than asserted. Enterprises also want model flexibility — the ability to route across OpenAI and Anthropic models (with automatic failover, and per-group restrictions) without re-architecting everything beneath.

Anti-pattern: Treating the LLM as the product. The model is interchangeable; the retrieval quality, citations, and grounding around it are what make answers trustworthy.

The hidden 7th layer: orchestration

The six layers describe the data path. They don't describe the thing that runs the path — and in production, orchestration is where reliability lives. Orchestration decides whether a query needs retrieval at all, how the query is rewritten from conversation history, how many passages to fetch, how memory and documents are assembled into the final prompt, and how to fail gracefully when a source is slow or down. As systems take on multi-step reasoning and tool use — agentic RAG — orchestration stops being glue code and becomes a first-class layer with its own logic and monitoring.

The layer most RAG skips: memory

Here's the layer that separates a search box from an assistant — and the one almost no architecture diagram includes. Standard RAG is stateless. It retrieves, answers, and forgets. Ask it the same context tomorrow and it starts from zero, because nothing about the last conversation, decision, or preference was retained.

SphereIQ adds a memory layer called Engram. As the system works, it forms engrams — persistent memory records of nine kinds: facts, entities, decisions, preferences, insights, events, procedures, relationships, and context. Engrams are created automatically from conversations and synced from connected systems, then recalled alongside the document citations on future queries. So when a user asks for a follow-up next week, the assistant already knows the relevant decision was made and who owns it — it cites and remembers, rather than re-deriving everything from scratch.

Engrams also behave like memory should: they have a lifecycle. A new memory starts ephemeral and decays quickly unless it's used; reuse promotes it to working, then consolidated, and finally crystallized — a permanent, company-wide fact that no longer decays. Trivia fades; what matters hardens. This is the practical difference between RAG that answers a question and a system that compounds institutional knowledge over time.

Evaluation and governance: the production difference

Two concerns wrap around every layer above, and skipping them is the most expensive mistake in enterprise RAG.

Evaluation is how you know the system is accurate before it ships and while it runs. Without a measurement harness — retrieval precision/recall, answer faithfulness, hallucination rate — "it seems to work" is the only quality signal you have, and it won't survive contact with real queries.

Governance is what makes the system safe to operate. In SphereIQ that means permission-aware retrieval (RBAC and per-source ACLs), private/self-hosted deployment and BYOK so proprietary data never leaves your environment, full audit logging of every message (with token and cost detail) that can mirror to your SIEM, tamper-evident eDiscovery export for legal hold, PII detection and content policies, prompt-injection scanning, and an EU AI Act compliance registry. For regulated buyers, governance isn't a feature — it's the precondition for the project existing at all.

Architecture anti-patterns Sphere sees in failed deployments

Across enterprise engagements, the same shortcuts show up in stalled pilots:

  • One flat index, no permissions. Access control is treated as a UI problem instead of an architecture problem, and the system can't safely go past a small pilot group.
  • Vector-only retrieval. Exact-match terms get lost; accuracy plateaus; nobody can explain why a "smart" system can't find an obvious answer.
  • Naive chunking. Tables and structured content are shredded, so the most valuable documents become the least retrievable.
  • No memory. Every conversation starts from zero; the system never accumulates institutional knowledge.
  • No evaluation harness. Quality is assessed by vibes, so regressions ship silently and trust erodes.
  • Data egress by default. Embeddings and prompts flow to third-party APIs, which kills the project the moment legal or security reviews it.

Every one of these is a layer problem masquerading as a model problem.

Sphere's approach: architecture as a delivery discipline

A good diagram is necessary but not sufficient; what gets enterprises to production is a repeatable way to build each layer correctly. Sphere's AI Foundry delivery model treats RAG architecture as a sequence — intake, blueprint, build, harden, run — rather than a single design pass:

  • Intake & blueprint map the source systems, security boundary, and use cases, then specify each layer (plus orchestration, memory, and governance) for your environment.
  • Build stands up the pipeline with production concerns wired in from the start — permission-aware connectors, hybrid retrieval, grounded generation with citations and confidence, and the Engram memory layer.
  • Harden adds the evaluation harness, RBAC, audit logging, injection scanning, and guardrails, then validates accuracy against real queries.
  • Run keeps it healthy with monitoring for drift, hallucination, and cost.

The result is what the SphereIQ platform and Sphere's private RAG solution are built around: private or self-hosted deployment, no proprietary data leaving the environment, any-LLM support with no inference markup, memory that compounds, and a 6–8 week path to production — fast because the architecture is treated as deliberate layers, not three hopeful boxes.

Frequently asked questions

A production enterprise RAG architecture has six core layers — ingestion, chunking, embedding, indexing, retrieval, and generation — plus orchestration that runs the pipeline, a memory layer that persists knowledge across conversations, and governance that wraps everything. Introductory three-box diagrams collapse this into "store, search, generate," which is why prototypes built on that model rarely survive production.
Most fail not on the model but on the hidden layers: ingestion without permissions, naive chunking that destroys context, vector-only retrieval that misses exact-match terms, no memory, no evaluation harness, and data egress that fails security review. These are architecture problems that only surface at real scale and under real security scrutiny.
No. A vector database is one layer (the index). Without content-aware chunking, permission-aware hybrid retrieval, grounded generation, a memory layer, and governance around it, a vector store alone produces a demo, not a dependable system.
Standard RAG is stateless — it retrieves, answers, and forgets. A system with a memory layer (in SphereIQ, called Engram) forms persistent records of facts, decisions, entities, and preferences, recalls them alongside document citations on future queries, and lets important knowledge harden while trivia fades. The result cites and remembers, compounding institutional knowledge instead of restarting every session.
By treating governance as part of the architecture: private or self-hosted deployment with no data egress, permission-aware retrieval (RBAC and per-source ACLs), BYOK, audit logging that can mirror to a SIEM, tamper-evident eDiscovery, PII detection, prompt-injection scanning, and an EU AI Act registry. Security designed into the layers — not added afterward — is what lets regulated organizations move past a pilot.

Continue the cluster: start with the enterprise RAG pillar guide, then move to the 8-phase RAG implementation playbook, and how to evaluate RAG accuracy (coming soon).

We'd love to hear from you!

Please provide your contact details, and our team will get back to you promptly.

Enterprise RAG Architecture: The 6 Layers | Sphere Inc.