The memory era of AI infrastructure.
Production agents exposed the limit of classic vector search. The serious design question is no longer which database to buy first. It is what shape of knowledge an agent needs, how that knowledge should be assembled, and how to deliver it with provenance, permissions, and the right amount of context.
Executive Summary . What this brief argues
- RAG and vector search are not the same thing. Retrieval-augmented generation is a broad loop. Vector search is one retrieval primitive inside that loop, and it is often not enough for production agents.
- Agents need bundles, not snippets. Multi-step work often requires customer records, policies, prior tickets, entitlements, metric definitions, workflow state, and access rules together, not a few semantically similar paragraphs.
- Pinecone's Nexus launch is a category signal. Pinecone now frames retrieval as compiled, structured, governed knowledge for agents, not only semantic similarity over chunks.
- PageIndex makes a stronger challenge to chunking. Its public docs argue that long professional documents should preserve hierarchy and section structure rather than being flattened into interchangeable text chunks.
- SAP is betting on structured enterprise memory. SAP's Dremio and Prior Labs acquisitions point toward semantic layers, governed data, and tabular reasoning as first-class ingredients for agent memory.
- The memory problem is broader than retrieval. Microsoft GraphRAG, Cloudflare Agents, and Google Cloud Memory Bank show that memory also includes relationships, runtime state, and long-term memory across sessions.
- The practical sequence is simple. Define the agent's contract with data, write down the bundle it needs, then choose the smallest set of primitives that delivers that bundle reliably.
Contents
Why classic RAG is no longer enough.
Classic RAG worked well for narrow question answering. Production agents do something harder. They complete multi-step work, which means they need complete operating context, not only nearby text.
The first distinction to keep clear is simple. RAG is the loop. A system retrieves information and supplies it to a model for context. Vector search is one retrieval method inside that loop. It is extremely useful for fuzzy prose and semantic similarity. It is not a universal answer to agent memory.
That distinction matters because many production failures are not search failures in the narrow sense. The agent found some text. It still did not have the full bundle it needed to act. Support agents often need the customer record, entitlement rules, previous tickets, policy excerpts, and the current workflow state together. Finance agents often need governed metric definitions, source tables, exception logic, and reporting schedules together. Legal and procurement agents often need section hierarchy, definitions, schedules, and overrides together. A few semantically similar paragraphs do not solve those jobs.
This is why production agents waste so much effort reconstructing context from scratch on every run. They reopen the same documents, refetch the same records, re-ask known questions, and spend expensive tokens rebuilding context before useful work begins. The core memory problem is not only finding text. It is delivering the right shape of knowledge for the work being done.
The four shapes that keep showing up
The right retrieval unit must match the job. A chunk may be fine for an FAQ. A section tree may be right for a contract. A table may be right for financial work. A graph neighborhood may be right for dependency reasoning. A compiled brief may be right for repeated workflows.
From retrieval to compiled knowledge.
The most important signal in Pinecone's Nexus launch is not that it introduced one more retrieval product. It is that a company built on vector databases now says agents need something broader.
Pinecone's public Nexus product page now describes Nexus as a knowledge engine that compiles enterprise data into trusted knowledge and serves structured answers instead of ranked chunks. Pinecone's companion launch material makes the same argument more directly: the problem in agent systems is not only inference quality, it is the repeated cost of fetching, assembling, and reasoning over raw data at query time.
That is a material change in emphasis. Pinecone is not abandoning vector search. Its own launch language says vector primitives remain foundational. But Pinecone is clearly repositioning vector search as one primitive inside a broader memory layer. The new system compiles artifacts when source knowledge changes, then serves typed outputs to the agent when the task runs. That is much closer to operating context than classic similarity search.
KnowQL pushes the same idea at the query surface. Pinecone frames it as a shared vocabulary for structured, grounded knowledge in a single call. Pinecone's public design material shows query concepts such as scope, filters, grounded output with citations, and output-shape constraints. That is the right direction if the goal is to deliver bundles that match the job rather than dumping chunks into context and hoping the model can sort them out.
Pinecone also puts numbers on the table. Its Nexus product page claims 30x faster time-to-completion, task completion above 90 percent, and up to 90 percent lower token consumption than traditional agentic retrieval. Those are vendor claims, not neutral benchmarks, but they are still useful as a market signal. They show what the category itself now believes the problem is: too much effort is being spent on retrieval loops and too little on the actual task.
Even if you never buy Nexus, the launch matters. It tells you that the vector-database category itself now accepts that agents need pre-assembled, permission-aware, grounded memory rather than repeated brute-force retrieval.
PageIndex and the wrong retrieval unit.
PageIndex pushes the argument further. Its public docs reject chunking for many long professional documents and instead preserve tree structure for reasoning over sections, subsections, and document hierarchy.
PageIndex describes itself as a vectorless, reasoning-based RAG framework. Its public documentation says it transforms documents into a tree-structured index, lets the model reason over that structure, and avoids both vector databases and chunking for long complex documents. Whether or not one adopts PageIndex itself, the underlying argument is strong: some documents lose meaning when broken into semantically similar chunks.
This matters most in contracts, policies, financial filings, compliance manuals, and technical documentation. A section title can change the meaning of a clause. Definitions can govern language many pages later. Schedules can override general terms. Tables are not interchangeable with surrounding prose. The retrieval unit should respect that reality.
Which retrieval unit fits which job
Better embeddings only improve matching inside the retrieval format you already chose. They do not rescue a workflow built on the wrong unit of retrieval.
SAP's bet on tables, catalogs, and semantic layers.
SAP's recent acquisitions make the enterprise-memory argument concrete. Much of the knowledge that matters inside a business lives in governed tables, catalogs, and semantic layers, not only in PDFs and prose.
In May 2026, SAP announced plans to acquire both Dremio and Prior Labs. The two deals fit together cleanly. Dremio gives SAP a lakehouse and open catalog story across SAP and non-SAP data. Prior Labs gives SAP a stronger tabular-model story for reasoning over structured business data.
This is the clearest enterprise statement in the market. If the work is based on ERP, CRM, procurement, finance, or operations data, then the correct memory layer is often not a chunked document store. It is a governed data system with access control, lineage, semantic definitions, and reasoning that respects table structure.
The useful design rule is direct. Give the agent the knowledge in the same shape the business uses it. Documents when the source is document-based. Tables when the source is tabular. Metric definitions when governance matters. Workflow state when action depends on process state. Flattening everything into prose may simplify the pipeline, but it usually degrades the work.
Graphs, runtime state, and long-term memory.
The broader market now treats memory as more than a database query. Relationship memory, runtime state, and persistent memory across sessions are all becoming explicit product surfaces.
These three examples matter because they describe different kinds of memory. Microsoft focuses on relational structure and global sensemaking across large corpora. Cloudflare focuses on agent runtime state and durable execution. Google focuses on long-term memories that persist across sessions and evolve over time. Together, they reinforce the same conclusion: memory is a multi-part architecture problem.
The industry is converging on a simple idea. Retrieval quality matters, but production agents also need durable state, long-term memory, relationship memory, permissions, provenance, and the right structure for each knowledge source.
Why bigger context windows are not the answer.
More context helps. It does not tell you what belongs in context, which source is authoritative, what the permission boundary is, or how to keep noise from degrading performance.
The most common shortcut in agent design is to assume that larger model context windows will eventually dissolve the retrieval problem. That is attractive because it seems to remove engineering decisions. In practice it only delays them.
A large context window does not decide which source is authoritative. It does not preserve document hierarchy. It does not tell the system whether a retrieved claim came from a governed table, a stale summary, or a previous model inference. It does not enforce permissions. And as more raw context is packed into the prompt, the model still has to spend attention budget sorting signal from noise.
Microsoft Research's later BenchmarkQED work is a useful check on the intuition that very long context makes retrieval design unnecessary. In that benchmark write-up, Microsoft reports strong LazyGraphRAG wins against competing methods, including a vector-based RAG configuration with a 1 million token context window. The implication is not that long context is useless. It is that long context by itself does not solve memory design.
If a system cannot explain why a specific record, clause, summary, or memory block belongs in context, then the extra context is not a solution. It is just more text.
A practical build sequence for production agents.
The useful decision order is simple. Do not pick a database first. Start by defining what the agent must receive, in what form, to do its job reliably.
Step 1. Define the agent's contract with data.
Write down the work the agent must do and the evidence it must have to do it safely. Do not say "relevant context." Say exactly what fields, records, sections, tables, policies, and approvals are needed. This usually reveals that the agent depends on multiple systems and that different sources need different handling.
Step 2. Write down the bundle.
Translate the task into a bundle the system must deliver. A support bundle might include account metadata, entitlement status, recent tickets, outage flags, and refund policy excerpts. A legal bundle might include clause hierarchy, definitions, exception schedules, and approval history. A finance bundle might include governed metric definitions, source tables, variance thresholds, and report cutoffs.
Step 3. Choose the smallest set of primitives that delivers that bundle.
Use vector retrieval for fuzzy prose. Use document trees when hierarchy matters. Use semantic layers and tabular reasoning for governed business data. Use graph retrieval when relationship reasoning matters. Use durable memory or compiled briefs when the same workflow repeats. Most real agents need a mix. The discipline is not adding every layer. It is adding only the layers the job requires.
Teams overbuild when they add a graph, a vector index, a semantic layer, and a memory store to a workflow that only needed one or two of those parts. Simple assistants often do not need a complex memory stack.
What to measure in agent logs.
The cleanest way to see the memory problem is not from architecture diagrams. It is from actual traces. Work logs show whether the system is completing work or repeatedly rebuilding the same context.
The fifth question is usually the most revealing. How often does the agent ask for information the system already has? If that happens often, the memory layer is not actually serving the workflow. It is only storing data somewhere nearby.
A good memory architecture reduces retrieval churn, shortens the path to useful work, lowers repeated token spend, and makes the agent's evidence package more consistent from run to run.
Start with the work, not the vendor.
The broader conclusion is straightforward. The memory era of AI infrastructure has arrived because production agents exposed the limits of systems built for chatbots. The core issue is not whether vector search was useful. It was. The issue is that production agents need richer memory systems than chatbots needed.
Pinecone now talks about compiled knowledge and single-call grounded answers. PageIndex argues that some important documents should never be chunked. SAP is investing in open catalogs, semantic layers, and tabular foundation models for the structured data that runs businesses. Microsoft keeps expanding GraphRAG because some work is fundamentally relational. Cloudflare treats durable state as native agent infrastructure. Google now exposes long-term memory as an explicit platform capability. All of these are different responses to the same underlying problem.
The right sequence remains the same. Start with the work. Define the bundle. Respect the shape of the knowledge. Then choose the fewest primitives that deliver that bundle reliably. If a team starts with the vendor, the architecture usually ends up serving the tool. If it starts with the work, the architecture has a chance to serve the agent.
Sources & references.
This brief is grounded in public product pages, documentation, and research posts available as of May 13, 2026. Product claims are identified as vendor claims where applicable.
- Pinecone Nexus product page, May 2026. Public description of Nexus as a knowledge engine for agents, including compile-on-change positioning, typed answers, field-level citations, and vendor-reported performance claims.
- Pinecone Nexus launch posts, May 2026. Public framing of context engineering, compiled artifacts, and the shift from chunk retrieval to knowledge compilation.
- Pinecone KnowQL public design material, May 2026. Public query-surface description including scope, filters, grounded output, and output-shape constraints.
- PageIndex developer documentation, March 2026, plus public GitHub repository. Public description of vectorless, reasoning-based retrieval over tree-structured document indexes without chunking.
- SAP News Center, May 4, 2026. SAP announcements for Dremio and Prior Labs, including Apache Iceberg-native lakehouse positioning, semantic layer and open catalog framing, and SAP's tabular-foundation-model strategy.
- Microsoft Research GraphRAG project and blog posts, 2024 to 2025. Public material covering GraphRAG, DRIFT Search, LazyGraphRAG, and BenchmarkQED for graph-based retrieval over complex private data.
- Cloudflare Agents and Durable Objects documentation, May 2026. Public docs describing stateful agents, persistent SQL-backed state, Session memory, and Durable Objects as the runtime substrate.
- Google Cloud Gemini Enterprise Agent Platform Memory Bank documentation, May 2026, plus Google Cloud ADK memory blog. Public docs describing long-term memories, extraction, consolidation, identity scope, and persistence across sessions.
- Chander Dhall Methodworks analysis, used as the framing thesis for this brief. Public sources above were used to validate product names, platform direction, and current public positioning.