Context Engineering for AI Agents: Curation Beats a Bigger Prompt

For two years the craft had a name: prompt engineering. Find the right phrasing, the right role-play, the right few-shot examples, and the model would behave. Then, sometime in 2025, the field quietly renamed itself. The new craft is context engineering, and the rename is not cosmetic — it reflects a real change in where the difficulty lives.
The reason is unglamorous. The models got good enough that the prompt stopped being the bottleneck. What surrounds the prompt — the documents, the conversation history, the tool results, the retrieved knowledge that all land in the context window — is now what decides whether an AI agent succeeds or quietly goes off the rails. And here is the counterintuitive part, the one this post is about: the fix is almost never "add more." A bigger prompt and a bigger context window are not the answer. Curation is.
TL;DR — Context engineering is the practice of curating the smallest set of high-signal tokens an AI agent needs — not writing a longer prompt or buying a bigger context window. Research shows models degrade as their input grows (the "context rot" effect), so retrieval and curation, not window size, are the real levers. MDflow makes curation a primitive: every folder's description is a ranking signal, and
mdflow_get_contextretrieves just the few best-matching documents on demand.
What is context engineering?
Context engineering is the practice of curating and maintaining the smallest set of high-signal tokens an AI model needs to do a task — across an entire multi-step agent run, not just a single prompt. Anthropic formalized the term on September 29, 2025, in its engineering note Effective context engineering for AI agents, defining it as "the set of strategies for curating and maintaining the optimal set of tokens (information) during LLM inference." The guiding rule it offers is sharp: good context engineering means finding "the smallest possible set of high-signal tokens that maximize the likelihood of some desired outcome."
The cleanest way to see the difference from prompt engineering is to notice what each one optimizes:
| Prompt engineering | Context engineering | |
|---|---|---|
| Unit of work | A single instruction | The whole context window, at every step |
| Core question | "How do I phrase this?" | "What is the smallest high-signal set the model needs right now?" |
| Main lever | Wording and examples | Curation and retrieval |
| Scope | One call | A multi-step, long-horizon agent run |
A prompt is the instruction you write. The context is everything the model can see while it follows that instruction: system prompt, tool definitions, prior messages, retrieved files, the half-finished output of three steps ago. As agents run longer and call more tools, that surrounding material — not the instruction — is what dominates the model's behavior. Most agent failures in production are not model failures; they are context failures. The model was fine. The information you fed it was not.
Why a bigger window is not the fix
The intuitive escape hatch is to make the window bigger and pour everything in. The research says that backfires. In July 2025, Chroma published a study it called "context rot" that tested 18 frontier models — including GPT-4.1, Claude Opus 4, and Gemini 2.5 — and found that every single one gets worse as input length grows. Accuracy dropped by 30 percent or more in some setups, often well before the model's documented context limit was reached. A related, long-known failure mode, "lost in the middle," means models attend well to the start and end of a long context but poorly to the material buried in the middle.
The cause is architectural, not a passing bug. A transformer has an attention budget: every token has to be related to every other token, so the number of pairwise relationships grows with the square of the input. Stretch the context and that budget thins out, with diminishing returns the further you push. As Anthropic puts it, "context windows of all sizes will be subject to context pollution and information relevance concerns" — so treating context as "a precious, finite resource" stays central even as windows grow to a million tokens and beyond.
This is the whole contrarian point in one line: you cannot retrieve your way out of a junk drawer, and you cannot prompt your way out of a polluted window. The lever is not size. The lever is what you choose to put in — and, just as importantly, what you choose to leave out.
Why context engineering is useful
For developers
- Curation is upstream; cleanup is downstream. Much of the published context-engineering advice is about damage control inside a running agent — compaction, summarization, trimming old turns. Those help, but they are patches applied after the context is already bloated. Organizing your knowledge once, so the right small slice is easy to fetch, is the upstream fix that makes the downstream patches rarely necessary.
- A description is a durable artifact; a prompt is a fragile one. A well-written folder description or a curated collection lives in your workspace, gets reviewed, and is reused by every agent and every query. A clever prompt string has to be re-tuned for each new model and each new task.
- It is portable and model-agnostic. Curated, plain-text context works the same whether the agent behind it is Claude, GPT, Gemini, or whatever ships next quarter. You are not betting your knowledge base on one vendor's prompt quirks.
- It controls cost and latency. Every irrelevant token you don't send is money you don't spend and latency you don't pay. Retrieving three high-signal documents beats stuffing thirty mediocre ones, on the bill and on the clock.
For AI agents
- Just-in-time beats load-everything. Anthropic's recommended pattern is for agents to hold "lightweight identifiers" — file paths, queries, links — and pull the actual content into the window only when a step needs it. That keeps each step's context small and sharp.
- High-signal labels beat blind similarity. A human-written description of what a body of knowledge is for is a far better ranking signal than cosine similarity over a pile of undifferentiated chunks. Curation gives retrieval something good to rank against.
- Less rot, better recall. A small, well-ranked context sidesteps both context rot and the lost-in-the-middle effect. The model can actually attend to everything you sent, because you didn't send too much.
- Provenance instead of guesswork. When an agent reads a curated document, it can cite it. When it stitches together loose retrieved fragments, it guesses — and guesses hallucinate.
Which applications benefit most
- Coding agents and IDE assistants. Tools like Claude Code, Cursor, and Codex already live or die on repo-local context. They lean on
CLAUDE.md-style notes plus just-in-timegrepand file reads — context engineering by another name — and a curated, retrievable knowledge layer is the natural next step up from ad-hoc notes. - Customer support and product copilots. A governed, curated knowledge base that the agent retrieves from beats a brittle prompt stuffed with pasted help-center articles. Curation is what keeps answers correct and citable.
- Data and analytics assistants. "Talk to your warehouse" tools need curated, described, cross-linked concepts — metric definitions, table semantics, runbooks. This is exactly the territory of Google's Open Knowledge Format, and it is curation-first by design.
- Research agents and "second brain" tools. Personal wikis and note vaults are only useful to an agent if the agent can find the right note. Folders of notes with meaningful descriptions turn a passive archive into a retrievable context source.
- Multi-agent systems. When several agents collaborate over MCP and A2A, each one needs its own scoped, curated slice of context. Hand every agent the whole knowledge base and you have multiplied the context-rot problem by the number of agents.
How MDflow fits
We did not set out to build a "context engineering platform" — the phrase barely existed when we started. But MDflow turns out to be one, because we made the bet the discipline is now converging on: the unit of useful agent context is a small, curated, human-readable set of Markdown — and curation should be a first-class primitive, not an afterthought.
What already lines up today
Folder descriptions are a ranking signal, not a label. In MDflow, every folder carries a description that states what the documents inside it are for. This is not decoration and it is not just organization — it is the primary ranking signal our agent retrieval uses. It is, quite literally, a context-engineering primitive: a human (or an agent) writes the high-signal sentence that tells a retriever "this region of knowledge is about enterprise churn analysis for the 2026 renewal push," and that one curated sentence does work a bigger prompt cannot.
# Folder: Churn analysis
description: >
Customer churn analysis for the 2026 enterprise renewal push:
monthly cohort retention, the top churn drivers by segment,
and save-offer experiment results.
mdflow_get_context is a just-in-time retrieval primitive. MDflow's MCP server exposes mdflow_get_context: give it a topic and it scores folder descriptions first, then folder names and document titles, and returns only the best-matching Markdown bodies — readable context plus structured JSON. The agent fetches a small, high-signal set exactly when it needs it, rather than loading the workspace into the prompt up front.
mdflow_get_context("why are enterprise customers churning?")
→ ranks folder descriptions first, then names, then titles
→ returns the 2–3 best-matching markdown bodies — not the whole workspace
That is the difference between context engineering and prompt stuffing, made concrete. The bigger-prompt approach pastes ten documents and hopes the model finds the relevant two. The curated approach ranks against a description someone wrote on purpose and hands back exactly those two.
Collections curate across the folder tree. Some context does not live in one folder. MDflow collections group documents into a named, curated set independent of the folder hierarchy — a hand-picked context bundle an agent can pull as a unit.
Markdown-native, with raw .md twins. Every document is plain Markdown, and any shared link gets a raw .md twin with YAML frontmatter served over open CORS — high-signal, no rendering noise, fetchable and citable in a single request. (This very post has one; the link is at the top.)
Agents read and maintain the curation. Through the MCP server and HTTP API, authenticated with a Personal Access Token, an agent can create documents, move them, and — crucially — keep folder descriptions accurate. The curation layer is not a static config file; it is something agents help maintain. The tedious bookkeeping that makes humans abandon their wikis is exactly what LLMs are good at.
Discovery and governance included. MDflow publishes an llms.txt index, an A2A agent card, and an OpenAPI 3.1 spec so agents can find the surface in the first place — and backs the curated knowledge with private and public sharing, anchored comments, automatic version history, and optional client-side encryption. Context you can trust is context you can govern.
This is also where the RAG-versus-curated-knowledge debate we opened in the OKF post actually resolves. It was never really "retrieval or curation." mdflow_get_context is curated retrieval — retrieval that ranks against human-written descriptions instead of raw embeddings. You keep the dynamism of retrieval and the precision of curation, which is precisely the hybrid the context-engineering literature keeps landing on.
Where we are headed
The following is direction, not a dated commitment, but it is the shape of our thinking:
- Richer ranking signals. Typed documents and tags (aligned with OKF's
typeandtagsfields) so retrieval can filter as well as rank — "the runbooks in this folder," not just "this folder." - A collections API and richer remote MCP. Serving a whole curated collection to an agent as a single cross-linked bundle, so an agent can pull an entire context set in one call instead of document by document.
- Workspaces as scoped context. Personal workspaces (shipping now) give each project its own scoped set of folders — a natural boundary for "only this context, for this agent."
- Agent-assisted curation. Letting an agent propose and maintain folder descriptions, types, and cross-links for knowledge you already have — turning curation itself into something the agent helps with.
- Capture-to-context. The Web Clipper already turns pages into clean Markdown; the next step is dropping clipped pages straight into a typed, retrievable, agent-ready folder.
The bottom line
The move from prompt engineering to context engineering is the industry admitting that the hard part is no longer the instruction — it is everything around it. And the evidence is now clear that the lazy fix, a bigger window with more pasted in, makes things worse, not better: models rot as their context grows. The durable advantage goes to whoever curates best — whoever can hand an agent the smallest, sharpest, most relevant set of tokens for the step in front of it.
That is the bet MDflow was built on. Write your knowledge as Markdown, give your folders meaning, and let agents retrieve the few documents that matter through a context tool designed for exactly that. A curated folder really does beat a bigger prompt — and now you can see why.
Start free · Connect an AI agent · Read the API docs
Frequently asked questions
What is context engineering?
Context engineering is the practice of curating and maintaining the smallest set of high-signal tokens an AI model needs to complete a task, across an entire multi-step agent run rather than a single prompt. Anthropic defines it as the set of strategies for curating and maintaining the optimal set of tokens during LLM inference. It treats the context window as a finite resource to be filled deliberately, not stuffed.
What is the difference between context engineering and prompt engineering?
Prompt engineering is about wording a single instruction well. Context engineering is about deciding what information surrounds that instruction in the context window — which documents, history, and tool results the model sees — across a whole agent run. Prompt engineering optimizes one call; context engineering optimizes the full lifecycle of what the model attends to.
Doesn't a bigger context window solve the problem?
No. Chroma's 2025 "context rot" study tested 18 frontier models and found every one degrades as input grows — accuracy can drop 30 percent or more well before the window is full, partly due to the lost-in-the-middle effect. Anthropic notes that context windows of all sizes remain subject to pollution and relevance problems, so curating a small, high-signal context beats supplying more raw tokens.
Is context engineering the same as RAG?
Not quite. Retrieval-augmented generation (RAG) is one technique for filling the context window — embedding and retrieving raw chunks at query time. Context engineering is the broader discipline of deciding what belongs in the window at each step, including curated documents, structured notes, and just-in-time tool reads. RAG is a tool inside context engineering, and it works far better when it retrieves over curated, well-labeled sources.
How does MDflow help with context engineering?
MDflow makes curation a first-class primitive. Every folder carries a description that acts as the primary ranking signal for retrieval, and the mdflow_get_context MCP tool takes a topic, scores folder descriptions first, then names and titles, and returns only the best-matching markdown bodies. Agents fetch a small, high-signal context just in time over the MCP server or HTTP API, instead of loading an entire workspace into the prompt.
Further reading
- Anthropic — Effective context engineering for AI agents
- Chroma Research — Context Rot: How Increasing Input Tokens Impacts LLM Performance
- MDflow — Google's Open Knowledge Format (OKF) · MCP and A2A: The Protocols Powering Agentic Interfaces
- MDflow — Markdown for AI agents · MCP documentation · API documentation · FAQ