$cat building-for-agents.md

Building for Agents: A Practical Checklist for Developers

Q: Should I block AI crawlers like GPTBot and ClaudeBot in robots.txt?

It depends on what each one does. Training crawlers (GPTBot, ClaudeBot, CCBot, and Google's Google-Extended token) feed model training and are the ones a disallow rule is meant for. Search and citation crawlers (OAI-SearchBot, Claude-SearchBot) index pages so AI answer engines can cite them. User-initiated agents (ChatGPT-User, Claude-User) only fetch a page because a real person asked their assistant to, which is closer to a visitor than a crawler. Most agent-friendly sites block none of them and focus on serving clean, readable content instead; block by name only if you have a specific reason, like paywalled content you don't want trained on.

June 30, 202615 min readby MDflowview as .md

A geometric emerald-green wireframe AI agent core approaching a glowing modular structure built from stacked translucent blocks, like a building under construction, with an open gateway of light, on a dark terminal-grid background

Andrej Karpathy spent part of a June 2025 talk arguing that software is entering a "decade of agents," and put the implication on a slide in three words: "BUILD FOR AGENTS." A year on, that idea has a name — building for agents — and it is no longer a forecast. Search crawlers had thirty years to settle into robots.txt and sitemap.xml. AI agents that read your docs, fill out your forms, and call your API on a person's behalf are arriving in years, not decades, and most of the web was never built with them in mind.

We have already written about the individual pieces of this puzzle: Google's Open Knowledge Format for structuring knowledge, MCP and A2A for how agents connect to tools and each other, llms.txt for how agents discover your site, and context engineering for keeping an agent's input small and sharp. This post is the synthesis: the practical checklist that ties all of it together, plus the ground those deep dives don't cover — AI crawler bots and robots.txt, why Markdown beats JavaScript for agents, and how to write content AI answer engines actually cite.

TL;DR — Building for agents means making your site discoverable, readable, and operable by AI — not just visible to human eyes and indexable by search bots. The practical checklist: allow AI crawlers deliberately in robots.txt, publish a curated llms.txt, serve clean Markdown instead of JS-only HTML, expose a documented API with simple token auth, speak MCP for real operability, add structured data and write answer-first, curate context instead of dumping it, and let agents write back safely. MDflow checks every box today.

What does "building for agents" actually mean?

Building for agents means making your content and functionality reachable, parseable, and operable by AI agents — not just legible to a human eye and crawlable by a search bot. That splits into three layers, and most sites today only handle half of the first one.

Discovery — can an agent find out what you offer and where to look? This is llms.txt, agent cards, sitemaps, and structured data.
Content — can it actually read what's there? This means clean Markdown or plain semantic text, not a page that renders blank until JavaScript runs.
Operability — can it act on your behalf? This is a documented API or an MCP server, with authentication that works for a process instead of a person clicking "Allow."

Search engine optimization only ever asked the first question, for one kind of reader. An agent trying to summarize your pricing page, fill out your signup form, or update a document on your behalf needs all three layers working, in order — discovery is useless if the content underneath can't be read, and content is useless if there's no way to act on it.

Why building for agents matters now

For developers

Agents are now a distinguishable traffic class. OpenAI and Anthropic each ship three separate bot identities — a training crawler, a search/citation crawler, and an on-demand user-agent — specifically because enough traffic now needs telling apart.
"Agent-ready" is becoming a product position, not an afterthought. Stripe ships an MCP server and an open-source Agent Toolkit alongside a co-authored Machine Payments Protocol for agent-initiated payments. Cloudflare's AI Index, in beta since September 2025, auto-generates an MCP server, an llms.txt, and a search API for every customer domain. Vercel publishes its own docs as llms-full.txt and plain Markdown by default. None of them treated agent access as a footnote.
Being unreadable to agents is shaping up to be the next "not indexed by Google." Invisible to an entire class of users — except this time that "user" might be acting with someone's money or calendar.

For AI agents

A site built for agents is the difference between finishing a task and failing silently. Hit a JavaScript-only wall, an undocumented endpoint, or an interactive-OAuth dead end, and an agent doesn't ask for help the way a human does — it stalls, retries, or guesses.
Clean discovery and content reduce the chance an agent acts on stale or mangled information, because it's reading the real current page instead of a scraped, half-rendered copy of it.

The building blocks: a practical checklist

None of this requires a platform migration. It's a sequence of small, concrete decisions — here's what they are, in roughly the order they pay off.

1. Let AI crawlers in — deliberately, in robots.txt. Decide on purpose, instead of defaulting to whatever your framework wrote years ago. AI crawlers come in three flavors, and they are not all the same as a search bot:

Bot	Vendor	Purpose
`GPTBot`	OpenAI	Model training
`OAI-SearchBot`	OpenAI	ChatGPT search indexing
`ChatGPT-User`	OpenAI	On-demand fetch for a user's request
`ClaudeBot`	Anthropic	Model training
`Claude-SearchBot`	Anthropic	Search quality
`Claude-User`	Anthropic	On-demand fetch for a user's request
`Google-Extended`	Google	Token controlling Gemini/Vertex AI training use (not a separate crawler)
`CCBot`	Common Crawl	Open web corpus, widely reused as training data
`Bytespider`	ByteDance	Model training (reportedly inconsistent about honoring `robots.txt`)

(Other AI-search products, Perplexity among them, run their own named crawlers too — the pattern holds beyond this list.) A typical block looks like:

User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: CCBot
Disallow: /

Block training crawlers if you don't want your content used as training data; leave search and user-initiated agents alone unless you want your product invisible to anyone using an AI assistant on their own behalf.

2. Publish a curated llms.txt. A short, hand-picked Markdown map of your most important pages at yoursite.com/llms.txt, proposed by Jeremy Howard of Answer.AI in September 2024. We cover what a good one looks like, the common mistakes, and why it isn't an SEO trick in llms.txt Explained.

3. Serve clean Markdown or plain text, not a JavaScript-only page. Most AI crawlers and fetch tools, by widely reported accounts, don't execute JavaScript — so content that only appears after a client-side render is often invisible to them, even though a human visitor sees it fine. (Googlebot is the standing exception: Google documents that it renders pages with a headless-Chrome-based engine before indexing.) The gap is real enough that a third-party product exists purely to paper over it: Jina AI's Reader turns any URL into clean Markdown on the fly — fetching with headless Chrome, stripping chrome with Mozilla's Readability library, and converting with Turndown — and Jina has even trained a small dedicated model, ReaderLM-v2, just for HTML-to-Markdown conversion. If a third party had to build a converter for content like yours, an agent is already spending a step just to read you.

4. Expose a real API, documented and authenticated for a process, not a person. An OpenAPI spec turns "read our docs and guess" into a typed, machine-checkable contract an agent can call correctly on the first try. Authentication matters as much as the schema: interactive OAuth assumes a human is sitting at a browser to click "Allow," which doesn't work for a cron job or a background agent. The industry is visibly correcting for this — Atlassian added API-token authentication to its remote MCP server in February 2026 specifically so clients could authenticate "without an interactive user flow in a browser" for "non-interactive environments," and GitHub's official MCP server accepts a personal access token in place of OAuth for the same reason. (MCP's own spec does support OAuth 2.1 for remote HTTP servers when authorization is implemented, with mandatory PKCE and dynamic client registration — OAuth isn't disappearing, it's being made less painful for headless clients. But a simple bearer token remains the pragmatic default for most teams today.)

5. Speak MCP — and consider A2A — for actual two-way operation. Discovery and clean content get an agent to read you. The Model Context Protocol is what lets it act: list, create, update, move — the verbs a human would otherwise click through a UI to perform. If your product is also something other agents might delegate work to, an A2A agent card extends the same discoverability outward. We cover both protocols, who built them, and how they differ in MCP and A2A: The Protocols Powering Agentic Interfaces.

6. Add structured data, and write answer-first. This is "generative engine optimization" (GEO) — optimizing to be cited or synthesized inside an AI-generated answer, rather than ranked in a list of blue links. The term comes from a 2024 paper by researchers at Princeton, Georgia Tech, IIT Delhi, and the Allen Institute for AI, who built a benchmark (GEO-bench) and found that techniques like adding citations and concrete statistics can lift a page's visibility in generative-engine answers by up to 40%. Two cheap, durable habits do most of the work: marking up FAQs and articles with schema.org JSON-LD (FAQPage, Article), so the structure is machine-explicit even where it no longer earns a visual rich result in classic search; and writing answer-first — putting the direct answer to a heading in the first sentence beneath it, exactly what an extraction-based answer engine is looking for.

7. Curate context instead of dumping it. A folder of a thousand undifferentiated files isn't "context" — it's a haystack. Research on long-context models (Chroma's 2025 "context rot" study) found accuracy degrading as input grows, well before the context window is actually full, so the fix isn't a bigger window, it's a smaller, better-labeled one. We go deep on why curation beats raw retrieval, and what that looks like in practice, in Context Engineering for AI Agents.

8. Let agents write back, safely. If you want agents to do more than read, you need an audit trail for what they change — line diffs and one-click restore, not blind trust. The same access token that authenticates a read should be scoped and revocable, and every write, human or agent, should land somewhere you can review and undo. (For a worked example of an agent doing sustained, large-scale writing back to a knowledge base, see The Karpathy-Style Wiki.)

Which applications benefit most

Developer docs and API references. The most mature case — IDE assistants and coding agents already live here constantly.
SaaS tools an agent operates on a user's behalf. Notes, project management, scheduling — anywhere "do this for me" is the point.
E-commerce and agentic commerce. Agents that compare products, fill carts, and check out need a machine-operable path, not just a pretty storefront.
Internal knowledge bases and wikis. Runbooks and policies an on-call or support agent reads — and increasingly updates.
Content and publishing sites chasing AI-answer-engine citations. GEO's actual home turf.
Platforms betting their API is the product. Stripe, Cloudflare, and Vercel are all building "agent infrastructure" into their core positioning, not as a side feature.

How MDflow fits

We didn't design MDflow around a checklist that didn't exist yet. But because the product was built on portable Markdown that people and agents can read and write, most of this list was already true by construction.

What already lines up today

Markdown-native storage. Every document is the same plain text a human edits and an agent reads — nothing to convert.
Folder descriptions as a ranking signal, feeding mdflow_get_context for curated, just-in-time retrieval instead of a dump of the whole workspace.
Two real operability surfaces, one auth model. A hosted MCP server (25 tools over Streamable HTTP at /api/mcp, plus a stdio build with one extra setup tool) and a REST API (22 operations across 12 paths, described in an OpenAPI 3.1 spec) — both authenticated with the same Personal Access Token (mdf_…), the pragmatic non-interactive pattern Atlassian and GitHub are also converging on. There's no OAuth flow to get stuck in.
A full discovery surface. An A2A agent card, an llms.txt index, and a self-contained docs.md manual, so an agent can find the rest of this list in the first place.
Raw .md twins on every shared document and collection — three dedicated routes serving plain Markdown with YAML frontmatter and open CORS, the exact "clean text, not a rendered page" pattern this post argues for. (This post has one; look for the link at the top.)
An open robots.txt stance. MDflow doesn't block a single AI crawler by name — the only disallowed paths are authenticated app routes (/documents, /settings, /doc/, /api/, /auth/). The blog, docs, and every shared document are open by default.
This post ships its own GEO, like the rest of the series: FAQPage structured data and answer-first sections, the same advice item 6 above gives you.
Automatic version history on every write — editor, API, or agent — with line diffs and one-click restore, so letting an agent write back is a reviewable decision, not a leap of faith.

Where we're headed

This is direction, not a dated commitment:

Naming specific AI crawlers in robots.txt, instead of relying on a generic allow.
Moving beyond Personal Access Tokens toward scoped, OAuth-style authorization for agents, following MCP's own authorization extensions.
A collections API and richer remote MCP, so an agent can pull a whole curated bundle in one call — the roadmap item every post in this series keeps landing on.

The bottom line

None of this is exotic anymore. Allowing the right crawlers, publishing a curated map, serving clean Markdown, documenting an API with sane auth, speaking MCP, marking up your content, curating instead of dumping, and keeping an audit trail — each step is small, and together they're the difference between a site agents can use and one they quietly give up on. The web spent thirty years building infrastructure for one kind of visitor. Karpathy's three words are the instruction for the next one: build for agents too.

MDflow was built this way from the start, for people and agents at once: write Markdown in the browser, give your folders meaning, and connect Claude, ChatGPT, Cursor, or Codex the same day.

Start free · Connect an AI agent · Read the API docs

Frequently asked questions

What does "building for agents" mean?

Building for agents means making your site's content and functionality reachable, parseable, and operable by AI agents, not just visible to human eyes and indexable by search bots. It spans three layers: discovery (can an agent find what you offer, via things like llms.txt and structured data), content (can it actually read your pages, which means clean Markdown or text rather than JavaScript-only rendering), and operability (can it act, through a documented API or an MCP server with sane authentication). Most sites today only handle half of the first layer.

Do I need to build an MCP server to be agent-friendly?

No. MCP gives agents real two-way operability, but it's the deepest rung on a ladder, not the entry price. You can become meaningfully more agent-friendly with a curated llms.txt, clean Markdown versions of your key pages, a deliberate robots.txt stance toward AI crawlers, and a documented REST API with simple token authentication. Add an MCP server once you want agents to take actions, not just read.

Should I block AI crawlers like GPTBot and ClaudeBot in robots.txt?

It depends on what each one does. Training crawlers (GPTBot, ClaudeBot, CCBot, and Google's Google-Extended token) feed model training and are the ones a disallow rule is meant for. Search and citation crawlers (OAI-SearchBot, Claude-SearchBot) index pages so AI answer engines can cite them. User-initiated agents (ChatGPT-User, Claude-User) only fetch a page because a real person asked their assistant to, which is closer to a visitor than a crawler. Most agent-friendly sites block none of them and focus on serving clean, readable content instead; block by name only if you have a specific reason, like paywalled content you don't want trained on.

Is building for agents the same thing as SEO?

No. SEO optimizes for ranking in a list of links. Generative engine optimization (GEO) and agent-accessibility optimize for being read, cited, or acted on directly by an AI system. A 2024 study from researchers at Princeton, Georgia Tech, and the Allen Institute for AI found that GEO techniques, like adding citations and concrete statistics, can lift a page's visibility in generative-engine answers by up to 40%, using different levers than classic SEO. Building for agents is broader still: it includes letting an agent operate your product, not just cite your content.

What's the single highest-leverage first step?

Serve a clean, readable version of your key pages: plain Markdown or simple semantic HTML that doesn't require JavaScript to render the content. Most AI crawlers and fetch tools don't execute JavaScript, so a page that's blank without it is often invisible to them even though a human sees it fine. Every other step — llms.txt, GEO, even MCP — assumes there's something legible underneath for it to point at.