Duplicating CloudFlare’s MARKDOWN FOR AGENTS

An useful feature is Markdown for Agents. It makes an existing website “AI-friendly” by serving a Markdown representation of normal HTML pages to agents, crawlers, and LLM-based tools.

What Cloudflare’s Feature Technically Does

1. Content Negotiation Trigger

An agent or crawler sends a request like this:

Accept: text/markdown

Cloudflare sees that header on an enabled zone, fetches the normal HTML page from the origin server, converts it at the edge, and returns Markdown
instead of HTML.

The response looks roughly like this:

Content-Type: text/markdown; charset=utf-8
Vary: Accept
x-markdown-tokens: <estimated-token-count>
Content-Signal: ai-train=yes, search=yes, ai-input=yes

Cloudflare’s docs say enabled zones use content negotiation. Clients request Markdown with:

Accept: text/markdown

Cloudflare then fetches the origin HTML, converts it, and serves Markdown.

Sources:

- Cloudflare Markdown for Agents docs (https://developers.cloudflare.com/fundamentals/reference/markdown-for-agents/)
- Cloudflare announcement (https://blog.cloudflare.com/markdown-for-agents/)

## 2. HTML to Markdown Edge Conversion

The pipeline is roughly:

request
↓
if zone/path has content_converter enabled
↓
if Accept includes text/markdown
↓
fetch original page as HTML from origin
↓
preprocess DOM:
- remove nav/chrome/header/footer
- remove scripts/styles
- preserve JSON-LD only
- remove non-content junk
↓
extract meta tags into YAML frontmatter
↓
convert body DOM to Markdown
↓
append JSON-LD as fenced json block
↓
return text/markdown

Cloudflare’s documented output structure is:

1. YAML frontmatter from page metadata.
2. Markdown body converted from the document body.
3. JSON-LD preserved at the end in a fenced json block.

## 3. Frontmatter Extraction

Cloudflare maps HTML metadata into YAML frontmatter.

Example output:

---
title: ...
description: ...
image: ...
---

Those values are pulled from tags such as:

<meta name="title">
<meta property="og:title">
<meta name="description">
<meta property="og:description">
<meta property="og:image">

Standard meta fields win over OpenGraph fallbacks.

## 4. JSON-LD Handling

Cloudflare preserves structured data from:

<script type="application/ld+json">...</script>

Then it appends that data to the Markdown output like this:

json
{ … }

All other script and style content is stripped.

## 5. URL Affordances for Agents

Cloudflare docs also expose agent-friendly URLs such as:

/page/index.md
/llms.txt
/llms-full.txt
/product/llms.txt
/product/llms-full.txt

These are not necessarily part of every customer-zone Markdown-for-Agents deployment, but they are part of Cloudflare’s own “Docs for agents” system.

Source:

- Cloudflare Docs for agents (https://developers.cloudflare.com/docs-for-agents/)

## How We Would Duplicate It

The minimum viable clone is a reverse-proxy or sidecar that detects whether the requester wants Markdown.

### Request Routing

Implement a proxy that handles:

GET /some/page
Accept: text/markdown

If Accept includes text/markdown, return converted Markdown.

Otherwise, proxy the normal HTML unchanged.

Important response headers:

Content-Type: text/markdown; charset=utf-8
Vary: Accept
X-Markdown-Tokens: <count>
Content-Signal: ai-train=yes, search=yes, ai-input=yes

## Conversion Engine

Use tools like:

- HTML parser: parse5, jsdom, cheerio, or Go goquery
- Readability extraction: Mozilla Readability-style algorithm
- Markdown conversion: Turndown, html-to-md, or a rehype / remark stack
- Token counting: tiktoken or an approximate tokenizer
- Cache key: url + normalized Accept + origin ETag/Last-Modified

## Conversion Rules

Practical duplication rules:

Remove:
- script except application/ld+json
- style
- noscript
- nav
- header
- footer
- aside
- form
- cookie banners
- modals
- ads
- tracking pixels
- SVG icon sprites
- hidden elements
- empty containers

Prefer:
- main
- article
- [role=main]
- schema.org Article/Product/FAQ content
- h1-h6
- p
- ul/ol/li
- table
- blockquote
- pre/code
- img alt text
- canonical URL

## Output Format

Example:

---
title: Example Page
description: Short page summary.
image: https://example.com/cover.png
canonical: https://example.com/page
---

# Example Page

Page body converted to clean Markdown.

json
{“@context”:”https://schema.org”,”@type”:”Article”}
“`

llms.txt Support

Add:

/llms.txt
/llms-full.txt

/llms.txt should list important Markdown endpoints:

Example Site

Docs

Products

/llms-full.txt can concatenate all important pages in Markdown for bulk ingestion or RAG.

Key Difference From CAPTCHA or Bot Protection

Turnstile verifies humans.

AI Crawl Control manages crawler access.

Markdown for Agents is the “AI-friendly sidecar” piece: the same human website remains available as HTML, but agent-requested Markdown output is
served through content negotiation.

Visited 3 times, 1 visit(s) today

Leave a Comment