Website Cloning

Import an entire existing website into your sitemd project. The clone tool scrapes text, images, navigation, forms, embeds, and card layouts — mapping each to the closest sitemd component. Styling is not copied; sitemd's three theme modes apply.

This is not a pixel-perfect replica. It extracts content and structure, then produces sitemd markdown that renders with your chosen theme. The result is a working site you can edit in markdown from day one.

Quick start

Use the /clone skill from any AI coding agent:

/clone https://example.com

The agent checks for Puppeteer, installs it if missing, scrapes the site, and walks you through importing the results.

Or use the CLI directly:

sitemd clone https://example.com

How it works

The clone pipeline runs in four phases:

  1. Crawl — Discover all pages via sitemap.xml, robots.txt, and link-following from the homepage
  2. Extract — Parse each page's HTML into structured data using a headless browser (Puppeteer)
  3. Map — Convert the structured data into sitemd markdown syntax
  4. Assets — Download images locally and rewrite URLs

The tool returns a structured JSON report. It does not create any files — the agent reviews the report with you, then creates pages with sitemd_pages_create and updates settings by editing the page files directly.

Puppeteer dependency

Website cloning requires a headless browser to render JavaScript-heavy pages. The clone tool uses Puppeteer, which downloads Chromium (~150MB) alongside it.

Puppeteer is not bundled with sitemd. Install it only when you need to clone:

npm install puppeteer

The /clone skill handles this automatically — it checks for Puppeteer before starting and tells the agent to install it if missing. If you never use the clone feature, your project stays lightweight.

Page discovery

The crawler finds pages in this order:

  1. Fetch /sitemap.xml and parse all <loc> URLs (including sub-sitemaps)
  2. Fetch /robots.txt for additional sitemap references and Disallow rules
  3. BFS from the homepage, following all same-origin <a href> links
  4. Deduplicate by normalizing URLs (trailing slashes, fragments, query params)

All three sources are combined and deduplicated before extraction begins.

Limits and rate control

Setting Default Description
maxPages 50 Maximum pages to crawl (hard cap: 200)
skipPaths [] URL prefixes to ignore (e.g. /admin, /api)
Delay 500ms Between requests (respects Crawl-delay from robots.txt)
Timeout 30s Per-page load timeout

The crawler waits for networkidle2 on each page to capture JS-rendered content, then pauses before the next request.

Content extraction

Each page is parsed for structured content. The extractor walks the main content area (<main>, <article>, or the largest content container) and classifies every element:

HTML element sitemd output
<h1><h6> ####### headings
<p> Paragraphs with bold, italic, code, and links preserved
<ul>, <ol> - and 1. lists with nesting
<img> ![alt](src) — queued for asset download
<pre><code> Fenced code blocks with language hint from CSS class
<table> Markdown tables
<blockquote> > blockquotes
<hr> ---
<iframe> (YouTube, Vimeo, Spotify, etc.) embed: URL
<iframe> (generic) Inline HTML passthrough
<form> form: block with extracted fields
Grid/flex containers with repeating card-like children card: / card-text: / card-image: / card-link:
Button-styled <a> (class contains btn, button, cta) button: Label: URL
<details><summary> Inline HTML (sitemd renders it natively)

Page type detection

The extractor classifies each page by URL pattern and content heuristics:

URL pattern Detected type
/blog/, /posts/, /articles/ Blog post
/docs/, /documentation/, /guide/ Documentation
/changelog Changelog
/roadmap Roadmap
Everything else Standard page

Blog and docs pages are automatically assigned to their respective groups.

Card detection

The extractor looks for parent containers using CSS grid or flexbox with 2+ children. If each child contains an image, heading, text, or link, they're extracted as sitemd cards:

card: Product A
card-text: Our flagship offering with enterprise features.
card-image: /media/clone/product-a.jpg
card-link: Learn More: /products/a

Form extraction

For each <form>, the extractor maps:

The generated form uses a placeholder webhook URL. You need to replace it with your actual endpoint (Zapier, Make, n8n, a serverless function, etc.). See Forms.

Site-level extraction

The homepage is also parsed for site-wide settings:

What Where it goes
Site title (from <title> or og:site_name) settings/meta.mdtitle, brandName
Meta description settings/meta.mddescription
Navigation structure (<nav> links + dropdowns) settings/header.mditems
Footer content (copyright, link groups) settings/footer.md
Social links (detected by domain: github, twitter, linkedin, reddit, etc.) settings/footer.mdsocial
Accent color (CSS custom properties or button backgrounds) settings/theme.mdaccentColor
Logo/favicon theme/images/

Asset downloading

Images referenced in page content are downloaded to media/clone/ and URLs are rewritten to local paths. The asset handler:

Disable asset downloading with --no-assets (CLI) or includeAssets: false (MCP tool).

The clone report

The tool returns a JSON report — it does not write files directly. The report contains:

{
  "source": "https://example.com",
  "crawled": 12,
  "site": {
    "title": "Example Co",
    "description": "We make things",
    "accentColor": "#e11d48",
    "suggestedMode": "light"
  },
  "navigation": { "header": [...], "footer": {...} },
  "groups": [{ "name": "blog", "items": [...] }],
  "pages": [
    {
      "slug": "/about",
      "title": "About Us",
      "type": "standard",
      "content": "# About Us<br><br>...",
      "components": ["cards", "buttons"],
      "confidence": 0.85,
      "warnings": ["Form action URL needs updating"]
    }
  ],
  "assets": { "downloaded": 8, "skipped": 2 },
  "warnings": [...],
  "unmapped": [{ "url": "/dashboard", "reason": "HTTP 401" }]
}

Each page includes a confidence score (0–1) indicating how well the extraction mapped to sitemd components. Pages below 0.5 are flagged for manual review.

The agent then creates pages via sitemd_pages_create, updates settings, and reports what needs manual attention.

MCP tool

The sitemd_clone tool is available to any MCP-capable agent:

Parameters:

Name Required Description
url yes Website URL to clone
maxPages no Max pages to crawl (default 50, max 200)
includeAssets no Download images locally (default true)
skipPaths no URL path prefixes to skip

See MCP Server for setup and the full tool list.

CLI command

sitemd clone <url> [--max-pages N] [--skip /path1 /path2] [--no-assets]

When run in a terminal, the CLI pretty-prints a summary of what was found. When piped (non-TTY), it outputs the raw JSON report.

What needs manual attention

After cloning, the agent reports what it couldn't fully automate:

The agent guides you through each item and suggests what to configure next.