User experiment: MCP Screaming Frog + Python API in Claude Code

Published on nicolasbillia.com — May 2026

Screaming Frog frog between two glowing panels: API (with Python code on the left) and MCP (chat-style interface on the right), navy grid background

Disclaimer

This content was produced with AI assistance based on ingested context: half a dozen projects worked in Claude Code using Antonio Maculus’s API (link in the second paragraph) + session logs from the MCP iteration. All the technical breakdown comes from the logs Claude Code returned; the analysis ideas have my own base + brainstorming at the moment of writing the post.

Screaming Frog announced its official MCP server today. I plugged it into Claude Code as soon as it dropped and ran a smoke test on nicolasbillia.com (40 URLs) to measure, endpoint by endpoint, what it does well and how heavy the output is when it travels to the LLM context.

In parallel, Antonio Atilio Maculus’s Python library (LinkedIn / repo on GitHub) we have been using since early this year on half a dozen real audits — e-commerce, premium retail, higher ed, fashion, news media. That is the comparison baseline: a freshly released MCP versus a tool already battle-tested in production.

The natural question is “which one is better”, but after running both in parallel the answer is different: they are not the same, they do not compete, and combining them unlocks workflows that break with either one alone.

This is a user’s review, not a technical authority piece. Antonio knows his library ten times better than I do, and the official MCP I have only used for a few hours. What I add is the SEO practitioner’s lens: today’s experiment methodology, a capability catalog with absolute numbers, scoring per analysis category, 20 combined-use ideas, step-by-step setup, copy-paste SQL queries, and how to join the crawl with GSC and GA4.

Table of contents

0. How I measured — smoke test methodology
1. Capability catalog — what each one exposes
2. Scoring 1-3 per analysis category (traffic-light)
3. Decision tree — which tool do I use
4. Step-by-step setup
5. 20 combined-use ideas (MCP + API)
6. Classic combined workflow (3 steps)
7. 3-in-1 pipeline: SF + GSC + GA4 with real code
8. Derby APP.* tables glossary (~25)
9. Recipe book — 10 SQL queries ready to copy
10. Tokens consumed in Claude Code per MCP endpoint
11. Real gotchas that tripped us
12. FAQ
Closing

0. How I measured — smoke test methodology

I wanted a small site to avoid burning credits or time, and a familiar one to validate the results. I picked nicolasbillia.com (my personal site, ~40 HTML URLs, WordPress block theme, ES/EN hreflang).

Experiment setup:

Screaming Frog with the MCP server enabled at http://localhost:11435/mcp.
Claude Code connected to the MCP via claude mcp add seospider --transport http http://localhost:11435/mcp.
Antonio’s Python API loading the same crawl with Crawl.load(crawl_id, db_id_backend="derby", csv_fallback=False).

MCP endpoints measured (6 representative calls): sf_crawl, sf_crawl_progress, sf_generate_report (Redirects:All), sf_export_seo_element_urls (Canonicals:Missing), sf_bulk_export_page_content (visible_text), sf_export_embeddings.

What I measured on each call:

File size on disk (KB) — complete output saved to file.
Size of the output that goes back to the LLM context (chars / KB) — what Claude actually “sees” after the tool call.
Estimated tokens consumed in context (chars / 4, conservative).
Errors and dependencies: if the endpoint fails, what prior configuration it requires (PageSpeed API key, hreflang crawl flag, Database storage, etc.).

What this smoke test does NOT measure:

It is not a formal benchmark — N=1 site, 40 URLs, one single run.
Tokens do NOT scale linearly: when file_path is set, the returned sample is ~1 URL and does not grow with the site size. Without file_path, it does scale with the entire site.
I did not measure the SF crawl latency itself (SF takes the same time with or without MCP). What I measured is the cost in LLM context and the endpoint’s “ergonomics”.

Where the “one scales to 100K+ URLs and the other does not” claim comes from

Important disclaimer: I have not tested the MCP against a 100K-URL site yet. The MCP is only a few days old since launch — there hasn’t been time to run it on a real audit of that size, and today’s smoke test was deliberately on a small site to map out each endpoint’s basic mechanics. Antonio Maculus’s library, on the other hand, did come with prior usage context: we ran it on real audits of tens of thousands of URLs over the past months. The conclusion about scalability is, then, inference from each solution’s architecture, partially validated through those real cases with Antonio’s library and not yet validated with the MCP on large sites.

Why Antonio’s library scales better in volume (architectural reasoning):

Reads directly from Apache Derby, SF’s internal database. No intermediate NDJSON or CSV parsing.
Queries use native SQL: planner, indexes, efficient JOINs. Query complexity does not inflate what travels to the LLM — only the SELECT result, not the full dataset.
Pre-computed tables like APP.INLINK_COUNTS avoid sitewide aggregation at runtime.
Lazy evaluation + pandas: the result materializes only when you ask for it.
Partial production validation: we ran audits on sites of tens of thousands of URLs without the library breaking, applying the documented workarounds (forced Derby, APP.INLINK_COUNTS instead of links("in").collect()).

Why the official MCP has a lower ceiling in volume (architectural reasoning + smoke test data):

Outputs travel as NDJSON or CSV. No SQL, no indexing — it is a row stream.
With file_path, the file is saved to disk and the LLM gets back a small sample (this is what we measured today). Without file_path, the entire file travels to the context.
In today’s smoke test on 40 URLs, sf_export_embeddings produced 539 KB. Linear extrapolation to 100K URLs: roughly 1.3 GB. The file is generated, but the LLM cannot read it directly: you have to parse it externally with a script.
MCP does not solve cross-table queries (no SQL). To join two bulk exports on a large site, you end up writing Python anyway.

Honest conclusion: the MCP’s volume ceiling is the LLM context window, not SF itself. SF crawls millions of URLs without issue; the MCP acts as a bridge, and that bridge has limited bandwidth. Antonio’s API skips the bridge and goes straight to the bottom of the pool (the Derby DB), which is why it holds up better.

1. Capability catalog — what each one exposes

1.1 Official Screaming Frog MCP

61 native reports in 13 groups:
- Crawl Overview / Issues / Segments (3)
- Redirects (4): All, Chains, Redirect & Canonical Chains, Redirects to Error
- Canonicals (2): Chains, Non-Indexable
- Pagination (2): Non-200, Unlinked
- Hreflang (7): All, Non-200, Unlinked, Missing Return, Inconsistent Language, Non-Canonical Return, Noindex Return
- Insecure Content / SERP Summary / Orphan Pages (3)
- Structured Data (5): Validation Errors Summary, Validation Errors, Parse Errors, Rich Results Summary, Rich Results
- JavaScript Console Log Summary (1)
- PageSpeed (28): Opportunities Summary, CSS/JS Coverage, Minify, Reduce Unused, Render Blocking, LCP, CLS, Fonts, DOM Size, etc.
- Mobile (4): Viewport, Target Size, Content Sized, Illegible Font
- Accessibility Violations Summary (1)
- Cookies Summary (1)
130+ bulk exports in 16 groups: Queued, Links (12), Web Headers / Cookies (3), Path Type (4), Security (6), Response Codes (31), Content (6), Images (8), Canonicals (12), Directives (17), JavaScript (3), AMP (7), Structured Data (6), Sitemaps (4), Custom Search / Extraction (5), URL Inspection (3), Accessibility (12).
5 extra utilities: URL screenshots, embeddings export (vectors), bulk page content (raw HTML or visible text), Node.js runner for custom post-processing, browser opener.

What we could not use as a user (limitations we hit):

PageSpeed reports — require a PageSpeed Insights API key configured in SF before the crawl. Without it, the 28 reports come back empty.
Accessibility WCAG — requires enabling the Axe module in SF Config > Spider > Crawl > Accessibility before the crawl.
URL Inspection (Rich Results, Referring Pages, Sitemaps) — requires a GSC connection via OAuth inside SF.
Custom Search / Custom Extraction — depend on patterns (XPath, regex) defined in SF before the crawl.
Change Detection — requires 2 comparable crawls of the same site.
Hreflang reports — the crawl must have been run with “Crawl Hreflang” enabled.

1.2 Antonio Maculus’s Python API

159 GUI filters across 12 SEO-element modules:
- response_codes (32 filters)
- directives (18)
- hreflang (15)
- canonicals (12)
- headings H1/H2 (12)
- internal content types (12)
- structured_data (12)
- pagination (11)
- page_titles (10)
- meta_description (9)
- images (8)
- meta_keywords (4)
9 pre-built audit reports: broken_links_report, broken_inlinks_report, nofollow_inlinks_report, title_meta_audit, indexability_audit, orphan_pages_report, security_issues_report, canonical_issues_report, hreflang_issues_report.
~25 Derby APP.* tables accessible via direct SQL (full glossary in section 8).
6 backends: Derby (fast, requires Java), DuckDB (analytical cache), CSV (no external deps), SQLite, CLI, Hybrid (Derby + CSV fallback).
Outputs: pandas / polars / dicts / lazy iterators / .to_sql() to inspect the generated query.
Coverage: 601 / 628 tabs mapped (95.7%), 15,490 / 15,589 fields (99.4%).

What hangs or we could not use:

crawl.summary() hangs on large crawls (100K+ URLs). Workaround: direct SQL over APP.* tables.
canonical_issues_report() and indexability_audit() with default backend. Force db_id_backend="derby" + csv_fallback=False.
Sitewide links("in").collect() on large sites. Use the pre-computed APP.INLINK_COUNTS instead.
Filters marked TODO in source: pixel-width titles/metas, “Is Relative” canonicals, “Incorrect Language Codes” hreflang, “Background Images”, several pagination and structured-data sub-filters.

2. Scoring 1-3 per analysis category (traffic-light)

Scale: 3 super useful — 2 works with caveats — 1 not recommended.

Column criteria and source of each score:

Speed: time from the query to having the data ready to analyze (call → pandas or file). Score based on today’s smoke test (MCP) + accumulated real audits (API).
Ease: learning curve and typical friction of each endpoint. Score based on today’s smoke test (MCP) + production usage (API).
Volume: inferred capacity to handle 100K+ URLs without breaking. This score is architectural inference (see the previous subsection on infrastructure), partially validated with the API on real audits of tens of thousands of URLs, but not yet tested with the MCP at that size.

Important: what follows is a matrix of reasoned inferences, not closed conclusions. Each cell will be validated against real sites and this post will be updated as we get hard data. If you have a use case or a site where a cell does not hold, let me know and I’ll incorporate it.

Analysis category	MCP Spd	MCP Ease	MCP Vol	API Spd	API Ease	API Vol
Response codes
Redirects (chains, loops)
Canonicals
Hreflang
Robots directives
Page titles / Meta description
Headings H1/H2
Inlinks / Outlinks (granular)
Content duplicates (exact / near / similar)
Images (alt, size, dimensions)
Structured data / Rich Results
PageSpeed / Core Web Vitals
Mobile usability
Accessibility (WCAG)
Screenshots / Embeddings / Node.js

Quick read: for granular inlinks and cross-table queries, the API wins. For PageSpeed, accessibility, screenshots and embeddings, MCP is exclusive or clearly better. For the bulk of standard SEO analysis (canonicals, hreflang, directives, titles), both work fine — the choice depends more on workflow than on tool.

3. Decision tree — which tool do I use

Do you want to launch the crawl from the LLM?
├── Yes ───────────────────────────→ MCP (the API only reads)
└── No (crawl already exists)
    │
    └── What kind of analysis?
        │
        ├── PageSpeed / WCAG / Screenshots / Embeddings → MCP exclusive
        │
        ├── Cross-table (canonical × hreflang × inlinks) → API (direct SQL)
        │
        ├── High volume (1M+ inlinks / 100K+ URLs)
        │   ├── Extraction only → API (pandas, no NDJSON parsing)
        │   └── Native client-grade reports → MCP with file_path
        │
        ├── Reusable pipeline (multiple clients)
        │   └── API (pandas + SQL are cleaner to maintain)
        │
        └── SF catalog reports
            ├── For client deliverable → MCP (native CSV)
            └── For intermediate analysis → either

4. Step-by-step setup

4.1 Claude Desktop

Official reference: Screaming Frog SEO Spider — Configuration / MCP Server. At the time of writing, the public section specifically about the MCP server is not yet detailed at that URL — the setup below is what worked on our install.

Open SF and enable the MCP server from Configuration > API Access > MCP Server (option available in recent versions).
Confirm the server is listening on http://localhost:11435/mcp (default port).
Edit ~/Library/Application Support/Claude/claude_desktop_config.json and add:

{
  "mcpServers": {
    "seospider": {
      "type": "http",
      "url": "http://localhost:11435/mcp"
    }
  }
}

Restart Claude Desktop (quit and reopen).
In chat, type “list available tools” — the sf_* tools should appear.

4.2 Claude Code (the flow we used)

SF must be open with the MCP server enabled (same steps 1 and 2 as above).

# Add the server to Claude Code's config:
claude mcp add seospider --transport http http://localhost:11435/mcp

# Verify it registered:
claude mcp list

Equivalent alternative — edit ~/.claude.json manually:

{
  "mcpServers": {
    "seospider": {
      "type": "http",
      "url": "http://localhost:11435/mcp"
    }
  }
}

Once registered, the tools surface inside Claude Code as mcp__seospider__sf_*. A basic crawl:

sf_crawl(crawl_url="https://example.com", crawl_name="Initial audit")
sf_crawl_progress()
sf_generate_report(
  category="Hreflang:Missing Return Links",
  export_type="CSV",
  file_path="hreflang.csv"
)

Playful but real shortcut: if this feels dense, copy this post’s URL, paste it into Claude Code with /url <link>, then ask “set up the Screaming Frog MCP per the instructions in this post”. It’s legitimate. It is exactly the use case MCP was designed for.

5. 20 combined-use ideas (MCP + API)

#	Idea	Combination	Output
1	One-shot full technical audit	MCP launches crawl + API builds prioritized issues DataFrame	CSV sorted by impact + reusable crawl_id
2	Semantic cannibalization with commercial value	MCP exports embeddings + API joins with GA4 conversions	List of URLs to redirect/merge with revenue justification
3	Post-deploy hreflang diagnostic	MCP launches crawl + API SQL on MULTIMAP_HREF_LANG_*	Per-variant fix table
4	Orphan pages with traffic	API extracts orphans + MCP downloads screenshots	Visual list + internal linking recommendation
5	Cross-path canonical audit post-migration	API custom SQL + MCP confirms via bulk export	Validated 301 redirect list
6	User-Agent cloaking detection	MCP Node.js script (curl + UA spoof) + API compares HTML hash	URLs with divergent browser vs Googlebot response
7	WCAG audit grouped by template	MCP Accessibility:All Violations + API groups by path	Per-template priority list, not per-URL
8	Zombie pages with GSC impressions	MCP Directives:Noindex Inlinks + API joins with GSC	Noindex URLs still receiving traffic — candidates for real removal
9	Internal linking gaps in semantic clusters	MCP embeddings + API APP.INLINK_COUNTS + cosine similarity	Highly similar URL pairs with no mutual link
10	Structured data errors per template	MCP Structured Data:Validation Errors + API groups by URL pattern	Template-level errors, not URL-by-URL
11	Internal redirect chains	MCP bulk Response Codes:3xx Inlinks + API SQL traversal	Internal chains map with hop count
12	Images missing alt prioritized by traffic	MCP Images:Missing Alt + API joins APP.INLINK_COUNTS and GA4	Priority list of images by impact
13	PageSpeed comparison template vs template	MCP PageSpeed reports + API groups by path pattern	Slowest templates identified, not isolated URLs
14	JS rendering issues classified	MCP JavaScript Console Log + API filters by severity	Pages with broken JS grouped by error type
15	Duplicates near vs exact per section	API SQL on APP.NEAR_DUPLICATE / DUPLICATES_TITLE	Consolidation decision with proximity criteria
16	Canonicals contradicting hreflang	API multi-table SQL (URLS + MULTIMAP_HREF_LANG_*)	Consolidation-policy conflicts
17	Rich Results eligibility	MCP Rich Results Features + API estimates SERP impact	Rich-snippet eligible URLs prioritized
18	Pre/post deploy diff	API CrawlDiff between 2 crawls + MCP generates diff reports	Automated post-deploy QA
19	Cookie mapping per path (GDPR)	MCP All Cookies bulk + API groups by path	GDPR / consent audit with coverage
20	Mixed content per section	MCP Security:Mixed Content + API groups by path	Priority by crawl depth and traffic

6. Classic combined workflow (3 steps)

MCP launches the crawl: sf_crawl(crawl_url=...) + sf_crawl_progress() until 100%.
API reads with SQL: Crawl.load(crawl_id, db_id_backend="derby").sql(...). Multi-table joins, complex aggregations.
MCP exports final reports for the client deliverable with sf_generate_report(category=..., file_path=...).

The LLM in Claude orchestrates the three steps in a single conversation.

7. 3-in-1 pipeline: SF + GSC + GA4 with real code

A crawl alone tells half the story. The other half is in GSC (what Google sees) and GA4 (what the user does). Once the crawl data is in pandas, joining it with GSC and GA4 is a SQL join.

import pandas as pd
from screamingfrog import Crawl
from search_console_connect import authenticate as gsc_auth, get_client as gsc_client
from extract_ga4 import get_client as ga4_client, extract_landing_pages

# 1) SF crawl -> URL DataFrame
crawl = Crawl.load("CRAWL_ID", db_id_backend="derby", csv_fallback=False)
df_sf = crawl.sql("""
  SELECT u.URL_PATH AS url,
         u.STATUS_CODE,
         u.INDEXABILITY,
         u.CANONICAL_LINK_ELEMENT,
         ic.INLINKS_COUNT
  FROM APP.URLS u
  LEFT JOIN APP.INLINK_COUNTS ic ON ic.URL_ID = u.URL_ID
  WHERE u.CONTENT_TYPE = 'text/html'
""").to_pandas()

# 2) GSC 90 days -> clicks / impressions / position per URL
gsc = gsc_client(gsc_auth())
df_gsc = (
  pd.DataFrame(gsc.searchanalytics().query(
    siteUrl="sc-domain:example.com",
    body={"startDate":"2026-02-19","endDate":"2026-05-19","dimensions":["page"]}
  ).execute()["rows"])
  .rename(columns={"keys":"url"})
)
df_gsc["url"] = df_gsc["url"].str[0]

# 3) GA4 30 days -> sessions / conversions per landing
df_ga4 = extract_landing_pages(ga4_client(), property_id="123456789",
                                start_date="30daysAgo", end_date="today")

# 4) Merge on normalized URL
df = (df_sf.merge(df_gsc, on="url", how="left")
              .merge(df_ga4, on="url", how="left"))

print(df.head())

8. Derby APP.* tables glossary (~25)

Table	Contents
APP.URLS	Crawled URLs with full metadata (status, indexability, canonical, hreflang flags, content type)
APP.LINKS	Raw link graph (SRC_ID, DST_ID, LINK_TYPE). LINK_TYPE=13 = HTML hreflang
APP.UNIQUE_URLS	URL_ID → normalized URL string mapping
APP.INLINK_COUNTS	Pre-computed inlink counts per URL (fast, no aggregation needed)
APP.DUPLICATES_TITLE	URLs with duplicate title between them
APP.DUPLICATES_META_DESCRIPTION	URLs with duplicate meta description
APP.DUPLICATES_H1	URLs with duplicate H1
APP.DUPLICATES_H2	URLs with duplicate H2
APP.MULTIMAP_CANONICALS_PENDING_LINK	Canonicals without confirmed HTML link
APP.MULTIMAP_HREF_LANG_NON_200_LINK	Hreflang pointing to non-200 URLs
APP.MULTIMAP_HREF_LANG_MISSING_CONFIRMATION	Hreflang without reciprocal return links
APP.MULTIMAP_HREF_LANG_INCONSISTENT_LANGUAGE_CONFIRMATION	Return links with inconsistent language code
APP.MULTIMAP_HREF_LANG_CANONICAL_CONFIRMATION	Hreflang without canonical on the return
APP.MULTIMAP_HREF_LANG_NO_INDEX_CONFIRMATION	Hreflang pointing to noindex URLs
APP.MULTIMAP_PAGINATION_PENDING_LINK	Pagination without anchor link
APP.MULTIMAP_PAGINATION_SEQUENCE_ERROR	rel=prev/next sequence errors
APP.MISSING_ALT_TEXT_TRACKER	Images missing alt text
APP.MISSING_ALT_ATTRIBUTE_TRACKER	Images missing alt attribute
APP.ALT_TEXT_OVER_X_CHARACTERS_TRACKER	Alt text exceeding length
APP.MISSING_SIZE_ATTRIBUTES	Images without width/height (CLS)
APP.HTML_VALIDATION_DATA	HTML validation, tag location (in/outside head)
APP.URL_INSPECTION	GSC URL Inspection data (if integration enabled)
APP.PAGE_SPEED_API	PageSpeed Insights scores/metrics (if API key configured)
APP.AXE_CORE_RESULTS	Axe accessibility audit results (if enabled)
APP.COSINE_SIMILARITY	Content similarity between URLs (requires Content similarity enabled)
APP.NEAR_DUPLICATE	Near duplicates with configurable threshold
APP.LOW_RELEVANCE	URLs with low-relevance / low-value content

9. Recipe book — 10 SQL queries ready to copy

All assume crawl = Crawl.load(crawl_id, db_id_backend="derby", csv_fallback=False) already done.

1. Indexable orphan pages

crawl.sql("""
SELECT u.URL_PATH FROM APP.URLS u
LEFT JOIN APP.INLINK_COUNTS ic ON ic.URL_ID = u.URL_ID
WHERE u.INDEXABILITY = 'Indexable'
  AND (ic.INLINKS_COUNT IS NULL OR ic.INLINKS_COUNT = 0)
""")

2. Pages with near-duplicates > 0.9

crawl.sql("""
SELECT u1.URL_PATH AS url_a, u2.URL_PATH AS url_b, nd.SIMILARITY
FROM APP.NEAR_DUPLICATE nd
JOIN APP.UNIQUE_URLS u1 ON u1.URL_ID = nd.URL_ID_A
JOIN APP.UNIQUE_URLS u2 ON u2.URL_ID = nd.URL_ID_B
WHERE nd.SIMILARITY > 0.9
ORDER BY nd.SIMILARITY DESC
""")

3. Hreflang missing return links

crawl.sql("SELECT * FROM APP.MULTIMAP_HREF_LANG_MISSING_CONFIRMATION")

4. Inlinks distribution (P50/P90/P99)

crawl.sql("""
SELECT
  PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY INLINKS_COUNT) AS p50,
  PERCENTILE_CONT(0.90) WITHIN GROUP (ORDER BY INLINKS_COUNT) AS p90,
  PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY INLINKS_COUNT) AS p99
FROM APP.INLINK_COUNTS
""")

5. Canonical chains of 2+ hops

crawl.sql("""
SELECT u.URL_PATH, u.CANONICAL_LINK_ELEMENT, c.URL_PATH AS canonical_path
FROM APP.URLS u
JOIN APP.URLS c ON c.URL_PATH = u.CANONICAL_LINK_ELEMENT
WHERE c.CANONICAL_LINK_ELEMENT IS NOT NULL
  AND c.CANONICAL_LINK_ELEMENT <> c.URL_PATH
""")

6. Noindex URLs with inlinks (zombie under construction)

crawl.sql("""
SELECT u.URL_PATH, ic.INLINKS_COUNT
FROM APP.URLS u
JOIN APP.INLINK_COUNTS ic ON ic.URL_ID = u.URL_ID
WHERE u.META_ROBOTS LIKE '%noindex%' AND ic.INLINKS_COUNT > 0
ORDER BY ic.INLINKS_COUNT DESC
""")

7. Duplicate titles with low word count

crawl.sql("""
SELECT u.URL_PATH, u.WORD_COUNT, u.TITLE
FROM APP.URLS u
JOIN APP.DUPLICATES_TITLE dt ON dt.URL_ID = u.URL_ID
WHERE u.WORD_COUNT < 300
ORDER BY u.WORD_COUNT
""")

8. Top 100 URLs by inlinks

crawl.sql("""
SELECT u.URL_PATH, ic.INLINKS_COUNT
FROM APP.URLS u
JOIN APP.INLINK_COUNTS ic ON ic.URL_ID = u.URL_ID
ORDER BY ic.INLINKS_COUNT DESC
FETCH FIRST 100 ROWS ONLY
""")

9. Internal 3xx redirects with descriptive anchor

crawl.sql("""
SELECT src.URL_PATH AS from_url, dst.URL_PATH AS to_url, l.ANCHOR_TEXT
FROM APP.LINKS l
JOIN APP.URLS src ON src.URL_ID = l.SRC_ID
JOIN APP.URLS dst ON dst.URL_ID = l.DST_ID
WHERE dst.STATUS_CODE BETWEEN 300 AND 399
  AND LENGTH(l.ANCHOR_TEXT) > 10
""")

10. URLs by Schema type

crawl.sql("""
SELECT SCHEMA_TYPE, COUNT(*) AS urls
FROM APP.URLS
WHERE SCHEMA_TYPE IS NOT NULL
GROUP BY SCHEMA_TYPE
ORDER BY urls DESC
""")

10. Tokens consumed in Claude Code per MCP endpoint

Real benchmark on nicolasbillia.com (small site, 40 URLs crawled). All endpoints run with file_path set ("save to disk" mode). These numbers don't scale linearly to larger sites — the returned sample is always ~1 URL.

Endpoint	File on disk	Output to LLM context	Tokens approx
`sf_crawl`	—	~37 chars	~10
`sf_crawl_progress`	—	~60 chars	~15
`sf_generate_report` (Redirects:All)	3.3 KB / 4 rows	~1.2 KB (header + path)	~310
`sf_export_seo_element_urls` (Canonicals:Missing)	1.4 KB / 4 URLs	~600 chars (1-URL sample)	~150
`sf_bulk_export_page_content` (visible_text)	128 KB / 18 URLs	~5.7 KB (full 1-URL sample)	~1,425
`sf_export_embeddings`	539 KB / 40 URLs × 1,536 dims	~200 chars (status + path)	~50

Critical operational warning: if you do NOT pass file_path to bulk exports, the entire file goes into the LLM context:

sf_bulk_export_page_content without file_path on this crawl: ~128 KB ≈ 32,000 tokens
sf_export_embeddings without file_path: ~539 KB ≈ 135,000 tokens (blows Sonnet's context window)

Practical rule: for exports over ~5 KB, always set file_path. Read the file later with a script or a targeted grep.

11. Real gotchas that tripped us

Antonio's Python API:

Crawl.load(...) without db_id_backend="derby" triggers the DuckDB cache and hangs on large crawls. Always force Derby + csv_fallback=False.
Of the 9 pre-built reports, crawl.summary(), canonical_issues_report() and indexability_audit() hang on some crawls. Workaround: direct SQL on APP.URLS.
Sitewide links("in").collect() on 100K+ URL sites hangs. Use the pre-computed APP.INLINK_COUNTS.
Filters marked TODO in source (Antonio documents this): pixel-width titles/metas, "Is Relative" canonicals, "Background Images", several others.
Requires Java 21 (bundled with SF). On macOS: export JAVA_HOME="/Applications/Screaming Frog SEO Spider.app/Contents/jre".

Official MCP:

Restricted base directory. On our install: /Users/<user>/seo_spider_mcp_server/. To write output to another folder, copy afterward.
5 PageSpeed reports return empty without a PageSpeed Insights API key configured in SF.
Hreflang reports return an error if the crawl was not run with "Crawl Hreflang" enabled in SF Config > Spider.
The crawl must have been run with Storage = Database (not Memory). Otherwise some reports have no data.
SF must remain open. If SF closes mid-crawl, progress is lost.

12. FAQ

Can I use Antonio's API without SF open?
For active crawls (in ~/.ScreamingFrogSEOSpider/ProjectInstanceData/), SF must be open because Derby holds an exclusive lock. For exported .seospider or .dbseospider files, no need.

Does it work on Linux / Windows?
API: yes (Python + Java are cross-platform). MCP: yes, SF runs on Win/Mac/Linux, the HTTP endpoint is the same.

How much disk does a crawl take?
In this benchmark, 40 URLs took ~6 MB in Derby format. On large sites (100K URLs) typically 2-4 GB.

Does Antonio accept PRs?
The repo is alpha public and active. Best channel: open an issue first, then PR. Repo link.

Does MCP work without a paid SF license?
SF Free allows crawls up to 500 URLs. MCP should work within that limit, but the reports require Database mode (not Memory), which is a paid tier.

Can I automate everything in cron?
Yes, via SF's CLI (headless). MCP requires SF open; for headless cron, prefer SF CLI + Antonio's API reading the output.

Minimum SF version required?
For MCP server: the most recent versions (check the official changelog). For Antonio's API: any version that writes crawls to Derby/CSV — tab coverage depends on the SF version.

Closing

The comparison is not "which one wins". It is "what each one does well", so you know when to pick one or combine them. On real audits (100K+ URL e-commerce, news media with fragmented templates, multi-country with complex hreflang) we ended up using both.

Full credit to Antonio Atilio Maculus for building the Python library and keeping it open source. If you work with SF programmatically, his repo is required reading.

If you have tried either (or both), we would like to hear what workflows you have built. Leave a comment or get in touch.