AI Site Grade
legal500.com — AI Site Grade
Legal 500's sitemap.xml and llms.txt return Next.js HTML shells, making them unparseable by AI crawlers and undermining the brand's proactive AI platform registrations.
Legal 500's AI visibility is severely limited by a broken sitemap, fragmented content across subdomains, missing schema on key pages, and a cold-knowledge gap that fails to reflect its AI-optimized positioning.
- Findings
- 10
- Evidence checks
- 28
- Completed
- 30 May 2026
Analysis
The Legal 500 / legal500.com — AI-Visibility Audit
The site's /sitemap.xml and /llms.txt both return a 137KB Next.js HTML shell instead of actual XML or plaintext content — meaning every AI crawler that hits these canonical discovery endpoints receives a JavaScript-dependent page with zero parseable URLs or content.
Crawler Access
All major AI bots — GPTBot, ClaudeBot, PerplexityBot, Google-Extended, OAI-SearchBot, ChatGPT-User, Bytespider, Applebot-Extended, anthropic-ai — receive HTTP 200 with the same ~139KB payload as a browser. No UA-based blocking exists. However, every page served carries cache-control: private, no-cache, no-store, max-age=0, must-revalidate, preventing any CDN-level caching for crawlers. The site runs on Cloudflare with a Next.js stack (x-powered-by: Next.js), and the DNS TXT records confirm explicit verification tokens for OpenAI (openai-domain-verification), Anthropic (anthropic-domain-verification), and Cursor — indicating the brand has proactively registered with these AI platforms, yet the technical infrastructure undermines that intent.
Sitemap and Discovery Failure
The robots.txt is minimal (33 bytes) — only disallowing /summary/ with a single User-Agent: * rule and no AI-bot-specific directives. The sitemap.xml returns a Next.js HTML shell (content-type text/html, not application/xml), making it unparseable by any search engine or AI crawler. The list_known_urls tool found only 24 URLs from the homepage's link graph — a tiny fraction of the thousands of ranking pages, firm profiles, GC Powerlist editions, and comparative guides the site actually hosts. The my.legal500.com subdomain hosts the actual ranking content but is a separate deployment with its own URL structure.
Cold-Knowledge Gap
The LLM knows Legal 500 as a "Big Three" legal directory alongside Chambers and Partners, founded in 1988 in London, with products like GC Magazine and the Green Guide. It also recalls reputational criticism around ranking inconsistency and paid sponsorship blurring editorial independence. The actual site describes itself as "data-driven, AI-optimised" and founded in 1987 (not 1988), benchmarking "over 15 million data points for over 8,000 firms in over 100 countries." The site's self-positioning as an "AI-optimised research platform" is entirely absent from the model's cold knowledge — the LLM knows Legal 500 as a directory, not a data platform.
Schema and Structured Data
Only two page types carry JSON-LD: the Corporation and WebSite schemas appear on the /faqs/, /guides/, /developments/, and my.legal500.com pages. The homepage, rankings page, about page, and data-products page have zero schema markup. No FAQPage schema on the FAQ page (despite 30+ Q&A items rendered as <h3> headings). No ItemList or Product schema on ranking pages. No Article or NewsArticle schema on developments. The WebSite schema's SearchAction points to a broken URL pattern (legal500.com/stp#stq=...).
Content Fragmentation
The site operates across at least three distinct deployments: www.legal500.com (Next.js, thin content), my.legal500.com (likely WordPress, richer content), and beta.legal500.com (mentioned as beta). Firm profiles live on my.legal500.com/profiles/ while the www subdomain's /firm-profile/ returns a 404. The GC Powerlist pages on www contain extensive lists of historical editions (back to 2014) but link to my.legal500.com for actual content. This split means AI crawlers indexing www.legal500.com see navigation shells without the substantive ranking data the brand is known for.
Findings
Sitemap.xml returns Next.js HTML shell instead of XML High
The sitemap.xml at /sitemap.xml returns a 137KB HTML page (content-type text/html) rather than valid XML, making it unparseable by search engines and AI crawlers. This prevents discovery of thousands of ranking pages, firm profiles, and guides.
What to change: Configure the Next.js server to serve a proper XML sitemap at /sitemap.xml with content-type application/xml, listing all canonical URLs across www and my subdomains.
llms.txt returns Next.js HTML shell instead of plaintext High
The llms.txt file at /llms.txt returns a 137KB HTML page instead of plaintext, making it unreadable by AI crawlers that use it for content discovery.
What to change: Serve a plaintext llms.txt file with content-type text/plain, listing the site's key AI-relevant resources.
Robots.txt is minimal and lacks AI-bot directives Medium
The robots.txt is only 33 bytes, disallowing only /summary/ with a single User-Agent: * rule. No AI-bot-specific directives exist, missing an opportunity to guide crawlers to parseable content.
What to change: Expand robots.txt to include AI-bot user-agents and point them to the correct sitemap URL.
Cache-control headers prevent CDN caching for crawlers Medium
All pages serve cache-control: private, no-cache, no-store, max-age=0, must-revalidate, preventing any CDN-level caching for AI crawlers and increasing load on origin servers.
What to change: Relax cache-control headers for static or cacheable resources to allow CDN caching for crawlers.
Homepage and key pages lack structured data High
The homepage, rankings page, about page, and data-products page have zero JSON-LD schema markup. Only /faqs/, /guides/, /developments/, and my.legal500.com carry Corporation and WebSite schemas.
What to change: Add Organization, WebSite, and relevant page-type schemas (e.g., ItemList for rankings) to all key pages.
FAQ page lacks FAQPage schema despite 30+ Q&A items Medium
The /faqs/ page contains over 30 Q&A items rendered as <h3> headings but has no FAQPage JSON-LD schema, missing a key opportunity for AI crawlers to extract structured Q&A content.
What to change: Add FAQPage schema markup to the /faqs/ page, wrapping each Q&A pair in the appropriate schema.
WebSite SearchAction points to broken URL pattern Medium
The WebSite schema's SearchAction property points to a broken URL pattern (legal500.com/stp#stq=...), which does not resolve to a functional search endpoint.
What to change: Update the SearchAction target to a working search URL, e.g., https://www.legal500.com/search?q={search_term_string}.
Content fragmented across three subdomains with broken links High
The site operates across www.legal500.com (Next.js, thin content), my.legal500.com (richer content), and beta.legal500.com. Firm profiles live on my.legal500.com/profiles/ while www.legal500.com/firm-profile/ returns a 404. GC Powerlist pages link to my.legal500.com for actual content, causing AI crawlers to see navigation shells without substantive data.
What to change: Consolidate ranking and firm profile content onto a single subdomain, or ensure www.legal500.com includes canonical links to the my.legal500.com pages and fix the 404 on /firm-profile/.
LLM cold knowledge lacks site's AI-optimized positioning Medium
The LLM knows Legal 500 as a legal directory but is unaware of its self-description as an 'AI-optimised research platform' with 15 million data points. The site's founding year is also misremembered (1988 vs 1987).
What to change: Ensure the homepage and about page prominently state the AI-optimized platform positioning and correct founding year in visible text and schema markup.
Only 24 URLs discovered from homepage link graph High
The list_known_urls tool found only 24 URLs from the homepage's link graph, a tiny fraction of the thousands of ranking pages, firm profiles, and guides the site hosts. This indicates poor internal linking or JavaScript-dependent navigation that crawlers cannot follow.
What to change: Add static HTML links to key sections (rankings, firm profiles, guides) from the homepage and ensure navigation is crawlable without JavaScript.
What's working
- Proactive registrations with OpenAI, Anthropic, and Cursor — DNS TXT records confirm verification tokens for OpenAI, Anthropic, and Cursor, indicating the brand has proactively registered with these AI platforms to enable crawling and indexing.
- All major AI bots receive HTTP 200 with no UA blocking — All tested AI bots (GPTBot, ClaudeBot, PerplexityBot, etc.) receive HTTP 200 responses with the same payload as browsers, with no user-agent-based blocking.
- JSON-LD Corporation and WebSite schemas on FAQ, guides, and developments pages — The /faqs/, /guides/, /developments/, and my.legal500.com pages include Corporation and WebSite JSON-LD schemas, providing basic entity recognition for AI crawlers.
- llms.txt file is served (though with wrong content type) — The site serves an llms.txt file at /llms.txt, indicating awareness of the AI discovery standard, even though the content is currently an HTML shell.
- my.legal500.com subdomain hosts rich ranking content — The my.legal500.com subdomain contains substantive ranking data, firm profiles, and guides with richer content than the www subdomain, providing a foundation for AI visibility if properly linked.
Track legal500.com across AI search
This is one snapshot. Open the interactive report to inspect evidence, or grade another site free.