AI Site Grade

ocrolus.com — AI Site Grade

Ocrolus's sophisticated llms.txt is undermined by broken URLs, and the site lacks FAQPage schema despite extensive FAQ content.

Ocrolus has strong AI crawler access and a rare llms.txt file, but broken URLs in the llms.txt, missing FAQPage schema, and a cold-knowledge gap limit its AI visibility.

Findings
11
Evidence checks
25
Completed
30 May 2026

Analysis

I have enough data to write a thorough audit. Let me compile the findings.

Ocrolus.com — AI-Visibility Audit

The site has an unusually sophisticated llms.txt file that is actively sabotaged by broken URLs — 3 of its 7 linked sub-pages return 404s or redirect to the homepage, meaning AI crawlers following the llms.txt will hit dead ends on nearly half the content map.

Crawler Access

The homepage returns a full 200 with ~192KB of content to GPTBot, OAI-SearchBot, ChatGPT-User, Google-Extended, PerplexityBot, Perplexity-User, anthropic-ai, and Applebot-Extended — all served identically to a browser. This is an excellent baseline. However, ClaudeBot gets a 429 (rate-limited) and GPTBot also gets a 429, not a 403 — Cloudflare is throttling them rather than blocking outright, which means inconsistent access depending on rate limits. Bytespider (ByteDance) is fully blocked at 403. The robots.txt contains no AI-bot directives at all — no mention of GPTBot, ClaudeBot, Google-Extended, or any AI crawler — despite explicitly blocking AhrefsBot, SemrushBot, MJ12bot, and Baiduspider. The site runs on Cloudflare (CDN/WAF) with WP Engine hosting, served over WordPress.

llms.txt — Advanced but Broken

The /llms.txt file exists and is well-structured with entity categories, definitions, and a full URL map — a rare and commendable implementation. But it contains at least 3 broken or redirected URLs: /intelligent-document-processing/ redirects to the homepage, /intelligent-document-processing/bank-statement-processing/ returns a 404, /how-it-works/ returns a 404, and /income-verification/ redirects to the homepage. The /intelligent-document-processing/mortgage-document-processing/ URL redirects (301) to /mortgage/ — functional but not the canonical path. An AI model consuming this file will waste crawl budget on dead pages.

Cold-Knowledge Gap

The LLM already knows Ocrolus well: founded 2014 by Sam Bobley and Vikas Bhambri, $110M+ raised, processes millions of documents monthly, clients include PayPal/SoFi/Kabbage. The cold knowledge mentions $80M Series C in 2021 led by Fin Capital — but the site itself says "Raised $110M in venture funding" with no breakdown. The cold knowledge also cites Deloitte Technology Fast 500 recognition, which the site does not mention anywhere. Conversely, the site prominently claims "3,700+ active user accounts" and "Analyzing 750k credit apps/month" — metrics the cold model does not know. The gap is that the model's prior is slightly dated (2021 funding round, older client list) while the site has evolved its positioning to "vertical AI engine" with newer products like Inspect, Encore, and Detect.

Schema Posture

Every page carries the same Organization + WebSite + WebPage + BreadcrumbList JSON-LD schema — technically valid but no FAQPage schema on the FAQ page despite 3,169 words of Q&A content across 7 categories. The FAQ page has has_faq: true signal detection but no structured markup. No Product or SoftwareApplication schema for the platform itself. No HowTo schema on the "how it works" page (which is a 404 anyway). The Organization schema uses a logo URL from 2021 and has an empty description field.

External Signals

The site has zero discoverable external review presence — no Gartner, Capterra, or Reddit mentions surfaced in search. The DNS TXT records reveal integrations with Anthropic (domain verification), Zoom, Atlassian, DocuSign, Mixpanel, Segment, Twilio, HubSpot, Zendesk, Salesforce — a sophisticated enterprise stack. Customer stories name Better, Enova, Kapitus, Fora Financial, Argyle, LendingClub, Octane, Nova Credit, Jackson Hewitt — strong logos that are not reflected in the cold model's knowledge.

Findings

  1. llms.txt contains broken or redirected URLs High

    The llms.txt file links to 7 sub-pages, but 3 return 404s or redirect to the homepage, wasting AI crawl budget.

    What to change: Update the llms.txt file to point only to valid, canonical URLs. Remove or correct the broken links.

  2. FAQ page lacks FAQPage schema markup High

    The FAQ page contains 3,169 words of Q&A content across 7 categories but has no FAQPage JSON-LD schema, reducing its visibility in AI-generated answers.

    What to change: Add FAQPage schema markup to the FAQ page, marking up each question and answer.

  3. ClaudeBot and GPTBot are rate-limited by Cloudflare High

    ClaudeBot and GPTBot receive 429 responses from Cloudflare, leading to inconsistent access and potential crawl gaps.

    What to change: Whitelist ClaudeBot and GPTBot in Cloudflare WAF rules to allow consistent access.

  4. Bytespider is fully blocked with 403 Medium

    ByteDance's Bytespider crawler receives a 403 Forbidden, preventing any content indexing by that bot.

    What to change: Allow Bytespider access if indexing by ByteDance is desired.

  5. robots.txt lacks directives for AI crawlers Medium

    The robots.txt file does not mention GPTBot, ClaudeBot, Google-Extended, or any AI crawler, leaving their access unmanaged.

    What to change: Add explicit allow/disallow rules for major AI crawlers in robots.txt.

  6. No Product or SoftwareApplication schema on platform pages Medium

    The platform and product pages lack structured data for SoftwareApplication or Product, reducing their semantic richness for AI models.

    What to change: Add SoftwareApplication schema to the platform page with relevant properties.

  7. Organization schema has empty description field Low

    The Organization JSON-LD schema includes an empty description field, missing an opportunity to convey the brand's value proposition.

    What to change: Populate the description field in the Organization schema with a concise summary of Ocrolus.

  8. Cold model knowledge is outdated and missing key metrics Medium

    The LLM's prior knowledge references a 2021 funding round and older client list, while the site now promotes newer products and metrics like 3,700+ active users.

    What to change: Ensure key metrics and recent milestones are prominently featured on the homepage and about page to update AI knowledge.

  9. How It Works page returns 404 High

    The /how-it-works/ page returns a 404 error, preventing crawlers and users from accessing this important content.

    What to change: Restore the How It Works page or set up a proper redirect to a relevant page.

  10. Bank statement processing page returns 404 High

    The /intelligent-document-processing/bank-statement-processing/ page returns a 404, breaking the content hierarchy.

    What to change: Restore the page or redirect to a relevant alternative.

  11. No discoverable external reviews on Gartner, Capterra, or Reddit Low

    Searches for Ocrolus reviews on Gartner, Capterra, and Reddit returned no results, limiting external validation signals.

    What to change: Encourage customers to leave reviews on third-party platforms and link to them from the site.

What's working

  • llms.txt file is present and well-structured — The site hosts a comprehensive llms.txt file with entity categories, definitions, and a URL map, a rare and commendable implementation that aids AI understanding.
  • Most AI crawlers receive full content access — GPTBot, OAI-SearchBot, ChatGPT-User, Google-Extended, PerplexityBot, Perplexity-User, anthropic-ai, and Applebot-Extended all receive a full 200 response with identical content to a browser.
  • Customer stories page contains detailed case studies — The customer stories page has 9,602 words of detailed case studies featuring strong logos like Better, Enova, and LendingClub, providing rich content for AI models.
  • FAQ page has extensive Q&A content — The FAQ page contains 3,169 words of Q&A content across 7 categories, covering common questions about the platform.
  • Supported documents page lists document types — The supported documents page provides a clear list of document types processed by the platform, aiding AI understanding of capabilities.
  • Consistent JSON-LD schema across all pages — Every page includes Organization, WebSite, WebPage, and BreadcrumbList schema, providing a solid structured data foundation.
  • DNS records show integrations with major enterprise tools — DNS TXT records reveal integrations with Anthropic, Zoom, Atlassian, DocuSign, and others, indicating a sophisticated tech stack.

Track ocrolus.com across AI search

This is one snapshot. Open the interactive report to inspect evidence, or grade another site free.

Open this AI Site Grade Grade another site Track your brand