AI Site Grade
cbinsights.com — AI Site Grade
CB Insights blocks GPTBot from its most cited research reports and company profiles while serving them to ClaudeBot, creating a fragmented AI-knowledge landscape where different models see radically different versions of the brand.
CB Insights has a two-tier AI crawler strategy that blocks GPTBot from high-value content while allowing ClaudeBot, and its cold knowledge is stuck on a 2023-era product narrative that doesn't reflect the current 'Predictive Intelligence' positioning.
- Findings
- 12
- Evidence checks
- 29
- Completed
- 30 May 2026
Analysis
CB Insights — AI-Visibility Audit
The site's own research reports — its most cited, press-worthy content — are blocked at the CloudFront edge to GPTBot (403) while served normally to ClaudeBot, creating a fragmented AI-knowledge landscape where different models see radically different versions of the brand.
Crawler Access
The homepage returns a full 200 with ~886 words of visible text to all major AI crawlers except GPTBot (403 at CloudFront) and Bytespider (403). ClaudeBot, PerplexityBot, Google-Extended, OAI-SearchBot, and Applebot-Extended all get the same content as a browser. However, the robots.txt explicitly disallows GPTBot, ClaudeBot, PerplexityBot, Google-Extended, and CCBot from /research/, /company/, /investor/, /esp/, and /compare/. In practice, CloudFront enforces a hard 403 for GPTBot on /research/ paths, while ClaudeBot gets a 200 redirect to the thin research landing page (56 words). The /company/stripe profile page — a 3,979-word, FAQ-rich, table-heavy company dossier — is served to ClaudeBot but blocked to GPTBot. No llms.txt exists (404). The site runs on AWS CloudFront + envoy with W3 Total Cache and no Content-Security-Policy or Strict-Transport-Security headers.
Cold-Knowledge Gap
The LLM knows CB Insights as a "market intelligence platform" famous for the AI 100 list, Fintech 250, State of Venture reports, and the Mosaic scoring algorithm. It also recalls reputational scrutiny over data accuracy and aggressive sales tactics. The actual site has repositioned entirely around "Predictive Intelligence" — a term that appears zero times in the cold knowledge. The homepage never mentions "AI 100" or "Fintech 250" in its main content; the research landing page mentions AI 100 2026 in a single sentence. The site now leads with AI Agents (ChatCBI, Personal Briefing, Acquisition Hunter, Competitive Sentinel) and MCP integrations — none of which appear in the model's prior. The cold knowledge is stuck on the 2023-era product narrative.
Schema Posture
The homepage has rich JSON-LD including Organization, WebSite, WebPage, BreadcrumbList, and SearchAction — technically correct. But the Organization schema lacks sameAs links to the company's active social profiles (Twitter/X, LinkedIn), and there is no FAQPage, Product, or SoftwareApplication schema despite the site selling a platform with named products (Strategy Terminal, ChatCBI, Team of Agents). The /company/stripe profile page — the site's most AI-valuable content — has zero JSON-LD despite containing structured data (founded year, stage, total raised, Mosaic Score, patents table, FAQ section). The /pricing page has breadcrumb schema but no Product or Offer schema.
External Signals
DNS TXT records reveal OpenAI domain verification (openai-domain-verification=dv-z97y4j3EbmHbAT3OnzeCLnsa) and two Anthropic domain verifications — confirming CB Insights has active partnerships with both OpenAI and Anthropic for data licensing or API access. The site prominently advertises an MCP server and ChatCBI as an AI-facing product. Despite this, the site blocks GPTBot from its most valuable content (research reports, company profiles). The cold knowledge references reputational criticism about data accuracy, but no recent press or Reddit threads surfaced to corroborate active controversy.
Surprising Contradictions
The site sells itself as an AI-native intelligence platform with AI agents, MCP connectors, and LLM integrations — yet it blocks GPTBot from /research/ and /company/ at the network edge, preventing OpenAI's crawler from indexing the very content that would make its own platform discoverable through AI search. The research subdomain (/research/) runs on a separate WordPress install (PHP 7.3.17) with a different sitemap structure than the main Next.js site, creating a two-engine architecture where individual report URLs like /research/report/state-of-venture-q1-2026/ silently redirect to the thin research landing page. The /company/ directory redirects to the homepage for crawlers, but individual company profile pages like /company/stripe are fully rendered Next.js pages with rich data — meaning the sitemap lists thousands of company profile URLs that may or may not resolve to unique content depending on the crawler's UA.
Findings
GPTBot blocked from research reports and company profiles High
GPTBot receives a 403 error from CloudFront on /research/ and /company/ paths, preventing OpenAI's crawler from indexing the site's most cited and press-worthy content. This includes the AI 100 2026 page and the Stripe company profile.
What to change: Allow GPTBot access to /research/ and /company/ paths, or implement a rate-limited allow policy to ensure OpenAI can index the site's high-value content.
Robots.txt disallows multiple AI crawlers from key sections High
The robots.txt explicitly disallows GPTBot, ClaudeBot, PerplexityBot, Google-Extended, and CCBot from /research/, /company/, /investor/, /esp/, and /compare/. While CloudFront enforcement varies, the directive itself signals a restrictive stance.
What to change: Remove AI crawler disallow rules from robots.txt for sections that contain public-facing content, or replace with crawl-delay directives to manage load.
No llms.txt file published Medium
The site returns a 404 for /llms.txt, missing an opportunity to provide AI crawlers with a curated list of important URLs and context about the site's content.
What to change: Publish an llms.txt file that lists key pages like /research/, /company/, and /pricing, along with a brief description of the site's purpose.
Cold knowledge stuck on 2023-era product narrative Medium
The LLM's prior knowledge describes CB Insights as a 'market intelligence platform' focused on AI 100 and Fintech 250, but the current site leads with 'Predictive Intelligence' and AI Agents (ChatCBI, MCP integrations) — none of which appear in the model's prior.
What to change: Update homepage and key landing pages to prominently feature 'Predictive Intelligence' and AI Agent offerings in visible text, and ensure these terms appear in page titles and meta descriptions.
Research landing page delivers thin content to crawlers Medium
The /research/ page returns only 56 words of visible text to ClaudeBot and other allowed crawlers, while individual report URLs like /research/report/state-of-venture-q1-2026/ silently redirect to the thin landing page. This limits the depth of content indexed.
What to change: Ensure individual report pages return full content to crawlers instead of redirecting to the landing page, or include report summaries and links in the landing page content.
Company profile pages have no JSON-LD schema Medium
The /company/stripe page contains structured data (founded year, stage, total raised, Mosaic Score, patents table, FAQ) but has zero JSON-LD markup, missing a chance to help AI crawlers understand and extract this data.
What to change: Add JSON-LD schema (e.g., Organization, FAQPage, Product) to company profile pages to expose structured data to AI crawlers.
Pricing page lacks Product and Offer schema Medium
The /pricing page has breadcrumb schema but no Product or Offer schema, despite listing multiple plans and features. This limits how AI crawlers can understand the pricing structure.
What to change: Add Product and Offer JSON-LD schema to the pricing page to describe plans, features, and pricing tiers.
Organization schema lacks sameAs links Low
The homepage's Organization JSON-LD does not include sameAs links to the company's active social profiles (Twitter/X, LinkedIn), which would help AI crawlers connect the brand across platforms.
What to change: Add sameAs URLs for Twitter/X, LinkedIn, and other official social profiles to the Organization schema.
Sitemap index not found at standard location Low
The standard /sitemap.xml returns a 404, though a sitemap index exists at /sitemap/sitemap_master.xml. This may confuse some crawlers that expect the standard path.
What to change: Add a redirect or symlink from /sitemap.xml to /sitemap/sitemap_master.xml, or list the correct sitemap in robots.txt.
Two-engine architecture creates content inconsistency Medium
The main site runs on Next.js while /research/ runs on a separate WordPress install (PHP 7.3.17) with a different sitemap structure. This leads to inconsistent content delivery and redirect behavior for crawlers.
What to change: Unify the content delivery architecture or ensure consistent sitemap and redirect behavior across both platforms.
Missing security headers (CSP, HSTS) Low
The site lacks Content-Security-Policy and Strict-Transport-Security headers, which are not directly related to AI visibility but are best practices for trust signals.
What to change: Add Content-Security-Policy and Strict-Transport-Security headers to improve security posture.
Contradiction: AI partnerships but blocks GPTBot High
The site has OpenAI and Anthropic domain verification TXT records and advertises MCP integrations, yet blocks GPTBot from its most valuable content, undermining its own AI-native positioning.
What to change: Align crawler access policy with the stated AI partnership strategy by allowing GPTBot to index key content.
What's working
- Homepage has rich JSON-LD schema — The homepage includes Organization, WebSite, WebPage, BreadcrumbList, and SearchAction JSON-LD, providing AI crawlers with structured context about the site.
- ClaudeBot receives full content on key pages — ClaudeBot is allowed to access and receive full HTML content for the homepage, company profiles (e.g., Stripe), and research pages, ensuring Anthropic's models can index the site's rich data.
- OpenAI and Anthropic domain verification in place — DNS TXT records show domain verification for both OpenAI and Anthropic, indicating active partnerships or integrations that can facilitate data licensing or API access.
- MCP server and AI agents prominently featured — The site prominently advertises an MCP server and AI agents (ChatCBI, Team of Agents), signaling an AI-forward product strategy that aligns with modern AI discovery patterns.
- Company profile pages contain rich, structured content — The /company/stripe page has 3,979 words of detailed information including FAQ, tables, and financial data, providing high-value content for AI crawlers that can access it.
- Sitemap index available at custom path — A sitemap index with 60 URLs exists at /sitemap/sitemap_master.xml, helping crawlers discover content.
- Homepage accessible to most major AI crawlers — The homepage returns a 200 with full content to 8 out of 11 tested AI crawlers, including Google-Extended, PerplexityBot, and Applebot-Extended.
- Pricing page includes breadcrumb schema — The /pricing page has breadcrumb JSON-LD, providing navigation context to AI crawlers.
Track cbinsights.com across AI search
This is one snapshot. Open the interactive report to inspect evidence, or grade another site free.