AI Crawlers

:::summarybox learn See exactly when GPTBot, ClaudeBot, PerplexityBot, and 60+ other AI bots visit your site Tell training, indexing, live-conversation, and agent crawls apart at a glance Follow every page through the crawl → cite → click conversion funnel Read AI-generated insights that connect crawls to citations and page health Catch JavaScript walls, robots.txt blocks, and bot-blocking WAF rules before they cost you visibility

note

There's a blind spot in your analytics. Google Analytics doesn't show AI bots. Your CDN logs lump them in with everything else. And the bots themselves don't announce themselves with clean labels - they hide behind cryptic user agents like Mozilla/5.0 (compatible; GPTBot/1.0; +https://openai.com/gptbot).

But AI crawlers are now the most consequential visitors to your site. Every time GPTBot reads your pricing page, every time PerplexityBot fetches a guide while a user is asking a question, every time Claude-User drops in mid-conversation - those are the events that decide whether ChatGPT recommends you next month.

AI Crawlers is the dashboard for those events. It identifies every AI bot that touches your site, classifies what each one is doing, and connects the dots between crawls, citations, and clicks - so you can see the full pipeline from "AI knows about this page" to "AI sent us a customer."

What you'll learn

Section	What it covers
The blind spot	Why your existing analytics miss the visits that matter most
The three kinds of AI bots	Training, indexing, conversations - and what each tells you
The 60+ bots Trakkr identifies	Every AI provider, mapped to their user-agent variants
How visits are captured	Server-side platform connections, CMS proxying, and webhook sources
The dashboard	Hero stats, the activity chart, Feed view, Pages view, Access view
Page intelligence	The crawl → cite → click funnel for every URL
AI-generated insights	Cross-signal analysis written by an LLM, refreshed every 4 hours
Access checks	robots.txt, llms.txt, JavaScript walls, bot blocks
Alerts	First-seen, silent, spike, and error-surge monitoring
Optimization playbook	The order to work on the data once it's flowing
FAQ	The questions teams ask in week one

Your analytics stack was built for human visitors. It tracks pageviews, sessions, sources, conversions. It does not track:

When GPTBot scraped your site at 03:47 UTC for OpenAI's next training run
When PerplexityBot indexed your guide so it could be cited in tomorrow's answers
When ChatGPT-User opened your page mid-conversation because someone asked Claude about you
When Google-Agent filled out a form on your site on behalf of an AI assistant
When Claude-SearchBot got a 403 because Cloudflare flagged it as suspicious

Multiply that by every AI platform, every bot variant, every page on your site. You're flying blind on the visits that matter most for AI visibility.

AI Crawlers fills the gap. It listens for AI bot user agents at the edge through a direct connection to your CDN, host, CMS proxy, or log forwarder, then surfaces every visit, classified by bot, by page, by intent.

The three kinds of AI bots

This is the most important concept on the page, and the one most teams get wrong. Not all crawlers are the same. They don't tell you the same thing about your visibility, and they don't deserve the same response.

Trakkr classifies every bot into one of three categories. The dashboard's hero stats, chart filter pills, and saved views are all built around these three buckets, so it's worth learning them now.

Category	What the bot is doing	Examples	What it tells you
Training	Scraping content for the next model training cycle	GPTBot, ClaudeBot, Amazonbot, Bytespider, CCBot	"AI is learning about you for the future" - effects show up in 6-18 months
Indexing	Pre-fetching content for an AI search index	OAI-SearchBot, PerplexityBot, Claude-SearchBot, YouBot, Applebot	"AI is preparing to cite you" - your content is being added to a retrieval index for live answers
Conversations	Live retrieval during an active AI chat	ChatGPT-User, Claude-User, Perplexity-User, MistralAI-User, Meta-ExternalFetcher	"A real user is asking AI something right now and it's reading your page" - the strongest signal you may be cited

Each one is a different stage of the AI funnel.

Conversations is the loudest signal. When a Claude-User or ChatGPT-User hits your page, it usually means a real person was asking a question and AI thought your page was the right answer. These visits are often the leading indicator that a citation is about to land.

Indexing sits in the middle. When PerplexityBot or OAI-SearchBot crawls a page, they're storing it in a retrieval index they'll consult during user queries. Heavy indexing of a page is a leading indicator of citations.

Training is the long game. GPTBot crawling your site today may not affect ChatGPT's behavior until the next major release - which could be six to eighteen months out. It's a useful signal for future model exposure, but it's weaker and slower than indexing or conversation traffic.

There's also an emerging fourth category of autonomous agent bots like Google-Agent that act on a site rather than just read it. Volume is currently tiny and Google-Agent is effectively the only resident, so Trakkr tracks agent traffic under a platform filter rather than promoting it to its own hero stat. When that bucket grows, it gets promoted.

The 60+ bots Trakkr identifies

Every major AI provider, plus the long tail of search and research crawlers. Bot detection is pattern-based on the User-Agent header, ordered most-specific-first to avoid false positives (so ChatGPT-User matches before generic ChatGPT).

Provider	Bots tracked
OpenAI	GPTBot, ChatGPT-User, OAI-SearchBot
Anthropic	ClaudeBot, Claude-User, Claude-SearchBot, Claude-Web, Claude-Code, anthropic-ai
Google	Google-Agent, Googlebot, GoogleOther, Google-Extended (control token)
Perplexity	PerplexityBot, Perplexity-User
Microsoft / Bing	Bingbot, BingPreview
Meta	Meta-ExternalAgent, Meta-ExternalFetcher, FacebookBot
Apple	Applebot, Applebot-Extended (control token)
Amazon	Amazonbot
ByteDance	Bytespider
Common Crawl	CCBot
Mistral AI	MistralAI-User
DeepSeek	DeepSeekBot
xAI	GrokBot
Cohere	cohere-ai
You.com	YouBot
Allen AI	AI2Bot
Other	Diffbot, Omgili, Timpibot, ImagesiftBot, Scrapy

New bots are added to the detector as providers ship them. If a bot starts hitting your site that Trakkr doesn't recognize yet, the dashboard surfaces the raw user agent so you can flag it.

How Trakkr captures crawler visits

Crawler visits flow into Trakkr through server-side sources. Most teams connect the platform already sitting in front of the site: Cloudflare, Vercel, Netlify, WordPress, a self-hosted runtime, or a hosted CMS routed through Cloudflare.

Path	What it is	Setup time	Best for
Server-side connection	A direct integration with your CDN, host, or log forwarder	5-10 minutes	Production apps, server-rendered sites, high-traffic platforms behind a CDN
Hosted CMS via Cloudflare	A guided Cloudflare proxy flow for CMS platforms without server logs	10-15 minutes	Webflow, Shopify, HubSpot, Squarespace, Wix, Framer, Ghost

Why server-side is more accurate

Browser scripts can only see crawlers that execute JavaScript, and many crawler requests never get that far. Server-side integrations see every request that hits your origin, including bots that never run JS.

If you have the option, server-side is the better choice. It catches:

Crawlers that don't execute JavaScript at all
Crawlers blocked by your CDN or WAF before they reach the page
Crawlers that 404 on URLs that no longer exist
Crawlers that hit non-HTML resources (PDFs, sitemaps, RSS feeds)

You can connect more than one server-side source. Trakkr deduplicates events by hash, so Cloudflare plus a webhook forwarder will not double-count visits.

Server-side integrations

Platform	Auth	Realtime	Notes
Cloudflare	API token	No	Reads Cloudflare's GraphQL analytics with a scoped read-only token. Works on every Cloudflare plan.
Vercel	OAuth	Yes	Installs a Log Drain on your Vercel project. Requires Vercel Pro or Enterprise.
Netlify	OAuth	Yes	Deploys an Edge Function that streams crawler visits.
WordPress	Existing adapter	No	Uses the Trakkr WordPress plugin you already have on connected sites.
Hosted CMS	Cloudflare proxy	No	Webflow, Shopify, HubSpot, Squarespace, Wix, Framer, and Ghost route through Cloudflare.
Manual / Custom CDN	Webhook	Yes	Point any log forwarder at your unique webhook URL. Works with Akamai, Fastly, CloudFront, custom servers.

Step-by-step guides for each platform live in Crawler Tracking.

Setting it up

Open Crawler in the sidebar
Click Connect platform and choose your host
Follow the platform-specific flow:

Cloudflare - paste a scoped API token
Vercel - OAuth + Log Drain
Netlify - OAuth + Edge Function
WordPress - install the Trakkr plugin
Webflow / Shopify / HubSpot / Squarespace / Wix / Framer / Ghost - use the Hosted CMS tile to route the site through Cloudflare
Next.js / CloudFront / Node / Nginx / Akamai / Fastly / Other - use the webhook setup tile that matches your runtime

Trakkr backfills recent history automatically and starts syncing new visits

The dashboard, explained

Open Crawler in the sidebar. You'll land on a single-page dashboard built around four hero stats and three main tabs.

:::screenshot wide Screenshot: Crawler dashboard with hero stats, activity chart, and Feed tab

note

The header

Date range - 24h, 7d, 30d, 90d, or custom. Most stats compare the current period to the previous one of equal length, so trend percentages are apples-to-apples.
Refresh - Bypass the cache and re-query.
Connections - Manage your platform integrations and see sync health.

The four hero stats

The most important strip on the page. Each card is clickable and filters the chart and table below.

Stat	What it counts	What it means
Total Visits	Every AI bot visit in the period	Top-of-funnel awareness
Conversations	Visits from interaction bots (ChatGPT-User, Claude-User, etc.)	The strongest signal your page may be cited - someone is asking AI right now
Training	Visits from training crawlers (GPTBot, ClaudeBot, etc.)	What AI is learning for future model versions
Indexing	Visits from search crawlers (OAI-SearchBot, PerplexityBot, etc.)	Pre-citation activity - AI is preparing to use this content

Each card has a sparkline showing the trend across the period and a percentage change vs the previous period. Click a card to filter the chart and table below to that category.

The activity chart

A stacked bar chart showing visits over time, broken out by bot category. Use the filter pills above the chart to isolate Live, Training, Indexing, or view All.

Look for:

Sustained growth in Conversations - your pages are being cited more often
Sudden Training spikes - a new model is being prepped, or a provider is doing a fresh crawl pass
Quiet weeks - usually fine on their own, but sustained silence on a previously-active page is a warning worth investigating
Coordinated drops across providers - almost always a robots.txt or WAF change on your end

Tab 1: Feed

A live stream of recent crawler events, newest first. Each row shows:

Time - relative ("3m ago") plus exact timestamp on hover
Bot - normalized name with provider logo
Page - the URL the bot hit
Status - HTTP status code (200, 404, 403, 500)
Method - GET, POST, HEAD
Verified badge - synthetic verification visits are tagged so you can filter them out

Use the view filter to switch between All, Conversations, Indexing, and Training. This is the view to keep open during launches and big content drops - it shows you AI activity in something close to real time.

Tab 2: Pages

The same data, but page-first instead of event-first. Every URL on your site that AI bots have visited, with:

Total fetches in the period
Unique bots that visited
Most recent visit
Funnel filter - sort by Most Crawled, Recently Crawled, By Citations, or By Clicks

Click any row to open the Page Drawer - the deepest view in the entire dashboard, and the next section.

Tab 3: Access

robots.txt, llms.txt, and per-page accessibility checks. Covered in detail in Access checks below.

Page intelligence: the crawl → cite → click funnel

The Page Drawer is the most important screen in the dashboard, and the one most teams underuse. It's a complete conversion funnel for a single URL, built from crawler data, citation data, click data, and the page health score.

When you click any page in the Pages tab, a drawer slides out with up to six sections.

1. Verdict

A one-paragraph synthesis of what's happening on this page. Trakkr generates this server-side from the page's crawl pattern, citation count, click data, and health score - the same context the AI Insights use.

Example: "This page is your highest-converting AI surface. It's been crawled 1,247 times by 8 distinct bots, cited 31 times in monitored prompts, and drove 89 LLM-referred clicks. Citations have grown 34% week-over-week, but page-load time is in the bottom quartile for your site - fix the LCP and you'll likely see citation gains."

2. Pipeline

The conversion funnel, visualized:

Crawls (1,247) ──▶ Citations (31) ──▶ Clicks (89)
                  ↑ 2.5%               ↑ 287%

The percentages are the conversion rate from one stage to the next, compared to your site average. If a page is converting better than average, you've found a template worth copying. If it's converting worse, you've found a problem worth fixing.

3. Health

Issues detected on this page that may be affecting AI visibility:

JavaScript walls - the page renders client-side and bots see a blank shell
Bot blocks - Cloudflare, WAF, or robots.txt is blocking the bot Trakkr tested as
404s - the URL is being crawled but no longer exists
Auth walls - the page requires login
Slow LCP / poor Core Web Vitals
Missing meta description or structured data

Each issue links to the relevant fix in Page Analysis or Actions.

4. Next Step

A single, concrete recommendation - the highest-leverage thing to do for this page. Not a list of generic best practices. One specific, actionable next step.

5. Traffic

LLM referral clicks broken down by source. When someone clicks a link in ChatGPT, Claude, Perplexity, or Gemini and lands on this page, it shows up here. Trakkr matches based on referrer URL and UTM parameters across nine sources: ChatGPT, Claude, Perplexity, Gemini, Bing Chat, You.com, Character.AI, Mistral, and Hugging Face.

Source	Visits
ChatGPT	42
Perplexity	28
Gemini	11
Claude	8

6. Diagnostics (collapsed by default)

Raw crawler data: bot-by-bot visit counts, status code distribution, request method breakdown, last seen, first seen. Useful when you're debugging a specific anomaly.

AI-generated insights

At the top of the dashboard, the Intelligence Brief card surfaces a short list of AI-generated insights connecting your crawler data to citations, page health, and trends.

These aren't dashboards or static rules. They're written by a language model (Gemini Flash) using a context blob assembled from:

Top 40 pages by fetch count, with citation and click data
Bot category totals for the period and the previous period
Status code distribution (how often bots hit 200, 404, 500)
Page health scores from Page Analysis
Trend deltas

The model returns 2-5 prioritized insights with headlines, explanations, affected pages, and (often) a suggested action you can navigate to. Cached for 4 hours per brand, recomputed when underlying data changes meaningfully.

Example insights:

"Training crawlers dominating - your site is being learned but not yet retrieved." GPTBot has crawled 412 pages but no Conversation bots have appeared yet, suggesting indexing is in progress but citations haven't started.
"Five pages are JavaScript-blocking AI bots." Lists the URLs and links straight to the access fix drawer.
"Citation conversion outperforming average on /guides/." Your guides section converts crawls to citations 3x better than your blog - a template worth copying.
"OAI-SearchBot returning 403 on 14 pages." WAF rule needs adjustment.

Insights have severity (info / warning / error) and link directly to the page or action that resolves them. If the LLM call fails or you don't have enough data, Trakkr falls back to a deterministic rule-based set so the brief is never empty.

Access checks: can bots actually read your site?

The Access tab is a quiet but critical layer. It answers a different question than the rest of the dashboard: if a bot tried to read your content right now, could it?

robots.txt analysis

Trakkr fetches your robots.txt and parses it for AI-relevant rules. The dashboard shows:

Per-bot allow/disallow matrix - which bots are blocked from which paths
Common pitfalls - blocks that look intentional but probably aren't (e.g. Disallow: / under User-agent: GPTBot left over from a defensive copy-paste)
Recommended fixes - one-click diffs you can paste back into your robots.txt

llms.txt analysis

llms.txt is a new convention for explicitly telling AI crawlers what content you'd like them to use, what to skip, and where to find your most authoritative pages. It's the AI equivalent of a sitemap, but written for models, not search engines.

If you have an llms.txt, Trakkr parses it. If you don't, Trakkr offers a generated starter file based on your site structure and citation history.

Page visibility detection

For every page Trakkr knows about, it can run a synthetic fetch as GPTBot and inspect the response. The detector flags:

Status	Meaning
Visible	The bot can read full content - no flag
JS-dependent	Bot received a near-empty shell - content needs JavaScript to render
Bot-blocked	403, Cloudflare challenge, or CAPTCHA
Not found	404
Auth required	Login wall
Unknown	Network error - can't determine

Any non-visible status is a leak. JS-dependent pages are especially dangerous because they look fine to humans but are invisible to most AI crawlers. The dashboard groups findings by type, lets you dismiss false positives, and links every issue to a recommended fix.

Alerts and monitoring

Crawler activity changes constantly, and watching the dashboard manually is not a workflow. Trakkr fires four kinds of automated alerts:

Alert	When it fires	Why it matters
First seen	A new bot crawls your site for the first time	A new AI provider is interested - the earliest possible signal you're entering a new model's index
Silent	A bot that was previously active goes quiet for N days	Something may be blocking access (WAF rules, DNS, robots.txt change)
Spike	A bot's volume jumps significantly above its baseline	Either a model retraining cycle or a product launch citing you heavily
Error surge	A bot starts hitting 4xx or 5xx more often	Pages broke, redirects misfire, or your CDN started rate-limiting

Alerts are deduplicated per (brand, bot, page) so you don't get flooded. They feed into Workflows where you can route them to Slack, email, Linear, or any other destination.

How crawler data connects to other Trakkr features

AI Crawlers isn't a standalone tool. It's a live ground-truth layer that makes the rest of Trakkr smarter.

Feature	How it uses crawler data
Citations	Cross-references which cited pages are actually being crawled - reveals citations from pages AI never actively retrieves
Optimize Site	Shows crawler activity on every audited page so you can prioritize fixes by AI traffic, not vanity scores
Actions	The intelligence brief generates Action items connected directly to crawler findings
Workflows	Crawler alerts can trigger any workflow - Slack pings, content updates, ticket creation
Live Visitors	Cross-references LLM-referred clicks against the pages being crawled, so you see the full crawl-to-click chain
AI Pages	Serves bot-optimized HTML to crawlers without changing what humans see
Reports	Crawler activity is included in stakeholder reports as a leading indicator alongside visibility scores

The optimization playbook

Once crawler data is flowing, here's the order to work on it.

1. Make sure bots can read your site

Open the Access tab. Fix any pages flagged as JS-dependent, bot-blocked, 404, or auth-required. This is table stakes - without it, nothing else matters.

2. Don't block in robots.txt

Check your robots.txt for AI bot blocks. The most common mistake is blanket Disallow: / rules left over from a defensive copy-paste:

# DON'T do this if you want AI visibility
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

Unless you have a real reason to opt out of training, allow these bots. If you want to be more permissive without naming bots individually:

# Allow all AI crawlers
User-agent: *
Allow: /
Sitemap: https://yoursite.com/sitemap.xml

3. Add an llms.txt

Even a minimal llms.txt pointing to your most authoritative pages is a meaningful signal. The Access tab generates a starter file you can use as a base.

# llms.txt
# A guide to the content on this site for LLMs.

> A short description of what you do and who it's for.

## Docs
- [Quick start](https://yoursite.com/docs/quickstart): Get up and running in 5 minutes
- [API reference](https://yoursite.com/docs/api): Complete endpoint documentation

## Guides
- [Best practices](https://yoursite.com/guides/best-practices): How experienced teams use the product

4. Find your converting pages and copy them

In the Pages tab, sort by Citations or Clicks. The pages at the top are your AI-visibility templates. What do they have in common - structure, length, schema, internal links? Apply that pattern to similar pages that aren't converting.

5. Watch for First Seen alerts

When a new bot first crawls your site, you have a window. Make sure your most important pages are reachable, fast, and well-structured before that bot's index becomes a citation source.

6. Use the Page Drawer for prioritization

For any page in your top 20 by traffic, open the drawer and look at the funnel. If crawls are high but citations are zero, the issue is content quality or structure. If citations are high but clicks are zero, the issue is snippet quality or attribution.

7. Verify, don't trust

When something looks anomalous - a sudden silence, a spike, a new bot - click Send Verification. Trakkr will fetch your homepage as GPTBot and confirm the detection pipeline is healthy. If verification works but real crawls are missing, the issue is upstream of Trakkr (CDN, robots.txt, DNS).

FAQ

Q: Do I need both the pixel and a server-side connection?

No. Pick whichever is easiest to set up. Server-side is more accurate; the pixel is faster to install. You can run both - Trakkr deduplicates events - but most sites won't need to.

Q: How long until I see my first crawler visit?

If you install the pixel and click Send Verification, you'll see a synthetic visit within 30 seconds. Real crawler visits depend on how often AI bots already visit your site - usually within hours for active domains, sometimes a day or two for newer sites.

Q: Why does my Conversations count look low?

Conversation bots (ChatGPT-User, Claude-User) only fire when a real user asks a question that AI thinks your page can answer. Low Conversations means you're not yet a frequent answer. Focus on your Indexing count first - if AI is preparing to cite you, citations and conversations will follow.

Q: Are crawler visits the same as visibility?

No. Crawls mean AI is aware of you. Citations mean AI is recommending you. Clicks mean AI is sending you traffic. The Page Drawer's pipeline view shows the conversion at each stage. High crawls + low citations = a quality or structure problem on the page.

Q: Can I block specific bots?

Yes, via your robots.txt (the standard way) or via your CDN (e.g. Cloudflare bot management). Trakkr will show which bots are blocked in the Access tab. Blocking reduces your AI visibility - do it only with intent.

Q: What user agents are tracked?

60+ at last count, including every major AI provider and the long tail of search and research crawlers. The detector is shared across server-side connections so every ingest path agrees on what counts as an AI bot.

Q: How far back does the data go?

Trakkr stores 90 days of detailed visit history. Daily aggregates are kept indefinitely so trend charts cover the full lifetime of the connection.

Q: Does it work behind Cloudflare?

Yes. The Cloudflare integration is the most accurate setup for sites behind Cloudflare - it reads from Cloudflare's analytics API directly, so it sees every bot Cloudflare logs, even ones blocked at the WAF.

Q: What about white-label client portals?

Crawler data is brand-scoped. Each connected brand has its own crawler dashboard, and the white-label portal exposes the same dashboard with portal branding. There's no Trakkr branding leak.

Q: How is this different from Google Search Console?

GSC tracks Google's traditional search crawling and indexing signals. AI Crawlers tracks 60+ bots from every provider and classifies them by intent. They complement each other - GSC for traditional search, AI Crawlers for AI-specific retrieval, training, and user-triggered traffic.

Q: Can I export the data?

Yes. From the Crawler page you can export filtered visits to CSV, and the Reports feature includes crawler activity in stakeholder PDFs. The full dataset is also available via the Crawler API endpoint.

Q: Is the LLM-generated insight the same as Actions?

Related but not identical. The Intelligence Brief is real-time analysis of your crawler data; the Actions feature pulls from a broader signal set including citations, narratives, and page health. Many actions originate from crawler insights, but actions live longer and persist across dashboard sessions.

Q: What plan do I need?

Crawler tracking is available on Growth and Scale plans. Server-side connections, historical dashboards, intelligence, and access analysis require crawler access on your plan.