Crawlers

AI crawlers are bots that read your website on behalf of AI tools. They are doing one of three things: scraping your content for the next training run, indexing it so it can be cited in answers, or fetching it in real time because a user is asking a question right now and the AI wants to read your page before it answers.

If you care whether ChatGPT, Claude, Perplexity, or Gemini recommend you, these are the visits that decide it. A page no AI bot can read might as well not exist as far as AI visibility is concerned. A page heavily crawled by indexing bots is one that AI is preparing to cite. A page hit by a live-conversation bot is, very often, one that's about to appear in someone's answer.

The Crawlers tab tells you, for every URL on your site, which bots are reading it, how often, and what's happening as a result.

The three kinds of bot, and why the distinction matters

The single thing most teams get wrong about crawlers is treating them as one population. They aren't. The bots crawling your site sit in three distinct buckets, and each one tells you something different about your AI visibility. The dashboard's hero stats and filter pills are built around these three.

Category	What the bot is doing	Examples	What it tells you
Training	Scraping content for the next model training cycle	GPTBot, ClaudeBot, Bytespider, CCBot, Amazonbot	AI is learning about you for future model versions, on a 6 to 18 month horizon
Indexing	Pre-fetching content for an AI search index	PerplexityBot, OAI-SearchBot, Claude-SearchBot, Applebot	AI is preparing to cite you. Heavy indexing of a page often precedes the first citation
Conversations	Live retrieval during an active AI chat	ChatGPT-User, Claude-User, Perplexity-User, MistralAI-User	Someone is asking AI a question right now and it is reading this page to answer them

The order in that table is roughly weakest signal to strongest. Training crawls are useful but slow, the effects show up next year. Indexing is the middle of the funnel. Conversations are the loudest signal you can get short of a citation, because they only happen when a real user has just asked a question.

There is also an emerging fourth bucket — agent traffic from autonomous bots that act on a site rather than just read it. Today that's effectively just Google-Agent, and because the population is one bot rather than a category of bots, Trakkr tracks it under platform filters instead of giving it its own card. When that bucket grows, it gets promoted.

The bots Trakkr identifies

Roughly 30+ AI bots, plus a long tail of generic and content-aggregation crawlers. Detection is pattern-based on the User-Agent header and ordered most-specific-first so ChatGPT-User matches before plain ChatGPT.

That grid is the user-facing AI model lineup, the same one Trakkr runs prompts against. The actual crawler detector covers a broader set, including search and retrieval bots that don't have a public chat product attached:

Provider	Bots tracked
OpenAI	GPTBot, ChatGPT-User, OAI-SearchBot
Anthropic	ClaudeBot, Claude-User, Claude-SearchBot, Claude-Web, Claude-Code, anthropic-ai
Perplexity	PerplexityBot, Perplexity-User
Google	Google-Agent, Google-Extended (control token)
Apple	Applebot, Applebot-Extended (control token)
Mistral	MistralAI-User
Amazon	Amazonbot
ByteDance	Bytespider
Common Crawl	CCBot
xAI	GrokBot
DeepSeek	DeepSeekBot
Cohere	cohere-ai
Allen AI	AI2Bot
Other	Diffbot, Omgili, Timpibot, ImagesiftBot, Scrapy

New providers get added to the detector as they ship. When a bot starts hitting your site that Trakkr does not yet recognise, the dashboard surfaces the raw User-Agent so you can flag it for the team to add.

Why blocking AI bots is almost always a mistake

This comes up enough to warrant its own section. The default copy-paste advice in some SEO circles is to block AI training bots in your robots.txt. For most sites, this is the single most expensive mistake you can make in AI visibility.

The logic of the block, "I don't want OpenAI training on my content for free", is rational. The execution is the problem. A blanket Disallow: / under User-agent: GPTBot doesn't just opt you out of training. It also tells OpenAI's other crawlers, which behave differently, that you don't want to participate, and it has a chilling effect on whether your pages get cited at all.

There are three reasonable positions:

Position	What to do	Who it's for
Open	Allow all AI bots. Update `robots.txt` and `llms.txt` to invite them in.	Almost every brand, marketing site, docs site, blog, product page
Surgical	Allow indexing and conversation bots; block only training	Publishers, anyone with a clear commercial argument against being trained on
Closed	Block everything	Internal tools, paywalled archives, regulated content

If you don't have a specific reason to be in the "Surgical" or "Closed" buckets, you want Open. AI visibility is a discoverability channel, and shutting bots out is the equivalent of noindex for search.

How Trakkr tracks crawler activity

Crawlers do not run JavaScript. Most never even render a page. So a browser pixel, which is fine for human analytics, sees almost nothing on the crawler side. The only reliable place to capture crawler activity is server-side, at the layer where the request actually arrives.

Trakkr ingests crawler hits through one or more server-side connections to your CDN, hosting platform, or CMS proxy. You connect the platform that's already in front of your site and Trakkr starts reading the visit stream. There are eighteen first-class methods covering hosting platforms (Cloudflare, Vercel, Netlify), self-hosted runtimes (Next.js, Node, Nginx, CloudFront, Akamai, Fastly, custom), WordPress, and hosted CMSes (Webflow, Shopify, HubSpot, Squarespace, Wix, Framer, Ghost).

You can connect more than one source — for instance, Cloudflare in front and a Vercel webhook on the origin. Each connection is deduplicated internally by event hash, so retries and overlapping syncs from a single source never double-count. Pick one connection per origin where you can: separate sources can't see each other's hashes, so two parallel feeds reading the same traffic will count each visit twice.

To verify the pipeline is working end to end, the dashboard has a Send test ping button in the header. It writes three synthetic crawler visits — one GPTBot, one PerplexityBot, one ChatGPT-User — straight into the analytics store, labelled as verification rows so they render with a "Verified" badge on the Live tab. They prove the query and render path are healthy. If the test ping shows up but real crawler visits don't, the issue is upstream of Trakkr, usually a robots.txt block, a WAF rule, or a misconfigured source connection.

What's on the dashboard

The page is built around four hero stats, a stacked activity chart, and four tabs.

The hero stats are Total Visits, Conversations, Indexing, and Training, each with a trend vs the previous period. Click any card to filter the chart and tabs below to that category.

The activity chart is a stacked bar chart of visits over time. The view toggle above the chart lets you stack by Platform (one segment per provider), Intent (Conversations / Indexing / Training), or Status (HTTP response code — useful for spotting bots being silently 404'd). Click any segment to filter the tabs below. What you're looking for:

Sustained growth in Conversations means your pages are being cited more often.
A sudden Training spike usually means a provider is doing a fresh crawl pass before a new model release.
A coordinated drop across providers almost always means a robots.txt or WAF change on your end.

The four tabs, in the order they appear:

Pages (default): every URL AI bots have visited, with totals, unique bots, last seen, and a click-through to the page drawer. This is where most of the work happens.
Live: a live stream of recent crawler events, newest first. The right view to keep open during a launch.
Actions: the recommended next moves Trakkr derives from your crawler activity — surfaced as a prioritised list with counts when there's something waiting on your attention.
Access: robots.txt and llms.txt analysis, plus per-page accessibility checks that tell you whether a bot trying to read each URL right now would actually succeed. A count appears on the tab when there are open findings.

The single most underused screen in the whole feature is the page drawer, which opens when you click any row on the Pages tab. It shows the full crawl → cite → click funnel for one URL: how many bots crawled it, how often it was cited in monitored prompts, and how many human clicks landed on it. Pages with high crawls and low citations are content or structure problems. Pages with high citations and low clicks are snippet or attribution problems. The drawer turns crawler data into a prioritised to-do list.

What to do with the data

The order most teams should work through this in:

Make sure bots can read your site. Open the Access tab. Fix anything flagged as JavaScript-walled, bot-blocked, 404, or behind auth. This is table stakes.
Check robots.txt for accidental blocks. Especially Disallow: / rules under any AI bot user-agent. Trakkr will tell you which bots are currently shut out.
Add an llms.txt. Even a minimal file pointing AI tools at your most authoritative pages is a meaningful signal. Trakkr can generate a starter file from your site structure.
Find your converting pages and copy them. Sort the Pages tab by Citations or Clicks. The pages at the top are your AI visibility templates, work out what they have in common (length, structure, schema, internal links) and apply it to similar pages that aren't converting.
Watch for new bots appearing. When a new bot first crawls your site, there's a short window where you can optimise specifically for that provider before your competitors notice.

Common questions

Should I block GPTBot to stop OpenAI training on my content?

For almost every marketing site, no. Blocking GPTBot does not just opt you out of training, it has knock-on effects on how OpenAI's other crawlers treat your site and on whether ChatGPT cites you at all. If you genuinely don't want to be trained on, see the "Surgical" approach above: block training, allow indexing and conversations. The Access tab will help you write the precise rules.

Why does my Conversations count look low?

Conversation bots like ChatGPT-User and Claude-User only fire when a real user asks a question that AI thinks your page can answer. Low Conversations usually means you're not yet a frequent answer for the prompts that matter. Focus on Indexing first: if AI is preparing to cite you, conversations and citations will follow.

Are crawler visits the same as visibility?

No. Crawls mean AI is aware of you. Citations mean AI is recommending you. Clicks mean AI is sending traffic. The page drawer shows the conversion at each stage; the gap between them is usually where the real work is.

Do I need more than one connection?

In most cases, no — pick the one closest to your traffic. The CDN-level integrations (Cloudflare, Vercel, Netlify) are the most complete because they capture requests before any application logic. A webhook from your origin gives you per-request fidelity for traffic that reached the server. Adding both is fine for redundancy, but expect duplicate counts on overlapping origins — dedup is scoped to a single connection.

How far back does the data go?

Detailed visit history is kept for 90 days. Daily aggregates are kept indefinitely, so the trend charts cover the full lifetime of the connection.

How does this differ from Google Search Console?

GSC tracks Google's own crawling and indexing. Crawlers tracks 30+ bots from every AI provider and classifies them by intent. They complement each other, GSC for traditional search, Crawlers for AI-specific retrieval, training, and live-conversation traffic.

Does this work behind Cloudflare?

Yes, and the Cloudflare integration is the easiest path for sites that sit behind it. Trakkr reads from Cloudflare's GraphQL Analytics API, which includes bots blocked at the WAF before they ever reach your origin — visits a server-side tag could never see. The trade-off is that the underlying dataset is aggregated and, on high-traffic zones, sampled, so very low-volume crawls can be slightly undercounted. For most sites the coverage gain is worth it; if you need exact per-request fidelity, layer a webhook from your origin on top.