Trakkr Docs

AI Crawlers

:::summarybox learn See exactly when GPTBot, ClaudeBot, PerplexityBot, and 60+ other AI bots visit your site Tell training, indexing, live-conversation, and agent crawls apart at a glance Follow every page through the crawl → cite → click conversion funnel Read AI-generated insights that connect crawls to citations and page health Catch JavaScript walls, robots.txt blocks, and bot-blocking WAF rules before they cost you visibility


The three kinds of AI bots

This is the most important concept on the page, and the one most teams get wrong. Not all crawlers are the same. They don't tell you the same thing about your visibility, and they don't deserve the same response.

Trakkr classifies every bot into one of three categories. The dashboard's hero stats, chart filter pills, and saved views are all built around these three buckets, so it's worth learning them now.

CategoryWhat the bot is doingExamplesWhat it tells you
TrainingScraping content for the next model training cycleGPTBot, ClaudeBot, Amazonbot, Bytespider, CCBot"AI is learning about you for the future" - effects show up in 6-18 months
IndexingPre-fetching content for an AI search indexOAI-SearchBot, PerplexityBot, Claude-SearchBot, YouBot, Applebot"AI is preparing to cite you" - your content is being added to a retrieval index for live answers
ConversationsLive retrieval during an active AI chatChatGPT-User, Claude-User, Perplexity-User, MistralAI-User, Meta-ExternalFetcher"A real user is asking AI something right now and it's reading your page" - the strongest signal you may be cited

Each one is a different stage of the AI funnel.

Conversations is the loudest signal. When a Claude-User or ChatGPT-User hits your page, it usually means a real person was asking a question and AI thought your page was the right answer. These visits are often the leading indicator that a citation is about to land.

Indexing sits in the middle. When PerplexityBot or OAI-SearchBot crawls a page, they're storing it in a retrieval index they'll consult during user queries. Heavy indexing of a page is a leading indicator of citations.

Training is the long game. GPTBot crawling your site today may not affect ChatGPT's behavior until the next major release - which could be six to eighteen months out. It's a useful signal for future model exposure, but it's weaker and slower than indexing or conversation traffic.

There's also an emerging fourth category of autonomous agent bots like Google-Agent that act on a site rather than just read it. Volume is currently tiny and Google-Agent is effectively the only resident, so Trakkr tracks agent traffic under a platform filter rather than promoting it to its own hero stat. When that bucket grows, it gets promoted.


The 60+ bots Trakkr identifies

Every major AI provider, plus the long tail of search and research crawlers. Bot detection is pattern-based on the User-Agent header, ordered most-specific-first to avoid false positives (so ChatGPT-User matches before generic ChatGPT).

ProviderBots tracked
OpenAIGPTBot, ChatGPT-User, OAI-SearchBot
AnthropicClaudeBot, Claude-User, Claude-SearchBot, Claude-Web, Claude-Code, anthropic-ai
GoogleGoogle-Agent, Googlebot, GoogleOther, Google-Extended (control token)
PerplexityPerplexityBot, Perplexity-User
Microsoft / BingBingbot, BingPreview
MetaMeta-ExternalAgent, Meta-ExternalFetcher, FacebookBot
AppleApplebot, Applebot-Extended (control token)
AmazonAmazonbot
ByteDanceBytespider
Common CrawlCCBot
Mistral AIMistralAI-User
DeepSeekDeepSeekBot
xAIGrokBot
Coherecohere-ai
You.comYouBot
Allen AIAI2Bot
OtherDiffbot, Omgili, Timpibot, ImagesiftBot, Scrapy

New bots are added to the detector as providers ship them. If a bot starts hitting your site that Trakkr doesn't recognize yet, the dashboard surfaces the raw user agent so you can flag it.


How Trakkr captures crawler visits

Crawler visits flow into Trakkr through server-side sources. Most teams connect the platform already sitting in front of the site: Cloudflare, Vercel, Netlify, WordPress, a self-hosted runtime, or a hosted CMS routed through Cloudflare.

PathWhat it isSetup timeBest for
Server-side connectionA direct integration with your CDN, host, or log forwarder5-10 minutesProduction apps, server-rendered sites, high-traffic platforms behind a CDN
Hosted CMS via CloudflareA guided Cloudflare proxy flow for CMS platforms without server logs10-15 minutesWebflow, Shopify, HubSpot, Squarespace, Wix, Framer, Ghost

Why server-side is more accurate

Browser scripts can only see crawlers that execute JavaScript, and many crawler requests never get that far. Server-side integrations see every request that hits your origin, including bots that never run JS.

If you have the option, server-side is the better choice. It catches:

You can connect more than one server-side source. Trakkr deduplicates events by hash, so Cloudflare plus a webhook forwarder will not double-count visits.

Server-side integrations

PlatformAuthRealtimeNotes
CloudflareAPI tokenNoReads Cloudflare's GraphQL analytics with a scoped read-only token. Works on every Cloudflare plan.
VercelOAuthYesInstalls a Log Drain on your Vercel project. Requires Vercel Pro or Enterprise.
NetlifyOAuthYesDeploys an Edge Function that streams crawler visits.
WordPressExisting adapterNoUses the Trakkr WordPress plugin you already have on connected sites.
Hosted CMSCloudflare proxyNoWebflow, Shopify, HubSpot, Squarespace, Wix, Framer, and Ghost route through Cloudflare.
Manual / Custom CDNWebhookYesPoint any log forwarder at your unique webhook URL. Works with Akamai, Fastly, CloudFront, custom servers.

Step-by-step guides for each platform live in Crawler Tracking.


Setting it up

  1. Open Crawler in the sidebar
  2. Click Connect platform and choose your host
  3. Follow the platform-specific flow:
  1. Trakkr backfills recent history automatically and starts syncing new visits

The dashboard, explained

Open Crawler in the sidebar. You'll land on a single-page dashboard built around four hero stats and three main tabs.

:::screenshot wide Screenshot: Crawler dashboard with hero stats, activity chart, and Feed tab


AI-generated insights

At the top of the dashboard, the Intelligence Brief card surfaces a short list of AI-generated insights connecting your crawler data to citations, page health, and trends.

These aren't dashboards or static rules. They're written by a language model (Gemini Flash) using a context blob assembled from:

The model returns 2-5 prioritized insights with headlines, explanations, affected pages, and (often) a suggested action you can navigate to. Cached for 4 hours per brand, recomputed when underlying data changes meaningfully.

Example insights:

Insights have severity (info / warning / error) and link directly to the page or action that resolves them. If the LLM call fails or you don't have enough data, Trakkr falls back to a deterministic rule-based set so the brief is never empty.


Access checks: can bots actually read your site?

The Access tab is a quiet but critical layer. It answers a different question than the rest of the dashboard: if a bot tried to read your content right now, could it?

robots.txt analysis

Trakkr fetches your robots.txt and parses it for AI-relevant rules. The dashboard shows:

llms.txt analysis

llms.txt is a new convention for explicitly telling AI crawlers what content you'd like them to use, what to skip, and where to find your most authoritative pages. It's the AI equivalent of a sitemap, but written for models, not search engines.

If you have an llms.txt, Trakkr parses it. If you don't, Trakkr offers a generated starter file based on your site structure and citation history.

Page visibility detection

For every page Trakkr knows about, it can run a synthetic fetch as GPTBot and inspect the response. The detector flags:

StatusMeaning
VisibleThe bot can read full content - no flag
JS-dependentBot received a near-empty shell - content needs JavaScript to render
Bot-blocked403, Cloudflare challenge, or CAPTCHA
Not found404
Auth requiredLogin wall
UnknownNetwork error - can't determine

Any non-visible status is a leak. JS-dependent pages are especially dangerous because they look fine to humans but are invisible to most AI crawlers. The dashboard groups findings by type, lets you dismiss false positives, and links every issue to a recommended fix.


Alerts and monitoring

Crawler activity changes constantly, and watching the dashboard manually is not a workflow. Trakkr fires four kinds of automated alerts:

AlertWhen it firesWhy it matters
First seenA new bot crawls your site for the first timeA new AI provider is interested - the earliest possible signal you're entering a new model's index
SilentA bot that was previously active goes quiet for N daysSomething may be blocking access (WAF rules, DNS, robots.txt change)
SpikeA bot's volume jumps significantly above its baselineEither a model retraining cycle or a product launch citing you heavily
Error surgeA bot starts hitting 4xx or 5xx more oftenPages broke, redirects misfire, or your CDN started rate-limiting

Alerts are deduplicated per (brand, bot, page) so you don't get flooded. They feed into Workflows where you can route them to Slack, email, Linear, or any other destination.


How crawler data connects to other Trakkr features

AI Crawlers isn't a standalone tool. It's a live ground-truth layer that makes the rest of Trakkr smarter.

FeatureHow it uses crawler data
CitationsCross-references which cited pages are actually being crawled - reveals citations from pages AI never actively retrieves
Optimize SiteShows crawler activity on every audited page so you can prioritize fixes by AI traffic, not vanity scores
ActionsThe intelligence brief generates Action items connected directly to crawler findings
WorkflowsCrawler alerts can trigger any workflow - Slack pings, content updates, ticket creation
Live VisitorsCross-references LLM-referred clicks against the pages being crawled, so you see the full crawl-to-click chain
AI PagesServes bot-optimized HTML to crawlers without changing what humans see
ReportsCrawler activity is included in stakeholder reports as a leading indicator alongside visibility scores

The optimization playbook

Once crawler data is flowing, here's the order to work on it.

1. Make sure bots can read your site

Open the Access tab. Fix any pages flagged as JS-dependent, bot-blocked, 404, or auth-required. This is table stakes - without it, nothing else matters.

2. Don't block in robots.txt

Check your robots.txt for AI bot blocks. The most common mistake is blanket Disallow: / rules left over from a defensive copy-paste:

# DON'T do this if you want AI visibility
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

Unless you have a real reason to opt out of training, allow these bots. If you want to be more permissive without naming bots individually:

# Allow all AI crawlers
User-agent: *
Allow: /
Sitemap: https://yoursite.com/sitemap.xml

3. Add an llms.txt

Even a minimal llms.txt pointing to your most authoritative pages is a meaningful signal. The Access tab generates a starter file you can use as a base.

# llms.txt
# A guide to the content on this site for LLMs.

> A short description of what you do and who it's for.

## Docs
- [Quick start](https://yoursite.com/docs/quickstart): Get up and running in 5 minutes
- [API reference](https://yoursite.com/docs/api): Complete endpoint documentation

## Guides
- [Best practices](https://yoursite.com/guides/best-practices): How experienced teams use the product

4. Find your converting pages and copy them

In the Pages tab, sort by Citations or Clicks. The pages at the top are your AI-visibility templates. What do they have in common - structure, length, schema, internal links? Apply that pattern to similar pages that aren't converting.

5. Watch for First Seen alerts

When a new bot first crawls your site, you have a window. Make sure your most important pages are reachable, fast, and well-structured before that bot's index becomes a citation source.

6. Use the Page Drawer for prioritization

For any page in your top 20 by traffic, open the drawer and look at the funnel. If crawls are high but citations are zero, the issue is content quality or structure. If citations are high but clicks are zero, the issue is snippet quality or attribution.

7. Verify, don't trust

When something looks anomalous - a sudden silence, a spike, a new bot - click Send Verification. Trakkr will fetch your homepage as GPTBot and confirm the detection pipeline is healthy. If verification works but real crawls are missing, the issue is upstream of Trakkr (CDN, robots.txt, DNS).


FAQ

Q: Do I need both the pixel and a server-side connection?

No. Pick whichever is easiest to set up. Server-side is more accurate; the pixel is faster to install. You can run both - Trakkr deduplicates events - but most sites won't need to.

Q: How long until I see my first crawler visit?

If you install the pixel and click Send Verification, you'll see a synthetic visit within 30 seconds. Real crawler visits depend on how often AI bots already visit your site - usually within hours for active domains, sometimes a day or two for newer sites.

Q: Why does my Conversations count look low?

Conversation bots (ChatGPT-User, Claude-User) only fire when a real user asks a question that AI thinks your page can answer. Low Conversations means you're not yet a frequent answer. Focus on your Indexing count first - if AI is preparing to cite you, citations and conversations will follow.

Q: Are crawler visits the same as visibility?

No. Crawls mean AI is aware of you. Citations mean AI is recommending you. Clicks mean AI is sending you traffic. The Page Drawer's pipeline view shows the conversion at each stage. High crawls + low citations = a quality or structure problem on the page.

Q: Can I block specific bots?

Yes, via your robots.txt (the standard way) or via your CDN (e.g. Cloudflare bot management). Trakkr will show which bots are blocked in the Access tab. Blocking reduces your AI visibility - do it only with intent.

Q: What user agents are tracked?

60+ at last count, including every major AI provider and the long tail of search and research crawlers. The detector is shared across server-side connections so every ingest path agrees on what counts as an AI bot.

Q: How far back does the data go?

Trakkr stores 90 days of detailed visit history. Daily aggregates are kept indefinitely so trend charts cover the full lifetime of the connection.

Q: Does it work behind Cloudflare?

Yes. The Cloudflare integration is the most accurate setup for sites behind Cloudflare - it reads from Cloudflare's analytics API directly, so it sees every bot Cloudflare logs, even ones blocked at the WAF.

Q: What about white-label client portals?

Crawler data is brand-scoped. Each connected brand has its own crawler dashboard, and the white-label portal exposes the same dashboard with portal branding. There's no Trakkr branding leak.

Q: How is this different from Google Search Console?

GSC tracks Google's traditional search crawling and indexing signals. AI Crawlers tracks 60+ bots from every provider and classifies them by intent. They complement each other - GSC for traditional search, AI Crawlers for AI-specific retrieval, training, and user-triggered traffic.

Q: Can I export the data?

Yes. From the Crawler page you can export filtered visits to CSV, and the Reports feature includes crawler activity in stakeholder PDFs. The full dataset is also available via the Crawler API endpoint.

Q: Is the LLM-generated insight the same as Actions?

Related but not identical. The Intelligence Brief is real-time analysis of your crawler data; the Actions feature pulls from a broader signal set including citations, narratives, and page health. Many actions originate from crawler insights, but actions live longer and persist across dashboard sessions.

Q: What plan do I need?

Crawler tracking is available on Growth and Scale plans. Server-side connections, historical dashboards, intelligence, and access analysis require crawler access on your plan.


Next steps