AI Crawlers
Track every AI bot visit to your site, classify training vs indexing vs live conversations, and follow each page through the crawl → cite → click funnel.
- See exactly when GPTBot, ClaudeBot, PerplexityBot, and 60+ other AI bots visit your site
- Tell training, indexing, live-conversation, and agent crawls apart at a glance
- Follow every page through the crawl → cite → click conversion funnel
- Read AI-generated insights that connect crawls to citations and page health
- Catch JavaScript walls, robots.txt blocks, and bot-blocking WAF rules before they cost you visibility
There's a blind spot in your analytics. Google Analytics doesn't show AI bots. Your CDN logs lump them in with everything else. And the bots themselves don't announce themselves with clean labels - they hide behind cryptic user agents like Mozilla/5.0 (compatible; GPTBot/1.0; +https://openai.com/gptbot).
But AI crawlers are now the most consequential visitors to your site. Every time GPTBot reads your pricing page, every time PerplexityBot fetches a guide while a user is asking a question, every time Claude-User drops in mid-conversation - those are the events that decide whether ChatGPT recommends you next month.
AI Crawlers is the dashboard for those events. It identifies every AI bot that touches your site, classifies what each one is doing, and connects the dots between crawls, citations, and clicks - so you can see the full pipeline from "AI knows about this page" to "AI sent us a customer."
What you'll learn
| Section | What it covers |
|---|---|
| The blind spot | Why your existing analytics miss the visits that matter most |
| The four kinds of AI bots | Training, indexing, conversations, agents - and what each tells you |
| The 60+ bots Trakkr identifies | Every AI provider, mapped to their user-agent variants |
| How visits are captured | Tracking pixel vs server-side platform connections |
| The dashboard | Hero stats, the activity chart, Feed view, Pages view, Access view |
| Page intelligence | The crawl → cite → click funnel for every URL |
| AI-generated insights | Cross-signal analysis written by an LLM, refreshed every 4 hours |
| Access checks | robots.txt, llms.txt, JavaScript walls, bot blocks |
| Alerts | First-seen, silent, spike, and error-surge monitoring |
| Optimization playbook | The order to work on the data once it's flowing |
| FAQ | The questions teams ask in week one |
The blind spot
Your analytics stack was built for human visitors. It tracks pageviews, sessions, sources, conversions. It does not track:
- When GPTBot scraped your site at 03:47 UTC for OpenAI's next training run
- When PerplexityBot indexed your guide so it could be cited in tomorrow's answers
- When ChatGPT-User opened your page mid-conversation because someone asked Claude about you
- When Google-Agent filled out a form on your site on behalf of an AI assistant
- When Claude-SearchBot got a 403 because Cloudflare flagged it as suspicious
Multiply that by every AI platform, every bot variant, every page on your site. You're flying blind on the visits that matter most for AI visibility.
AI Crawlers fills the gap. It listens for AI bot user agents at the edge - either via a tracking pixel you install, or via a direct connection to your CDN or host - and surfaces every visit, classified by bot, by page, by intent.
The four kinds of AI bots
This is the most important concept on the page, and the one most teams get wrong. Not all crawlers are the same. They don't tell you the same thing about your visibility, and they don't deserve the same response.
Trakkr classifies every bot into one of four categories. The dashboard's hero stats are organized around these four buckets, so it's worth learning them now.
| Category | What the bot is doing | Examples | What it tells you |
|---|---|---|---|
| Training | Scraping content for the next model training cycle | GPTBot, ClaudeBot, Amazonbot, Bytespider, CCBot | "AI is learning about you for the future" - effects show up in 6-18 months |
| Indexing | Pre-fetching content for an AI search index | OAI-SearchBot, PerplexityBot, Claude-SearchBot, YouBot, Applebot | "AI is preparing to cite you" - your content is being added to a retrieval index for live answers |
| Conversations | Live retrieval during an active AI chat | ChatGPT-User, Claude-User, Perplexity-User, MistralAI-User, Meta-ExternalFetcher | "A real user is asking AI something right now and it's reading your page" - the strongest signal you may be cited |
| Agents | Automated action-taking on your site | Google-Agent | "An AI assistant is acting on your behalf" - filling forms, navigating, completing tasks |
Each one is a different stage of the AI funnel.
Conversations is the loudest signal. When a Claude-User or ChatGPT-User hits your page, it usually means a real person was asking a question and AI thought your page was the right answer. These visits are often the leading indicator that a citation is about to land.
Indexing sits in the middle. When PerplexityBot or OAI-SearchBot crawls a page, they're storing it in a retrieval index they'll consult during user queries. Heavy indexing of a page is a leading indicator of citations.
Training is the long game. GPTBot crawling your site today may not affect ChatGPT's behavior until the next major release - which could be six to eighteen months out. It's a useful signal for future model exposure, but it's weaker and slower than indexing or conversation traffic.
Agents are the newest category and the rarest in volume. They're also the most consequential when they appear, because they signal that AI is acting on your site, not just reading it. Google-Agent visits today are a preview of how AI will interact with the web going forward.
The 60+ bots Trakkr identifies
Every major AI provider, plus the long tail of search and research crawlers. Bot detection is pattern-based on the User-Agent header, ordered most-specific-first to avoid false positives (so ChatGPT-User matches before generic ChatGPT).
| Provider | Bots tracked |
|---|---|
| OpenAI | GPTBot, ChatGPT-User, OAI-SearchBot |
| Anthropic | ClaudeBot, Claude-User, Claude-SearchBot, Claude-Web, Claude-Code, anthropic-ai |
| Google-Agent, Googlebot, GoogleOther, Google-Extended (control token) | |
| Perplexity | PerplexityBot, Perplexity-User |
| Microsoft / Bing | Bingbot, BingPreview |
| Meta | Meta-ExternalAgent, Meta-ExternalFetcher, FacebookBot |
| Apple | Applebot, Applebot-Extended (control token) |
| Amazon | Amazonbot |
| ByteDance | Bytespider |
| Common Crawl | CCBot |
| Mistral AI | MistralAI-User |
| DeepSeek | DeepSeekBot |
| xAI | GrokBot |
| Cohere | cohere-ai |
| You.com | YouBot |
| Allen AI | AI2Bot |
| Other | Diffbot, Omgili, Timpibot, ImagesiftBot, Scrapy |
New bots are added to the detector as providers ship them. If a bot starts hitting your site that Trakkr doesn't recognize yet, the dashboard surfaces the raw user agent so you can flag it.
How Trakkr captures crawler visits
There are two paths into Trakkr. They are not exclusive - you can run both - but most sites only need one.
| Path | What it is | Setup time | Best for |
|---|---|---|---|
| Tracking pixel | A lightweight JavaScript snippet you embed in your site's <head> | 2 minutes | Marketing sites, Webflow, Framer, anywhere you can drop a script tag |
| Server-side connection | A direct integration with your CDN, host, or log forwarder | 5-10 minutes | Production apps, server-rendered sites, high-traffic platforms behind a CDN |
Why server-side is more accurate
The pixel runs in the browser. That means it can only see crawlers that execute JavaScript - which most modern AI crawlers do, but not all of them, and not always reliably. Server-side integrations see every request that hits your origin, including bots that never run JS.
If you have the option, server-side is the better choice. It catches:
- Crawlers that don't execute JavaScript at all
- Crawlers blocked by your CDN or WAF before they reach the page
- Crawlers that 404 on URLs that no longer exist
- Crawlers that hit non-HTML resources (PDFs, sitemaps, RSS feeds)
You can run both at once. Trakkr deduplicates events by hash, so connecting Cloudflare and installing the pixel won't double-count visits.
The five server-side integrations
| Platform | Auth | Realtime | Notes |
|---|---|---|---|
| Cloudflare | API token | No | Reads Cloudflare's GraphQL analytics with a scoped read-only token. Works on every Cloudflare plan. |
| Vercel | OAuth | Yes | Installs a Log Drain on your Vercel project. Requires Vercel Pro or Enterprise. |
| Netlify | OAuth | Yes | Deploys an Edge Function that streams crawler visits. |
| WordPress | Existing adapter | No | Uses the Trakkr WordPress plugin you already have on connected sites. |
| Manual / Custom CDN | Webhook | Yes | Point any log forwarder at your unique webhook URL. Works with Akamai, Fastly, CloudFront, custom servers. |
Step-by-step guides for each platform live in Crawler Tracking.
Setting it up
Option A: Install the pixel (fastest)
- 1Open Crawler in the sidebar
- 2In the empty state, click Install tracking
- 3Copy the script tag and paste it into your site's
<head>element - 4Deploy your site
- 5Click Send Verification in Trakkr - this fetches your homepage as
GPTBotand confirms the detection pipeline works
You should see a "Verified ✓" event appear in the Feed within about 30 seconds. Real crawler visits will start arriving as soon as bots discover your site.
Option B: Connect your platform
- 1Open Crawler in the sidebar
- 2Click Connect platform and choose your host
- 3Follow the platform-specific flow:
- Cloudflare - paste a scoped API token - Vercel - OAuth + Log Drain - Netlify - OAuth + Edge Function - WordPress - install the Trakkr plugin - Next.js / CloudFront / Node / Nginx / Akamai / Fastly / Other - use the webhook setup tile that matches your runtime
- 1Trakkr backfills recent history automatically and starts syncing new visits
The dashboard, explained
Open Crawler in the sidebar. You'll land on a single-page dashboard built around four hero stats and three main tabs.
The header
- Date range - 24h, 7d, 30d, 90d, or custom. Most stats compare the current period to the previous one of equal length, so trend percentages are apples-to-apples.
- Refresh - Bypass the cache and re-query.
- Connections - Manage your platform integrations and see sync health.
The four hero stats
The most important strip on the page. Each card is clickable and filters the chart and table below.
| Stat | What it counts | What it means |
|---|---|---|
| Total Visits | Every AI bot visit in the period | Top-of-funnel awareness |
| Conversations | Visits from interaction bots (ChatGPT-User, Claude-User, etc.) | The strongest signal your page may be cited - someone is asking AI right now |
| Training | Visits from training crawlers (GPTBot, ClaudeBot, etc.) | What AI is learning for future model versions |
| Indexing | Visits from search crawlers (OAI-SearchBot, PerplexityBot, etc.) | Pre-citation activity - AI is preparing to use this content |
Each card has a sparkline showing the trend across the period and a percentage change vs the previous period. Click a card to filter the chart and table below to that category.
The activity chart
A stacked bar chart showing visits over time, broken out by bot category. Use the filter pills above the chart to isolate Live, Training, Indexing, or view All.
Look for:
- Sustained growth in Conversations - your pages are being cited more often
- Sudden Training spikes - a new model is being prepped, or a provider is doing a fresh crawl pass
- Quiet weeks - usually fine on their own, but sustained silence on a previously-active page is a warning worth investigating
- Coordinated drops across providers - almost always a robots.txt or WAF change on your end
Tab 1: Feed
A live stream of recent crawler events, newest first. Each row shows:
- Time - relative ("3m ago") plus exact timestamp on hover
- Bot - normalized name with provider logo
- Page - the URL the bot hit
- Status - HTTP status code (200, 404, 403, 500)
- Method - GET, POST, HEAD
- Verified badge - synthetic verification visits are tagged so you can filter them out
Use the view filter to switch between All, Conversations, Indexing, and Training. This is the view to keep open during launches and big content drops - it shows you AI activity in something close to real time.
Tab 2: Pages
The same data, but page-first instead of event-first. Every URL on your site that AI bots have visited, with:
- Total fetches in the period
- Unique bots that visited
- Most recent visit
- Funnel filter - sort by Most Crawled, Recently Crawled, By Citations, or By Clicks
Click any row to open the Page Drawer - the deepest view in the entire dashboard, and the next section.
Tab 3: Access
robots.txt, llms.txt, and per-page accessibility checks. Covered in detail in Access checks below.
Page intelligence: the crawl → cite → click funnel
The Page Drawer is the most important screen in the dashboard, and the one most teams underuse. It's a complete conversion funnel for a single URL, built from crawler data, citation data, click data, and the page health score.
When you click any page in the Pages tab, a drawer slides out with up to six sections.
1. Verdict
A one-paragraph synthesis of what's happening on this page. Trakkr generates this server-side from the page's crawl pattern, citation count, click data, and health score - the same context the AI Insights use.
Example: "This page is your highest-converting AI surface. It's been crawled 1,247 times by 8 distinct bots, cited 31 times in monitored prompts, and drove 89 LLM-referred clicks. Citations have grown 34% week-over-week, but page-load time is in the bottom quartile for your site - fix the LCP and you'll likely see citation gains."
2. Pipeline
The conversion funnel, visualized:
Crawls (1,247) ──▶ Citations (31) ──▶ Clicks (89)
↑ 2.5% ↑ 287%The percentages are the conversion rate from one stage to the next, compared to your site average. If a page is converting better than average, you've found a template worth copying. If it's converting worse, you've found a problem worth fixing.
3. Health
Issues detected on this page that may be affecting AI visibility:
- JavaScript walls - the page renders client-side and bots see a blank shell
- Bot blocks - Cloudflare, WAF, or robots.txt is blocking the bot Trakkr tested as
- 404s - the URL is being crawled but no longer exists
- Auth walls - the page requires login
- Slow LCP / poor Core Web Vitals
- Missing meta description or structured data
Each issue links to the relevant fix in Page Analysis or Actions.
4. Next Step
A single, concrete recommendation - the highest-leverage thing to do for this page. Not a list of generic best practices. One specific, actionable next step.
5. Traffic
LLM referral clicks broken down by source. When someone clicks a link in ChatGPT, Claude, Perplexity, or Gemini and lands on this page, it shows up here. Trakkr matches based on referrer URL and UTM parameters across nine sources: ChatGPT, Claude, Perplexity, Gemini, Bing Chat, You.com, Character.AI, Mistral, and Hugging Face.
| Source | Visits |
|---|---|
| ChatGPT | 42 |
| Perplexity | 28 |
| Gemini | 11 |
| Claude | 8 |
6. Diagnostics (collapsed by default)
Raw crawler data: bot-by-bot visit counts, status code distribution, request method breakdown, last seen, first seen. Useful when you're debugging a specific anomaly.
AI-generated insights
At the top of the dashboard, the Intelligence Brief card surfaces a short list of AI-generated insights connecting your crawler data to citations, page health, and trends.
These aren't dashboards or static rules. They're written by a language model (Gemini Flash) using a context blob assembled from:
- Top 40 pages by fetch count, with citation and click data
- Bot category totals for the period and the previous period
- Status code distribution (how often bots hit 200, 404, 500)
- Page health scores from Page Analysis
- Trend deltas
The model returns 2-5 prioritized insights with headlines, explanations, affected pages, and (often) a suggested action you can navigate to. Cached for 4 hours per brand, recomputed when underlying data changes meaningfully.
Example insights:
- "Training crawlers dominating - your site is being learned but not yet retrieved." GPTBot has crawled 412 pages but no Conversation bots have appeared yet, suggesting indexing is in progress but citations haven't started.
- "Five pages are JavaScript-blocking AI bots." Lists the URLs and links straight to the access fix drawer.
- "Citation conversion outperforming average on /guides/." Your guides section converts crawls to citations 3x better than your blog - a template worth copying.
- "OAI-SearchBot returning 403 on 14 pages." WAF rule needs adjustment.
Insights have severity (info / warning / error) and link directly to the page or action that resolves them. If the LLM call fails or you don't have enough data, Trakkr falls back to a deterministic rule-based set so the brief is never empty.
Access checks: can bots actually read your site?
The Access tab is a quiet but critical layer. It answers a different question than the rest of the dashboard: if a bot tried to read your content right now, could it?
robots.txt analysis
Trakkr fetches your robots.txt and parses it for AI-relevant rules. The dashboard shows:
- Per-bot allow/disallow matrix - which bots are blocked from which paths
- Common pitfalls - blocks that look intentional but probably aren't (e.g.
Disallow: /underUser-agent: GPTBotleft over from a defensive copy-paste) - Recommended fixes - one-click diffs you can paste back into your robots.txt
llms.txt analysis
llms.txt is a new convention for explicitly telling AI crawlers what content you'd like them to use, what to skip, and where to find your most authoritative pages. It's the AI equivalent of a sitemap, but written for models, not search engines.
If you have an llms.txt, Trakkr parses it. If you don't, Trakkr offers a generated starter file based on your site structure and citation history.
Page visibility detection
For every page Trakkr knows about, it can run a synthetic fetch as GPTBot and inspect the response. The detector flags:
| Status | Meaning |
|---|---|
| Visible | The bot can read full content - no flag |
| JS-dependent | Bot received a near-empty shell - content needs JavaScript to render |
| Bot-blocked | 403, Cloudflare challenge, or CAPTCHA |
| Not found | 404 |
| Auth required | Login wall |
| Unknown | Network error - can't determine |
Any non-visible status is a leak. JS-dependent pages are especially dangerous because they look fine to humans but are invisible to most AI crawlers. The dashboard groups findings by type, lets you dismiss false positives, and links every issue to a recommended fix.
Alerts and monitoring
Crawler activity changes constantly, and watching the dashboard manually is not a workflow. Trakkr fires four kinds of automated alerts:
| Alert | When it fires | Why it matters |
|---|---|---|
| First seen | A new bot crawls your site for the first time | A new AI provider is interested - the earliest possible signal you're entering a new model's index |
| Silent | A bot that was previously active goes quiet for N days | Something may be blocking access (WAF rules, DNS, robots.txt change) |
| Spike | A bot's volume jumps significantly above its baseline | Either a model retraining cycle or a product launch citing you heavily |
| Error surge | A bot starts hitting 4xx or 5xx more often | Pages broke, redirects misfire, or your CDN started rate-limiting |
Alerts are deduplicated per (brand, bot, page) so you don't get flooded. They feed into Workflows where you can route them to Slack, email, Linear, or any other destination.
How crawler data connects to other Trakkr features
AI Crawlers isn't a standalone tool. It's a live ground-truth layer that makes the rest of Trakkr smarter.
| Feature | How it uses crawler data |
|---|---|
| Citations | Cross-references which cited pages are actually being crawled - reveals citations from pages AI never actively retrieves |
| Optimize Site | Shows crawler activity on every audited page so you can prioritize fixes by AI traffic, not vanity scores |
| Actions | The intelligence brief generates Action items connected directly to crawler findings |
| Workflows | Crawler alerts can trigger any workflow - Slack pings, content updates, ticket creation |
| Live Visitors | Cross-references LLM-referred clicks against the pages being crawled, so you see the full crawl-to-click chain |
| AI Pages | Serves bot-optimized HTML to crawlers without changing what humans see |
| Reports | Crawler activity is included in stakeholder reports as a leading indicator alongside visibility scores |
The optimization playbook
Once crawler data is flowing, here's the order to work on it.
1. Make sure bots can read your site
Open the Access tab. Fix any pages flagged as JS-dependent, bot-blocked, 404, or auth-required. This is table stakes - without it, nothing else matters.
2. Don't block in robots.txt
Check your robots.txt for AI bot blocks. The most common mistake is blanket Disallow: / rules left over from a defensive copy-paste:
# DON'T do this if you want AI visibility
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: PerplexityBot
Disallow: /Unless you have a real reason to opt out of training, allow these bots. If you want to be more permissive without naming bots individually:
# Allow all AI crawlers
User-agent: *
Allow: /
Sitemap: https://yoursite.com/sitemap.xml3. Add an llms.txt
Even a minimal llms.txt pointing to your most authoritative pages is a meaningful signal. The Access tab generates a starter file you can use as a base.
# llms.txt
# A guide to the content on this site for LLMs.
> A short description of what you do and who it's for.
## Docs
- [Quick start](https://yoursite.com/docs/quickstart): Get up and running in 5 minutes
- [API reference](https://yoursite.com/docs/api): Complete endpoint documentation
## Guides
- [Best practices](https://yoursite.com/guides/best-practices): How experienced teams use the product4. Find your converting pages and copy them
In the Pages tab, sort by Citations or Clicks. The pages at the top are your AI-visibility templates. What do they have in common - structure, length, schema, internal links? Apply that pattern to similar pages that aren't converting.
5. Watch for First Seen alerts
When a new bot first crawls your site, you have a window. Make sure your most important pages are reachable, fast, and well-structured before that bot's index becomes a citation source.
6. Use the Page Drawer for prioritization
For any page in your top 20 by traffic, open the drawer and look at the funnel. If crawls are high but citations are zero, the issue is content quality or structure. If citations are high but clicks are zero, the issue is snippet quality or attribution.
7. Verify, don't trust
When something looks anomalous - a sudden silence, a spike, a new bot - click Send Verification. Trakkr will fetch your homepage as GPTBot and confirm the detection pipeline is healthy. If verification works but real crawls are missing, the issue is upstream of Trakkr (CDN, robots.txt, DNS).
FAQ
Q: Do I need both the pixel and a server-side connection?
No. Pick whichever is easiest to set up. Server-side is more accurate; the pixel is faster to install. You can run both - Trakkr deduplicates events - but most sites won't need to.
Q: How long until I see my first crawler visit?
If you install the pixel and click Send Verification, you'll see a synthetic visit within 30 seconds. Real crawler visits depend on how often AI bots already visit your site - usually within hours for active domains, sometimes a day or two for newer sites.
Q: Why does my Conversations count look low?
Conversation bots (ChatGPT-User, Claude-User) only fire when a real user asks a question that AI thinks your page can answer. Low Conversations means you're not yet a frequent answer. Focus on your Indexing count first - if AI is preparing to cite you, citations and conversations will follow.
Q: Are crawler visits the same as visibility?
No. Crawls mean AI is aware of you. Citations mean AI is recommending you. Clicks mean AI is sending you traffic. The Page Drawer's pipeline view shows the conversion at each stage. High crawls + low citations = a quality or structure problem on the page.
Q: Can I block specific bots?
Yes, via your robots.txt (the standard way) or via your CDN (e.g. Cloudflare bot management). Trakkr will show which bots are blocked in the Access tab. Blocking reduces your AI visibility - do it only with intent.
Q: What user agents are tracked?
60+ at last count, including every major AI provider and the long tail of search and research crawlers. The detector is shared between the tracking pixel and the backend, so the pixel and platform connections always agree on what counts as an AI bot.
Q: How far back does the data go?
Trakkr stores 90 days of detailed visit history. Daily aggregates are kept indefinitely so trend charts cover the full lifetime of the connection.
Q: Does it work behind Cloudflare?
Yes. The Cloudflare integration is the most accurate setup for sites behind Cloudflare - it reads from Cloudflare's analytics API directly, so it sees every bot Cloudflare logs, even ones blocked at the WAF.
Q: What about white-label client portals?
Crawler data is brand-scoped. Each connected brand has its own crawler dashboard, and the white-label portal exposes the same dashboard with portal branding. There's no Trakkr branding leak.
Q: How is this different from Google Search Console?
GSC tracks Google's traditional search crawling and indexing signals. AI Crawlers tracks 60+ bots from every provider and classifies them by intent. They complement each other - GSC for traditional search, AI Crawlers for AI-specific retrieval, training, and user-triggered traffic.
Q: Can I export the data?
Yes. From the Crawler page you can export filtered visits to CSV, and the Reports feature includes crawler activity in stakeholder PDFs. The full dataset is also available via the Crawler API endpoint.
Q: Is the LLM-generated insight the same as Actions?
Related but not identical. The Intelligence Brief is real-time analysis of your crawler data; the Actions feature pulls from a broader signal set including citations, narratives, and page health. Many actions originate from crawler insights, but actions live longer and persist across dashboard sessions.
Q: What plan do I need?
Crawler tracking is available on Growth and Scale plans. Free brands can install the pixel and see live verification, but historical dashboards, intelligence, and access analysis require an upgrade.
Next steps
Set up Crawler Tracking
Connect Cloudflare, Vercel, Netlify, Next.js, CloudFront, WordPress, Node, Nginx, or webhook edge sources.
Citations
See where AI cites you - and which crawled pages they came from.
AI Pages
Serve clean, AI-readable content to crawlers automatically.
Was this helpful?
