Study 008

The Anatomy of an AI Citation

We crawled 1,465 AI-cited pages across 950 domains to understand what the most-cited pages on the web have in common.

1465
AI-cited pages crawled
68%
have schema markup
+45%
citation lift from FAQ schema
3x
more words than web average
Last updated: April 16, 2026

We crawled 1,465 pages across 950 domains that ChatGPT, Perplexity, and Gemini actively cite, drawing from 28,033+ citation appearances. For each page, we extracted its schema markup, content structure, and technical metadata - then measured how each feature over- or under-indexes relative to web averages from the HTTP Archive.

These are correlational findings. They describe what AI-cited pages look like, not necessarily why they were chosen. Where sample sizes are small or confounds are present, we say so.

[01]

The FAQ Schema Effect

FAQ Schema Effect
+45% more citations

Pages with FAQPage schema average 45% more citation appearances than pages with no FAQ signal. Pages with FAQ content but no corresponding schema fall in between, suggesting the markup adds signal beyond the content pattern alone - though FAQ schema pages also tend to be substantially longer, which may partially explain the lift.

n=23 pages with FAQ schema. Early signal - sample size warrants caution.

FAQPage shows the strongest positive association with citation volume in the dataset. Most other schema types appear more often on cited pages but don't predict higher citation counts within them.

Avg Citations by FAQ Signal
FAQ Schema + FAQ Content36.9
n=23
FAQ Content Only27.2
n=161
No FAQ Signal25.4
n=269
[02]

Schema Types on Cited Pages

Schema Adoption
68%of AI-cited pages have schema markup

vs ~38.5% web average (Web Almanac 2024)

AI-cited pages are nearly twice as likely to have structured data as the web average. The standout types are Person (author attribution), ImageObject, and NewsArticle - each appearing 8-9x more frequently on cited pages than across the web at large.

Web averages from HTTP Archive / Web Almanac 2024. Over-representation indicates the kind of pages AI models cite, not a direct causal effect. Of all types measured, only FAQPage independently correlates with higher citation frequency (Section 01).

Schema Type Lift vs Web Average
Person9.4x

18.9% cited vs 2.0% web

ImageObject8.9x

21.4% cited vs 2.4% web

NewsArticle8.7x

10.4% cited vs 1.2% web

SoftwareApplication8.0x

2.4% cited vs 0.3% web

Service6.5x

1.3% cited vs 0.2% web

BreadcrumbList5.2x

37.7% cited vs 7.3% web

WebPage5.1x

29.3% cited vs 5.8% web

BlogPosting4.8x

8.1% cited vs 1.7% web

ItemList4.4x

4.4% cited vs 1.0% web

WebSite4.3x

33.0% cited vs 7.7% web

Organization4.1x

31.5% cited vs 7.6% web

Article3.8x

24.4% cited vs 6.5% web

[03]

Light Schema Outperforms Heavy

30.5
avg citations for light schema pages

Pages with light schema implementation are cited more frequently than pages with heavy, complex markup. Beyond a modest threshold, additional structured data shows diminishing - and eventually negative - returns.

Focus beats thoroughness. The lightest schema tier consistently earns the most citations in this dataset. Heavier markup shows no benefit - though whether this reflects a preference by AI models or simply the characteristics of high-performing pages is an open question.

No Schema
24.1
n=1461,843.4 avg words
Light
30.5
n=1352,551.6 avg words
Medium
26.6
n=892,310 avg words
Rich
24.8
n=722,646.6 avg words
Very Rich
23.7
n=122,478.6 avg words
[04]

The Citation Blueprint

Ten on-page features compared between the top 10% most-cited pages and the bottom 50%. The differences are narrower than you might expect - these are structural properties, not external signals like backlinks or domain authority.

Has Any Schema
80.0%
65.6%
+14.4%
Article Schema
37.8%
23.3%
+14.5%
FAQ Schema
11.1%
5.3%
+5.8%
Person Schema
17.8%
19.4%
-1.6%
Word Count
2521.1
2304.7
+216.4
Total Headings
33.7
31.4
+2.3
List Items
146.9
120.3
+26.6
Has Tables
40.0%
28.2%
+11.8%
Has FAQ Content
42.2%
38.3%
+3.9%
Has How-To Content
73.3%
70.0%
+3.3%

Schema adoption shows the clearest separation between tiers. Content structure features - headings, FAQ patterns, how-to content - are remarkably similar, suggesting the baseline quality among AI-cited pages is already high.

[05]

Most-Cited Pages

The 15 most frequently cited pages in the dataset. Note both the pattern and the exceptions: most have focused schema, but several of the highest-ranked pages have none at all.

Domain
Citations
Words
Schema Types
softwarefinder.com
218
2,937
Corporation
rankmyagent.com
174
1,461
FAQPageRealEstateAgentItemList
collegenet.com
123
808
WebPageBreadcrumbListVideoObject
dotcom-monitor.com
111
5,785
BreadcrumbListPersonWebSite+4 more
runnersworld.com
82
3,834
NewsArticleItemList
g-co.agency
80
2,558
None
iiba.org
80
2,806
None
milanote.com
79
1,111
HowTo
offers.hubspot.com
75
553
None
dash.dropbox.com
75
1,474
MobileApplicationSoftwareApplicationOrganization+2 more
nokia.com
72
1,771
BreadcrumbList
ehrinpractice.com
72
1,832
None
skyquestt.com
71
2,993
WebPageItemList
readycontacts.com
70
1,857
PersonArticle
[06]

Key Takeaways

01

Schema is infrastructure, not advantage

68% of AI-cited pages have structured data - nearly double the web average. But most schema types don't predict citation volume; they describe the kind of pages AI happens to cite. Having schema is common among cited pages, not a differentiator within them.

02

FAQ schema is the exception - with caveats

Pages with FAQPage schema average 45% more citations than pages with no FAQ signal. But the sample is small (n=23), and these pages also tend to be substantially longer. A real association, not yet a proven cause.

03

Focused markup beats comprehensive markup

Pages with light schema (1-20 fields) earn the most citations in the dataset. Heavier implementations show diminishing returns. Schema complexity doesn't help; content quality might.

04

Content depth is the likely foundation

AI-cited pages average 2,289.6 words - 3x the typical web page. When comparing the top 10% to the bottom 50% of cited pages, structural differences are modest. Substance appears to matter more than any single on-page signal.

[07]

Methodology

Data Sources

Citation Data

Top-cited URLs from Trakkr's citation tracking system, drawn from 28,000+ citation appearances across 950 domains. Each URL was crawled live to extract JSON-LD structured data, content characteristics (word count, headings, lists, tables, FAQ patterns), and technical metadata.

Benchmarks

Web average benchmarks come from the HTTP Archive / Web Almanac 2024. The sample skews toward B2B, SaaS, and DTC brands that use Trakkr - findings are most directly applicable to those verticals.

Limitations

Small FAQ Sample (n=23). The FAQ schema finding is an early, actionable signal rather than a definitive causal claim. Larger samples will sharpen the estimate.
No Non-Cited Control Group. We compare cited pages to general web averages, not to comparable pages that were not cited. Differences may reflect page quality rather than features AI specifically selects for.
Presence, Not Quality. Schema adoption percentages measure whether structured data exists on a page, not whether it is correctly implemented or validated.

Are your pages built to get cited?

See where your pages stand on the metrics that matter: schema markup, FAQ structure, content depth, and citation frequency across ChatGPT, Perplexity, and Gemini.

Start monitoring free

No credit card required