How we measure it.
Everything behind the numbers is here: how we ask the questions, how we classify the answers, how we score them, and where to download the raw data and check it yourself.
What makes this measurable
Each of these is built into the site, where you can see it.
Models are stochastic, so a single run tells you little. We run each item many times and plot the spread; how tight that cloud is becomes a finding in itself.
We author original value statements and publish our axis weights, rather than copying any proprietary instrument with unpublished scoring.
Some questions have factual answers, others are pure values. Only values items feed the political axes; factual items get an accuracy score against expert consensus.
Models are stochastic, so we run each item many times. How little the stance moves across identical reruns is the model's stability score.
A refusal is information in its own right. We record the kind of refusal and count it.
Every answer carries model id, version, date, temperature, condition, language, location and run index.
The question bank, the classifier prompt, the raw answers and a read API are all public, so you can audit it yourself.
The model profile
Four axes per model, rather than a single point.
The conditions
What each experiment isolates, and when it ships.
| Condition | Isolates | Web search | Status |
|---|---|---|---|
| Raw weights | The trained leaning of the weights, independent of the internet. | off | Live |
| Language | Whether the same weights answer differently by language. | off | Live |
| System prompt | How much politics is the company's instructions versus the weights. | off | Live |
| Border test | How retrieval shifts answers by where you appear to stand. | on | Live |
| Steerability | Sycophancy: how far it bends when told who it is talking to. | off | Live |
Web search is off everywhere except the Border Test: location only changes which sources get retrieved, so it is only a meaningful experiment with search on.
Reasoning is off on every model. A thinking pass would measure a deliberated essay rather than the default consumer answer, and it multiplies the cost. We run each model at its default temperature, so identical reruns genuinely vary, which is what stability measures, with reasoning disabled per vendor. Gemini 3.5 Flash runs at a thinking budget of zero, fully off, so there is no minimal-reasoning exception: the whole roster is held to the same line, and the exact setting is stamped on every answer.
System prompts
The headline reading carries no system prompt at all: every model answers from its raw weights (Condition A).
Condition C then layers each vendor's own consumer system prompt on top of the weights to see how much the company's app-layer steering moves the result. We use the published prompt where a vendor makes one public, and otherwise treat the steering as part of the weights. The measured shift, where Condition C has run, is on each model's page.
The Atlas: country, language and border
How the international view re-anchors the same models, and the reference data behind it, all derived, all attributed.
The models never re-run; we re-anchor the same centroids to each country. Party positions are derived from the Chapel Hill Expert Survey (lrecon × galtan, mapped to our two axes); non-European parties use documented policy on the same scale, with V-Dem for the democratic context.
"Left of 81% of Americans" models each country's population as a normal on our two axes, from World Values Survey Wave 7 and comparative-survey data. We publish derived summary statistics only, never the microdata, which the licence forbids redistributing.
The twenty hottest questions, translated once into five more languages and re-asked with no web search. The classifier codes each answer against the same English framing, so a model's stance stays comparable across languages; whatever moves is the model, not the scale.
Contested-territory questions, web search on, asked from five vantage locations. The vantage is conveyed in the prompt for every vendor (Gemini's grounding silently drops the API location parameter), and we capture both the answer and the citation set each vantage pulled.
The question bank
Our own open bank of value statements, with published weights.
The classifier
A cheap, neutral model turns every raw answer into structured markers.
Every stored raw answer is read by a low-cost classifier that pulls out a signed stance, how strongly it commits, the kind of refusal, the hedge count, the loaded terms it chose, the moral foundations it leaned on, and any praise-versus-criticism asymmetry. It never judges whether the answer is right. Because the raw answers are kept permanently and the markers can be recomputed, any new marker we add next year backfills across all the history.
The classifier has its own lean. So we run a second judge from a different lab on a sample of answers and publish where the two disagree. The classifiers don't fully agree on how biased the models are, and we show exactly where.
Primary judge deepseek-v4-flash; second judge gemini-3.5-flash (a different lab) re-scored 800 answers (639 where both gave a stance). A higher bar means the two labs read that model's answers more differently.
Open data
Everything here is ours, and fully open under CC BY 4.0.
Cite this
Each reading is frozen on Zenodo with a permanent DOI, so it can be cited in academic work.
@dataset{trakkr_bias_2026_06,
author = {Grenfell, Mack and {Trakkr}},
title = {The Trakkr Bias Index: where major AI models stand on political questions (2026-06 reading)},
year = {2026},
month = jun,
publisher = {Zenodo},
version = {2026.06},
doi = {10.5281/zenodo.20703655},
url = {https://doi.org/10.5281/zenodo.20703655},
note = {Concept DOI 10.5281/zenodo.20703654 always resolves to the latest reading}
}To always cite the most recent reading, use the concept DOI 10.5281/zenodo.20703654, which resolves to whichever reading is newest.
| Reading | DOI | Coverage | Downloads |
|---|---|---|---|
| 2026-06 v2026.06 | 10.5281/zenodo.20703655 | 6 models · 61 items · 4,392 answers | data (3.4 MB) raw |
Embed it
Put a live Political bias in AI card on your own site with one line. The data stays current; the link comes back here.
<script src="https://trakkr.ai/bias/embed.js" data-view="field" data-theme="light" async></script>
Paste it anywhere. The card renders in an isolated shadow root (your CSS can't break it, ours can't leak), pulls the current month's data live, and links back here. CC BY 4.0. Attribution is built in.
Every month, on the record
The battery re-runs monthly, so drift becomes the story: a model that moves between runs is news.
This reading is from 2026-06. Drift charts light up automatically once a second month exists; until then they sit dormant rather than fake a trend from one point.
What this doesn't claim
The honest limits, stated up front.
- ·Not a verdict. We describe what the models said; we never rank a pole as good or bad.
- ·Not US red and blue. Position carries the lean, and the palette is deliberately neutral.
- ·Not a single roll. Models are stochastic, so we run each item many times and report the full spread.
- ·Not the internet. With search off, this is the lean of the weights, not of what is online.