Agreement by query class | Trakkr Research
Cross-model agreement benchmark across major prompt families.
Methodology: Built from 797,644 valid comparisons across 44,088 reports and 8 models, covering 6,439,133 model responses in the observed window.
Summary
Comparison prompts are the most stable query class, while broader best-of and general prompts remain less portable across models.
Benchmark rows
| Metric | Value | Context |
|---|---|---|
| Comparison-query agreement | 50.4% | Comparison prompts produce the highest average agreement. |
| General-query agreement | 42.2% | General prompts are less stable across models. |
| Best-of high divergence | 14.8% | Best-of prompts frequently split models. |
Ranked view
| Item | Value | Detail |
|---|---|---|
| Comparison queries | 50.4% | The highest average agreement rate in the study. |
| How-to queries | 45.3% | More constrained than general prompts, but still not highly converged. |
| Alternative queries | 44.1% | Moderate agreement with a smaller sample. |
| Best-of queries | 43.4% | Broad buyer-intent prompts still split models materially. |
| General queries | 42.2% | The least stable mainstream prompt family in the benchmark. |
Related pages
Continue through the same study cluster.
- do ai models recommend the same brands - Related answer page
- how often is there perfect consensus across models - Related answer page
- average cross model agreement is only forty three percent - Related fact page
Data & Sources
- Same Question, Different AI, Different Answers - Flagship study behind this page
- Page JSON - Machine-readable companion file