Agreement by query class | Trakkr Research

Cross-model agreement benchmark across major prompt families.

Methodology: Built from 797,644 valid comparisons across 44,088 reports and 8 models, covering 6,439,133 model responses in the observed window.

Summary

Comparison prompts are the most stable query class, while broader best-of and general prompts remain less portable across models.

Benchmark rows

Metric Value Context
Comparison-query agreement 50.4% Comparison prompts produce the highest average agreement.
General-query agreement 42.2% General prompts are less stable across models.
Best-of high divergence 14.8% Best-of prompts frequently split models.

Ranked view

Item Value Detail
Comparison queries 50.4% The highest average agreement rate in the study.
How-to queries 45.3% More constrained than general prompts, but still not highly converged.
Alternative queries 44.1% Moderate agreement with a smaller sample.
Best-of queries 43.4% Broad buyer-intent prompts still split models materially.
General queries 42.2% The least stable mainstream prompt family in the benchmark.

Related pages

Continue through the same study cluster.

Data & Sources