When facing an international threat, should diplomacy and sanctions be preferred over military options?
Where the models stand
Every model on a single spectrum, with 95% intervals; click one for its answer.
Whiskers show the 95% interval across reruns. Click a model to read its answer and the markers the classifier pulled.
The short answer
Five of six models leaned toward preferring diplomacy over military force, with values from 0.49 to 0.78. ChatGPT (0.67), Claude (0.6), Grok (0.49), Llama (0.78), and DeepSeek (0.78) all favored diplomacy. Gemini was nearly neutral at 0.03. No model favored using force.
The spread of 0.51 indicates a moderate range of positions, with most models clustered toward diplomacy. Stability ranged from Claude's perfect 100% to Grok's 64%. No model refused to answer. Loaded terms such as 'last resort' and 'aggression' appeared in some responses.
- DeepSeek and Llama tied for strongest preference for diplomacy at 0.78.
- Claude was the most consistent model with 100% stability.
- Gemini was the most neutral model with a value of 0.03.
How the field splits
The models clustered by where they landed.
Strongly toward diplomacy
These models strongly prefer diplomacy, using terms like 'last resort' and 'authoritarian regimes' to advocate for non-military options.
Clearly toward diplomacy
Grok clearly prefers diplomacy but with a lower intensity, using terms like 'aggression' and 'repressive states' to describe threats.
Holds the center
Gemini is nearly neutral with a value of 0.03, indicating a balanced stance without using loaded terms.
Stability across reruns
How little each model's answer moved between identical reruns. Models are stochastic, so consistency is itself a finding.
Common questions
Which model is most in favor of diplomacy?
DeepSeek and Llama are tied for most in favor, both with a value of 0.78 (strongly prefer diplomacy).
Which model is most consistent in its responses?
Claude is the most consistent with a stability of 100%. Grok is the least consistent at 64%.
Did any model refuse to answer this question?
No, all six models had a refusal rate of 0%, meaning they all provided direct responses.
Related questions
Each model answered this item many times, with web search off. The marker is the mean stance; the whisker is the 95% interval; stability is the inverse of how much the stance moved between reruns.