Introduction
SPIN-Bench evaluates LLMs' abilities in strategic planning and social reasoning across a diverse range of tasks and environments. Our benchmark provides a comprehensive assessment through three primary categories:
- Classical Planning: Single-agent planning problems from 21 domains that test LLMs' factual retrieval, spatial reasoning, and multi-step planning abilities.
- Competitive Games: Turn-based board games (Tic-tac-toe, Connect Four, Chess) that assess adversarial reasoning and strategic thinking.
- Collaborative Games: Multi-agent coordination challenges, specifically focused on Hanabi, which requires inference about hidden information and coordinated team play.
Model Performance Rankings
The table below presents our comprehensive evaluation of various LLM models across all benchmark tasks. The average score represents the model's overall performance, combining results from planning accuracy, competitive game performance, and collaborative game coordination.
Rank | Model | Classical Planning | Competitive Games | Collaborative: Hanabi | Avg. Score ↑ | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Plan Acc ↑ | N-Step ↑ | TTTDR ↑ | C4DR ↑ | CHDR ↑ | C4T3 ↑ | CHT3 ↑ | 2P ↑ | 3P ↑ | 4P ↑ | 5P ↑ | |||
o1 | 58.59 | 16.09 | 70.0 | 0.0 | 0.0 | 83.1 | 45.9 | 16.4 | 14.8 | 14.8 | 14.2 | 49.8 | |
o4-mini | 46.79 | 11.52 | 80.0 | 0.0 | 0.0 | 81.1 | 50.5 | 12.8 | 11.0 | 12.6 | 13.2 | 45.7 | |
llama4-Maverick | 13.05 | 1.69 | 0.0 | 0.0 | 0.0 | 75.4 | 51.3 | 3.0 | 4.2 | 4.4 | 5.6 | 20.9 | |
o1-mini | 13.20 | 1.95 | 50.0 | 0.0 | 0.0 | 87.0 | 36.5 | 6.8 | 7.4 | 11.4 | 10.2 | 33.0 | |
o3-mini | 51.25 | 13.04 | 20.0 | 0.0 | 0.0 | 74.2 | 52.8 | 8.8 | 7.6 | 8.8 | 8.0 | 33.1 | |
GPT-4o | 8.75 | 0.60 | 0.0 | 0.0 | 0.0 | 84.1 | 32.2 | 6.6 | 4.8 | 4.8 | 4.6 | 20.8 | |
GPT-4-turbo | 5.62 | 0.13 | 60.0 | 0.0 | 0.0 | 83.8 | 38.7 | 5.2 | 5.6 | 5.0 | 6.0 | 27.5 | |
Claude 3.5 Sonnet | 20.55 | 4.44 | 60.0 | 0.0 | 0.0 | 78.9 | 49.5 | 8.2 | 9.4 | 7.4 | 8.4 | 34.3 | |
Claude 3.5 Haiku | 4.22 | 0.30 | 50.0 | 0.0 | 0.0 | 69.6 | 35.9 | 2.4 | 4.0 | 2.8 | 2.8 | 20.8 | |
DeepSeek R1 | 44.30 | 10.71 | 10.0 | 0.0 | 0.0 | 78.9 | 47.8 | 6.0 | 16.0 | 11.3 | 13.0 | 36.6 | |
Llama-3.3-70b | 5.78 | 0.32 | 0.0 | 0.0 | 0.0 | 79.5 | 25.4 | 2.4 | 0.8 | 0.6 | 1.4 | 13.1 |
Table Legend:
- TTT, C4, CH: Tic Tac Toe, Connect Four and Chess
- DR: Draw rate (%) against solvers (for chess, we use stockfish-20)
- T3: Percentage of top-3 moves among all games against the solver
- 2P, 3P, 4P, 5P: Average Hanabi score with 2, 3, 4, and 5 players
- Avg. Score: Average of Plan Acc and all game metrics, with Hanabi scores normalized to percentages (divided by full score 25)
Highlighted cells indicate the best performance for each metric.
Key Findings
Our comprehensive evaluation reveals several key insights:
- Planning Capabilities:
o1
excels in classical planning tasks with the highest planning accuracy (58.59%) and N-Step look-ahead (16.09), demonstrating superior multi-step reasoning. - Competitive Game Performance: While
o4-mini
achieves the highest draw rate against perfect play in Tic Tac Toe (80%), no model successfully draws against perfect play in Connect Four or Chess. - Collaborative Play:
o1
achieves the best overall performance in the collaborative Hanabi setting, with particularly strong results in 2-player (16.4), 4-player (14.8), and 5-player (14.2) scenarios. - Overall Performance:
o1
leads with the highest average score (49.8%), followed byo4-mini
(45.7%), demonstrating that larger, more recent models generally outperform smaller or earlier models in strategic and social reasoning tasks.