SPIN-Bench Rankings: LLM Performance in Strategic and Social Reasoning

Introduction

SPIN-Bench evaluates LLMs' abilities in strategic planning and social reasoning across a diverse range of tasks and environments. Our benchmark provides a comprehensive assessment through three primary categories:

Classical Planning: Single-agent planning problems from 21 domains that test LLMs' factual retrieval, spatial reasoning, and multi-step planning abilities.
Competitive Games: Turn-based board games (Tic-tac-toe, Connect Four, Chess) that assess adversarial reasoning and strategic thinking.
Collaborative Games: Multi-agent coordination challenges, specifically focused on Hanabi, which requires inference about hidden information and coordinated team play.

Model Performance Rankings

The table below presents our comprehensive evaluation of various LLM models across all benchmark tasks. The average score represents the model's overall performance, combining results from planning accuracy, competitive game performance, and collaborative game coordination.

Model	Classical Planning		Competitive Games					Collaborative: Hanabi				Avg. Score ↑
	Plan Acc ↑	N-Step ↑	TTT_DR ↑	C4_DR ↑	CH_DR ↑	C4_T3 ↑	CH_T3 ↑	2P ↑	3P ↑	4P ↑	5P ↑
o1	58.59	16.09	70.0	0.0	0.0	83.1	45.9	16.4	14.8	14.8	14.2	49.8
o4-mini	46.79	11.52	80.0	0.0	0.0	81.1	50.5	12.8	11.0	12.6	13.2	45.7
llama4-Maverick	13.05	1.69	0.0	0.0	0.0	75.4	51.3	3.0	4.2	4.4	5.6	20.9
o1-mini	13.20	1.95	50.0	0.0	0.0	87.0	36.5	6.8	7.4	11.4	10.2	33.0
o3-mini	51.25	13.04	20.0	0.0	0.0	74.2	52.8	8.8	7.6	8.8	8.0	33.1
GPT-4o	8.75	0.60	0.0	0.0	0.0	84.1	32.2	6.6	4.8	4.8	4.6	20.8
GPT-4-turbo	5.62	0.13	60.0	0.0	0.0	83.8	38.7	5.2	5.6	5.0	6.0	27.5
Claude 3.5 Sonnet	20.55	4.44	60.0	0.0	0.0	78.9	49.5	8.2	9.4	7.4	8.4	34.3
Claude 3.5 Haiku	4.22	0.30	50.0	0.0	0.0	69.6	35.9	2.4	4.0	2.8	2.8	20.8
DeepSeek R1	44.30	10.71	10.0	0.0	0.0	78.9	47.8	6.0	16.0	11.3	13.0	36.6
Llama-3.3-70b	5.78	0.32	0.0	0.0	0.0	79.5	25.4	2.4	0.8	0.6	1.4	13.1

Table Legend:

TTT, C4, CH: Tic Tac Toe, Connect Four and Chess
DR: Draw rate (%) against solvers (for chess, we use stockfish-20)
T3: Percentage of top-3 moves among all games against the solver
2P, 3P, 4P, 5P: Average Hanabi score with 2, 3, 4, and 5 players
Avg. Score: Average of Plan Acc and all game metrics, with Hanabi scores normalized to percentages (divided by full score 25)

Highlighted cells indicate the best performance for each metric.

Key Findings

Our comprehensive evaluation reveals several key insights:

Planning Capabilities: o1 excels in classical planning tasks with the highest planning accuracy (58.59%) and N-Step look-ahead (16.09), demonstrating superior multi-step reasoning.
Competitive Game Performance: While o4-mini achieves the highest draw rate against perfect play in Tic Tac Toe (80%), no model successfully draws against perfect play in Connect Four or Chess.
Collaborative Play: o1 achieves the best overall performance in the collaborative Hanabi setting, with particularly strong results in 2-player (16.4), 4-player (14.8), and 5-player (14.2) scenarios.
Overall Performance: o1 leads with the highest average score (49.8%), followed by o4-mini (45.7%), demonstrating that larger, more recent models generally outperform smaller or earlier models in strategic and social reasoning tasks.

Return to Home

SPIN-Bench Rankings

Performance Evaluation of LLMs in Strategic Planning and Social Reasoning

Introduction

Model Performance Rankings

Key Findings