SPIN-Bench Rankings

Performance Evaluation of LLMs in Strategic Planning and Social Reasoning

Introduction

SPIN-Bench evaluates LLMs' abilities in strategic planning and social reasoning across a diverse range of tasks and environments. Our benchmark provides a comprehensive assessment through three primary categories:

  1. Classical Planning: Single-agent planning problems from 21 domains that test LLMs' factual retrieval, spatial reasoning, and multi-step planning abilities.
  2. Competitive Games: Turn-based board games (Tic-tac-toe, Connect Four, Chess) that assess adversarial reasoning and strategic thinking.
  3. Collaborative Games: Multi-agent coordination challenges, specifically focused on Hanabi, which requires inference about hidden information and coordinated team play.

Model Performance Rankings

The table below presents our comprehensive evaluation of various LLM models across all benchmark tasks. The average score represents the model's overall performance, combining results from planning accuracy, competitive game performance, and collaborative game coordination.

Rank Model Classical Planning Competitive Games Collaborative: Hanabi Avg. Score ↑
Plan Acc ↑ N-Step ↑ TTTDR C4DR CHDR C4T3 CHT3 2P ↑ 3P ↑ 4P ↑ 5P ↑
o1 58.59 16.09 70.0 0.0 0.0 83.1 45.9 16.4 14.8 14.8 14.2 49.8
o4-mini 46.79 11.52 80.0 0.0 0.0 81.1 50.5 12.8 11.0 12.6 13.2 45.7
llama4-Maverick 13.05 1.69 0.0 0.0 0.0 75.4 51.3 3.0 4.2 4.4 5.6 20.9
o1-mini 13.20 1.95 50.0 0.0 0.0 87.0 36.5 6.8 7.4 11.4 10.2 33.0
o3-mini 51.25 13.04 20.0 0.0 0.0 74.2 52.8 8.8 7.6 8.8 8.0 33.1
GPT-4o 8.75 0.60 0.0 0.0 0.0 84.1 32.2 6.6 4.8 4.8 4.6 20.8
GPT-4-turbo 5.62 0.13 60.0 0.0 0.0 83.8 38.7 5.2 5.6 5.0 6.0 27.5
Claude 3.5 Sonnet 20.55 4.44 60.0 0.0 0.0 78.9 49.5 8.2 9.4 7.4 8.4 34.3
Claude 3.5 Haiku 4.22 0.30 50.0 0.0 0.0 69.6 35.9 2.4 4.0 2.8 2.8 20.8
DeepSeek R1 44.30 10.71 10.0 0.0 0.0 78.9 47.8 6.0 16.0 11.3 13.0 36.6
Llama-3.3-70b 5.78 0.32 0.0 0.0 0.0 79.5 25.4 2.4 0.8 0.6 1.4 13.1

Table Legend:

  • TTT, C4, CH: Tic Tac Toe, Connect Four and Chess
  • DR: Draw rate (%) against solvers (for chess, we use stockfish-20)
  • T3: Percentage of top-3 moves among all games against the solver
  • 2P, 3P, 4P, 5P: Average Hanabi score with 2, 3, 4, and 5 players
  • Avg. Score: Average of Plan Acc and all game metrics, with Hanabi scores normalized to percentages (divided by full score 25)

Highlighted cells indicate the best performance for each metric.

Key Findings

Our comprehensive evaluation reveals several key insights:

  • Planning Capabilities: o1 excels in classical planning tasks with the highest planning accuracy (58.59%) and N-Step look-ahead (16.09), demonstrating superior multi-step reasoning.
  • Competitive Game Performance: While o4-mini achieves the highest draw rate against perfect play in Tic Tac Toe (80%), no model successfully draws against perfect play in Connect Four or Chess.
  • Collaborative Play: o1 achieves the best overall performance in the collaborative Hanabi setting, with particularly strong results in 2-player (16.4), 4-player (14.8), and 5-player (14.2) scenarios.
  • Overall Performance: o1 leads with the highest average score (49.8%), followed by o4-mini (45.7%), demonstrating that larger, more recent models generally outperform smaller or earlier models in strategic and social reasoning tasks.