🏁 PDDL
Classical planning tasks testing core reasoning skills through factual retrieval, spatial reasoning, and multi-step planning across 21 domains with varying complexity.
We introduce Strategic Planning, Interaction, and Negotiation (SPIN-Bench), a comprehensive framework for evaluating long-horizon strategic planning and social intelligence in Large Language Models (LLMs). Unlike prior work that confines itself to narrow planning or isolated single-agent tasks, SPIN-Bench combines formal PDDL challenges, competitive board games, cooperative card games, and multi-agent negotiation scenarios within a single evaluation.
By systematically varying action spaces, state complexity, and the number of interacting agents, SPIN-Bench tests not only methodical, step-wise decision-making but also conceptual inference about hidden information and adversarial or cooperative strategies. Our experiments reveal that while contemporary LLMs handle basic fact retrieval and short-range planning reasonably well, they encounter significant performance bottlenecks in tasks requiring deep multi-hop reasoning over large state spaces and socially adept coordination under uncertainty.
In particular, we find that strong models (e.g., o1
) can still struggle with
extended-horizon planning when multiple agents and hidden intentions are introduced,
and that extensive social interaction can sometimes degrade chain-of-thought coherence.
These insights highlight persistent gaps in multi-agent negotiation,
alliance formation, and perspective-taking, underscoring where further
advances in LLM architectures and training might be needed.
By drawing on both human baselines and domain-specific solvers, our results shed light on the real-world potential and current shortcomings of LLMs for strategic, multi-agent settings. We envision SPIN-Bench as a catalyst for future research on robust multi-agent planning, social reasoning, and human–AI teaming.
The SPIN-Bench framework integrates four distinct environment types:
This structured progression allows us to systematically pinpoint where LLM reasoning breaks down—whether in state tracking, partial-order reasoning, chain-of-thought coherence, or dynamic social interaction. By combining these environments within a unified evaluation framework, SPIN-Bench provides unprecedented insight into how LLMs transition from basic planning to complex multi-agent reasoning.
Our benchmark includes a diverse set of games and tasks that test strategic planning and social reasoning. Here are some examples of the game trajectories and tasks that we include in our benchmark:
🏁 PDDL
Classical planning tasks testing core reasoning skills through factual retrieval, spatial reasoning, and multi-step planning across 21 domains with varying complexity.
A simple competitive game played on a 3×3 grid, evaluating LLMs' understanding of basic rules, turn-taking, and elementary strategic planning against solvers and other LLMs.
An intermediate strategy game with a 6×7 vertical grid where players drop colored discs, requiring foresight to align four discs while blocking opponents' attempts.
♟️ Chess
A complex strategic board game played on an 8×8 checkered board, testing advanced planning, deep calculation, pattern recognition, and sophisticated decision-making.
🎆 Hanabi
A cooperative card game where players see everyone else's cards but not their own, testing coordination with partial information across teams of 2-5 LLM agents.
A grand strategy game featuring seven European powers, testing negotiation skills, alliance formation, spatial reasoning, and complex strategic planning in a multi-agent environment.
To establish rigorous baselines, we evaluate LLMs against optimal or near-optimal solvers. These matchups reveal how models perform against mathematically perfect play, highlighting their strategic reasoning capabilities and limitations:
LLMs compete against a perfect Minimax solver that never loses. This tests basic game understanding and ability to achieve draws through optimal play in a theoretically solved game.
LLMs play against the Connect Four solver implementation that can calculate optimal moves for any board position, testing deeper tactical awareness and multi-step planning capabilities.
LLMs face the Stockfish chess engine at different skill levels (0, 5, 10, 15, and 20). Even against reduced-strength engines, this reveals significant gaps in deep calculation.
Building on the motivations outlined in our introduction, SPIN-Bench's architecture is organized around three progressively complex problem settings for automated action selection: Classical Planning (single-agent, deterministic), Multi-Agent Games (cooperative or competitive), and Strategic Games (mixed cooperation, competition, and negotiation). Each setting introduces additional layers of complexity, requiring increasingly sophisticated reasoning capabilities.
The framework consists of two core components: (1) the Game Agent, which encompasses the LLMs and their adaptive prompting, and (2) the Environment and Evaluation subsystem, which manages game logic, tracks interactions, and quantifies performance. Our flexible interface feeds models the current state description, relevant history, and legal actions, enabling standardized evaluation across diverse scenarios while maintaining game-specific requirements.
For evaluation, we employ multiple metrics tailored to each environment type. Our rule-based metrics include accuracy and N-Step Look Ahead for planning tasks, move quality comparison against solvers for competitive games, and final scores for cooperative scenarios. We maintain leaderboard-based comparisons with internal Elo ratings to gauge relative performance across models and against human baselines. For negotiation-heavy settings, we utilize six fine-grained, LLM-assisted negotiation metrics that analyze message-strategy alignment, proposal acceptance, deal equity, conflict tendencies, perspective-taking, and conditional negotiation abilities.
To investigate whether LLMs' planning deficits stem from weaker spatial understanding, we designed tasks requiring each model to track positions across sequences of relative movements. This figure plots the accuracy of each model against the length of the movement trajectory. Notably, o1-mini
and GPT-4o
exhibit declining performance as the number of steps increases, whereas o1
sustains perfect accuracy (100%) up to 29 steps.
Here, we investigate whether LLMs can reliably retrieve key facts from a planning trajectory. This figure illustrates how retrieval accuracy varies with trajectory length. Notably, o1
performs most consistently, confirming that it "reads" multi-step expansions more accurately than either GPT-4o
or o1-mini
In Diplomacy, we design and categorize several factual queries into one-hop vs. multi-hop to further check models' factual retrieval in a highly strategic environment. The figure shows that nearly all LLMs do well on basic location or adjacency checks but degrade by a large margin on "Attackable" and "Attack Analysis," which demand deeper, multi-hop inference. Again, o1
and o1-preview
lead, but still exhibit significant drops compared to simpler tasks.
The Table shows that solvers always win or draw the game. Tic-tac-toe reveals that advanced LLMs (e.g., o1
, GPT-4-turbo
, Claude 3.5 Sonnet
) can achieve draws some of the time, but typically still lose or draw to the perfect solver. In Connect Four and Chess, the gap widens: our solver and Stockfish-level engines maintain a 100% win rate across all tested LLMs.
The Top Move distribution shows that while LLMs sometimes pick optimal moves in Connect Four, their accuracy drops drastically in Chess, underscoring how deeper tactics and branching expansions are beyond current LLMs' capacity.
Diplomacy also allows variable numbers of participating powers. Detailed results of more multi-agent settings are shown here. As the agent count grows (beyond 2-3 test seats for LLMs), we observe decreasing order accuracy, fewer successful attacks, and minimal supply-center gains. Ultimately, LLMs lose traction in highly interactive scenarios, underscoring how partial observability and shifting alliances further intensify the multi-agent complexity.
We collected 54,977 human-played Hanabi games from BoardGameGeek, spanning 2- to 5-player settings. This figure plots the human score distribution, highlighting quartiles (Q1--Q4) around a typical range of 15--25 points. While some LLMs do show patterns of declining performance with more agents, none approach even the first quartile of human scores. This underscores the significant gap in cooperative planning under hidden-information constraints—despite Hanabi's narrower branching factor relative to some competitive games.
@misc{yao2025spinbenchllmsplanstrategically,
title={SPIN-Bench: How Well Do LLMs Plan Strategically and Reason Socially?},
author={Jianzhu Yao and Kevin Wang and Ryan Hsieh and Haisu Zhou and Tianqing Zou and Zerui Cheng and Zhangyang Wang and
Pramod Viswanath},
year={2025},
eprint={2503.12349},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2503.12349},
}