Table of Contents

SPIN-Bench

How Well Do LLMs Plan Strategically and Reason Socially?

Jianzhu Yao* , Kevin Wang* , Ryan Hsieh, Haisu Zhou, Tianqing Zou, Zerui Cheng, Zhangyang Wang , Pramod Viswanath
Princeton University, The University of Texas at Austin
* Equal Contribution

Overview of the Strategic Planning, Interaction, and Negotiation (SPIN-Bench) framework, highlighting its two core components: (1) the Game Agent, which encompasses the LLMs and their adaptive prompting, and (2) the Environment and Evaluation subsystem, which manage game logic, track interactions, and quantify performance

Introduction

We introduce Strategic Planning, Interaction, and Negotiation (SPIN-Bench), a comprehensive framework for evaluating long-horizon strategic planning and social intelligence in Large Language Models (LLMs). Unlike prior work that confines itself to narrow planning or isolated single-agent tasks, SPIN-Bench combines formal PDDL challenges, competitive board games, cooperative card games, and multi-agent negotiation scenarios within a single evaluation.

By systematically varying action spaces, state complexity, and the number of interacting agents, SPIN-Bench tests not only methodical, step-wise decision-making but also conceptual inference about hidden information and adversarial or cooperative strategies. Our experiments reveal that while contemporary LLMs handle basic fact retrieval and short-range planning reasonably well, they encounter significant performance bottlenecks in tasks requiring deep multi-hop reasoning over large state spaces and socially adept coordination under uncertainty.

In particular, we find that strong models (e.g., o1) can still struggle with extended-horizon planning when multiple agents and hidden intentions are introduced, and that extensive social interaction can sometimes degrade chain-of-thought coherence. These insights highlight persistent gaps in multi-agent negotiation, alliance formation, and perspective-taking, underscoring where further advances in LLM architectures and training might be needed.

By drawing on both human baselines and domain-specific solvers, our results shed light on the real-world potential and current shortcomings of LLMs for strategic, multi-agent settings. We envision SPIN-Bench as a catalyst for future research on robust multi-agent planning, social reasoning, and human–AI teaming.

Task Taxonomy and Environments

The SPIN-Bench framework integrates four distinct environment types:

  1. PDDL Tasks: Classical planning problems across 21 domains (1,280 tasks) spanning factual retrieval, spatial reasoning, and multi-step planning with increasing state spaces.
  2. Competitive Games: Turn-based board games of escalating complexity (Tic-tac-toe, Connect Four, Chess) that test adversarial reasoning from short-range tactics to deeper strategic thinking.
  3. Cooperative Games: Featuring Hanabi, a card game where players see others' cards but not their own, requiring trust-building, inference about hidden states, and coordinated actions.
  4. Strategic Games: Incorporating Diplomacy, where negotiation, alliance formation, and strategic betrayal are integral, testing both planning capabilities and social intelligence.

This structured progression allows us to systematically pinpoint where LLM reasoning breaks down—whether in state tracking, partial-order reasoning, chain-of-thought coherence, or dynamic social interaction. By combining these environments within a unified evaluation framework, SPIN-Bench provides unprecedented insight into how LLMs transition from basic planning to complex multi-agent reasoning.

Game Trajectory Visualization

Our benchmark includes a diverse set of games and tasks that test strategic planning and social reasoning. Here are some examples of the game trajectories and tasks that we include in our benchmark:

🏁 PDDL

Classical planning tasks testing core reasoning skills through factual retrieval, spatial reasoning, and multi-step planning across 21 domains with varying complexity.

Tic Tac Toe

A simple competitive game played on a 3×3 grid, evaluating LLMs' understanding of basic rules, turn-taking, and elementary strategic planning against solvers and other LLMs.

🔴 Connect Four

An intermediate strategy game with a 6×7 vertical grid where players drop colored discs, requiring foresight to align four discs while blocking opponents' attempts.

♟️ Chess

A complex strategic board game played on an 8×8 checkered board, testing advanced planning, deep calculation, pattern recognition, and sophisticated decision-making.

🎆 Hanabi

A cooperative card game where players see everyone else's cards but not their own, testing coordination with partial information across teams of 2-5 LLM agents.

🌍 Diplomacy

A grand strategy game featuring seven European powers, testing negotiation skills, alliance formation, spatial reasoning, and complex strategic planning in a multi-agent environment.

LLM vs Solver Game Trajectories

To establish rigorous baselines, we evaluate LLMs against optimal or near-optimal solvers. These matchups reveal how models perform against mathematically perfect play, highlighting their strategic reasoning capabilities and limitations:

Tic Tac Toe vs Minimax

LLMs compete against a perfect Minimax solver that never loses. This tests basic game understanding and ability to achieve draws through optimal play in a theoretically solved game.

🔴 Connect Four vs Solver

LLMs play against the Connect Four solver implementation that can calculate optimal moves for any board position, testing deeper tactical awareness and multi-step planning capabilities.

♟️ Chess vs Stockfish

LLMs face the Stockfish chess engine at different skill levels (0, 5, 10, 15, and 20). Even against reduced-strength engines, this reveals significant gaps in deep calculation.

Game Settings and Evaluation Metrics

The SPIN-Bench Framework

Building on the motivations outlined in our introduction, SPIN-Bench's architecture is organized around three progressively complex problem settings for automated action selection: Classical Planning (single-agent, deterministic), Multi-Agent Games (cooperative or competitive), and Strategic Games (mixed cooperation, competition, and negotiation). Each setting introduces additional layers of complexity, requiring increasingly sophisticated reasoning capabilities.

The framework consists of two core components: (1) the Game Agent, which encompasses the LLMs and their adaptive prompting, and (2) the Environment and Evaluation subsystem, which manages game logic, tracks interactions, and quantifies performance. Our flexible interface feeds models the current state description, relevant history, and legal actions, enabling standardized evaluation across diverse scenarios while maintaining game-specific requirements.

For evaluation, we employ multiple metrics tailored to each environment type. Our rule-based metrics include accuracy and N-Step Look Ahead for planning tasks, move quality comparison against solvers for competitive games, and final scores for cooperative scenarios. We maintain leaderboard-based comparisons with internal Elo ratings to gauge relative performance across models and against human baselines. For negotiation-heavy settings, we utilize six fine-grained, LLM-assisted negotiation metrics that analyze message-strategy alignment, proposal acceptance, deal equity, conflict tendencies, perspective-taking, and conditional negotiation abilities.

Experimental Results

Share SPIN-Bench

BibTeX

@misc{yao2025spinbenchllmsplanstrategically,
      title={SPIN-Bench: How Well Do LLMs Plan Strategically and Reason Socially?},
      author={Jianzhu Yao and Kevin Wang and Ryan Hsieh and Haisu Zhou and Tianqing Zou and Zerui Cheng and Zhangyang Wang and
      Pramod Viswanath},
      year={2025},
      eprint={2503.12349},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2503.12349},
      }