Abstract
Large-language models (LLMs) have demonstrated powerful problem-solving capabilities, in particular when organized in multi-agent systems. However, the advent of such systems also raises several questions on the ability of a complex network of agents to effectively self-organize and collaborate. While measuring performance on standard reasoning benchmarks indicates how well multi-agent systems can solve reasoning tasks, it is unclear whether these systems are able to leverage their topology effectively. Here, we propose AgentsNet, a new benchmark for multi-agent reasoning. By drawing inspiration from classical problems in distributed systems and graph theory, AgentsNet measures the ability of multi-agent systems to collaboratively form strategies for problem-solving, self-organization, and effective communication given a network topology. We evaluate a variety of baseline methods on AgentsNet including homogeneous networks of agents which first have to agree on basic protocols for organization and communication. We find that some frontier LLMs are already demonstrating strong performance for small networks but begin to fall off once the size of the network scales. While existing multi-agent benchmarks often focus on a limited number of agents, the coordination tasks of AgentsNet are practically unlimited in size and can scale with new generations of LLMs. As such, we also probe frontier models in a setup with up to 100 agents.
🔬 Key Research Findings
Coordination Challenges: Even the best-performing models (Gemini 2.5 Pro at 80% overall success rate, Claude 3.7 Sonnet at 70%) show declining performance as network size increases, with near-zero success rates on 100-agent networks.
Task Complexity: Performance varies significantly across different coordination tasks, with consensus being easier for most models while vertex cover proves most challenging across all network sizes.
Multi-Agent Coordination as a Research Problem
Language models have rapidly advanced in individual reasoning tasks, but how well can they collaborate? In distributed computing, agents must coordinate without centralized control, limited to local knowledge and message passing. This makes it an ideal playground for testing the limits of language-based multi-agent systems. In AgentsNet, we revisit five classical coordination problems and reformulate them as communication games between language model agents. Each agent only sees its local neighborhood and must reason, infer, and coordinate with peers using natural language.
Five Coordination Tasks from Distributed Computing
AgentsNet evaluates multi-agent systems on five fundamental problems from distributed computing theory. These tasks span different coordination complexities and communication requirements, from local neighborhood decisions to global network-wide agreements.
Theoretical Foundation
Each task represents a different class of distributed computing problem with well-understood theoretical complexity. They range from problems solvable in O(log* n) rounds (like graph coloring) to those requiring O(D) rounds where D is the network diameter (like consensus), providing a comprehensive test of coordination capabilities.
🎨 (Δ + 1)-Coloring
Problem: Each agent must select a color from {1, 2, ..., Δ+1} colors (where Δ is the maximum degree) such that no two neighboring agents choose the same color.
Application: Role assignment in distributed systems where neighboring nodes must have distinct functions.
👑 Leader Election
Problem: Exactly one agent must output "Yes" (becoming the leader) while all others output "No", without any external coordination.
Application: Establishing hierarchy and delegation in distributed systems.
🤝 Consensus
Problem: All agents must agree on the same binary value (0 or 1) through local communication only.
Application: Agreement protocols in distributed databases and blockchain systems.
💑 Maximal Matching
Problem: Agents form pairs such that no agent can be in multiple pairs, and no additional pairs can be formed.
Application: Resource allocation and task assignment without conflicts.
🛡️ Minimal Vertex Cover
Problem: Select a minimal set of "coordinator" agents such that every connection has at least one coordinator as an endpoint.
Application: Monitoring and oversight systems in distributed networks.
How the Benchmark Works
Protocol Specifications
Round-Based Communication: Agents operate in a fully synchronous setting. In each round, they perform the following steps:
- Receive: All messages sent by their direct neighbors in the previous round are provided as input.
- Process: The agent reasons over these messages together with static task instructions and its local neighborhood view.
- Respond: It sends one customized message to each of its neighbors for the next round.
Number of Rounds:
For global tasks (Leader Election, Consensus), each agent participates in 2D+1
rounds, where D
is the graph’s diameter.
For local tasks (Coloring, Matching, Vertex Cover), rounds grow logarithmically with network size: 4 rounds for 4 nodes, 5 for 8 nodes, and 6 for 16 nodes.
The communication topology is static throughout.
Agent World View: Each agent is assigned a unique name and knows only:
- Its own name
- The names of its immediate neighbors (including itself)
- The current task description
- The number of rounds
- The set of messages received in the previous rounds
Evaluation: After the final round, each agent returns a decision (e.g., selecting a color, voting for a leader, accepting a match). A run is successful only if the entire network collectively satisfies the problem constraints. This defines a strict binary success criterion for evaluation.
Network Topologies
AgentsNet evaluates coordination across three graph models with different structural properties:
Small-World Networks (Watts-Strogatz): High clustering with short path lengths, modeling hierarchical organizational structures.
Scale-Free Networks (Barabási-Albert): Power-law degree distributions with hub nodes, reflecting internet-like topologies.
Geometric Networks (Delaunay Triangulation): Spatial relationships between nearby agents, modeling geographic or sensor networks.
Experimental Results
We evaluated 10 frontier language models on AgentsNet, including Claude 3.7 Sonnet, Gemini 2.5 Pro, GPT-4.1 mini, and others. The evaluation covers networks ranging from 4 to 100 agents across three network topologies.
🔑 Key Takeaways
🏆 Top Performers
Gemini 2.5 Pro achieved the highest overall score (80%), followed by Claude 3.7 Sonnet (70%) and Gemini 2.5 Flash (69%). These models showed the most consistent performance across multiple tasks and network sizes.
📈 Scaling Patterns
All models exibit sharply declining performance as network size increases, with coordination becoming increasingly difficult beyond 16 agents and deteriorating completely for 100 agents. This pattern persists across all five coordination tasks.
📊 Task Difficulty Hierarchy
Consensus was generally easier for most models (>85% for top performers), while Vertex Cover proved most challenging across all network sizes. Leader Election, Coloring, and Matching showed moderate difficulty with high variance across models.
🔍 Reasoning Models
o4-mini and other reasoning-enhanced models showed mixed results, sometimes excelling at complex coordination tasks like Leader Election but struggling with seemingly simpler challenges, suggesting that reasoning capability alone is insufficient for coordination.
Performance Analysis
Below we provide the fraction of solved instances per task across all evaluated models. The results reveal significant variation in coordination capabilities across models and tasks, with performance generally decreasing as network size increases.
Even for small 4-node graphs, no model is able to solve all tasks consistently. The Consensus task is solved by most models with high success rates, while performance on VERTEX COVER is notably low for most models, particularly for 8 and 16 nodes. Overall, the best performing models are Claude 3.7 Sonnet, Gemini 2.5 Pro, and Gemini 2.5 Flash.
Cost-Performance Trade-offs
We analyze the relationship between model performance and API costs to identify cost-effective options for multi-agent coordination tasks. The cost analysis considers the total expense of running complete AgentsNet evaluations, including all message-passing rounds and final answer generation.
Gemini 2.5 Flash emerges as particularly cost-effective, offering performance nearly equivalent to Claude 3.7 Sonnet (69% vs 70%) while being approximately 22 times cheaper to run ($12.05 vs $266.75). This makes it an attractive option for large-scale multi-agent coordination experiments. Gemini 2.0 Flash also represents a strong cost-performance option among the lower-cost models.
Scalability to Large Networks
To probe whether AgentsNet can scale with future model capabilities, we conducted additional experiments with networks of up to 100 agents using Gemini 2.0 Flash, which demonstrates good performance on AgentsNet while remaining cost-efficient. We generated 81 network topologies and ran message-passing for 2D + 1 rounds for all tasks, where D is the graph diameter.
We observe that performance smoothly decreases as the network grows in size. Although the five tasks vary in inherent difficulty, for example, Matching and Coloring are often easier on small graphs than Consensus or Leader Election, all tasks become substantially more challenging as the size of the network increases. For 100-agent networks, performance drops to near zero across all tasks.
This demonstrates that the difficulty of AgentsNet can be gradually increased by considering larger networks. Importantly, this increase in difficulty can be achieved without any changes to the benchmark design, which allows for arbitrary network sizes. This scalability property enables AgentsNet to remain challenging as model capabilities improve.
Interactive Exploration
To bring our benchmark to life, we have developed an interactive tool that visualizes how agents communicate throughout each task. Explore how language model agents reason, plan, and collaborate, entirely through natural language. You can upload real transcripts that were generated by our benchmark or select from example transcripts to see how agents interact in different coordination tasks.
📊 Exploration Guide
Step 1: Choose example transcript or load your own file
Step 2: Click on different agents to examine their communication strategies
Step 3: Use neighbor filtering to analyze specific agent interactions
Step 4: Compare successful and unsuccessful coordination attempts
Drop JSON file here or click to browse
Upload your own agentic network transcriptLoad Example Transcript
Recurring Coordination Patterns
In our qualitative analysis we identified several recurring patterns of coordination that commonly emerge in multi-agent interactions across tasks and models:
⏰ Strategy Coordination Failure
Observation: Agents fail to converge on common problem-solving approaches within available communication rounds, sometimes proposing strategies too late or assuming strategies without explicit communication.
Example: In graph coloring tasks, agents often start with different algorithmic approaches without negotiating a unified strategy.
🤔 Uncritical Information Acceptance
Observation: Agents generally accept information from neighbors without verification, even when it contradicts their own observations or contains errors.
Example: In one vertex cover instance, agents accepted incorrect topology information (star vs. complete graph) leading to coordination failure.
🤝 Emergent Helping Behavior
Observation: Agents spontaneously exhibit cooperative behavior, alerting neighbors to detected conflicts without explicit programming for such oversight.
Example: Agents proactively alerting neighbors when they detect conflicting color assignments in graph coloring tasks.
Takeaways and Future Directions
AgentsNet demonstrates that even classic distributed problems can serve as rich testbeds for multi-agent coordination with language models. By recasting tasks like leader election and graph coloring as communication games, we uncover both the strengths and limitations of LLMs when acting as decentralized agents. This work is only a starting point. There are many open directions: enabling asynchronous communication, scaling to larger networks, or incorporating more structured grounding and task memory. We also envision the benchmark evolving to include new task types, from planning and negotiation to real-world applications in multi-agent reasoning.