AgentsNet: Coordination and Collaborative Reasoning in Multi-Agent LLMs

Literal Message Passing: Distributed Computing with LLM Agents
📄 ArXiv 💻 GitHub 🤗 Hugging Face
Florian Grötschla1*, Luis Müller2*, Jan Tönshoff2*,
Mikhail Galkin3, Bryan Perozzi3
1ETH Zurich    2RWTH Aachen University    3Google Research * Equal contribution Contact: fgroetschla@ethz.ch, luis.mueller@cs.rwth-aachen.de, toenshoff@cs.rwth-aachen.de

Abstract

Large-language models (LLMs) have demonstrated powerful problem-solving capabilities, in particular when organized in multi-agent systems. However, the advent of such systems also raises several questions on the ability of a complex network of agents to effectively self-organize and collaborate. While measuring performance on standard reasoning benchmarks indicates how well multi-agent systems can solve reasoning tasks, it is unclear whether these systems are able to leverage their topology effectively. Here, we propose AgentsNet, a new benchmark for multi-agent reasoning. By drawing inspiration from classical problems in distributed systems and graph theory, AgentsNet measures the ability of multi-agent systems to collaboratively form strategies for problem-solving, self-organization, and effective communication given a network topology. We evaluate a variety of baseline methods on AgentsNet including homogeneous networks of agents which first have to agree on basic protocols for organization and communication. We find that some frontier LLMs are already demonstrating strong performance for small networks but begin to fall off once the size of the network scales. While existing multi-agent benchmarks often focus on a limited number of agents, the coordination tasks of AgentsNet are practically unlimited in size and can scale with new generations of LLMs. As such, we also probe frontier models in a setup with up to 100 agents.

🔬 Key Research Findings

Coordination Challenges: Even the best-performing models (Gemini 2.5 Pro at 80% overall success rate, Claude 3.7 Sonnet at 70%) show declining performance as network size increases, with near-zero success rates on 100-agent networks.

Task Complexity: Performance varies significantly across different coordination tasks, with consensus being easier for most models while vertex cover proves most challenging across all network sizes.

Multi-Agent Coordination as a Research Problem

Language models have rapidly advanced in individual reasoning tasks, but how well can they collaborate? In distributed computing, agents must coordinate without centralized control, limited to local knowledge and message passing. This makes it an ideal playground for testing the limits of language-based multi-agent systems. In AgentsNet, we revisit five classical coordination problems and reformulate them as communication games between language model agents. Each agent only sees its local neighborhood and must reason, infer, and coordinate with peers using natural language.

Five Coordination Tasks from Distributed Computing

AgentsNet evaluates multi-agent systems on five fundamental problems from distributed computing theory. These tasks span different coordination complexities and communication requirements, from local neighborhood decisions to global network-wide agreements.

Theoretical Foundation

Each task represents a different class of distributed computing problem with well-understood theoretical complexity. They range from problems solvable in O(log* n) rounds (like graph coloring) to those requiring O(D) rounds where D is the network diameter (like consensus), providing a comprehensive test of coordination capabilities.

🎨 (Δ + 1)-Coloring

Problem: Each agent must select a color from {1, 2, ..., Δ+1} colors (where Δ is the maximum degree) such that no two neighboring agents choose the same color.

Application: Role assignment in distributed systems where neighboring nodes must have distinct functions.

👑 Leader Election

Problem: Exactly one agent must output "Yes" (becoming the leader) while all others output "No", without any external coordination.

Application: Establishing hierarchy and delegation in distributed systems.

🤝 Consensus

Problem: All agents must agree on the same binary value (0 or 1) through local communication only.

Application: Agreement protocols in distributed databases and blockchain systems.

💑 Maximal Matching

Problem: Agents form pairs such that no agent can be in multiple pairs, and no additional pairs can be formed.

Application: Resource allocation and task assignment without conflicts.

🛡️ Minimal Vertex Cover

Problem: Select a minimal set of "coordinator" agents such that every connection has at least one coordinator as an endpoint.

Application: Monitoring and oversight systems in distributed networks.

How the Benchmark Works

Protocol Specifications

Round-Based Communication: Agents operate in a fully synchronous setting. In each round, they perform the following steps:

  1. Receive: All messages sent by their direct neighbors in the previous round are provided as input.
  2. Process: The agent reasons over these messages together with static task instructions and its local neighborhood view.
  3. Respond: It sends one customized message to each of its neighbors for the next round.

Number of Rounds: For global tasks (Leader Election, Consensus), each agent participates in 2D+1 rounds, where D is the graph’s diameter. For local tasks (Coloring, Matching, Vertex Cover), rounds grow logarithmically with network size: 4 rounds for 4 nodes, 5 for 8 nodes, and 6 for 16 nodes. The communication topology is static throughout.

Agent World View: Each agent is assigned a unique name and knows only:

  • Its own name
  • The names of its immediate neighbors (including itself)
  • The current task description
  • The number of rounds
  • The set of messages received in the previous rounds

Evaluation: After the final round, each agent returns a decision (e.g., selecting a color, voting for a leader, accepting a match). A run is successful only if the entire network collectively satisfies the problem constraints. This defines a strict binary success criterion for evaluation.

A visualizatoin of the communication protocol
Example: Leader Election via Local Messaging Three agents (Emily, Zach, and Tom) collaborate to elect a leader by exchanging natural language messages with their neighbors. Each agent sees only local context and neighbor messages. After three rounds of communication, all agents agree that Emily should be the leader.

Network Topologies

AgentsNet evaluates coordination across three graph models with different structural properties:

Small-World Networks (Watts-Strogatz): High clustering with short path lengths, modeling hierarchical organizational structures.

Scale-Free Networks (Barabási-Albert): Power-law degree distributions with hub nodes, reflecting internet-like topologies.

Geometric Networks (Delaunay Triangulation): Spatial relationships between nearby agents, modeling geographic or sensor networks.

Experimental Results

We evaluated 10 frontier language models on AgentsNet, including Claude 3.7 Sonnet, Gemini 2.5 Pro, GPT-4.1 mini, and others. The evaluation covers networks ranging from 4 to 100 agents across three network topologies.

🔑 Key Takeaways

🏆 Top Performers

Gemini 2.5 Pro achieved the highest overall score (80%), followed by Claude 3.7 Sonnet (70%) and Gemini 2.5 Flash (69%). These models showed the most consistent performance across multiple tasks and network sizes.

📈 Scaling Patterns

All models exibit sharply declining performance as network size increases, with coordination becoming increasingly difficult beyond 16 agents and deteriorating completely for 100 agents. This pattern persists across all five coordination tasks.

📊 Task Difficulty Hierarchy

Consensus was generally easier for most models (>85% for top performers), while Vertex Cover proved most challenging across all network sizes. Leader Election, Coloring, and Matching showed moderate difficulty with high variance across models.

🔍 Reasoning Models

o4-mini and other reasoning-enhanced models showed mixed results, sometimes excelling at complex coordination tasks like Leader Election but struggling with seemingly simpler challenges, suggesting that reasoning capability alone is insufficient for coordination.

Performance Analysis

Below we provide the fraction of solved instances per task across all evaluated models. The results reveal significant variation in coordination capabilities across models and tasks, with performance generally decreasing as network size increases.

A bar chart showing the main results of the AgentsNet benchmark
Main Results: Success rates across different AI models for the five AgentsNet tasks, broken down by network size (4, 8, and 16 nodes). Each bar represents the percentage of successfully solved instances.

Even for small 4-node graphs, no model is able to solve all tasks consistently. The Consensus task is solved by most models with high success rates, while performance on VERTEX COVER is notably low for most models, particularly for 8 and 16 nodes. Overall, the best performing models are Claude 3.7 Sonnet, Gemini 2.5 Pro, and Gemini 2.5 Flash.

Cost-Performance Trade-offs

We analyze the relationship between model performance and API costs to identify cost-effective options for multi-agent coordination tasks. The cost analysis considers the total expense of running complete AgentsNet evaluations, including all message-passing rounds and final answer generation.

A scatter plot showing the price vs performance of the tested models.
Cost-Performance Analysis: AgentsNet performance versus API costs per experimental run. Gold stars indicate Pareto-optimal models that offer the best performance for their price point.

Gemini 2.5 Flash emerges as particularly cost-effective, offering performance nearly equivalent to Claude 3.7 Sonnet (69% vs 70%) while being approximately 22 times cheaper to run ($12.05 vs $266.75). This makes it an attractive option for large-scale multi-agent coordination experiments. Gemini 2.0 Flash also represents a strong cost-performance option among the lower-cost models.

Scalability to Large Networks

To probe whether AgentsNet can scale with future model capabilities, we conducted additional experiments with networks of up to 100 agents using Gemini 2.0 Flash, which demonstrates good performance on AgentsNet while remaining cost-efficient. We generated 81 network topologies and ran message-passing for 2D + 1 rounds for all tasks, where D is the graph diameter.

A heatmap showing the performance on different tasks for graphs up to 100 nodes.
Scalability Challenge: Performance of Gemini 2.0 Flash as network size increases from 20 to 100 agents. Success rates approach zero for all tasks in very large networks, highlighting fundamental scalability challenges.

We observe that performance smoothly decreases as the network grows in size. Although the five tasks vary in inherent difficulty, for example, Matching and Coloring are often easier on small graphs than Consensus or Leader Election, all tasks become substantially more challenging as the size of the network increases. For 100-agent networks, performance drops to near zero across all tasks.

This demonstrates that the difficulty of AgentsNet can be gradually increased by considering larger networks. Importantly, this increase in difficulty can be achieved without any changes to the benchmark design, which allows for arbitrary network sizes. This scalability property enables AgentsNet to remain challenging as model capabilities improve.

Interactive Exploration

To bring our benchmark to life, we have developed an interactive tool that visualizes how agents communicate throughout each task. Explore how language model agents reason, plan, and collaborate, entirely through natural language. You can upload real transcripts that were generated by our benchmark or select from example transcripts to see how agents interact in different coordination tasks.

📊 Exploration Guide

Step 1: Choose example transcript or load your own file

Step 2: Click on different agents to examine their communication strategies

Step 3: Use neighbor filtering to analyze specific agent interactions

Step 4: Compare successful and unsuccessful coordination attempts

Groups
Select an agent
Click on a node to view full transcript
Filter conversation with:
Task: -
-

Drop JSON file here or click to browse

Upload your own agentic network transcript
OR

Load Example Transcript

←→ Navigate rounds
Esc Close transcript

Recurring Coordination Patterns

In our qualitative analysis we identified several recurring patterns of coordination that commonly emerge in multi-agent interactions across tasks and models:

⏰ Strategy Coordination Failure

Observation: Agents fail to converge on common problem-solving approaches within available communication rounds, sometimes proposing strategies too late or assuming strategies without explicit communication.

Example: In graph coloring tasks, agents often start with different algorithmic approaches without negotiating a unified strategy.

🤔 Uncritical Information Acceptance

Observation: Agents generally accept information from neighbors without verification, even when it contradicts their own observations or contains errors.

Example: In one vertex cover instance, agents accepted incorrect topology information (star vs. complete graph) leading to coordination failure.

🤝 Emergent Helping Behavior

Observation: Agents spontaneously exhibit cooperative behavior, alerting neighbors to detected conflicts without explicit programming for such oversight.

Example: Agents proactively alerting neighbors when they detect conflicting color assignments in graph coloring tasks.

Takeaways and Future Directions

AgentsNet demonstrates that even classic distributed problems can serve as rich testbeds for multi-agent coordination with language models. By recasting tasks like leader election and graph coloring as communication games, we uncover both the strengths and limitations of LLMs when acting as decentralized agents. This work is only a starting point. There are many open directions: enabling asynchronous communication, scaling to larger networks, or incorporating more structured grounding and task memory. We also envision the benchmark evolving to include new task types, from planning and negotiation to real-world applications in multi-agent reasoning.

📄 Citation

@misc{grötschla2025agentsnetcoordinationcollaborativereasoning, title={AgentsNet: Coordination and Collaborative Reasoning in Multi-Agent LLMs}, author={Florian Grötschla and Luis Müller and Jan Tönshoff and Mikhail Galkin and Bryan Perozzi}, year={2025}, eprint={2507.08616}, archivePrefix={arXiv}, primaryClass={cs.MA}, url={https://arxiv.org/abs/2507.08616}, }