Pmm.putty PDocsScience & Space
Related
NASA Astronaut-Anil Menon to Ride Russian Soyuz to ISS in July—A Career Forged Across Space Agencies2026 Poised to Overtake 2024 as the Hottest Year, Says Renowned Climatologist10 Fascinating Facts About NASA Goddard's Visitor Center on Its 50th Anniversary5 Reasons PRAGMATA's Cloud Launch Redefines Gaming on GeForce NOWHow Scientists Uncover the True Causes of Lightning: A Step-by-Step GuideHow Scientists Finally Unlocked the Secret Atomic Structure of Relaxor FerroelectricsBistrifluron: A Safer, Targeted Method for Eliminating Drywood Termite ColoniesUnveiling the Cosmic Web: First Detailed Image of Intergalactic Highways

How to Automatically Attribute Failures in LLM Multi-Agent Systems Using the Who&When Dataset

Last updated: 2026-05-16 10:16:40 · Science & Space

Introduction

When your LLM-powered multi-agent system fails on a task, you're not just left with a broken output — you're left with a headache. Which agent made the mistake? At what step did things go wrong? Manual log crawling feels like hunting for a single typo in a novel. Fortunately, researchers from Penn State University and Duke University, in collaboration with Google DeepMind, the University of Washington, Meta, Nanyang Technological University, and Oregon State University, have introduced a structured solution: automated failure attribution. Their work, accepted as a Spotlight presentation at ICML 2025, provides the first benchmark dataset (Who&When) and several evaluation methods to pinpoint the root cause of failures. This guide walks you through applying these tools to your own multi-agent systems, saving you hours of frustration.

How to Automatically Attribute Failures in LLM Multi-Agent Systems Using the Who&When Dataset
Source: syncedreview.com

What You Need

  • Python 3.8+ environment
  • Git to clone the official code repository
  • Basic understanding of LLM multi-agent system architectures
  • Access to a Hugging Face account to download the Who&When dataset
  • Compute resources (a machine with at least 16GB RAM, GPU optional but recommended for large models)

Step-by-Step Guide

Step 1: Understand the Task of Failure Attribution

Before diving into code, grasp the core concept. In LLM multi-agent systems, multiple agents collaborate (e.g., via conversation or tool use) to solve a problem. A failure occurs when the final output is incorrect or incomplete. Failure attribution answers two questions: which agent caused the failure and at which point in the interaction (i.e., which timestamp or turn). The Who&When dataset simulates such failures with ground-truth labels, so you can evaluate the accuracy of your attribution method.

Step 2: Clone the Repository and Set Up the Environment

  1. Open a terminal and run:
    git clone https://github.com/mingyin1/Agents_Failure_Attribution.git
  2. Navigate to the directory:
    cd Agents_Failure_Attribution
  3. Create a virtual environment (recommended):
    python -m venv venv
    source venv/bin/activate # On Windows: venv\Scripts\activate
  4. Install dependencies:
    pip install -r requirements.txt

Step 3: Download the Who&When Dataset

The dataset is hosted on Hugging Face. Run the provided download script or use the Hugging Face datasets library:

from datasets import load_dataset
dataset = load_dataset("Kevin355/Who_and_When")

Alternatively, visit the dataset page and download the files manually. Place them in a data/ folder within the repository.

Step 4: Understand the Dataset Structure

The dataset contains multi-agent interaction logs, each labeled with:

  • Failure type (e.g., reasoning error, miscommunication, missing information)
  • Responsible agent (by ID or role)
  • Failure timestamp (step index where the error first manifested)

Familiarize yourself with the format by examining a sample: dataset['train'][0] in Python.

Step 5: Choose an Attribution Method

The paper introduces several automated methods. Start with the trace-based method which uses a pre-trained LLM to analyze the entire interaction trace and predict the responsible agent and time. More advanced options include:

  • Contrastive attribution: compares failed traces with successful ones to isolate divergences.
  • Causal intervention: simulates “what if” scenarios by modifying agent outputs and checking if the failure is avoided.

The repository includes scripts for each. For your first run, use the default trace-based approach.

Step 6: Run Attribution on a Sample Failure

Execute the provided evaluation script:

python run_attribution.py --dataset_path ./data/Who_and_When --method trace_based --split test

This will analyze a batch of test cases and output predictions vs. ground truth. The script logs the results including accuracy for who and when separately.

Step 7: Interpret the Results

Check the output summary. A high who accuracy (e.g., >80%) indicates the method reliably identifies the failing agent. A low when accuracy suggests the method struggles with pinpointing the exact moment. Examine false positives — does the model blame an agent too early or too late? The paper reports baseline metrics (e.g., random guessing gives ~25% accuracy for who in a 4-agent system), so compare accordingly.

Step 8: Apply to Your Own Multi-Agent System

To use this on your custom system, you must log interactions in the same format as the dataset: a JSON or dict with keys for agent names, message content, timestamps, and final success/failure. Modify the attribution scripts to accept your data. The trace_based method can be adapted by feeding your logs to the LLM with a similar prompt template.

Tips for Success

  • Start simple: Begin with the provided dataset to validate that your environment runs correctly before applying to your own data.
  • Use a strong LLM: The attribution quality improves with more capable models (e.g., GPT-4, Claude, or open-source models like LLaMA-3). The default script supports OpenAI and Hugging Face models.
  • Log everything: For your own system, ensure you capture every input, output, and intermediate state of each agent. Missing logs lead to ambiguous attribution.
  • Combine methods: The paper shows that ensemble methods (e.g., averaging predictions from trace-based and contrastive) can boost accuracy by 5–10%.
  • Beware of cascading failures: Sometimes an earlier harmless error triggers a later failure. The “when” label might be earlier than the obvious first mistake — the dataset accounts for this.
  • Iterate: Use failure attribution as a feedback loop. Once you find a common failure pattern (e.g., Agent 3 consistently misreads numeric data), modify that agent’s prompt or tool use.