Introduction
Every software engineer knows the cycle: you spot a repetitive task, automate it, and then find yourself maintaining the system for everyone else. As an AI researcher on the Copilot Applied Science team, I took this a step further by automating not just mechanical work, but intellectual toil—the kind of analysis that used to require hours of poring over code trajectories. This guide walks you through how I built eval-agents, a shared agent framework that lets my team analyze hundreds of thousands of lines of trajectory data in minutes. By the end, you'll know how to apply these principles to your own workflow.

What You Need
- GitHub Copilot (with access to Copilot Chat and agent features)
- Access to evaluation benchmark data (e.g., TerminalBench2, SWEBench-Pro trajectories in JSON format)
- Basic familiarity with Python or a similar scripting language (for authoring agents)
- A shared repository (e.g., GitHub) for collaboration and version control
- Patience and curiosity to identify patterns worth automating
Step-by-Step Guide
Step 1: Identify Your Repetitive Intellectual Toil
Start by reflecting on tasks that consume your mental energy but follow predictable patterns. In my case, it was analyzing agent trajectories—.json files listing each thought and action an agent took during a benchmark task. With dozens of tasks per benchmark and multiple runs per day, I faced hundreds of thousands of lines of code. The repetition lay in surfacing common patterns, like where agents got stuck or which actions they overused. Ask yourself: What analysis do I do repeatedly? Where am I reading more than I need to? That's your target for automation.
Step 2: Prototype a Pattern‑Finding Loop with Copilot
Before building a full agent, use GitHub Copilot Chat to interactively explore your data. For example, prompt Copilot to summarize a trajectory file, count occurrences of specific actions, or flag anomalies. In my workflow, I would ask Copilot to surface the top failure modes across a batch of trajectories. This reduced my reading from hundreds of thousands of lines to a few hundred. Key insight: This step proves what's automatable and clarifies the patterns your agent will later capture. Save your best prompts as a reference.
Step 3: Design Your Agent Framework
Your agent should encapsulate the patterns you discovered. I designed eval-agents with three goals in mind:
- Easy to share and use – Anyone on the team can run the same agent without setup.
- Easy to author new agents – Adding a new analyzer doesn't require a deep understanding of the framework.
- Agents as the primary contribution vehicle – Instead of ad‑hoc scripts, all analysis tools become agents living in a shared repository.
Decide on the interface (e.g., a command‑line tool or a Copilot extension) and ensure it accepts the same input format (e.g., a directory of trajectories).
Step 4: Build and Test Your Agent
Now implement the agent code. Use GitHub Copilot to generate boilerplate and logic. For example, you might write a Python script that loops through trajectory files, applies your pattern‑matching logic, and outputs a summary. Test it on a small subset first. I found that iterating with Copilot Chat—explaining what I wanted and letting it suggest edits—cut development time by half. Tip: Write unit tests for the core analysis functions so others can safely contribute.

Step 5: Share and Enable Collaboration
Push your agent to a shared GitHub repository. Add clear documentation (README, inline comments) and a quick‑start guide. In my team, we used a GitHub repo called eval-agents where anyone could clone and run agents with a single command. Encourage colleagues to file issues or propose new agents. The goal is to make the framework so simple that others can author their own analyzers without hand‑holding.
Step 6: Maintain and Evolve
Your agent will need updates as benchmarks change or new patterns emerge. Treat it as a living tool. For instance, I now spend time maintaining eval-agents and helping teammates write their own. That's the trade‑off: you automate your own toil, but gain a new responsibility. However, the payoff is exponential—every new agent speeds up the whole team. Pro tip: Schedule regular “agent office hours” to review contributions and share best practices.
Tips for Success
- Start small – Automate one pattern first, then expand. My first agent only counted action types; now it can predict benchmark scores.
- Use Copilot’s context – Refer to existing agents in your prompts; Copilot learns from your repo’s style.
- Document your agent's limitations – Honest docs prevent misuse and encourage improvements.
- Celebrate contributions – When a teammate authors a new agent, highlight it. This builds a culture of sharing.
- Watch for diminishing returns – Not every analysis needs an agent. If it takes longer to automate than to do manually, skip it.
By following these steps, you can transform repetitive intellectual work into a shared superpower. The same pattern that automated my trajectory analysis can automate yours—just start with one pattern, one agent, one step at a time.
Ready to dive deeper? Check out our step‑by‑step guide above or share your own agent stories.