What does the end-to-end stack look like for terminal agents when you combine structured toolkits, synthetic RL environments, and benchmark aligned evaluations? A team of researchers from CAMEL AI, Eigent AI and other collaborators have released Hair of the pigA toolkit and environment stack focused on reinforcement learning for terminal agents. The project targets agents that run inside a Unix style shell and have to perform verifiable tasks under a benchmark harness such as Terminal Bench.
Three main contributions:
- State-of-the-art terminal agents on Terminal Bench: They achieve state-of-the-art performance with Cloud SONET 4.5 based agents on Terminal Bench 2.0 and GPT 4.1 based agents on Terminal Bench 1.0. The comparison is limited to agents that use the same base model.
- Scalable RL training with synthetic terminal environments: The research team releases an initial synthetic dataset with 400 terminal tasks covering a range of difficulty levels. Of these, 260 functions are used for RLVR finetuning of the Qwen3-8B model.
- A clean agent design that generalizes across training and evaluation frameworks: the same agent implementation is used for both local task runs and the official terminal bench evaluation harness.
Terminal Toolkit and Log Structure
The SETA code repository shows a terminal toolkit that transforms a language model into an executable terminal agent. For each task run, the framework creates a structured log directory. evaluation/terminal_bench_runThe README page shows a concrete layout for a task called play-zork,
Main files include:
chatagent.logWhich records the complete history of agent messages and tool calls, including test results.- A
sessionswith directorysession_logsWhich captures terminal interactions from the toolkit. - Inside
session_logsfiles likeblocking_commands.log,session_run_zork_1_correct_path.log,session_zork-1.logAndsession_zork_start.logStore command output for different sessions and modes. tests.logAndtests.log.stripWhich records the test run output, the latter removing terminal control characters.
This structure gives a concrete way to debug an agent. You can find out from high level chat decisions chatagent.log Access individual shell commands in the session log and confirm success or failure from the test log.
For official Terminal Bench evaluation, the GitHub repository offers a separate entry point evaluation/terminal_bench_evalA developer goes to that directory and runs run_eval.sh For Terminal Bench 1.0 and up run_tb2.sh For Terminal Bench 2.0.
results are written evaluation/terminal_bench_eval/run/{run_id}/results.jsonTask specific session logs are kept below evaluation/terminal_bench_eval/logs/camel_logs/{task_id}The Agent class that binds the CAMEL agent to the benchmark is implemented in tbench_camel_agent.py,
Taking note toolkit as permanent memory
The research team has also introduced a note taking toolkit that has been described as a persistent memory for long-term tasks. Those examples show note-taking tool calls where the agent writes and reads notes in a structured manner while solving terminal tasks. Current public content focuses on the existence and use examples of this toolkit. This does not yet describe the full training purpose for note use.
The important thing is that the agent has a clear channel where it can externalize intermediate results and signals, separate from the raw terminal buffer.
understanding performance
SETA’s agent harness achieves leading results on the terminal bench. With Cloud SONET-4.5 as the backbone, CAMEL Terminal Agent reaches 46.5% accuracy on Terminal Bench 2.0 in 89 real-world tasks, ranking first and outperforming the second system by 3 percentage points, with especially strong results in Git workflow, DevOps automation, and code security tasks. On Terminal Bench 1.0, a GPT 4.1 based agent achieves 35% accuracy, which is 4.7 percentage points above the next entry, again within the same model family. In comparison, a supervised Qwen3 8B baseline achieves 3.4% on Terminal Bench 2.0, and a Qwen3 8B terminal agent trained with the SETA RL pipeline improves on this baseline on a curated synthetic environment.
key takeaways
- SETA is a joint community project that provides both an agent toolkit and a synthetic RL environment specifically aligned with the terminal bench evaluation format for terminal agents.
- The framework reports state-of-the-art performance for CAMEL terminal agents on Terminal Bench 1.0 and 2.0 when using Cloud SONET 4.5 and GPT 4.1 as base models, evaluated against agents built on similar model families.
- The SETA RL dataset on hugging faces contains 400 synthetic terminal tasks, each packaged as
task.yaml,DockerfileAndrun-tests.sh260 tasks are used for RLVR finetuning of the Qwen3-8B based agent. - The open source SETA codebase exposes a terminal toolkit with structured logging and a note taking toolkit for long horizon memory, and integrates directly with terminal bench evaluation scripts and logging paths.
setaGitHub repository. - The overall design demonstrates a clean path from synthetic RL environments to benchmark verified agents, giving developers a reproducible stack for training, debugging, and evaluating terminal agents rather than relying on ad-hoc tool calling examples.
check it out Blog, Technical Details, GitHub Repo And weightAlso, feel free to follow us Twitter And don’t forget to join us 100k+ ml subreddit and subscribe our newsletterwait! Are you on Telegram? Now you can also connect with us on Telegram.
Check out our latest releases ai2025.devA 2025-focused analytics platform that models launches, benchmarks and transforms ecosystem activity into a structured dataset that you can filter, compare and export.

Michael Sutter is a data science professional and holds a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michael excels in transforming complex datasets into actionable insights.