A Coding Implementation to Parsing, Analyzing, Visualizing, and Fine-Tuning Agent Reasoning Traces Using the lambda/hermes-agent-reasoning-traces Dataset

Spread the love

In this tutorial, we explore lambda/hermes-agent-reasoning-traces dataset To understand how agent-based models think, use tools, and generate responses in multi-turn conversations. We start by loading and inspecting the dataset, examining its structure, categories, and conversation format to get a clear idea of the information available. We then create simple parsers to extract key components such as logic traces, tool calls, and tool responses, allowing us to separate internal thinking from external actions. Additionally, we analyze patterns such as tool usage frequency, conversation length, and error rates to better understand the agent’s behavior. We also create visualizations to highlight these trends and make analysis more intuitive. Finally, we prepare the dataset for training by converting it into a model-friendly format, making it suitable for tasks like supervised fine-tuning.

!pip -q install -U datasets pandas matplotlib seaborn transformers accelerate trl


import json, re, random, textwrap
from collections import Counter, defaultdict
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datasets import load_dataset, concatenate_datasets


random.seed(0)


CONFIG = "kimi"
ds = load_dataset("lambda/hermes-agent-reasoning-traces", CONFIG, split="train")
print(ds)
print("Config:", CONFIG, "| Fields:", ds.column_names)
print("Categories:", sorted(set(ds["category"])))


COMPARE_BOTH = False
if COMPARE_BOTH:
   ds_kimi = load_dataset("lambda/hermes-agent-reasoning-traces", "kimi", split="train")
   ds_glm  = load_dataset("lambda/hermes-agent-reasoning-traces", "glm-5.1", split="train")
   ds_kimi = ds_kimi.add_column("source", ["kimi"] * len(ds_kimi))
   ds_glm  = ds_glm.add_column("source", ["glm-5.1"] * len(ds_glm))
   ds = concatenate_datasets([ds_kimi, ds_glm]).shuffle(seed=0)
   print("Combined:", ds, "→ counts:", Counter(ds["source"]))


sample = ds[0]
print("\n=== Sample 0 ===")
print("id        :", sample["id"])
print("category  :", sample["category"], "/", sample["subcategory"])
print("task      :", sample["task"])
print("turns     :", len(sample["conversations"]))
print("system[0] :", sample["conversations"][0]["value"][:220], "...\n")

We install all the necessary libraries and import the necessary modules to set up our environment. We then load the lambda/hermes-agent-reasoning-trace dataset and inspect its structure, fields, and categories. We optionally combine multiple dataset configurations and examine a sample to understand the conversation format.

THINK_RE     = re.compile(r"<think>(.*?)</think>", re.DOTALL)
TOOL_CALL_RE = re.compile(r"<tool_call>\s*(\.*?\)\s*</tool_call>", re.DOTALL)
TOOL_RESP_RE = re.compile(r"<tool_response>\s*(.*?)\s*</tool_response>", re.DOTALL)


def parse_assistant(value: str) -> dict:
   thoughts = [t.strip() for t in THINK_RE.findall(value)]
   calls = []
   for raw in TOOL_CALL_RE.findall(value):
       try:
           calls.append(json.loads(raw))
       except json.JSONDecodeError:
           calls.append("name": "<malformed>", "arguments": )
   final = TOOL_CALL_RE.sub("", THINK_RE.sub("", value)).strip()
   return "thoughts": thoughts, "tool_calls": calls, "final": final


def parse_tool(value: str):
   raw = TOOL_RESP_RE.search(value)
   if not raw: return "raw": value
   body = raw.group(1)
   try:    return json.loads(body)
   except: return "raw": body


first_gpt = next(t for t in sample["conversations"] if t["from"] == "gpt")
p = parse_assistant(first_gpt["value"])
print("Thought preview :", (p["thoughts"][0][:160] + "...") if p["thoughts"] else "(none)")
print("Tool calls       :", [(c.get("name"), list(c.get("arguments", ).keys())) for c in p["tool_calls"]])

We define regex-based parsers to extract argument traces, tool calls, and tool responses from the dataset. We process supporting messages to separate ideas, actions and final outputs in a structured manner. We then test the parser on a sample conversation to verify that the extraction works correctly.

N = 3000
sub = ds.select(range(min(N, len(ds))))


tool_calls         = Counter()
parallel_widths    = Counter()
thoughts_per_turn  = []
calls_per_traj     = []
errors_per_traj    = []
turns_per_traj     = []
cat_counts         = Counter()


for ex in sub:
   cat_counts[ex["category"]] += 1
   n_calls = n_err = 0
   turns_per_traj.append(len(ex["conversations"]))
   for t in ex["conversations"]:
       if t["from"] == "gpt":
           p = parse_assistant(t["value"])
           thoughts_per_turn.append(len(p["thoughts"]))
           if p["tool_calls"]:
               parallel_widths[len(p["tool_calls"])] += 1
               for c in p["tool_calls"]:
                   tool_calls[c.get("name", "<unknown>")] += 1
               n_calls += len(p["tool_calls"])
       elif t["from"] == "tool":
           r = parse_tool(t["value"])
           blob = json.dumps(r).lower()
           if "error" in blob or '"exit_code": 1' in blob or "traceback" in blob:
               n_err += 1
   calls_per_traj.append(n_calls)
   errors_per_traj.append(n_err)


print(f"\nScanned len(sub) trajectories")
print(f"Avg turns/traj      : np.mean(turns_per_traj):.1f")
print(f"Avg tool calls/traj : np.mean(calls_per_traj):.1f")
print(f"% with >=1 error    : 100*np.mean([e>0 for e in errors_per_traj]):.1f%")
print(f"% parallel turns    : 100*sum(v for k,v in parallel_widths.items() if k>1)/max(1,sum(parallel_widths.values())):.1f%")
print("Top 10 tools        :", tool_calls.most_common(10))


fig, axes = plt.subplots(2, 2, figsize=(13, 9))


top = tool_calls.most_common(15)
axes[0,0].barh([t for t,_ in top][::-1], [c for _,c in top][::-1], color="teal")
axes[0,0].set_title("Top 15 tools by call volume")
axes[0,0].set_xlabel("calls")


ks = sorted(parallel_widths)
axes[0,1].bar([str(k) for k in ks], [parallel_widths[k] for k in ks], color="coral")
axes[0,1].set_title("Tool-calls per assistant turn (parallel width)")
axes[0,1].set_xlabel("# tool calls in one turn"); axes[0,1].set_ylabel("count")
axes[0,1].set_yscale("log")


axes[1,0].hist(turns_per_traj, bins=40, color="steelblue")
axes[1,0].set_title("Conversation length"); axes[1,0].set_xlabel("turns")


cats, vals = zip(*cat_counts.most_common())
axes[1,1].pie(vals, labels=cats, autopct="%1.0f%%", startangle=90)
axes[1,1].set_title("Category distribution")


plt.tight_layout(); plt.show()

We conduct dataset-wide analysis to measure tool usage, conversation length, and error patterns. We collect data across multiple samples to understand overall agent behavior. We also create visualizations to highlight trends such as tool frequency, parallel calls, and category distribution.

def render_trace(ex, max_chars=350):
   print(f"\n'='*72\nTASK [ex['category'] / ex['subcategory']]: ex['task']\n'='*72")
   for t in ex["conversations"]:
       role = t["from"]
       if role == "system":
           continue
       if role == "human":
           print(f"\n[USER]\ntextwrap.shorten(t['value'], 600)")
       elif role == "gpt":
           p = parse_assistant(t["value"])
           for th in p["thoughts"]:
               print(f"\n[THINK]\ntextwrap.shorten(th, max_chars)")
           for c in p["tool_calls"]:
               args = json.dumps(c.get("arguments", ))[:200]
               print(f"[CALL] c.get('name')(args)")
           if p["final"]:
               print(f"\n[ANSWER]\ntextwrap.shorten(p['final'], max_chars)")
       elif role == "tool":
           print(f"[TOOL_RESPONSE] textwrap.shorten(t['value'], 220)")
   print("="*72)


idx = int(np.argmin(np.abs(np.array(turns_per_traj) - 10)))
render_trace(sub[idx])


def get_tool_schemas(ex):
   try:    return json.loads(ex["tools"])
   except: return []


schemas = get_tool_schemas(sample)
print(f"\nSample 0 has len(schemas) tools available")
for s in schemas[:3]:
   fn = s.get("function", )
   print(" -", fn.get("name"), "—", (fn.get("description") or "")[:80])


ROLE_MAP = "system": "system", "human": "user", "gpt": "assistant", "tool": "tool"


def to_openai_messages(conv):
   return ["role": ROLE_MAP[t["from"]], "content": t["value"] for t in conv]


example_msgs = to_openai_messages(sample["conversations"])
print("\nFirst 2 OpenAI messages:")
for m in example_msgs[:2]:
   print(" ", m["role"], "→", m["content"][:120].replace("\n", " "), "...")

We build utilities to present entire conversation fragments in a readable format for closer inspection. We also extract the tool schema and convert the dataset to an OpenAI-style message format for compatibility with training pipelines. This helps us better understand both the structure of the tool and how interactions can be standardized.

from transformers import AutoTokenizer
TOK_ID = "Qwen/Qwen2.5-0.5B-Instruct"
tok = AutoTokenizer.from_pretrained(TOK_ID)


def build_masked(conv, tokenizer, max_len=2048):
   msgs = to_openai_messages(conv)
   for m in msgs:
       if m["role"] == "tool":
           m["role"] = "user"
           m["content"] = "[TOOL OUTPUT]\n" + m["content"]
   input_ids, labels = [], []
   for m in msgs:
       text = tokenizer.apply_chat_template([m], tokenize=False, add_generation_prompt=False)
       ids = tokenizer.encode(text, add_special_tokens=False)
       input_ids.extend(ids)
       labels.extend(ids if m["role"] == "assistant" else [-100] * len(ids))
   return input_ids[:max_len], labels[:max_len]


ids, lbls = build_masked(sample["conversations"], tok)
trainable = sum(1 for x in lbls if x != -100)
print(f"\nTokenized example: len(ids) tokens, trainable trainable (100*trainable/len(ids):.1f%)")


think_lens, call_lens, ans_lens = [], [], []
for ex in sub.select(range(min(500, len(sub)))):
   for t in ex["conversations"]:
       if t["from"] != "gpt": continue
       p = parse_assistant(t["value"])
       for th in p["thoughts"]: think_lens.append(len(th))
       for c in p["tool_calls"]: call_lens.append(len(json.dumps(c)))
       if p["final"]: ans_lens.append(len(p["final"]))


plt.figure(figsize=(10,4))
plt.hist([think_lens, call_lens, ans_lens], bins=40, log=True,
        label=["<think>", "<tool_call>", "final answer"], stacked=False)
plt.legend(); plt.xlabel("characters"); plt.title("Length distributions (log y)")
plt.tight_layout(); plt.show()


class TraceReplayer:
   def __init__(self, ex):
       self.ex = ex
       self.steps = []
       pending = None
       for t in ex["conversations"]:
           if t["from"] == "gpt":
               if pending: self.steps.append(pending)
               pending = "think": parse_assistant(t["value"]), "responses": []
           elif t["from"] == "tool" and pending:
               pending["responses"].append(parse_tool(t["value"]))
       if pending: self.steps.append(pending)
   def __len__(self): return len(self.steps)
   def play(self, i):
       s = self.steps[i]
       print(f"\n── Step i+1/len(self) ──")
       for th in s["think"]["thoughts"]:
           print(f"💭 textwrap.shorten(th, 280)")
       for c in s["think"]["tool_calls"]:
           print(f"⚙️  c.get('name')(json.dumps(c.get('arguments', ))[:140])")
       for r in s["responses"]:
           print(f"📥 textwrap.shorten(json.dumps(r), 200)")
       if s["think"]["final"]:
           print(f"💬 textwrap.shorten(s['think']['final'], 200)")


rp = TraceReplayer(sample)
for i in range(min(3, len(rp))):
   rp.play(i)


TRAIN = False
if TRAIN:
   import torch
   from transformers import AutoModelForCausalLM
   from trl import SFTTrainer, SFTConfig


   train_subset = ds.select(range(200))


   def to_text(batch):
       msgs = to_openai_messages(batch["conversations"])
       for m in msgs:
           if m["role"] == "tool":
               m["role"] = "user"; m["content"] = "[TOOL]\n" + m["content"]
       batch["text"] = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=False)
       return batch


   train_subset = train_subset.map(to_text)


   model = AutoModelForCausalLM.from_pretrained(
       TOK_ID,
       torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
       device_map="auto" if torch.cuda.is_available() else None,
   )


   cfg = SFTConfig(
       output_dir="hermes-sft-demo",
       per_device_train_batch_size=1,
       gradient_accumulation_steps=4,
       max_steps=20,
       learning_rate=2e-5,
       logging_steps=2,
       max_seq_length=1024,
       dataset_text_field="text",
       report_to="none",
       fp16=torch.cuda.is_available(),
   )
   SFTTrainer(model=model, args=cfg, train_dataset=train_subset, processing_class=tok).train()
   print("Fine-tune demo finished.")


print("\n✅ Tutorial complete. You now have parsers, analytics, plots, a replayer, "
     "tokenized + label-masked SFT examples, and an optional training hook.")

We tokenize conversations and apply label masking so that only helpful responses contribute to training. We analyze the detailed distribution of arguments, tool calls, and answers to gain more insight. We also implement a trace replayer to step through the agent behavior and optionally run a small fine-tuning loop.

Finally, we developed a structured workflow to parse, analyze, and effectively manipulate agent logic traces. We were able to break down interactions into meaningful components, examine how agents reason step by step, and measure how they interact with tools during problem solving. Using visualization and analytics, we gained insight into common patterns and behavior in the dataset. Additionally, we converted the data into a format suitable for training language models, including handling tokenization and label masking for auxiliary responses. At the same time, this process provides a strong foundation for studying, evaluating, and improving tool-using AI systems in a practical, scalable manner.

check it out notebook with full code. Also, feel free to follow us Twitter And don’t forget to join us 130k+ ML subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.

Do you need to partner with us to promote your GitHub repo or Hugging Face page or product release or webinar, etc? join us

Source link

Related Stories

Access Denied

Making the case for curiosity-driven science | MIT News

Qwen Team Releases FlashQLA: a High-Performance Linear Attention Kernel Library That Achieves Up to 3× Speedup on NVIDIA Hopper GPUs

You may have missed

A Coding Implementation to Parsing, Analyzing, Visualizing, and Fine-Tuning Agent Reasoning Traces Using the lambda/hermes-agent-reasoning-traces Dataset

These financial adviser fees pinch your portfolio — and 2 questions can stop them

2026 Berkshire Hathaway Annual Shareholder Meeting HP

Bitcoin Weekly Forecast: Two months down and two months up, what's next?