On this tutorial, we discover the lambda/hermes-agent-reasoning-traces dataset to grasp how agent-based fashions assume, use instruments, and generate responses throughout multi-turn conversations. We begin by loading and inspecting the dataset, inspecting its construction, classes, and conversational format to get a transparent thought of the obtainable info. We then construct easy parsers to extract key elements reminiscent of reasoning traces, device calls, and power responses, permitting us to separate inside considering from exterior actions. Additionally, we analyze patterns reminiscent of device utilization frequency, dialog size, and error charges to higher perceive agent conduct. We additionally create visualizations to focus on these traits and make the evaluation extra intuitive. Lastly, we put together the dataset for coaching by changing it right into a model-friendly format, making it appropriate for duties like supervised fine-tuning.
!pip -q set up -U datasets pandas matplotlib seaborn transformers speed up trl
import json, re, random, textwrap
from collections import Counter, defaultdict
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datasets import load_dataset, concatenate_datasets
random.seed(0)
CONFIG = "kimi"
ds = load_dataset("lambda/hermes-agent-reasoning-traces", CONFIG, cut up="prepare")
print(ds)
print("Config:", CONFIG, "| Fields:", ds.column_names)
print("Classes:", sorted(set(ds["category"])))
COMPARE_BOTH = False
if COMPARE_BOTH:
ds_kimi = load_dataset("lambda/hermes-agent-reasoning-traces", "kimi", cut up="prepare")
ds_glm = load_dataset("lambda/hermes-agent-reasoning-traces", "glm-5.1", cut up="prepare")
ds_kimi = ds_kimi.add_column("supply", ["kimi"] * len(ds_kimi))
ds_glm = ds_glm.add_column("supply", ["glm-5.1"] * len(ds_glm))
ds = concatenate_datasets([ds_kimi, ds_glm]).shuffle(seed=0)
print("Mixed:", ds, "→ counts:", Counter(ds["source"]))
pattern = ds[0]
print("n=== Pattern 0 ===")
print("id :", pattern["id"])
print("class :", pattern["category"], "/", pattern["subcategory"])
print("job :", pattern["task"])
print("turns :", len(pattern["conversations"]))
print("system[0] :", pattern["conversations"][0]["value"][:220], "...n")
We set up all required libraries and import the mandatory modules to arrange our surroundings. We then load the lambda/hermes-agent-reasoning-traces dataset and examine its construction, fields, and classes. We additionally optionally mix a number of dataset configurations and look at a pattern to grasp the conversational format.
THINK_RE = re.compile(r"(.*?)", re.DOTALL)
TOOL_CALL_RE = re.compile(r"s*({.*?})s*", re.DOTALL)
TOOL_RESP_RE = re.compile(r"s*(.*?)s*", re.DOTALL)
def parse_assistant(worth: str) -> dict:
ideas = [t.strip() for t in THINK_RE.findall(value)]
calls = []
for uncooked in TOOL_CALL_RE.findall(worth):
attempt:
calls.append(json.masses(uncooked))
besides json.JSONDecodeError:
calls.append({"identify": "", "arguments": {}})
last = TOOL_CALL_RE.sub("", THINK_RE.sub("", worth)).strip()
return {"ideas": ideas, "tool_calls": calls, "last": last}
def parse_tool(worth: str):
uncooked = TOOL_RESP_RE.search(worth)
if not uncooked: return {"uncooked": worth}
physique = uncooked.group(1)
attempt: return json.masses(physique)
besides: return {"uncooked": physique}
first_gpt = subsequent(t for t in pattern["conversations"] if t["from"] == "gpt")
p = parse_assistant(first_gpt["value"])
print("Thought preview :", (p["thoughts"][0][:160] + "...") if p["thoughts"] else "(none)")
print("Software calls :", [(c.get("name"), list(c.get("arguments", {}).keys())) for c in p["tool_calls"]])
We outline regex-based parsers to extract reasoning traces, device calls, and power responses from the dataset. We course of assistant messages to separate ideas, actions, and last outputs in a structured manner. We then check the parser on a pattern dialog to confirm that the extraction works appropriately.
N = 3000
sub = ds.choose(vary(min(N, len(ds))))
tool_calls = Counter()
parallel_widths = Counter()
thoughts_per_turn = []
calls_per_traj = []
errors_per_traj = []
turns_per_traj = []
cat_counts = Counter()
for ex in sub:
cat_counts[ex["category"]] += 1
n_calls = n_err = 0
turns_per_traj.append(len(ex["conversations"]))
for t in ex["conversations"]:
if t["from"] == "gpt":
p = parse_assistant(t["value"])
thoughts_per_turn.append(len(p["thoughts"]))
if p["tool_calls"]:
parallel_widths[len(p["tool_calls"])] += 1
for c in p["tool_calls"]:
tool_calls[c.get("name", "")] += 1
n_calls += len(p["tool_calls"])
elif t["from"] == "device":
r = parse_tool(t["value"])
blob = json.dumps(r).decrease()
if "error" in blob or '"exit_code": 1' in blob or "traceback" in blob:
n_err += 1
calls_per_traj.append(n_calls)
errors_per_traj.append(n_err)
print(f"nScanned {len(sub)} trajectories")
print(f"Avg turns/traj : {np.imply(turns_per_traj):.1f}")
print(f"Avg device calls/traj : {np.imply(calls_per_traj):.1f}")
print(f"% with >=1 error : {100*np.imply([e>0 for e in errors_per_traj]):.1f}%")
print(f"% parallel turns : {100*sum(v for ok,v in parallel_widths.gadgets() if ok>1)/max(1,sum(parallel_widths.values())):.1f}%")
print("Prime 10 instruments :", tool_calls.most_common(10))
fig, axes = plt.subplots(2, 2, figsize=(13, 9))
high = tool_calls.most_common(15)
axes[0,0].barh([t for t,_ in top][::-1], [c for _,c in top][::-1], colour="teal")
axes[0,0].set_title("Prime 15 instruments by name quantity")
axes[0,0].set_xlabel("calls")
ks = sorted(parallel_widths)
axes[0,1].bar([str(k) for k in ks], [parallel_widths[k] for ok in ks], colour="coral")
axes[0,1].set_title("Software-calls per assistant flip (parallel width)")
axes[0,1].set_xlabel("# device calls in a single flip"); axes[0,1].set_ylabel("rely")
axes[0,1].set_yscale("log")
axes[1,0].hist(turns_per_traj, bins=40, colour="steelblue")
axes[1,0].set_title("Dialog size"); axes[1,0].set_xlabel("turns")
cats, vals = zip(*cat_counts.most_common())
axes[1,1].pie(vals, labels=cats, autopct="%1.0f%%", startangle=90)
axes[1,1].set_title("Class distribution")
plt.tight_layout(); plt.present()
We carry out dataset-wide analytics to measure device utilization, dialog lengths, and error patterns. We mixture statistics throughout a number of samples to grasp total agent conduct. We additionally create visualizations to focus on traits reminiscent of device frequency, parallel calls, and class distribution.
def render_trace(ex, max_chars=350):
print(f"n{'='*72}nTASK [{ex['category']} / {ex['subcategory']}]: {ex['task']}n{'='*72}")
for t in ex["conversations"]:
position = t["from"]
if position == "system":
proceed
if position == "human":
print(f"n[USER]n{textwrap.shorten(t['value'], 600)}")
elif position == "gpt":
p = parse_assistant(t["value"])
for th in p["thoughts"]:
print(f"n[THINK]n{textwrap.shorten(th, max_chars)}")
for c in p["tool_calls"]:
args = json.dumps(c.get("arguments", {}))[:200]
print(f"[CALL] {c.get('identify')}({args})")
if p["final"]:
print(f"n[ANSWER]n{textwrap.shorten(p['final'], max_chars)}")
elif position == "device":
print(f"[TOOL_RESPONSE] {textwrap.shorten(t['value'], 220)}")
print("="*72)
idx = int(np.argmin(np.abs(np.array(turns_per_traj) - 10)))
render_trace(sub[idx])
def get_tool_schemas(ex):
attempt: return json.masses(ex["tools"])
besides: return []
schemas = get_tool_schemas(pattern)
print(f"nSample 0 has {len(schemas)} instruments obtainable")
for s in schemas[:3]:
fn = s.get("perform", {})
print(" -", fn.get("identify"), "—", (fn.get("description") or "")[:80])
ROLE_MAP = {"system": "system", "human": "person", "gpt": "assistant", "device": "device"}
def to_openai_messages(conv):
return [{"role": ROLE_MAP[t["from"]], "content material": t["value"]} for t in conv]
example_msgs = to_openai_messages(pattern["conversations"])
print("nFirst 2 OpenAI messages:")
for m in example_msgs[:2]:
print(" ", m["role"], "→", m["content"][:120].change("n", " "), "...")
We construct utilities to render full dialog traces in a readable format for deeper inspection. We additionally extract device schemas and convert the dataset into OpenAI-style message format for compatibility with coaching pipelines. This helps us higher perceive each the construction of instruments and the way conversations might be standardized.
from transformers import AutoTokenizer
TOK_ID = "Qwen/Qwen2.5-0.5B-Instruct"
tok = AutoTokenizer.from_pretrained(TOK_ID)
def build_masked(conv, tokenizer, max_len=2048):
msgs = to_openai_messages(conv)
for m in msgs:
if m["role"] == "device":
m["role"] = "person"
m["content"] = "[TOOL OUTPUT]n" + m["content"]
input_ids, labels = [], []
for m in msgs:
textual content = tokenizer.apply_chat_template([m], tokenize=False, add_generation_prompt=False)
ids = tokenizer.encode(textual content, add_special_tokens=False)
input_ids.lengthen(ids)
labels.lengthen(ids if m["role"] == "assistant" else [-100] * len(ids))
return input_ids[:max_len], labels[:max_len]
ids, lbls = build_masked(pattern["conversations"], tok)
trainable = sum(1 for x in lbls if x != -100)
print(f"nTokenized instance: {len(ids)} tokens, {trainable} trainable ({100*trainable/len(ids):.1f}%)")
think_lens, call_lens, ans_lens = [], [], []
for ex in sub.choose(vary(min(500, len(sub)))):
for t in ex["conversations"]:
if t["from"] != "gpt": proceed
p = parse_assistant(t["value"])
for th in p["thoughts"]: think_lens.append(len(th))
for c in p["tool_calls"]: call_lens.append(len(json.dumps(c)))
if p["final"]: ans_lens.append(len(p["final"]))
plt.determine(figsize=(10,4))
plt.hist([think_lens, call_lens, ans_lens], bins=40, log=True,
label=["", "", "final answer"], stacked=False)
plt.legend(); plt.xlabel("characters"); plt.title("Size distributions (log y)")
plt.tight_layout(); plt.present()
class TraceReplayer:
def __init__(self, ex):
self.ex = ex
self.steps = []
pending = None
for t in ex["conversations"]:
if t["from"] == "gpt":
if pending: self.steps.append(pending)
pending = {"assume": parse_assistant(t["value"]), "responses": []}
elif t["from"] == "device" and pending:
pending["responses"].append(parse_tool(t["value"]))
if pending: self.steps.append(pending)
def __len__(self): return len(self.steps)
def play(self, i):
s = self.steps[i]
print(f"n── Step {i+1}/{len(self)} ──")
for th in s["think"]["thoughts"]:
print(f"💭 {textwrap.shorten(th, 280)}")
for c in s["think"]["tool_calls"]:
print(f"⚙️ {c.get('identify')}({json.dumps(c.get('arguments', {}))[:140]})")
for r in s["responses"]:
print(f"📥 {textwrap.shorten(json.dumps(r), 200)}")
if s["think"]["final"]:
print(f"💬 {textwrap.shorten(s['think']['final'], 200)}")
rp = TraceReplayer(pattern)
for i in vary(min(3, len(rp))):
rp.play(i)
TRAIN = False
if TRAIN:
import torch
from transformers import AutoModelForCausalLM
from trl import SFTTrainer, SFTConfig
train_subset = ds.choose(vary(200))
def to_text(batch):
msgs = to_openai_messages(batch["conversations"])
for m in msgs:
if m["role"] == "device":
m["role"] = "person"; m["content"] = "[TOOL]n" + m["content"]
batch["text"] = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=False)
return batch
train_subset = train_subset.map(to_text)
mannequin = AutoModelForCausalLM.from_pretrained(
TOK_ID,
torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
device_map="auto" if torch.cuda.is_available() else None,
)
cfg = SFTConfig(
output_dir="hermes-sft-demo",
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
max_steps=20,
learning_rate=2e-5,
logging_steps=2,
max_seq_length=1024,
dataset_text_field="textual content",
report_to="none",
fp16=torch.cuda.is_available(),
)
SFTTrainer(mannequin=mannequin, args=cfg, train_dataset=train_subset, processing_class=tok).prepare()
print("Wonderful-tune demo completed.")
print("n✅ Tutorial full. You now have parsers, analytics, plots, a replayer, "
"tokenized + label-masked SFT examples, and an optionally available coaching hook.")
We tokenize the conversations and apply label masking so solely assistant responses contribute to coaching. We analyze the size distributions of reasoning, device calls, and solutions to achieve additional insights. We additionally implement a hint replayer to step by means of agent conduct and optionally run a small fine-tuning loop.
In conclusion, we developed a structured workflow to parse, analyze, and work successfully with agent reasoning traces. We have been capable of break down conversations into significant elements, look at how brokers motive step-by-step, and measure how they work together with instruments throughout drawback fixing. Utilizing the visualizations and analytics, we gained insights into widespread patterns and behaviors throughout the dataset. As well as, we transformed the info right into a format appropriate for coaching language fashions, together with dealing with tokenization and label masking for assistant responses. Additionally, this course of offers a powerful basis for finding out, evaluating, and bettering tool-using AI programs in a sensible, scalable manner.
Try the Full Codes with Notebook. Additionally, be at liberty to observe us on Twitter and don’t neglect to affix our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Have to accomplice with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so forth.? Connect with us
