Close Menu
    Facebook X (Twitter) Instagram
    Articles Stock
    • Home
    • Technology
    • AI
    • Pages
      • About ArticlesStock — AI & Technology Journalist
      • Contact us
      • Disclaimer For Articles Stock
      • Privacy Policy
      • Terms and Conditions
    Facebook X (Twitter) Instagram
    Articles Stock
    AI

    A Coding Implementation to Parsing, Analyzing, Visualizing, and Wonderful-Tuning Agent Reasoning Traces Utilizing the lambda/hermes-agent-reasoning-traces Dataset

    Naveed AhmadBy Naveed Ahmad02/05/2026No Comments8 Mins Read
    blog 5


    On this tutorial, we discover the lambda/hermes-agent-reasoning-traces dataset to grasp how agent-based fashions assume, use instruments, and generate responses throughout multi-turn conversations. We begin by loading and inspecting the dataset, inspecting its construction, classes, and conversational format to get a transparent thought of the obtainable info. We then construct easy parsers to extract key elements reminiscent of reasoning traces, device calls, and power responses, permitting us to separate inside considering from exterior actions. Additionally, we analyze patterns reminiscent of device utilization frequency, dialog size, and error charges to higher perceive agent conduct. We additionally create visualizations to focus on these traits and make the evaluation extra intuitive. Lastly, we put together the dataset for coaching by changing it right into a model-friendly format, making it appropriate for duties like supervised fine-tuning.

    !pip -q set up -U datasets pandas matplotlib seaborn transformers speed up trl
    
    
    import json, re, random, textwrap
    from collections import Counter, defaultdict
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    from datasets import load_dataset, concatenate_datasets
    
    
    random.seed(0)
    
    
    CONFIG = "kimi"
    ds = load_dataset("lambda/hermes-agent-reasoning-traces", CONFIG, cut up="prepare")
    print(ds)
    print("Config:", CONFIG, "| Fields:", ds.column_names)
    print("Classes:", sorted(set(ds["category"])))
    
    
    COMPARE_BOTH = False
    if COMPARE_BOTH:
       ds_kimi = load_dataset("lambda/hermes-agent-reasoning-traces", "kimi", cut up="prepare")
       ds_glm  = load_dataset("lambda/hermes-agent-reasoning-traces", "glm-5.1", cut up="prepare")
       ds_kimi = ds_kimi.add_column("supply", ["kimi"] * len(ds_kimi))
       ds_glm  = ds_glm.add_column("supply", ["glm-5.1"] * len(ds_glm))
       ds = concatenate_datasets([ds_kimi, ds_glm]).shuffle(seed=0)
       print("Mixed:", ds, "→ counts:", Counter(ds["source"]))
    
    
    pattern = ds[0]
    print("n=== Pattern 0 ===")
    print("id        :", pattern["id"])
    print("class  :", pattern["category"], "/", pattern["subcategory"])
    print("job      :", pattern["task"])
    print("turns     :", len(pattern["conversations"]))
    print("system[0] :", pattern["conversations"][0]["value"][:220], "...n")

    We set up all required libraries and import the mandatory modules to arrange our surroundings. We then load the lambda/hermes-agent-reasoning-traces dataset and examine its construction, fields, and classes. We additionally optionally mix a number of dataset configurations and look at a pattern to grasp the conversational format.

    THINK_RE     = re.compile(r"(.*?)", re.DOTALL)
    TOOL_CALL_RE = re.compile(r"s*({.*?})s*", re.DOTALL)
    TOOL_RESP_RE = re.compile(r"s*(.*?)s*", re.DOTALL)
    
    
    def parse_assistant(worth: str) -> dict:
       ideas = [t.strip() for t in THINK_RE.findall(value)]
       calls = []
       for uncooked in TOOL_CALL_RE.findall(worth):
           attempt:
               calls.append(json.masses(uncooked))
           besides json.JSONDecodeError:
               calls.append({"identify": "", "arguments": {}})
       last = TOOL_CALL_RE.sub("", THINK_RE.sub("", worth)).strip()
       return {"ideas": ideas, "tool_calls": calls, "last": last}
    
    
    def parse_tool(worth: str):
       uncooked = TOOL_RESP_RE.search(worth)
       if not uncooked: return {"uncooked": worth}
       physique = uncooked.group(1)
       attempt:    return json.masses(physique)
       besides: return {"uncooked": physique}
    
    
    first_gpt = subsequent(t for t in pattern["conversations"] if t["from"] == "gpt")
    p = parse_assistant(first_gpt["value"])
    print("Thought preview :", (p["thoughts"][0][:160] + "...") if p["thoughts"] else "(none)")
    print("Software calls       :", [(c.get("name"), list(c.get("arguments", {}).keys())) for c in p["tool_calls"]])

    We outline regex-based parsers to extract reasoning traces, device calls, and power responses from the dataset. We course of assistant messages to separate ideas, actions, and last outputs in a structured manner. We then check the parser on a pattern dialog to confirm that the extraction works appropriately.

    N = 3000
    sub = ds.choose(vary(min(N, len(ds))))
    
    
    tool_calls         = Counter()
    parallel_widths    = Counter()
    thoughts_per_turn  = []
    calls_per_traj     = []
    errors_per_traj    = []
    turns_per_traj     = []
    cat_counts         = Counter()
    
    
    for ex in sub:
       cat_counts[ex["category"]] += 1
       n_calls = n_err = 0
       turns_per_traj.append(len(ex["conversations"]))
       for t in ex["conversations"]:
           if t["from"] == "gpt":
               p = parse_assistant(t["value"])
               thoughts_per_turn.append(len(p["thoughts"]))
               if p["tool_calls"]:
                   parallel_widths[len(p["tool_calls"])] += 1
                   for c in p["tool_calls"]:
                       tool_calls[c.get("name", "")] += 1
                   n_calls += len(p["tool_calls"])
           elif t["from"] == "device":
               r = parse_tool(t["value"])
               blob = json.dumps(r).decrease()
               if "error" in blob or '"exit_code": 1' in blob or "traceback" in blob:
                   n_err += 1
       calls_per_traj.append(n_calls)
       errors_per_traj.append(n_err)
    
    
    print(f"nScanned {len(sub)} trajectories")
    print(f"Avg turns/traj      : {np.imply(turns_per_traj):.1f}")
    print(f"Avg device calls/traj : {np.imply(calls_per_traj):.1f}")
    print(f"% with >=1 error    : {100*np.imply([e>0 for e in errors_per_traj]):.1f}%")
    print(f"% parallel turns    : {100*sum(v for ok,v in parallel_widths.gadgets() if ok>1)/max(1,sum(parallel_widths.values())):.1f}%")
    print("Prime 10 instruments        :", tool_calls.most_common(10))
    
    
    fig, axes = plt.subplots(2, 2, figsize=(13, 9))
    
    
    high = tool_calls.most_common(15)
    axes[0,0].barh([t for t,_ in top][::-1], [c for _,c in top][::-1], colour="teal")
    axes[0,0].set_title("Prime 15 instruments by name quantity")
    axes[0,0].set_xlabel("calls")
    
    
    ks = sorted(parallel_widths)
    axes[0,1].bar([str(k) for k in ks], [parallel_widths[k] for ok in ks], colour="coral")
    axes[0,1].set_title("Software-calls per assistant flip (parallel width)")
    axes[0,1].set_xlabel("# device calls in a single flip"); axes[0,1].set_ylabel("rely")
    axes[0,1].set_yscale("log")
    
    
    axes[1,0].hist(turns_per_traj, bins=40, colour="steelblue")
    axes[1,0].set_title("Dialog size"); axes[1,0].set_xlabel("turns")
    
    
    cats, vals = zip(*cat_counts.most_common())
    axes[1,1].pie(vals, labels=cats, autopct="%1.0f%%", startangle=90)
    axes[1,1].set_title("Class distribution")
    
    
    plt.tight_layout(); plt.present()

    We carry out dataset-wide analytics to measure device utilization, dialog lengths, and error patterns. We mixture statistics throughout a number of samples to grasp total agent conduct. We additionally create visualizations to focus on traits reminiscent of device frequency, parallel calls, and class distribution.

    def render_trace(ex, max_chars=350):
       print(f"n{'='*72}nTASK [{ex['category']} / {ex['subcategory']}]: {ex['task']}n{'='*72}")
       for t in ex["conversations"]:
           position = t["from"]
           if position == "system":
               proceed
           if position == "human":
               print(f"n[USER]n{textwrap.shorten(t['value'], 600)}")
           elif position == "gpt":
               p = parse_assistant(t["value"])
               for th in p["thoughts"]:
                   print(f"n[THINK]n{textwrap.shorten(th, max_chars)}")
               for c in p["tool_calls"]:
                   args = json.dumps(c.get("arguments", {}))[:200]
                   print(f"[CALL] {c.get('identify')}({args})")
               if p["final"]:
                   print(f"n[ANSWER]n{textwrap.shorten(p['final'], max_chars)}")
           elif position == "device":
               print(f"[TOOL_RESPONSE] {textwrap.shorten(t['value'], 220)}")
       print("="*72)
    
    
    idx = int(np.argmin(np.abs(np.array(turns_per_traj) - 10)))
    render_trace(sub[idx])
    
    
    def get_tool_schemas(ex):
       attempt:    return json.masses(ex["tools"])
       besides: return []
    
    
    schemas = get_tool_schemas(pattern)
    print(f"nSample 0 has {len(schemas)} instruments obtainable")
    for s in schemas[:3]:
       fn = s.get("perform", {})
       print(" -", fn.get("identify"), "—", (fn.get("description") or "")[:80])
    
    
    ROLE_MAP = {"system": "system", "human": "person", "gpt": "assistant", "device": "device"}
    
    
    def to_openai_messages(conv):
       return [{"role": ROLE_MAP[t["from"]], "content material": t["value"]} for t in conv]
    
    
    example_msgs = to_openai_messages(pattern["conversations"])
    print("nFirst 2 OpenAI messages:")
    for m in example_msgs[:2]:
       print(" ", m["role"], "→", m["content"][:120].change("n", " "), "...")

    We construct utilities to render full dialog traces in a readable format for deeper inspection. We additionally extract device schemas and convert the dataset into OpenAI-style message format for compatibility with coaching pipelines. This helps us higher perceive each the construction of instruments and the way conversations might be standardized.

    from transformers import AutoTokenizer
    TOK_ID = "Qwen/Qwen2.5-0.5B-Instruct"
    tok = AutoTokenizer.from_pretrained(TOK_ID)
    
    
    def build_masked(conv, tokenizer, max_len=2048):
       msgs = to_openai_messages(conv)
       for m in msgs:
           if m["role"] == "device":
               m["role"] = "person"
               m["content"] = "[TOOL OUTPUT]n" + m["content"]
       input_ids, labels = [], []
       for m in msgs:
           textual content = tokenizer.apply_chat_template([m], tokenize=False, add_generation_prompt=False)
           ids = tokenizer.encode(textual content, add_special_tokens=False)
           input_ids.lengthen(ids)
           labels.lengthen(ids if m["role"] == "assistant" else [-100] * len(ids))
       return input_ids[:max_len], labels[:max_len]
    
    
    ids, lbls = build_masked(pattern["conversations"], tok)
    trainable = sum(1 for x in lbls if x != -100)
    print(f"nTokenized instance: {len(ids)} tokens, {trainable} trainable ({100*trainable/len(ids):.1f}%)")
    
    
    think_lens, call_lens, ans_lens = [], [], []
    for ex in sub.choose(vary(min(500, len(sub)))):
       for t in ex["conversations"]:
           if t["from"] != "gpt": proceed
           p = parse_assistant(t["value"])
           for th in p["thoughts"]: think_lens.append(len(th))
           for c in p["tool_calls"]: call_lens.append(len(json.dumps(c)))
           if p["final"]: ans_lens.append(len(p["final"]))
    
    
    plt.determine(figsize=(10,4))
    plt.hist([think_lens, call_lens, ans_lens], bins=40, log=True,
            label=["", "", "final answer"], stacked=False)
    plt.legend(); plt.xlabel("characters"); plt.title("Size distributions (log y)")
    plt.tight_layout(); plt.present()
    
    
    class TraceReplayer:
       def __init__(self, ex):
           self.ex = ex
           self.steps = []
           pending = None
           for t in ex["conversations"]:
               if t["from"] == "gpt":
                   if pending: self.steps.append(pending)
                   pending = {"assume": parse_assistant(t["value"]), "responses": []}
               elif t["from"] == "device" and pending:
                   pending["responses"].append(parse_tool(t["value"]))
           if pending: self.steps.append(pending)
       def __len__(self): return len(self.steps)
       def play(self, i):
           s = self.steps[i]
           print(f"n── Step {i+1}/{len(self)} ──")
           for th in s["think"]["thoughts"]:
               print(f"💭 {textwrap.shorten(th, 280)}")
           for c in s["think"]["tool_calls"]:
               print(f"⚙️  {c.get('identify')}({json.dumps(c.get('arguments', {}))[:140]})")
           for r in s["responses"]:
               print(f"📥 {textwrap.shorten(json.dumps(r), 200)}")
           if s["think"]["final"]:
               print(f"💬 {textwrap.shorten(s['think']['final'], 200)}")
    
    
    rp = TraceReplayer(pattern)
    for i in vary(min(3, len(rp))):
       rp.play(i)
    
    
    TRAIN = False
    if TRAIN:
       import torch
       from transformers import AutoModelForCausalLM
       from trl import SFTTrainer, SFTConfig
    
    
       train_subset = ds.choose(vary(200))
    
    
       def to_text(batch):
           msgs = to_openai_messages(batch["conversations"])
           for m in msgs:
               if m["role"] == "device":
                   m["role"] = "person"; m["content"] = "[TOOL]n" + m["content"]
           batch["text"] = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=False)
           return batch
    
    
       train_subset = train_subset.map(to_text)
    
    
       mannequin = AutoModelForCausalLM.from_pretrained(
           TOK_ID,
           torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
           device_map="auto" if torch.cuda.is_available() else None,
       )
    
    
       cfg = SFTConfig(
           output_dir="hermes-sft-demo",
           per_device_train_batch_size=1,
           gradient_accumulation_steps=4,
           max_steps=20,
           learning_rate=2e-5,
           logging_steps=2,
           max_seq_length=1024,
           dataset_text_field="textual content",
           report_to="none",
           fp16=torch.cuda.is_available(),
       )
       SFTTrainer(mannequin=mannequin, args=cfg, train_dataset=train_subset, processing_class=tok).prepare()
       print("Wonderful-tune demo completed.")
    
    
    print("n✅ Tutorial full. You now have parsers, analytics, plots, a replayer, "
         "tokenized + label-masked SFT examples, and an optionally available coaching hook.")

    We tokenize the conversations and apply label masking so solely assistant responses contribute to coaching. We analyze the size distributions of reasoning, device calls, and solutions to achieve additional insights. We additionally implement a hint replayer to step by means of agent conduct and optionally run a small fine-tuning loop.

    In conclusion, we developed a structured workflow to parse, analyze, and work successfully with agent reasoning traces. We have been capable of break down conversations into significant elements, look at how brokers motive step-by-step, and measure how they work together with instruments throughout drawback fixing. Utilizing the visualizations and analytics, we gained insights into widespread patterns and behaviors throughout the dataset. As well as, we transformed the info right into a format appropriate for coaching language fashions, together with dealing with tokenization and label masking for assistant responses. Additionally, this course of offers a powerful basis for finding out, evaluating, and bettering tool-using AI programs in a sensible, scalable manner.


    Try the Full Codes with Notebook. Additionally, be at liberty to observe us on Twitter and don’t neglect to affix our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

    Have to accomplice with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so forth.? Connect with us




    Source link

    Naveed Ahmad

    Naveed Ahmad is a technology journalist and AI writer at ArticlesStock, covering artificial intelligence, machine learning, and emerging tech policy. Read his latest articles.

    Related Posts

    Uber desires to show its thousands and thousands of drivers right into a sensor grid for self-driving corporations

    02/05/2026

    A New NVIDIA Analysis Reveals Speculative Decoding in NeMo RL Achieves 1.8× Rollout Era Speedup at 8B and Tasks 2.5× Finish-to-Finish Speedup at 235B

    02/05/2026

    Meta buys robotics startup to bolster its humanoid AI ambitions

    02/05/2026
    Leave A Reply Cancel Reply

    Categories
    • AI
    Recent Comments
      Facebook X (Twitter) Instagram Pinterest
      © 2026 ThemeSphere. Designed by ThemeSphere.

      Type above and press Enter to search. Press Esc to cancel.