Close Menu
    Facebook X (Twitter) Instagram
    Articles Stock
    • Home
    • Technology
    • AI
    • Pages
      • About us
      • Contact us
      • Disclaimer For Articles Stock
      • Privacy Policy
      • Terms and Conditions
    Facebook X (Twitter) Instagram
    Articles Stock
    AI

    How one can Construct and Evolve a Customized OpenAI Agent with A-Evolve Utilizing Benchmarks, Expertise, Reminiscence, and Workspace Mutations

    Naveed AhmadBy Naveed Ahmad01/04/2026Updated:01/04/2026No Comments12 Mins Read
    blog 22


    On this tutorial, we work straight with the A-Evolve framework in Colab and construct an entire evolutionary agent pipeline from the bottom up. We arrange the repository, configure an OpenAI-powered agent, outline a customized benchmark, and construct our personal evolution engine to see how A-Evolve really improves an agent by way of iterative workspace mutations. By means of the code, we use the framework’s core abstractions for prompts, abilities, reminiscence, benchmarking, and evolution, which assist us perceive not simply methods to run A-Evolve, but in addition methods to prolong it in a sensible, Colab-friendly method.

    import os
    import sys
    import json
    import textwrap
    import subprocess
    import shutil
    from pathlib import Path
    from getpass import getpass
    from collections import Counter, defaultdict
    
    
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "openai>=1.30.0", "pyyaml>=6.0", "matplotlib>=3.8"])
    REPO_DIR = Path("/content material/a-evolve")
    if REPO_DIR.exists():
       shutil.rmtree(REPO_DIR)
    subprocess.check_call(["git", "clone", "--depth", "1", "https://github.com/A-EVO-Lab/a-evolve.git", str(REPO_DIR)])
    sys.path.insert(0, str(REPO_DIR))
    
    
    if not os.environ.get("OPENAI_API_KEY"):
       os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ").strip()
    
    
    OPENAI_MODEL = "gpt-4o-mini"
    
    
    import yaml
    import matplotlib.pyplot as plt
    
    
    import agent_evolve as ae
    from agent_evolve.protocol.base_agent import BaseAgent
    from agent_evolve.benchmarks.base import BenchmarkAdapter
    from agent_evolve.engine.base import EvolutionEngine
    from agent_evolve.sorts import Activity, Trajectory, Suggestions, StepResult
    from agent_evolve.contract.workspace import AgentWorkspace
    from openai import OpenAI
    
    
    shopper = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    
    
    WORKSPACE_ROOT = Path("/content material/a_evolve_demo_workspace")
    if WORKSPACE_ROOT.exists():
       shutil.rmtree(WORKSPACE_ROOT)
    
    
    (WORKSPACE_ROOT / "prompts").mkdir(dad and mom=True, exist_ok=True)
    (WORKSPACE_ROOT / "abilities").mkdir(dad and mom=True, exist_ok=True)
    (WORKSPACE_ROOT / "reminiscence").mkdir(dad and mom=True, exist_ok=True)
    (WORKSPACE_ROOT / "instruments").mkdir(dad and mom=True, exist_ok=True)
    
    
    manifest = {
       "identify": "colab-aevolve-demo-agent",
       "model": "0.1.0",
       "contract_version": "1.0",
       "agent": {
           "sort": "customized",
           "entrypoint": None
       },
       "evolvable_layers": ["prompts", "skills", "memory"],
       "reload_strategy": "scorching"
    }
    with open(WORKSPACE_ROOT / "manifest.yaml", "w") as f:
       yaml.dump(manifest, f, sort_keys=False)
    
    
    initial_system_prompt = textwrap.dedent("""
    You're a exact text-transformation agent.
    
    
    Clear up the duty precisely.
    Be concise.
    Return solely the ultimate reply with no clarification until the duty explicitly asks for JSON.
    """).strip()
    
    
    (WORKSPACE_ROOT / "prompts" / "system.md").write_text(initial_system_prompt)

    We put together the total Colab atmosphere wanted to run the tutorial from begin to end. We set up the required packages, clone the A-Evolve repository, load the framework imports, and securely accumulate the OpenAI API key for mannequin entry. We additionally outline the workspace construction and initialize the manifest and system immediate, offering our evolving agent with a sound place to begin inside the A-Evolve framework.

    def build_dataset():
       prepare = [
           {
               "id": "train-01",
               "rule": "json_sum",
               "input": "Numbers: 7, 11, 4",
               "answer": '{"sum":22}'
           },
           {
               "id": "train-02",
               "rule": "json_sum",
               "input": "Numbers: 20, 5, 3, 2",
               "answer": '{"sum":30}'
           },
           {
               "id": "train-03",
               "rule": "acronym_upper",
               "input": "Create the acronym from: retrieval augmented generation",
               "answer": "RAG"
           },
           {
               "id": "train-04",
               "rule": "acronym_upper",
               "input": "Create the acronym from: large language model",
               "answer": "LLM"
           },
           cherry"
           ,
           lion,
           {
               "id": "train-07",
               "rule": "vowel_parity",
               "input": "Word: equation",
               "answer": "EVEN"
           },
           {
               "id": "train-08",
               "rule": "vowel_parity",
               "input": "Word: education",
               "answer": "ODD"
           },
       ]
    
    
       holdout = [
           {
               "id": "holdout-01",
               "rule": "json_sum",
               "input": "Numbers: 100, 1, 9",
               "answer": '{"sum":110}'
           },
           {
               "id": "holdout-02",
               "rule": "acronym_upper",
               "input": "Create the acronym from: artificial general intelligence",
               "answer": "AGI"
           },
           mango"
           ,
           {
               "id": "holdout-04",
               "rule": "vowel_parity",
               "input": "Word: aeroplane",
               "answer": "ODD"
           },
       ]
       return prepare, holdout
    
    
    TRAIN_DATA, HOLDOUT_DATA = build_dataset()
    
    
    def normalize_text(x: str) -> str:
       return x.strip().substitute(" ", "")
    
    
    class MiniTextBenchmark(BenchmarkAdapter):
       def __init__(self):
           self.prepare = TRAIN_DATA
           self.holdout = HOLDOUT_DATA
    
    
       def get_tasks(self, break up: str = "prepare", restrict: int = 10):
           knowledge = self.prepare if break up == "prepare" else self.holdout
           duties = []
           for row in knowledge[:limit]:
               duties.append(
                   Activity(
                       id=row["id"],
                       enter=row["input"],
                       metadata={
                           "rule": row["rule"],
                           "reply": row["answer"]
                       }
                   )
               )
           return duties
    
    
       def consider(self, job: Activity, trajectory: Trajectory):
           pred = trajectory.output.strip()
           gold = job.metadata["answer"].strip()
           success = normalize_text(pred) == normalize_text(gold)
           element = {
               "rule": job.metadata["rule"],
               "gold": gold,
               "pred": pred,
               "enter": job.enter,
               "success": success
           }
           rating = 1.0 if success else 0.0
           return Suggestions(
               success=success,
               rating=rating,
               element=json.dumps(element, ensure_ascii=False),
               uncooked=element
           )
    
    
    SKILL_ROUTING = {
       "json_sum": ["json", "sum"],
       "acronym_upper": ["acronym", "uppercase"],
       "pipe_unique_sorted_lower": ["unique", "sorted", "lowercase", "pipe"],
       "vowel_parity": ["vowel", "odd", "even", "parity"]
    }
    

    We outline the coaching and holdout datasets used to measure the agent earlier than and after evolution. We construct a customized benchmark class that packages every instance into A-Evolve duties and evaluates predictions in opposition to precise anticipated outputs. We additionally arrange the routing hints for abilities, which prepares the system to attach totally different job sorts with the precise behavioral patterns later within the workflow.

    class ColabAEResolverAgent(BaseAgent):
       def __init__(self, workspace_dir: str | Path, mannequin: str = OPENAI_MODEL):
           self.mannequin = mannequin
           tremendous().__init__(workspace_dir)
    
    
       def _pick_relevant_skills(self, job: Activity):
           rule = job.metadata.get("rule", "")
           chosen = []
           for ability in self.abilities:
               hay = f"{ability.identify} {ability.description}".decrease()
               if rule == "json_sum" and ("json" in hay or "sum" in hay):
                   chosen.append(ability)
               elif rule == "acronym_upper" and ("acronym" in hay or "uppercase" in hay):
                   chosen.append(ability)
               elif rule == "pipe_unique_sorted_lower" and any(okay in hay for okay in ["unique", "sorted", "lowercase", "pipe"]):
                   chosen.append(ability)
               elif rule == "vowel_parity" and any(okay in hay for okay in ["vowel", "odd", "even", "parity"]):
                   chosen.append(ability)
           return chosen[:3]
    
    
       def remedy(self, job: Activity) -> Trajectory:
           relevant_skills = self._pick_relevant_skills(job)
           relevant_skill_texts = []
           for s in relevant_skills:
               relevant_skill_texts.append(self.get_skill_content(s.identify))
    
    
           memory_text = "n".be part of(
               [f"- {m.get('content', '')}" for m in self.memories[-8:]]
           ).strip()
    
    
           skill_block = "nn".be part of(relevant_skill_texts).strip()
           if not skill_block:
               skill_block = "(no abilities loaded but)"
    
    
           if not memory_text:
               memory_text = "(no reminiscence but)"
    
    
           user_prompt = textwrap.dedent(f"""
           TASK RULE: {job.metadata.get("rule")}
           TASK INPUT:
           {job.enter}
    
    
           ACTIVE SYSTEM PROMPT:
           {self.system_prompt}
    
    
           RELEVANT SKILLS:
           {skill_block}
    
    
           RECENT MEMORIES:
           {memory_text}
    
    
           Clear up the duty precisely.
           Return solely the ultimate reply.
           """).strip()
    
    
           response = shopper.chat.completions.create(
               mannequin=self.mannequin,
               temperature=0,
               messages=[
                   {"role": "system", "content": "You are an exact text-transformation agent."},
                   {"role": "user", "content": user_prompt}
               ]
           )
    
    
           output = (response.selections[0].message.content material or "").strip()
    
    
           self.bear in mind(
               content material=f"Activity {job.id} underneath rule {job.metadata.get('rule')} produced output: {output}",
               class="episodic"
           )
    
    
           return Trajectory(
               task_id=job.id,
               output=output,
               steps=[
                   {
                       "rule": task.metadata.get("rule"),
                       "used_skills": [s.name for s in relevant_skills],
                       "system_prompt_chars": len(self.system_prompt),
                       "memory_items_seen": len(self.recollections)
                   }
               ]
           )
    
    
    SKILL_TEMPLATES = {
       "json_sum": textwrap.dedent("""
           ---
           identify: json-sum-exact
           description: Add all integers and output strict compact JSON with the one key sum.
           ---
           # JSON Sum Precise
    
    
           Process:
           1. Extract all integers from the duty enter.
           2. Add them.
           3. Return precisely one compact JSON object on this format:
              {"sum":NUMBER}
           4. Don't add areas, explanations, markdown, or further keys.
       """).strip(),
    
    
       "acronym_upper": textwrap.dedent("""
           ---
           identify: acronym-upper-exact
           description: Construct an uppercase acronym by taking the primary letter of every phrase.
           ---
           # Acronym Higher Precise
    
    
           Process:
           1. Establish the phrase after the colon.
           2. Take the primary letter of every phrase.
           3. Convert each letter to uppercase.
           4. Return solely the ultimate acronym, with no punctuation or clarification.
       """).strip(),
    
    
       "pipe_unique_sorted_lower": textwrap.dedent("""
           ---
           identify: pipe-unique-sorted-lower
           description: Normalize tokens to lowercase, deduplicate them, type ascending, and be part of them with pipes.
           ---
           # Pipe Distinctive Sorted Decrease
    
    
           Process:
           1. Learn the token record after the colon.
           2. Break up by commas.
           3. Trim areas and lowercase each token.
           4. Take away duplicates.
           5. Kind alphabetically ascending.
           6. Be part of with "|" and return solely the ultimate string.
       """).strip(),
    
    
       "vowel_parity": textwrap.dedent("""
           ---
           identify: vowel-parity-exact
           description: Depend vowels within the phrase and output ODD or EVEN solely.
           ---
           # Vowel Parity Precise
    
    
           Process:
           1. Learn the goal phrase after the colon.
           2. Depend vowels utilizing a, e, i, o, u.
           3. If the rely is odd, output ODD.
           4. If the rely is even, output EVEN.
           5. Return solely ODD or EVEN with no further textual content.
       """).strip(),
    }
    
    
    PROMPT_APPENDIX = textwrap.dedent("""
    ## STRICT OUTPUT CONTRACT
    - Output solely the ultimate reply.
    - By no means clarify your reasoning.
    - If a job expects JSON, return compact JSON with precise keys solely.
    - When a related ability exists, observe it actually.
    - Precise format is extra vital than being conversational.
    """).strip()

    We implement the customized A-Evolve agent that reads the lively immediate, abilities, and reminiscence from the workspace and makes use of OpenAI to resolve every job. We design the agent so it selects related abilities, injects latest reminiscence, and returns trajectories within the construction anticipated by the framework. We additionally outline the ability templates and the strict output contract, which function the primary components that the evolution engine can add to enhance efficiency over time.

    class ColabMutationEngine(EvolutionEngine):
       def __init__(self):
           self.cycle_count = 0
    
    
       def step(self, workspace: AgentWorkspace, observations, historical past, trial):
           self.cycle_count += 1
    
    
           failed_by_rule = defaultdict(record)
           for obs in observations:
               if not obs.suggestions.success:
                   failed_by_rule[obs.task.metadata["rule"]].append({
                       "task_id": obs.job.id,
                       "enter": obs.job.enter,
                       "gold": obs.job.metadata["answer"],
                       "pred": obs.trajectory.output
                   })
    
    
           mutated = False
           summaries = []
    
    
           current_prompt = workspace.read_prompt()
           if "STRICT OUTPUT CONTRACT" not in current_prompt:
               workspace.write_prompt(current_prompt.rstrip() + "nn" + PROMPT_APPENDIX + "n")
               mutated = True
               summaries.append("immediate hardened")
    
    
           existing_skill_names = {s.identify for s in workspace.list_skills()}
    
    
           needed_rule_to_skill_name = {
               "json_sum": "json-sum-exact",
               "acronym_upper": "acronym-upper-exact",
               "pipe_unique_sorted_lower": "pipe-unique-sorted-lower",
               "vowel_parity": "vowel-parity-exact",
           }
    
    
           for rule, fails in failed_by_rule.gadgets():
               skill_name = needed_rule_to_skill_name[rule]
               if skill_name not in existing_skill_names:
                   workspace.write_skill(skill_name, SKILL_TEMPLATES[rule])
                   mutated = True
                   summaries.append(f"added ability {skill_name}")
    
    
               workspace.add_memory({
                   "content material": f"Cycle {self.cycle_count}: rule={rule} failed {len(fails)} time(s). Widespread failure sample: output formatting or process mismatch. Gold examples have to be adopted precisely.",
                   "rule": rule,
                   "examples": fails[:2]
               }, class="episodic")
    
    
           if not failed_by_rule:
               workspace.add_memory({
                   "content material": f"Cycle {self.cycle_count}: all present coaching duties succeeded. Protect precise formatting conduct."
               }, class="episodic")
    
    
           abstract = " | ".be part of(summaries) if summaries else "no mutation wanted"
           return StepResult(
               mutated=mutated,
               abstract=abstract,
               metadata={
                   "failed_rules": record(failed_by_rule.keys()),
                   "num_failed_rules": len(failed_by_rule),
                   "cycle": self.cycle_count
               }
           )
    
    
    def evaluate_split(agent, benchmark, break up="prepare"):
       duties = benchmark.get_tasks(break up=break up, restrict=100)
       rows = []
       complete = 0
       appropriate = 0
       for job in duties:
           traj = agent.remedy(job)
           fb = benchmark.consider(job, traj)
           rows.append({
               "task_id": job.id,
               "rule": job.metadata["rule"],
               "enter": job.enter,
               "gold": job.metadata["answer"],
               "pred": traj.output,
               "rating": fb.rating,
               "success": fb.success
           })
           complete += 1
           appropriate += int(fb.success)
       rating = appropriate / max(complete, 1)
       return rating, rows
    
    
    def print_table(rows, title, max_rows=20):
       print("n" + "=" * 110)
       print(title)
       print("=" * 110)
       proven = rows[:max_rows]
       for r in proven:
           print(f"[{r['task_id']}] rule={r['rule']}")
           print(f"  enter : {r['input']}")
           print(f"  gold  : {r['gold']}")
           print(f"  pred  : {r['pred']}")
           print(f"  rating : {r['score']}  success={r['success']}")
           print("-" * 110)
    
    
    def show_workspace(root: Path):
       print("n" + "=" * 110)
       print("EVOLVED WORKSPACE SNAPSHOT")
       print("=" * 110)
       for path in sorted(root.rglob("*")):
           rel = path.relative_to(root)
           if path.is_dir():
               print(f"[DIR ] {rel}/")
           else:
               print(f"[FILE] {rel}")
    
    
    def show_skill_contents(root: Path):
       skill_files = sorted((root / "abilities").glob("*/SKILL.md"))
       print("n" + "=" * 110)
       print("SKILL FILES")
       print("=" * 110)
       if not skill_files:
           print("No ability recordsdata but.")
       for sf in skill_files:
           print(f"n--- {sf.guardian.identify}/SKILL.md ---")
           print(sf.read_text())

    We construct a customized evolution engine that inspects failures and decides methods to mutate the workspace. We use it to harden the immediate, add lacking abilities, and retailer episodic reminiscence in order that the agent step by step learns higher formatting and task-specific conduct throughout cycles. We additionally outline analysis and reporting utilities that assist us rating the agent, examine predictions, and look at the developed workspace clearly.

    benchmark = MiniTextBenchmark()
    agent = ColabAEResolverAgent(WORKSPACE_ROOT, mannequin=OPENAI_MODEL)
    engine = ColabMutationEngine()
    
    
    baseline_train_score, baseline_train_rows = evaluate_split(agent, benchmark, break up="prepare")
    baseline_holdout_score, baseline_holdout_rows = evaluate_split(agent, benchmark, break up="holdout")
    
    
    print(f"Baseline prepare rating   : {baseline_train_score:.3f}")
    print(f"Baseline holdout rating : {baseline_holdout_score:.3f}")
    
    
    print_table(baseline_train_rows, "BASELINE TRAIN RESULTS")
    print_table(baseline_holdout_rows, "BASELINE HOLDOUT RESULTS")
    
    
    config = ae.EvolveConfig(
       batch_size=8,
       max_cycles=4,
       egl_window=2
    )
    
    
    evolver = ae.Evolver(
       agent=agent,
       benchmark=benchmark,
       config=config,
       engine=engine
    )
    
    
    consequence = evolver.run(cycles=4)
    
    
    print("n" + "=" * 110)
    print("A-EVOLVE RUN SUMMARY")
    print("=" * 110)
    print(f"Cycles accomplished : {consequence.cycles_completed}")
    print(f"Closing prepare rating: {consequence.final_score:.3f}")
    print(f"Rating historical past    : {consequence.score_history}")
    print(f"Converged        : {consequence.converged}")
    
    
    agent.reload_from_fs()
    final_train_score, final_train_rows = evaluate_split(agent, benchmark, break up="prepare")
    final_holdout_score, final_holdout_rows = evaluate_split(agent, benchmark, break up="holdout")
    
    
    print(f"nFinal prepare rating   : {final_train_score:.3f}")
    print(f"Closing holdout rating : {final_holdout_score:.3f}")
    
    
    print_table(final_train_rows, "FINAL TRAIN RESULTS")
    print_table(final_holdout_rows, "FINAL HOLDOUT RESULTS")
    
    
    show_workspace(WORKSPACE_ROOT)
    show_skill_contents(WORKSPACE_ROOT)
    
    
    print("n" + "=" * 110)
    print("FINAL SYSTEM PROMPT")
    print("=" * 110)
    print((WORKSPACE_ROOT / "prompts" / "system.md").read_text())
    
    
    episodic_path = WORKSPACE_ROOT / "reminiscence" / "episodic.jsonl"
    if episodic_path.exists():
       print("n" + "=" * 110)
       print("RECENT EPISODIC MEMORY")
       print("=" * 110)
       traces = episodic_path.read_text().strip().splitlines()
       for line in traces[-10:]:
           print(line)
    
    
    plt.determine(figsize=(8, 4))
    plt.plot(vary(1, len(consequence.score_history) + 1), consequence.score_history, marker="o")
    plt.xlabel("Evolution cycle")
    plt.ylabel("Prepare rating")
    plt.title("A-Evolve rating historical past")
    plt.grid(True)
    plt.present()
    
    
    print("n" + "=" * 110)
    print("COMPARISON")
    print("=" * 110)
    print(f"Prepare   : {baseline_train_score:.3f} -> {final_train_score:.3f}")
    print(f"Holdout : {baseline_holdout_score:.3f} -> {final_holdout_score:.3f}")
    
    
    improved_rules = []
    for earlier than, after in zip(sorted(baseline_train_rows, key=lambda x: x["task_id"]), sorted(final_train_rows, key=lambda x: x["task_id"])):
       if (not earlier than["success"]) and after["success"]:
           improved_rules.append(after["rule"])
    
    
    print(f"Improved prepare instances by rule: {dict(Counter(improved_rules))}")
    
    
    print("nDone. This pocket book used the true A-Evolve framework and demonstrated:")
    print("1) a sound agent workspace")
    print("2) a BaseAgent subclass")
    print("3) a BenchmarkAdapter subclass")
    print("4) an EvolutionEngine subclass")
    print("5) immediate / ability / reminiscence mutations throughout A-Evolve cycles")

    We put all the pieces collectively and run the total A-Evolve loop from baseline analysis to post-evolution evaluation. We measure the agent earlier than coaching, execute a number of evolution cycles, reload the workspace, after which examine the ultimate prepare and holdout efficiency to see what improves. We additionally examine the developed immediate, abilities, reminiscence, and rating historical past, which lets us clearly observe how the framework transforms the agent step-by-step.

    In conclusion, we efficiently constructed and ran a full A-Evolve workflow quite than simply inspecting the repository at a floor stage. We created a sound workspace, plugged in a customized agent, benchmarked it on structured duties, after which developed its conduct by modifying prompts, including abilities, and storing reminiscence throughout cycles. Additionally, we noticed how A-Evolve’s design permits us to deal with agent enchancment as a repeatable engineering course of, wherein we are able to measure baseline efficiency, apply managed mutations, and observe how the system turns into extra correct over time.


    Try the Full Coding Notebook here. Additionally, be happy to observe us on Twitter and don’t overlook to hitch our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.




    Source link

    Naveed Ahmad

    Related Posts

    Robotaxi corporations refuse to say how typically their AVs want distant assist

    01/04/2026

    It’s not your creativeness: AI seed startups are commanding increased valuations

    01/04/2026

    Yupp.ai shuts down lower than a yr after launching with $33M from a16z crypto’s Chris Dixon

    01/04/2026
    Leave A Reply Cancel Reply

    Categories
    • AI
    Recent Comments
      Facebook X (Twitter) Instagram Pinterest
      © 2026 ThemeSphere. Designed by ThemeSphere.

      Type above and press Enter to search. Press Esc to cancel.