Meet 'AutoAgent': The Open-Supply Library That Lets an AI Engineer and Optimize Its Personal Agent Harness In a single day

There’s a selected form of tedium that each AI engineer is aware of intimately: the prompt-tuning loop. You write a system immediate, run your agent towards a benchmark, learn the failure traces, tweak the immediate, add a instrument, rerun. Repeat this a number of dozen occasions and also you may transfer the needle. It’s grunt work dressed up in Python recordsdata. Now, a brand new open-source library referred to as AutoAgent, constructed by Kevin Gu at thirdlayer.inc, proposes an unsettling various — don’t try this work your self. Let an AI do it.

AutoAgent is an open supply library for autonomously enhancing an agent on any area. In a 24-hour run, it hit #1 on SpreadsheetBench with a rating of 96.5%, and achieved the #1 GPT-5 rating on TerminalBench with 55.1%.

https://x.com/kevingu/standing/2039843234760073341

What Is AutoAgent, Actually?

AutoAgent is described as being ‘like autoresearch however for agent engineering.’ The concept: give an AI agent a job, let it construct and iterate on an agent harness autonomously in a single day. It modifies the system immediate, instruments, agent configuration, and orchestration, runs the benchmark, checks the rating, retains or discards the change, and repeats.

To know the analogy: Andrej Karpathy’s autoresearch does the identical factor for ML coaching — it loops by way of propose-train-evaluate cycles, holding solely modifications that enhance validation loss. AutoAgent ports that very same ratchet loop from ML coaching into agent engineering. As a substitute of optimizing a mannequin’s weights or coaching hyperparameters, it optimizes the harness — the system immediate, instrument definitions, routing logic, and orchestration technique that decide how an agent behaves on a job.

A harness, on this context, is the scaffolding round an LLM: what system immediate it receives, what instruments it will probably name, the way it routes between sub-agents, and the way duties are formatted as inputs. Most agent engineers hand-craft this scaffolding. AutoAgent automates the iteration on that scaffolding itself.

The Structure: Two Brokers, One File, One Directive

The GitHub repo has a intentionally easy construction. agent.py is your entire harness underneath take a look at in a single file — it incorporates config, instrument definitions, agent registry, routing/orchestration, and the Harbor adapter boundary. The adapter part is explicitly marked as fastened; the remaining is the first edit floor for the meta-agent. program.md incorporates directions for the meta-agent plus the directive (what sort of agent to construct), and that is the one file the human edits.

Consider it as a separation of issues between human and machine. The human units the path inside program.md. The meta-agent (a separate, higher-level AI) then reads that directive, inspects agent.py, runs the benchmark, diagnoses what failed, rewrites the related components of agent.py, and repeats. The human by no means touches agent.py immediately.

A important piece of infrastructure that retains the loop coherent throughout iterations is outcomes.tsv — an experiment log robotically created and maintained by the meta-agent. It tracks each experiment run, giving the meta-agent a historical past to be taught from and calibrate what to strive subsequent. The complete venture construction additionally consists of Dockerfile.base, an optionally available .agent/ listing for reusable agent workspace artifacts like prompts and expertise, a duties/ folder for benchmark payloads (added per benchmark department), and a jobs/ listing for Harbor job outputs.

The metric is whole rating produced by the benchmark’s job take a look at suites. The meta-agent hill-climbs on this rating. Each experiment produces a numeric rating: maintain if higher, discard if not — the identical loop as autoresearch.

The Process Format and Harbor Integration

Benchmarks are expressed as duties in Harbor format. Every job lives underneath duties/my-task/ and features a job.toml for config like timeouts and metadata, an instruction.md which is the immediate despatched to the agent, a exams/ listing with a take a look at.sh entry level that writes a rating to /logs/reward.txt, and a take a look at.py for verification utilizing both deterministic checks or LLM-as-judge. An surroundings/Dockerfile defines the duty container, and a recordsdata/ listing holds reference recordsdata mounted into the container. Exams write a rating between 0.0 and 1.0 to the verifier logs. The meta-agent hill-climbs on this.

The LLM-as-judge sample right here is price flagging: as a substitute of solely checking solutions deterministically (like unit exams), the take a look at suite can use one other LLM to guage whether or not the agent’s output is ‘appropriate sufficient.’ That is widespread in agentic benchmarks the place appropriate solutions aren’t reducible to string matching.

Key Takeaways

Autonomous harness engineering works — AutoAgent proves {that a} meta-agent can change the human prompt-tuning loop fully, iterating on agent.py in a single day with none human touching the harness recordsdata immediately.
Benchmark outcomes validate the method — In a 24-hour run, AutoAgent hit #1 on SpreadsheetBench (96.5%) and the highest GPT-5 rating on TerminalBench (55.1%), beating each different entry that was hand-engineered by people.
‘Mannequin empathy’ could also be an actual phenomenon — A Claude meta-agent optimizing a Claude job agent appeared to diagnose failures extra precisely than when optimizing a GPT-based agent, suggesting same-family mannequin pairing might matter when designing your AutoAgent loop.
The human’s job shifts from engineer to director — You don’t write or edit agent.py. You write program.md — a plain Markdown directive that steers the meta-agent. The excellence mirrors the broader shift in agentic engineering from writing code to setting targets.
It’s plug-and-play with any benchmark — As a result of duties observe Harbor’s open format and brokers run in Docker containers, AutoAgent is domain-agnostic. Any scorable job — spreadsheets, terminal instructions, or your individual customized area — can change into a goal for autonomous self-optimization.

Take a look at the Repo and Tweet. Additionally, be at liberty to observe us on Twitter and don’t overlook to affix our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Have to accomplice with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so on.? Connect with us

Source link

Meet ‘AutoAgent’: The Open-Supply Library That Lets an AI Engineer and Optimize Its Personal Agent Harness In a single day

TechCrunch Mobility: ‘A shocking lack of transparency’

In Japan, the robotic is not coming on your job; it is filling the one no person needs

Contained in the Artistic Synthetic Intelligence (AI) Stack: The place Human Imaginative and prescient and Synthetic Intelligence Meet to Design Future Vogue