Poetiq's Meta-System Mechanically Builds a Mannequin-Agnostic Harness That Improved Each LLM Examined on LiveCodeBench Professional With out Fantastic-Tuning

Poetiq has simply printed some very fascinating outcomes displaying its Meta-System reached a brand new state-of-the-art on LiveCodeBench Professional (LCB Professional), a aggressive coding benchmark, by mechanically constructing and optimizing its personal inference harness — with out fine-tuning any underlying mannequin or accessing mannequin internals.

The end result: GPT 5.5 Excessive with Poetiq’s harness scores 93.9% on LCB Professional (25Q2), up from its baseline of 89.6%. Gemini 3.1 Professional, the mannequin the harness was particularly optimized on, jumps from 78.6% to 90.9% — surpassing Google’s personal Gemini 3 Deep Suppose (88.8%), a mannequin that isn’t even accessible by way of API for exterior verification.

https://poetiq.ai/posts/recursive_self_improvement_coding/

What’s LiveCodeBench Professional?

Earlier than stepping into the mechanics, it helps to grasp why the benchmark issues. LiveCodeBench Professional (LCB) is designed to check AI coding capability in a means that resists two frequent failure modes in benchmarks: information contamination and overfitting.

LCB Professional pulls issues from main aggressive programming competitions and withholds public ground-truth code. As an alternative, options are validated in opposition to a complete testing framework. Right output alone isn’t sufficient — options should additionally fulfill particular reminiscence and runtime constraints. The benchmark can also be topic to steady updates, which distinguishes it from many customary benchmarks that turn into stale.

The benchmark focuses on C++ challenges and emphasizes inventive coding, testing a mannequin’s capability for complicated problem-solving and high-quality, performant procedural logic. This distinguishes it from datasets like SWEBench that consider instrument utilization or bug-fixing workflows. Issues are categorized by issue — Straightforward, Medium, and Onerous — based mostly on aggressive human resolve charges.

https://poetiq.ai/posts/recursive_self_improvement_coding/

Poetiq’s Strategic Framing: Three LLM Process Classes

That is Poetiq’s third publicly reported benchmark, and the selection of LCB Professional was deliberate. The analysis workforce frames LLM efficiency round three distinct job classes: Reasoning challenges (ARC-AGI is their benchmark right here), Retrieval challenges (Humanity’s Final Examination, or HLE), and Coding challenges — which, as essentially the most pervasive business software for AI at this time, meld reasoning and retrieval with the technology of specialised procedural logic.

Their coding initiative had three particular, acknowledged targets: first, show that an clever harness can enhance efficacy with out fine-tuning or particular mannequin entry; second, validate the Meta-System’s capability for recursive self-improvement in creating that harness mechanically; and third, show that the ensuing harness is model-agnostic and will be utilized to any mannequin with out modification. In keeping with their outcomes, all three have been glad.

What’s a Harness, and Why Does It Matter?

On this context, a harness refers back to the infrastructure wrapped round a language mannequin to deal with a selected job. Consider it as an orchestration layer — it controls how the mannequin is prompted, how outputs are structured, how solutions are assembled throughout a number of calls, and the way options are evaluated.

Historically, these harnesses are hand-built by engineers. Poetiq’s declare is that their Meta-System builds and optimizes these harnesses mechanically, by means of recursive self-improvement. Internally, the Meta-System works by creating higher methods for figuring out what to ask, refining sequential chain-of-questions, and devising new strategies for assembling the solutions. The system continually incorporates learnings from earlier and present duties and datasets to create new, customized task-specific harnesses — in addition to brokers and orchestrators for different job varieties.

How the Harness was Constructed?

Poetiq’s Meta-System was given the LCB Professional job and constructed a harness from scratch utilizing solely Gemini 3.1 Professional as the bottom mannequin. The Meta-System accounted for all three dimensions LCB Professional checks: accuracy, runtime, and reminiscence constraints. The system constructed on insights from its earlier work on ARC-AGI and HLE when designing the harness. No fine-tuning of the underlying mannequin was carried out, and no entry to inner mannequin activations was required — solely customary API entry.

As soon as the harness was constructed and optimized for Gemini 3.1 Professional, it was then utilized to a broad set of different fashions from completely different suppliers and generations — each open-weights and proprietary — with none extra optimization. Each mannequin examined improved.

The Numbers

The benchmark outcomes throughout issue tiers are price taking a look at intimately. On Onerous issues — the class the place gaps between fashions are largest — Gemini 3.1 Professional with Poetiq’s harness scores 58.3%, up from its 7.7% baseline. GPT 5.5 Excessive with the harness reaches 75.0% on Onerous, up from 50.0%. Throughout Straightforward and Medium classes, the harness additionally outperforms all base fashions.

Among the smaller mannequin outcomes are additionally notable. Gemini 3.0 Flash improves by 10 proportion factors, going from 72.3% to 82.3% — overtaking Claude Opus 4.7, Gemini 3.1 Professional, and GPT 5.2 Excessive, all bigger and costlier fashions. This mirrors a sample Poetiq beforehand noticed on ARC-AGI, the place their optimization allowed a smaller, extra economical mannequin to surpass a much bigger one. Kimi K2.6 sees the most important leap: from 50.0% to 79.9%, a roughly 30 proportion level enchancment. Nemotron 3 Tremendous 120B improves by 12.8%.

Accuracy numbers are reported immediately from the LCB Professional leaderboard at livecodebenchpro.com (25Q2). For fashions not featured on the leaderboard, Poetiq carried out its personal evaluations, cross-validating its experimental setup by replicating official leaderboard accuracies for baseline fashions.

Key Takeaways

Poetiq’s Meta-System mechanically builds task-specific harnesses by means of recursive self-improvement, with no mannequin fine-tuning or inner mannequin entry
GPT 5.5 Excessive with the harness reaches 93.9% on LCB Professional (25Q2), up 4.3% from its 89.6% baseline; Gemini 3.1 Professional jumps 12.3% (78.6% → 90.9%)
The harness is model-agnostic: optimized utilizing solely Gemini 3.1 Professional, it improved each different mannequin examined — open-weights and proprietary — with out modification
Gemini 3.0 Flash good points 10 proportion factors with the harness (72.3% → 82.3%), surpassing Claude Opus 4.7, Gemini 3.1 Professional, and GPT 5.2 Excessive regardless of being smaller and cheaper
Kimi K2.6 exhibits the most important achieve at ~30 proportion factors (50.0% → 79.9%); Nemotron 3 Tremendous 120B improves by 12.8%

Take a look at the Technical details here. Additionally, be happy to comply with us on Twitter and don’t overlook to hitch our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Have to associate with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and many others.? Connect with us

Source link

Poetiq’s Meta-System Mechanically Builds a Mannequin-Agnostic Harness That Improved Each LLM Examined on LiveCodeBench Professional With out Fantastic-Tuning

Greatest AI Brokers for Software program Improvement Ranked: A Benchmark-Pushed Take a look at the Present Discipline

Mira Murati Desires Her AI to ‘Preserve People within the Loop’

What occurs when AI begins constructing itself?