OpenAI Releases GPT-5.5, a Absolutely Retrained Agentic Mannequin That Scores 82.7% on Terminal-Bench 2.0 and 84.9% on GDPval

OpenAI has launched GPT-5.5, its most succesful mannequin to this point and the primary totally retrained base mannequin since GPT-4.5. GPT-5.5 is designed to finish complicated, multi-step pc duties with minimal human course. Consider it because the distinction between an assistant who wants a guidelines and one who understands the underlying purpose and figures out the steps themselves. The discharge is rolling out at present to Plus, Professional, Enterprise, and Enterprise subscribers throughout ChatGPT and Codex.

What ‘Agentic’ Really Means Right here

An agentic mannequin doesn’t simply reply to a single immediate — it takes a sequence of actions, makes use of instruments (like shopping the online, writing code, operating scripts, or working software program), checks its personal work, and retains going till the duty is completed. Prior fashions typically stalled at handoff factors, requiring the consumer to re-prompt or appropriate course. GPT-5.5 is constructed to scale back these interruptions.

OpenAI launched GPT-5.5 as a mannequin focused at agentic pc use — it writes and debugs code, browses the online, fills out spreadsheets, and retains working by way of multi-step duties with out requiring a human to oversee each transfer.

The 4 Domains The place Positive aspects Are Concentrated

The positive factors are concentrated in 4 areas: agentic coding, pc use, data work, and early scientific analysis — domains OpenAI describes as these ‘the place progress will depend on reasoning throughout context and taking motion over time.’

For software program engineers, essentially the most instantly related benchmark is SWE-Bench Professional, which evaluates real-world GitHub concern decision throughout 4 programming languages. GPT-5.5 resolves 58.6% of duties end-to-end in a single go. Value noting: Claude Opus 4.7 scores greater at 64.3% on this identical benchmark, although OpenAI has famous that Anthropic reported indicators of memorization on a subset of these issues, which can have an effect on the comparability.

For long-horizon coding particularly, OpenAI additionally reviews outcomes on Skilled-SWE, an inside benchmark measuring duties with a median estimated human completion time of 20 hours. GPT-5.5 outperforms GPT-5.4 on Skilled-SWE. This benchmark is critical as a result of it displays the form of prolonged, multi-session engineering work — massive refactors, function builds, debugging deep in a codebase — that agentic instruments are more and more being requested to deal with autonomously.

Builders who examined the system early mentioned GPT-5.5 has a greater understanding of the “form” of a software program system, and might higher perceive why one thing is failing, the place the repair is required, and what else within the codebase can be affected.

https://openai.com/index/introducing-gpt-5-5/

For ML engineers and information scientists who spend important time in terminal environments orchestrating pipelines and debugging scripts, the Terminal-Bench 2.0 outcomes are essentially the most compelling sign. GPT-5.5 scores 82.7% on Terminal-Bench 2.0, which checks complicated command-line workflows requiring planning, iteration, and gear coordination — beating Claude Opus 4.7 at 69.4% and Gemini 3.1 Professional at 68.5%. That’s not a marginal lead.

For broader data work, GPT-5.5 scores 84.9% on GDPval, which checks brokers throughout 44 occupations of data work. On OSWorld-Verified, a benchmark measuring whether or not a mannequin can autonomously function actual pc environments, it reaches 78.7%.

GPT-5.5 additionally ships with a Professional variant constructed for higher-accuracy, more durable duties. On BrowseComp, which checks a mannequin’s skill to trace down hard-to-find info throughout the online, GPT-5.5 Professional scores 90.1%, forward of Gemini 3.1 Professional at 85.9%. The mannequin can be the top-ranked system on the Synthetic Evaluation Intelligence Index.

https://openai.com/index/introducing-gpt-5-5/

Velocity and Token Effectivity

One concern with extra succesful fashions is that they are usually slower or costlier to run. OpenAI addressed this immediately. GPT-5.5 matches GPT-5.4’s per-token latency in real-world serving whereas performing higher throughout practically each analysis measured. It additionally makes use of considerably fewer tokens to finish the identical Codex duties — that means shorter, extra environment friendly runs even on complicated agentic workflows.

On pricing, the usual GPT-5.5 API might be charged at $5 per million enter tokens and $30 per million output tokens. For context, GPT-5.4 was priced at $2.50 per million enter tokens and $15 per million output tokens — so the per-token value has doubled. OpenAI staff argued that token effectivity positive factors offset the associated fee, since GPT-5.5 completes the identical Codex duties with fewer tokens, that means cheaper runs total even on the greater per-token fee. GPT-5.5 Professional, the higher-accuracy variant, is priced at $30 per million enter tokens and $180 per million output tokens within the API.

For groups operating Codex at scale, the web math is what issues: if GPT-5.5 completes a job in materially fewer tokens than GPT-5.4, the efficient value per accomplished workflow can nonetheless come out decrease regardless of the upper fee.

Scale and Adoption

OpenAI has seen a surge in Codex utilization, with about 4 million builders utilizing the software weekly. That scale issues for understanding the deployment context: GPT-5.5 shouldn’t be a analysis preview however a manufacturing mannequin being pushed to an energetic, massive developer base instantly on launch.

Key Takeaways

GPT-5.5 is OpenAI’s first totally retrained base mannequin since GPT-4.5, designed particularly for agentic workflows — it might probably perceive complicated targets, use instruments, examine its personal work, and carry multi-step duties by way of to completion with minimal human course.
The largest efficiency positive factors are in agentic coding, pc use, data work, and early scientific analysis — GPT-5.5 scores 82.7% on Terminal-Bench 2.0, 84.9% on GDPval, and 78.7% on OSWorld-Verified, outperforming each Claude Opus 4.7 and Gemini 3.1 Professional on a number of key benchmarks.
GPT-5.5 matches GPT-5.4’s per-token latency whereas being extra succesful throughout practically each benchmark — it additionally makes use of considerably fewer tokens to finish the identical Codex duties, that means higher outcomes with no proportional enhance in velocity or value per accomplished workflow.
API pricing will increase to $5/M enter tokens and $30/M output tokens (up from $2.50 and $15 for GPT-5.4), with GPT-5.5 Professional priced at $30/M enter and $180/M output — OpenAI staff argues token effectivity positive factors offset the upper per-token fee for many workloads.
GPT-5.5 is rolling out at present to Plus, Professional, Enterprise, and Enterprise customers in ChatGPT and Codex, with roughly 4 million builders already utilizing Codex weekly.

Michal Sutter is an information science skilled with a Grasp of Science in Knowledge Science from the College of Padova. With a strong basis in statistical evaluation, machine studying, and information engineering, Michal excels at remodeling complicated datasets into actionable insights.

Source link

OpenAI Releases GPT-5.5, a Absolutely Retrained Agentic Mannequin That Scores 82.7% on Terminal-Bench 2.0 and 84.9% on GDPval

OpenAI says hackers stole some information after newest code safety concern

Cerebras raises $5.5B, kicking off 2026’s IPO season with a bang

Khosla Ventures is betting $10M on Ian Crosby, whose final startup, Bench, imploded

OpenAI Releases GPT-5.5, a Absolutely Retrained Agentic Mannequin That Scores 82.7% on Terminal-Bench 2.0 and 84.9% on GDPval

What ‘Agentic’ Really Means Right here

The 4 Domains The place Positive aspects Are Concentrated

Velocity and Token Effectivity

Scale and Adoption

Key Takeaways

Related Posts

OpenAI says hackers stole some information after newest code safety concern

Cerebras raises $5.5B, kicking off 2026’s IPO season with a bang

Khosla Ventures is betting $10M on Ian Crosby, whose final startup, Bench, imploded