OpenAI has launched GPT-5.1-Codex-Max, a frontier agentic coding mannequin designed for lengthy working software program engineering duties that span tens of millions of tokens and multi hour periods. It’s obtainable in the present day inside Codex within the CLI, IDE extension, cloud integration and code assessment surfaces, with API entry deliberate quickly.
What GPT-5.1-Codex-Max is optimised for?
GPT-5.1-Codex-Max is constructed on an replace to OpenAI’s foundational reasoning mannequin. This base mannequin is skilled on agentic duties throughout software program engineering, math, analysis and different domains. On high of this, GPT-5.1-Codex-Max is skilled on actual world software program engineering workloads similar to PR creation, code assessment, frontend coding and Q&A.
The mannequin targets frontier coding evaluations quite than basic chat. GPT-5.1-Codex-Max and the broader Codex household is really useful just for agentic coding duties in Codex or Codex like environments, not as a drop in substitute for GPT-5.1 on the whole objective conversations.
GPT-5.1-Codex-Max can be the primary Codex mannequin skilled to function in Home windows environments. Its coaching consists of duties that make it a greater collaborator within the Codex CLI, together with improved behaviour when working instructions and dealing with information beneath the Codex sandbox.
Compaction and lengthy working duties
A core function of GPT-5.1-Codex-Max is compaction. The mannequin nonetheless runs inside a hard and fast context window, however it’s natively skilled to work throughout a number of context home windows by pruning its interplay historical past whereas preserving a very powerful data over lengthy horizons.
In Codex purposes, GPT-5.1-Codex-Max mechanically compacts its session when it approaches the context window restrict. It creates a contemporary context window that retains the important state of the duty, then continues execution. This course of repeats till the duty completes.
OpenAI experiences inside evaluations the place GPT-5.1-Codex-Max works independently for greater than 24 hours on a single activity. Throughout these runs, the mannequin iterates on its implementation, fixes failing assessments and ultimately produces a profitable consequence.
Reasoning effort, velocity and token effectivity
GPT-5.1-Codex-Max makes use of the identical reasoning effort management launched with GPT-5.1, however tuned for coding brokers. Reasoning effort selects what number of pondering tokens the mannequin makes use of earlier than committing to a solution.
On SWE-bench Verified, GPT-5.1-Codex-Max with medium reasoning effort achieves increased accuracy than GPT-5.1-Codex on the identical effort whereas utilizing 30% fewer pondering tokens. For non latency delicate duties, OpenAI introduces a brand new Additional Excessive, written as xhigh, reasoning effort that lets the mannequin assume longer to succeed in higher solutions. Medium stays the really useful setting for many workloads.
These adjustments present up in benchmark outcomes. With GPT-5.1-Codex evaluated at excessive reasoning effort and GPT-5.1-Codex-Max at xhigh, OpenAI experiences the next scores on 500 problems with SWE-bench Verified, 73.7% for GPT-5.1-Codex and 77.9% for GPT-5.1-Codex-Max. On SWE-Lancer IC SWE, the scores are 66.3% and 79.9%. On Terminal-Bench 2.0, scores are 52.8% and 58.1%. All evaluations run with compaction enabled, and Terminal-Bench 2.0 makes use of the Codex CLI contained in the Laude Institute Harbor harness.
In qualitative assessments, GPT-5.1-Codex-Max generates prime quality frontend designs with comparable performance and visible high quality to GPT-5.1-Codex but at decrease general token value, because of extra environment friendly reasoning traces.
Key Takeaways
- GPT 5.1 Codex Max is a frontier agentic coding mannequin constructed on an up to date reasoning base, additional skilled on actual software program engineering duties similar to PR creation, code assessment, frontend coding and Q&A, and is offered in the present day throughout Codex CLI, IDE, cloud and code assessment surfaces, with API entry coming later.
- The mannequin introduces native assist for lengthy working work by means of compaction, the place it repeatedly compresses its personal historical past to span a number of context home windows, enabling autonomous coding periods that may proceed for greater than 24 hours over tens of millions of tokens whereas staying on a single activity.
- GPT 5.1 Codex Max retains the reasoning effort management from GPT 5.1, and at medium effort it outperforms GPT 5.1 Codex on SWE bench Verified whereas utilizing about 30 % fewer pondering tokens, with an Additional Excessive reasoning mode for the toughest duties.
- On frontier coding benchmarks with compaction enabled, GPT 5.1 Codex Max at xhigh effort improves SWE bench Verified from 73.7 % to 77.9 %, SWE Lancer IC SWE from 66.3 % to 79.9 %, and Terminal Bench 2.0 from 52.8 % to 58.1 %, in comparison with GPT 5.1 Codex at excessive effort.
GPT-5.1-Codex-Max is a transparent assertion that OpenAI is doubling down on long-running, agentic coding quite than quick, single shot edits. Compaction, frontier coding evaluations like SWE-bench Verified and SWE-Lancer IC SWE, and specific reasoning effort controls make this mannequin a take a look at case for scaling test-time compute in actual software program engineering workflows, not simply benchmarks. The Preparedness Framework and Codex sandbox shall be vital as this functionality strikes into manufacturing pipelines. Total, GPT-5.1-Codex-Max is a frontier agentic coding mannequin that operationalises long-horizon reasoning in sensible developer instruments.
Michal Sutter is an information science skilled with a Grasp of Science in Knowledge Science from the College of Padova. With a stable basis in statistical evaluation, machine studying, and information engineering, Michal excels at reworking advanced datasets into actionable insights.