Anthropic has by no means revealed a technical paper on Claude Mythos. That has not stopped the analysis group from theorizing. A brand new open-source venture referred to as OpenMythos, launched on GitHub by Kye Gomez, makes an attempt one thing bold: a first-principles theoretical reconstruction of what the Claude Mythos structure may really be, constructed completely in PyTorch and grounded in peer-reviewed analysis.
The venture will not be a leaked mannequin, a fine-tune, or a distillation. It’s a speculation rendered in code — and the speculation is particular sufficient to be falsifiable, which is what makes it attention-grabbing.
The Foremost Declare: Claude Mythos Is a Recurrent-Depth Transformer
OpenMythos proposes that Claude Mythos belongs to a category of architectures referred to as Recurrent-Depth Transformers (RDTs), additionally referred to within the literature as Looped Transformers. The idea is meaningfully completely different from commonplace transformer stacks.
In a traditional transformer — GPT, LLaMA, Mistral — the mannequin passes enter by means of a sequence of distinctive layers, one after one other, every with its personal unbiased weights. Extra functionality usually means extra layers and extra parameters. In a Recurrent-Depth Transformer, a set set of weights is utilized iteratively throughout T loop steps inside a single ahead go. The identical weights run a number of occasions. Reasoning depth will not be a perform of what number of parameters are saved, however of what number of iterations are run at inference time.
Consider it much less like studying a guide and extra like refining a draft: the mannequin returns to the identical computational block many times, bettering its inside illustration with every go.
How the Structure is Structured
OpenMythos instantiates this as a three-part construction: Prelude → Recurrent Block → Coda. The Prelude and Coda are commonplace transformer layers that run precisely as soon as. The Recurrent Block is the computational core, looped as much as T=16 occasions.
At every loop step t, the hidden state is up to date utilizing the next rule:
ht+1 = A·ht + B·e + Transformer(ht, e)
Right here, ht is the hidden state after loop iteration t, and e is the encoded enter from the Prelude — re-injected at each step. The re-injection is deliberate: with out it, the hidden state would drift away from the unique enter sign throughout deep loops. The realized matrices A and B govern how a lot of the earlier hidden state and the encoded enter carry ahead at every step.
The FFN contained in the Recurrent Block will not be an ordinary feedforward layer. OpenMythos replaces it with a Combination-of-Specialists (MoE) layer following the design launched in DeepSeekMoE: a big pool of fine-grained routed specialists, with solely a sparse top-Okay subset activated per token, alongside a small set of always-active shared specialists that soak up widespread cross-domain patterns. Crucially, the router selects distinct knowledgeable subsets at every loop depth, which means every iteration is computationally distinct regardless of sharing the identical base weights. MoE supplies area breadth; looping supplies reasoning depth.
Consideration defaults to Multi-Latent Consideration from DeepSeek-V2, which caches a compressed low-rank KV latent relatively than full key/worth tensors, yielding a ten–20× discount in KV reminiscence at manufacturing scale.
Reasoning in Steady Latent Area
Some of the necessary properties of this structure is that reasoning happens completely in steady latent house. There is no such thing as a intermediate token emission between loop steps — the mannequin doesn’t produce textual content mid-thought after which re-read it. That is structurally distinct from chain-of-thought prompting, the place reasoning is externalized as token sequences, and has been formally analyzed in each Saunshi et al. (2025) and COCONUT (2024).
Saunshi et al. (2025) formally present that every loop iteration in an RDT is functionally equal to 1 step of chain-of-thought, however working over real-valued vectors relatively than discrete tokens. Steady latent ideas can even encode a number of different subsequent steps concurrently, enabling one thing nearer to breadth-first search over the reasoning house inside a single ahead go.
This additionally explains a concrete functionality benefit. An ordinary transformer educated on 5-hop reasoning chains fails when examined on 10-hop chains at inference time — it has no mechanism to increase its depth past what it noticed throughout coaching. A Recurrent-Depth Transformer handles this naturally: working extra inference-time loops extends the reasoning chain with none retraining. Tougher issues obtain extra compute; less complicated ones exit early.
Fixing the Stability Drawback
Coaching looped fashions has traditionally been brittle. The hidden state ht can develop unboundedly throughout iterations — a failure mode referred to as residual explosion. OpenMythos addresses this utilizing a Linear Time-Invariant (LTI) injection constraint borrowed from the Parcae structure (Prairie et al., 2026): the spectral radius of A, denoted ρ(A), is enforced to be lower than 1 by building, guaranteeing stability no matter studying price or gradient noise.
A second failure mode additionally exists on the different excessive: past a sure loop depth, extreme recurrence degrades predictions — the hidden state drifts previous the answer and into noise. That is the ‘overthinking’ drawback. Adaptive Computation Time (ACT) halting addresses it with a realized scalar per place that dynamically decides when to cease looping. Positions which can be tougher to course of obtain extra computation; tokens which have already converged halt early.
Lastly, Depth-Sensible LoRA adapters introduce a small rank-r adaptation matrix at every iteration depth, giving every loop step barely distinct conduct with out including substantial parameters — bridging the hole between pure weight-tying and absolutely distinct layers.
Why Parameter Effectivity Issues
The Parcae paper (Prairie et al., 2026) supplies empirical grounding for the effectivity declare. At 770M parameters, an RDT matches a 1.3B commonplace transformer educated on equivalent information — roughly half the parameters for equal downstream high quality. Optimum recurrence and optimum token depend each observe energy legal guidelines with constant exponents throughout scales, establishing the primary predictable scaling legal guidelines for looped coaching.
The implication is critical: reasoning depth scales with inference-time compute, not saved parameter depend. This reframes one of many dominant assumptions within the scaling debate. The related axis will not be parameter depend at coaching, however loop depth at inference.
What OpenMythos Contributes
OpenMythos supplies 4 concrete analysis artifacts: a totally configurable PyTorch implementation of the RDT speculation with MoE FFN and Multi-Latent Consideration; LTI-stable recurrent injection built-in as a first-class coaching primitive; depth-wise LoRA adapters enabling per-iteration behavioral differentiation; and a reproducible analysis baseline for learning looped transformer dynamics and inference-time reasoning depth.
Whether or not or not Mythos is definitely an RDT, OpenMythos offers the analysis group one thing concrete and runnable — an implementation of an structure class the literature more and more suggests is underexplored, and one which will signify a basically completely different path to succesful AI than merely coaching larger fashions.
Take a look at the Full Codes with Notebook here. Additionally, be happy to observe us on Twitter and don’t overlook to hitch our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Have to accomplice with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and many others.? Connect with us
