Close Menu
    Facebook X (Twitter) Instagram
    Articles Stock
    • Home
    • Technology
    • AI
    • Pages
      • About ArticlesStock — AI & Technology Journalist
      • Contact us
      • Disclaimer For Articles Stock
      • Privacy Policy
      • Terms and Conditions
    Facebook X (Twitter) Instagram
    Articles Stock
    AI

    A Coding Tutorial on OpenMythos on Recurrent-Depth Transformers with Depth Extrapolation, Adaptive Computation, and Combination-of-Specialists Routing

    Naveed AhmadBy Naveed Ahmad24/04/2026Updated:24/04/2026No Comments7 Mins Read
    blog 1 15


    On this tutorial, we discover the implementation of OpenMythos, a theoretical reconstruction of the Claude Mythos structure that allows deeper reasoning via iterative computation quite than elevated parameter measurement. We construct and analyze fashions utilizing each GQA and MLA consideration mechanisms, look at reminiscence effectivity via KV-cache comparisons, and validate stability through the spectral properties of the recurrent replace. We then prepare the mannequin on a structured parity job and examine how growing loop depth at inference improves efficiency with out retraining. Alongside the best way, we additionally examine adaptive computation through ACT halting and monitor skilled utilization within the MoE layers, offering a complete, hands-on understanding of this rising structure.

    import subprocess, sys
    attempt:
       import open_mythos  # noqa: F401
    besides ImportError:
       subprocess.check_call([sys.executable, "-m", "pip", "install", "-q",
                              "open-mythos"])
    
    
    import math, time, copy
    from collections import Counter, defaultdict
    
    
    import numpy as np
    import torch, torch.nn as nn, torch.nn.useful as F
    import matplotlib.pyplot as plt
    
    
    from open_mythos.principal import (
       OpenMythos, MythosConfig,
       ACTHalting, MoEFFN,
    )
    
    
    torch.manual_seed(0); np.random.seed(0)
    machine = "cuda" if torch.cuda.is_available() else "cpu"
    print(f"▸ machine = {machine}   |   torch = {torch.__version__}")
    
    
    def make_config(attn_type: str, *, dim=128, n_heads=4, n_experts=4,
                   max_loops=8, seq_len=128, vocab=256):
       base = dict(
           vocab_size=vocab, dim=dim, n_heads=n_heads,
           max_seq_len=seq_len, max_loop_iters=max_loops,
           prelude_layers=1, coda_layers=1,
           n_experts=n_experts, n_shared_experts=1,
           n_experts_per_tok=2, expert_dim=dim // 2,
           lora_rank=8, attn_type=attn_type,
       )
       if attn_type == "gqa":
           return MythosConfig(**base, n_kv_heads=2)
       return MythosConfig(
           **base, n_kv_heads=n_heads,
           kv_lora_rank=32, q_lora_rank=64,
           qk_rope_head_dim=16, qk_nope_head_dim=16, v_head_dim=16,
       )
    
    
    cfg_gqa = make_config("gqa")
    cfg_mla = make_config("mla")
    m_gqa = OpenMythos(cfg_gqa).to(machine)
    m_mla = OpenMythos(cfg_mla).to(machine)
    
    
    print("n─── Half 1 ─ mannequin sizes ──────────────────────────────")
    print(f"GQA  params : {sum(p.numel() for p in m_gqa.parameters()):>10,}")
    print(f"MLA  params : {sum(p.numel() for p in m_mla.parameters()):>10,}")

    We set up and import all required dependencies and initialize our surroundings for working OpenMythos. We assemble configurations for each GQA and MLA consideration mechanisms and instantiate their respective fashions. We additionally examine their parameter sizes to know how architectural variations influence mannequin scale.

    def cache_bytes(kv: dict) -> int:
       complete = 0
       for entry in kv.values():
           for t in entry.values():
               complete += t.element_size() * t.numel()
       return complete
    
    
    x = torch.randint(0, 256, (1, 64), machine=machine)
    ck_gqa, ck_mla = {}, {}
    with torch.no_grad():
       m_gqa(x, n_loops=4, kv_cache=ck_gqa)
       m_mla(x, n_loops=4, kv_cache=ck_mla)
    
    
    gqa_kb = cache_bytes(ck_gqa) / 1024
    mla_kb = cache_bytes(ck_mla) / 1024
    print("n─── Half 2 ─ KV-cache footprint (1×64 tokens, 4 loops) ─")
    print(f"GQA cache : {gqa_kb:6.2f} KB   ({len(ck_gqa)} layer-keys)")
    print(f"MLA cache : {mla_kb:6.2f} KB   ({len(ck_mla)} layer-keys)")
    print(f"ratio      : MLA is ≈{gqa_kb / max(mla_kb, 1e-9):.2f}× smaller")
    
    
    def show_stability(mannequin, tag):
       A = mannequin.recurrent.injection.get_A()
       print(f"{tag:3s}  ρ(A): min={A.min():.4f}  max={A.max():.4f}  "
             f"imply={A.imply():.4f}  steady={bool((A < 1).all() and (A > 0).all())}")
    
    
    print("n─── Half 3 ─ spectral radius at init ──────────────────")
    show_stability(m_gqa, "GQA")
    show_stability(m_mla, "MLA")
    
    
    decide = torch.optim.Adam(m_mla.parameters(), lr=1.0)
    for _ in vary(30):
       loss = m_mla(torch.randint(0, 256, (2, 16), machine=machine),
                    n_loops=2).sq.().imply()
       decide.zero_grad(); loss.backward(); decide.step()
    show_stability(m_mla, "MLA after abusive coaching (lr=1.0, 30 steps)")

    We compute and examine the KV-cache reminiscence footprint for each GQA and MLA consideration sorts throughout ahead passes. We then examine the soundness of the recurrent element by analyzing the spectral radius of matrix A. We additional stress-test the mannequin with excessive coaching situations to substantiate that stability is preserved.

    VOCAB = 64
    SEQ_LEN = 24
    
    
    def make_batch(batch=64, seq_len=SEQ_LEN):
       x = torch.randint(1, 3, (batch, seq_len), machine=machine)
       bits = x - 1
       parity = bits.cumsum(dim=1) % 2
       y = parity + 1
       return x, y
    
    
    cfg = MythosConfig(
       vocab_size=VOCAB, dim=64, n_heads=4, n_kv_heads=2,
       max_seq_len=SEQ_LEN + 4, max_loop_iters=16,
       prelude_layers=1, coda_layers=1,
       n_experts=4, n_shared_experts=1, n_experts_per_tok=2,
       expert_dim=32, lora_rank=4, attn_type="gqa",
       act_threshold=0.99,
    )
    mannequin = OpenMythos(cfg).to(machine)
    decide = torch.optim.AdamW(mannequin.parameters(), lr=3e-4)
    T_TRAIN = 3
    
    
    print("n─── Half 5 ─ coaching (T_train = 3) ───────────────────")
    print(f"params: {sum(p.numel() for p in mannequin.parameters()):,}")
    losses = []
    t0 = time.time()
    for step in vary(600):
       x, y = make_batch(64)
       logits = mannequin(x, n_loops=T_TRAIN)
       loss = F.cross_entropy(logits.reshape(-1, VOCAB), y.reshape(-1))
       decide.zero_grad(); loss.backward()
       decide.step()
       losses.append(loss.merchandise())
       if step % 100 == 0 or step == 599:
           with torch.no_grad():
               acc = (logits.argmax(-1) == y).float().imply().merchandise()
           print(f"step {step:3d}   loss={loss.merchandise():.4f}   acc@T3={acc:.3f}")
    print(f"coaching wallclock: {time.time() - t0:.1f}s")

    We outline a cumulative parity job to coach our mannequin on a structured sequential downside. We initialize the OpenMythos mannequin with a hard and fast loop depth and prepare it utilizing cross-entropy loss. All through coaching, we monitor loss and accuracy to judge how nicely the mannequin learns below constrained depth.

    mannequin.eval()
    T_sweep = [1, 2, 3, 4, 6, 8, 10, 12, 14, 16]
    accs = []
    with torch.no_grad():
       x_eval, y_eval = make_batch(512)
       for T in T_sweep:
           logits = mannequin(x_eval, n_loops=T)
           accs.append((logits.argmax(-1) == y_eval).float().imply().merchandise())
    
    
    print("n─── Half 6 ─ depth extrapolation (T_train=3) ──────────")
    for T, a in zip(T_sweep, accs):
       bar = "█" * int(a * 40)
       marker = "  ← skilled right here" if T == T_TRAIN else ""
       print(f"T={T:second}  acc={a:.3f}  {bar}{marker}")
    
    
    halt_trace: listing[torch.Tensor] = []
    orig_halt = mannequin.recurrent.act.ahead
    
    
    def halt_hook(self, h):
       p = orig_halt(h)
       halt_trace.append(p.detach().cpu())
       return p
    mannequin.recurrent.act.ahead = halt_hook.__get__(mannequin.recurrent.act, ACTHalting)
    
    
    with torch.no_grad():
       x_h, _ = make_batch(1)
       _ = mannequin(x_h, n_loops=16)
    
    
    mannequin.recurrent.act.ahead = orig_halt
    
    
    halts = torch.stack(halt_trace, dim=0)[:, 0].numpy()
    print(f"n─── Half 7 ─ ACT halting matrix (loops × positions) ───")
    print(f"form: {halts.form}  |  "
         f"imply halt-prob per loop: "
         f"{', '.be a part of(f'{v:.2f}' for v in halts.imply(1))}")

    We consider the skilled mannequin by various the variety of inference loops to review depth extrapolation. We observe how growing loop depth improves accuracy with out retraining the mannequin. We additionally instrument the ACT mechanism to seize halting chances at every sequence place and iteration.

    expert_hits = Counter()
    orig_moe = mannequin.recurrent.block.ffn.ahead
    
    
    def moe_hook(self, x):
       flat = x.view(-1, x.form[-1])
       logits = self.router(flat) + self.router_bias
       scores = F.softmax(logits, dim=-1)
       _, idx = scores.topk(self.topk, dim=-1)
       for e in idx.flatten().tolist():
           expert_hits[e] += 1
       return orig_moe(x)
    
    
    mannequin.recurrent.block.ffn.ahead = moe_hook.__get__(
       mannequin.recurrent.block.ffn, MoEFFN)
    
    
    with torch.no_grad():
       x_m, _ = make_batch(32)
       _ = mannequin(x_m, n_loops=T_TRAIN)
    
    
    mannequin.recurrent.block.ffn.ahead = orig_moe
    
    
    print("n─── Half 8 ─ MoE skilled utilization ───────────────────")
    complete = sum(expert_hits.values())
    for eid in vary(cfg.n_experts):
       share = expert_hits.get(eid, 0) / max(complete, 1)
       print(f"skilled {eid}: {share*100:5.2f}% of topk slots")
    
    
    immediate = torch.tensor([[1, 2, 1, 1, 2, 2, 1, 2]], machine=machine)
    print("n─── Half 9 ─ era ───────────────────────────────")
    print(f"immediate (parity sample): {immediate.tolist()[0]}")
    for T_gen in [1, 4, 12]:
       with torch.no_grad():
           out = mannequin.generate(immediate, max_new_tokens=8,
                                n_loops=T_gen, temperature=0.1, top_k=2)
       print(f"T_gen={T_gen:second}  → {out.tolist()[0]}")
    
    
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    
    axes[0].plot(losses)
    axes[0].set_title("Coaching loss (parity job)")
    axes[0].set_xlabel("step"); axes[0].set_ylabel("cross-entropy")
    axes[0].grid(alpha=0.3)
    
    
    axes[1].plot(T_sweep, accs, "o-", linewidth=2, markersize=8)
    axes[1].axvline(T_TRAIN, colour="crimson", linestyle="--",
                   label=f"T_train = {T_TRAIN}")
    axes[1].set_title("Depth extrapolation: accuracy vs inference loops")
    axes[1].set_xlabel("n_loops at inference"); axes[1].set_ylabel("accuracy")
    axes[1].legend(); axes[1].grid(alpha=0.3); axes[1].set_ylim(0, 1.05)
    
    
    im = axes[2].imshow(halts, side="auto", cmap="viridis",
                       vmin=0, vmax=halts.max())
    axes[2].set_title("ACT halting probabilityn(loop t × place)")
    axes[2].set_xlabel("place"); axes[2].set_ylabel("loop iteration t")
    plt.colorbar(im, ax=axes[2], fraction=0.046, pad=0.04)
    
    
    plt.tight_layout()
    plt.savefig("openmythos_tutorial.png", dpi=120, bbox_inches="tight")
    plt.present()

    We analyze skilled utilization within the MoE layer by monitoring how tokens are routed throughout specialists. We then generate sequences at totally different loop depths to look at their results on outputs. Lastly, we visualize coaching loss, depth extrapolation efficiency, and ACT halting habits via plots.

    In conclusion, we demonstrated that OpenMythos successfully leverages looped computation to attain depth extrapolation, enabling the mannequin to enhance accuracy just by growing the variety of inference-time loops. We noticed that the recurrent mechanism stays steady even below excessive coaching situations, and that MLA consideration considerably reduces KV-cache reminiscence utilization in comparison with GQA. We additionally noticed how ACT allows dynamic computation throughout sequence positions and the way MoE routing distributes workload throughout specialists. Total, we established that this structure affords a compelling path for compute-adaptive reasoning, the place we commerce extra inference compute for higher efficiency with out modifying the mannequin’s parameters.


    Take a look at the Full Codes with Notebook here. Additionally, be happy to observe us on Twitter and don’t neglect to affix our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

    Must accomplice with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so on.? Connect with us




    Source link

    Naveed Ahmad

    Naveed Ahmad is a technology journalist and AI writer at ArticlesStock, covering artificial intelligence, machine learning, and emerging tech policy. Read his latest articles.

    Related Posts

    These are the nations transferring to ban social media for kids

    24/04/2026

    Google DeepMind Introduces Decoupled DiLoCo: An Asynchronous Coaching Structure Reaching 88% Goodput Below Excessive {Hardware} Failure Charges

    24/04/2026

    Porsche is including an all-electric Cayenne coupe to its lineup

    24/04/2026
    Leave A Reply Cancel Reply

    Categories
    • AI
    Recent Comments
      Facebook X (Twitter) Instagram Pinterest
      © 2026 ThemeSphere. Designed by ThemeSphere.

      Type above and press Enter to search. Press Esc to cancel.