On this tutorial, we discover the implementation of OpenMythos, a theoretical reconstruction of the Claude Mythos structure that allows deeper reasoning via iterative computation quite than elevated parameter measurement. We construct and analyze fashions utilizing each GQA and MLA consideration mechanisms, look at reminiscence effectivity via KV-cache comparisons, and validate stability through the spectral properties of the recurrent replace. We then prepare the mannequin on a structured parity job and examine how growing loop depth at inference improves efficiency with out retraining. Alongside the best way, we additionally examine adaptive computation through ACT halting and monitor skilled utilization within the MoE layers, offering a complete, hands-on understanding of this rising structure.
import subprocess, sys
attempt:
import open_mythos # noqa: F401
besides ImportError:
subprocess.check_call([sys.executable, "-m", "pip", "install", "-q",
"open-mythos"])
import math, time, copy
from collections import Counter, defaultdict
import numpy as np
import torch, torch.nn as nn, torch.nn.useful as F
import matplotlib.pyplot as plt
from open_mythos.principal import (
OpenMythos, MythosConfig,
ACTHalting, MoEFFN,
)
torch.manual_seed(0); np.random.seed(0)
machine = "cuda" if torch.cuda.is_available() else "cpu"
print(f"▸ machine = {machine} | torch = {torch.__version__}")
def make_config(attn_type: str, *, dim=128, n_heads=4, n_experts=4,
max_loops=8, seq_len=128, vocab=256):
base = dict(
vocab_size=vocab, dim=dim, n_heads=n_heads,
max_seq_len=seq_len, max_loop_iters=max_loops,
prelude_layers=1, coda_layers=1,
n_experts=n_experts, n_shared_experts=1,
n_experts_per_tok=2, expert_dim=dim // 2,
lora_rank=8, attn_type=attn_type,
)
if attn_type == "gqa":
return MythosConfig(**base, n_kv_heads=2)
return MythosConfig(
**base, n_kv_heads=n_heads,
kv_lora_rank=32, q_lora_rank=64,
qk_rope_head_dim=16, qk_nope_head_dim=16, v_head_dim=16,
)
cfg_gqa = make_config("gqa")
cfg_mla = make_config("mla")
m_gqa = OpenMythos(cfg_gqa).to(machine)
m_mla = OpenMythos(cfg_mla).to(machine)
print("n─── Half 1 ─ mannequin sizes ──────────────────────────────")
print(f"GQA params : {sum(p.numel() for p in m_gqa.parameters()):>10,}")
print(f"MLA params : {sum(p.numel() for p in m_mla.parameters()):>10,}")
We set up and import all required dependencies and initialize our surroundings for working OpenMythos. We assemble configurations for each GQA and MLA consideration mechanisms and instantiate their respective fashions. We additionally examine their parameter sizes to know how architectural variations influence mannequin scale.
def cache_bytes(kv: dict) -> int:
complete = 0
for entry in kv.values():
for t in entry.values():
complete += t.element_size() * t.numel()
return complete
x = torch.randint(0, 256, (1, 64), machine=machine)
ck_gqa, ck_mla = {}, {}
with torch.no_grad():
m_gqa(x, n_loops=4, kv_cache=ck_gqa)
m_mla(x, n_loops=4, kv_cache=ck_mla)
gqa_kb = cache_bytes(ck_gqa) / 1024
mla_kb = cache_bytes(ck_mla) / 1024
print("n─── Half 2 ─ KV-cache footprint (1×64 tokens, 4 loops) ─")
print(f"GQA cache : {gqa_kb:6.2f} KB ({len(ck_gqa)} layer-keys)")
print(f"MLA cache : {mla_kb:6.2f} KB ({len(ck_mla)} layer-keys)")
print(f"ratio : MLA is ≈{gqa_kb / max(mla_kb, 1e-9):.2f}× smaller")
def show_stability(mannequin, tag):
A = mannequin.recurrent.injection.get_A()
print(f"{tag:3s} ρ(A): min={A.min():.4f} max={A.max():.4f} "
f"imply={A.imply():.4f} steady={bool((A < 1).all() and (A > 0).all())}")
print("n─── Half 3 ─ spectral radius at init ──────────────────")
show_stability(m_gqa, "GQA")
show_stability(m_mla, "MLA")
decide = torch.optim.Adam(m_mla.parameters(), lr=1.0)
for _ in vary(30):
loss = m_mla(torch.randint(0, 256, (2, 16), machine=machine),
n_loops=2).sq.().imply()
decide.zero_grad(); loss.backward(); decide.step()
show_stability(m_mla, "MLA after abusive coaching (lr=1.0, 30 steps)")
We compute and examine the KV-cache reminiscence footprint for each GQA and MLA consideration sorts throughout ahead passes. We then examine the soundness of the recurrent element by analyzing the spectral radius of matrix A. We additional stress-test the mannequin with excessive coaching situations to substantiate that stability is preserved.
VOCAB = 64
SEQ_LEN = 24
def make_batch(batch=64, seq_len=SEQ_LEN):
x = torch.randint(1, 3, (batch, seq_len), machine=machine)
bits = x - 1
parity = bits.cumsum(dim=1) % 2
y = parity + 1
return x, y
cfg = MythosConfig(
vocab_size=VOCAB, dim=64, n_heads=4, n_kv_heads=2,
max_seq_len=SEQ_LEN + 4, max_loop_iters=16,
prelude_layers=1, coda_layers=1,
n_experts=4, n_shared_experts=1, n_experts_per_tok=2,
expert_dim=32, lora_rank=4, attn_type="gqa",
act_threshold=0.99,
)
mannequin = OpenMythos(cfg).to(machine)
decide = torch.optim.AdamW(mannequin.parameters(), lr=3e-4)
T_TRAIN = 3
print("n─── Half 5 ─ coaching (T_train = 3) ───────────────────")
print(f"params: {sum(p.numel() for p in mannequin.parameters()):,}")
losses = []
t0 = time.time()
for step in vary(600):
x, y = make_batch(64)
logits = mannequin(x, n_loops=T_TRAIN)
loss = F.cross_entropy(logits.reshape(-1, VOCAB), y.reshape(-1))
decide.zero_grad(); loss.backward()
decide.step()
losses.append(loss.merchandise())
if step % 100 == 0 or step == 599:
with torch.no_grad():
acc = (logits.argmax(-1) == y).float().imply().merchandise()
print(f"step {step:3d} loss={loss.merchandise():.4f} acc@T3={acc:.3f}")
print(f"coaching wallclock: {time.time() - t0:.1f}s")
We outline a cumulative parity job to coach our mannequin on a structured sequential downside. We initialize the OpenMythos mannequin with a hard and fast loop depth and prepare it utilizing cross-entropy loss. All through coaching, we monitor loss and accuracy to judge how nicely the mannequin learns below constrained depth.
mannequin.eval()
T_sweep = [1, 2, 3, 4, 6, 8, 10, 12, 14, 16]
accs = []
with torch.no_grad():
x_eval, y_eval = make_batch(512)
for T in T_sweep:
logits = mannequin(x_eval, n_loops=T)
accs.append((logits.argmax(-1) == y_eval).float().imply().merchandise())
print("n─── Half 6 ─ depth extrapolation (T_train=3) ──────────")
for T, a in zip(T_sweep, accs):
bar = "█" * int(a * 40)
marker = " ← skilled right here" if T == T_TRAIN else ""
print(f"T={T:second} acc={a:.3f} {bar}{marker}")
halt_trace: listing[torch.Tensor] = []
orig_halt = mannequin.recurrent.act.ahead
def halt_hook(self, h):
p = orig_halt(h)
halt_trace.append(p.detach().cpu())
return p
mannequin.recurrent.act.ahead = halt_hook.__get__(mannequin.recurrent.act, ACTHalting)
with torch.no_grad():
x_h, _ = make_batch(1)
_ = mannequin(x_h, n_loops=16)
mannequin.recurrent.act.ahead = orig_halt
halts = torch.stack(halt_trace, dim=0)[:, 0].numpy()
print(f"n─── Half 7 ─ ACT halting matrix (loops × positions) ───")
print(f"form: {halts.form} | "
f"imply halt-prob per loop: "
f"{', '.be a part of(f'{v:.2f}' for v in halts.imply(1))}")
We consider the skilled mannequin by various the variety of inference loops to review depth extrapolation. We observe how growing loop depth improves accuracy with out retraining the mannequin. We additionally instrument the ACT mechanism to seize halting chances at every sequence place and iteration.
expert_hits = Counter()
orig_moe = mannequin.recurrent.block.ffn.ahead
def moe_hook(self, x):
flat = x.view(-1, x.form[-1])
logits = self.router(flat) + self.router_bias
scores = F.softmax(logits, dim=-1)
_, idx = scores.topk(self.topk, dim=-1)
for e in idx.flatten().tolist():
expert_hits[e] += 1
return orig_moe(x)
mannequin.recurrent.block.ffn.ahead = moe_hook.__get__(
mannequin.recurrent.block.ffn, MoEFFN)
with torch.no_grad():
x_m, _ = make_batch(32)
_ = mannequin(x_m, n_loops=T_TRAIN)
mannequin.recurrent.block.ffn.ahead = orig_moe
print("n─── Half 8 ─ MoE skilled utilization ───────────────────")
complete = sum(expert_hits.values())
for eid in vary(cfg.n_experts):
share = expert_hits.get(eid, 0) / max(complete, 1)
print(f"skilled {eid}: {share*100:5.2f}% of topk slots")
immediate = torch.tensor([[1, 2, 1, 1, 2, 2, 1, 2]], machine=machine)
print("n─── Half 9 ─ era ───────────────────────────────")
print(f"immediate (parity sample): {immediate.tolist()[0]}")
for T_gen in [1, 4, 12]:
with torch.no_grad():
out = mannequin.generate(immediate, max_new_tokens=8,
n_loops=T_gen, temperature=0.1, top_k=2)
print(f"T_gen={T_gen:second} → {out.tolist()[0]}")
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
axes[0].plot(losses)
axes[0].set_title("Coaching loss (parity job)")
axes[0].set_xlabel("step"); axes[0].set_ylabel("cross-entropy")
axes[0].grid(alpha=0.3)
axes[1].plot(T_sweep, accs, "o-", linewidth=2, markersize=8)
axes[1].axvline(T_TRAIN, colour="crimson", linestyle="--",
label=f"T_train = {T_TRAIN}")
axes[1].set_title("Depth extrapolation: accuracy vs inference loops")
axes[1].set_xlabel("n_loops at inference"); axes[1].set_ylabel("accuracy")
axes[1].legend(); axes[1].grid(alpha=0.3); axes[1].set_ylim(0, 1.05)
im = axes[2].imshow(halts, side="auto", cmap="viridis",
vmin=0, vmax=halts.max())
axes[2].set_title("ACT halting probabilityn(loop t × place)")
axes[2].set_xlabel("place"); axes[2].set_ylabel("loop iteration t")
plt.colorbar(im, ax=axes[2], fraction=0.046, pad=0.04)
plt.tight_layout()
plt.savefig("openmythos_tutorial.png", dpi=120, bbox_inches="tight")
plt.present()
We analyze skilled utilization within the MoE layer by monitoring how tokens are routed throughout specialists. We then generate sequences at totally different loop depths to look at their results on outputs. Lastly, we visualize coaching loss, depth extrapolation efficiency, and ACT halting habits via plots.
In conclusion, we demonstrated that OpenMythos successfully leverages looped computation to attain depth extrapolation, enabling the mannequin to enhance accuracy just by growing the variety of inference-time loops. We noticed that the recurrent mechanism stays steady even below excessive coaching situations, and that MLA consideration considerably reduces KV-cache reminiscence utilization in comparison with GQA. We additionally noticed how ACT allows dynamic computation throughout sequence positions and the way MoE routing distributes workload throughout specialists. Total, we established that this structure affords a compelling path for compute-adaptive reasoning, the place we commerce extra inference compute for higher efficiency with out modifying the mannequin’s parameters.
Take a look at the Full Codes with Notebook here. Additionally, be happy to observe us on Twitter and don’t neglect to affix our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Must accomplice with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so on.? Connect with us
