**Revolutionizing Language Models: Introducing Engram, a Conditional Memory Axis**
In a landmark innovation, researchers at DeepSeek.ai have unveiled Engram, a game-changing technology that’s poised to transform the field of Sparse Large Language Models (LLMs). Engram is designed to revolutionize memory storage and retrieval in LLMs, unleashing new possibilities for efficient and effective language processing.
**How Engram Enhances DeepSeek Transformers**
Engram is built on top of the powerful DeepSeek V3 tokenizer, which has been pre-trained on an impressive 262 billion tokens. The backbone of the model is a 30-block Transformer with a hidden size of 2560 and Multi-head Latent Attention. Engram is seamlessly integrated into this framework as a sparse embedding module, utilizing hashed N-gram tables, multi-head hashing into prime-sized buckets, and a context-aware gating scalar.
**Sparsity Allocation: The Key to Unlocking Engram’s Potential**
The decisive question is how to allocate the sparse parameter budget between routed specialists and conditional memory. By formalizing this as the Sparsity Allocation problem, the authors have discovered a sweet spot where Engram models outperform MoE models even when the ratio of inactive parameters drops to around 0.25. This corresponds to roughly half as many routed specialists as before. The optimal allocation ratio appears to be around 20-25%, which holds true across both compute regimes.
**Crowning Achievement: Giant-Scale Pre-Training Results**
Four models were trained on the same massive 262 billion token curriculum, with 3.8 billion activated parameters in each case. The models included Dense 4B, MoE 27B, Engram 27B, and Engram 40B. On the Pile test set, language modeling loss was significantly lower for Engram models, with Engram 40B boasting a remarkable 1.942 loss. Engram models consistently outperformed MoE models on data and reasoning benchmarks, such as MMLU, CMMLU, C-Eval, ARC, BBH, and DROP F1.
**Unlocking the Potential of Engram**
After pre-training, the authors pushed the context window to 32768 tokens for 5000 steps, using 30 billion high-quality long context tokens. They analyzed MoE-27B and Engram-27B at various checkpoints, finding that Engram-27B matched or exceeded MoE-27B in three scenarios, with about 82% of the pre-training FLOPs.
**Takeaways from the Engram Revolution**
Engram provides a conditional memory axis for sparse LLMs, enabling fast lookup of frequent N-gram patterns and entities.
Under fixed parameter and FLOPs budgets, Engram allows for more effective memory allocation by reallocating about 20-25% of the sparse capability from MoE specialists into Engram memory.
In giant-scale pre-training on 262 billion tokens, Engram-27B and Engram-40B with the same 3.8 billion activated parameters outperform the MoE-27B baseline on various benchmarks, including language modeling, data, reasoning, code, and math.
I rewrote the text in a more conversational tone, using natural language and breaking up the sections into clear headings. The introduction provides context for the reader, while the takeaways summarize the main points in a concise and easy-to-understand format.
