DeepSeek Researchers Apply a 1967 Matrix Normalization Algorithm to Repair Instability in Hyper Connections

**From Residual Connections to Hyper Connections: A Game-Changer for Language Models**

Hey everyone, welcome back to my blog! Today, I’m excited to dive into the world of deep learning and explore the latest advancements in language model development. In this post, we’ll be discussing the concept of hyper connections, an extension of residual connections that has been making waves in the AI community.

If you’re familiar with residual connections, you might be wondering what’s new and exciting. Well, let me tell you – hyper connections take the concept of residual connections to the next level by introducing additional layers and a unique way of mixing and combining them. This has the potential to significantly improve the performance of language models, and I’m excited to share the details with you.

So, what are hyper connections? Simply put, they’re a type of connection that builds upon the residual connection architecture. Instead of just passing the input through each layer, hyper connections allow for the mixing and combining of multiple layers, enabling the model to learn more complex and nuanced patterns.

The key idea behind hyper connections is that each layer has an n-stream buffer, which is a collection of vectors from each layer. These vectors are then combined using three learned mappings:

* H_l^pre: selects a combination of streams as the layer input
* F: the same as the attention or feed-forward sublayer
* H_l^pub: writes outcomes back into the n-stream buffer
* H_l^res: mixes streams between layers

The update formula is x_l+1 = H_l^res x_l + H_l^pub^T F(H_l^pre x_l, W_l)

By introducing hyper connections, researchers have been able to achieve improved downstream performance in language models without a significant increase in computation. The key is to set n to 4, which seems to be the sweet spot for this type of architecture.

However, hyper connections are not without their challenges. One of the main issues is the potential for instability, particularly when the product of residual mixers across many layers is considered. Researchers at DeepSeek studied the composite mapping and came up with the Amax Achieve Magnitude metric, which measures the worst-case amplification in the forward and backward sign paths. In the hyper connection model, this gain reaches peaks around 3000, far from the expected value of 1!

To address this issue, researchers introduced a new constraint: manifold-constrained hyper connections (mHC). mHC takes the idea of hyper connections and constrains the residual mixing matrix H_l^res to live on a manifold of doubly stochastic matrices. This means every row and column sums to 1, and all entries are non-negative.

To enforce this constraint, researchers used the Sinkhorn-Knopp algorithm, which alternates row and column normalizations to approximate a doubly stochastic matrix. This adds a small overhead, but it’s manageable.

So, what’s the impact on training? Constraining each residual mixer with Sinkhorn-type iterations does come with a cost. However, the research group addresses this with a few techniques:

* Fused kernels combine RMSNorm, projections, and gating for the mHC mappings to keep memory traffic low
* Recompute-based activation checkpointing trades compute for memory by recomputing mHC activations during backpropagation for blocks of layers
* Integration with a DualPipe-like pipeline schedule overlaps communication and recomputation, so additional work doesn’t stall the training pipeline

In large-scale in-home training runs, mHC with an expansion rate of n equal to 4 adds about 6.7% training time overhead relative to the baseline structure.

But what about the results? The research group trained 3B, 9B, and 27B combination of specialists models and evaluated them on a standard language model benchmark suite.

For the 27B model, the reported numbers on a subset of tasks show the pattern clearly:

* Baseline: BBH 43.8, DROP F1 47.0
* With hyper connections: BBH 48.9, DROP 51.6
* With mHC: BBH 51.0, DROP 53.9

So, hyper connections already show a gain over the baseline residual design, and manifold-constrained hyper connections push efficiency further while restoring stability.

In conclusion, mHC stabilizes widened residual streams by constraining the residual mixing matrices on a manifold of doubly stochastic matrices. This approach reduces exploding gain, increases expressivity without a huge increase in computation, and leads to improved downstream performance in language models. With only a small training overhead, this is a promising direction for future large language models.

Thanks for reading, and I’ll see you in the next post!

DeepSeek Researchers Apply a 1967 Matrix Normalization Algorithm to Repair Instability in Hyper Connections

How the Web Broke Everybody’s Bullshit Detectors

How Data Distillation Compresses Ensemble Intelligence right into a Single Deployable AI Mannequin

PSA: For those who use the Meta AI app, your pals will discover out and it will likely be embarrassing

DeepSeek Researchers Apply a 1967 Matrix Normalization Algorithm to Repair Instability in Hyper Connections

Related Posts

How the Web Broke Everybody’s Bullshit Detectors

How Data Distillation Compresses Ensemble Intelligence right into a Single Deployable AI Mannequin

PSA: For those who use the Meta AI app, your pals will discover out and it will likely be embarrassing