Zyphra Introduces Tensor and Sequence Parallelism (TSP): A {Hardware}-Conscious Coaching and Inference Technique That Delivers 2.6x Throughput Over Matched TP+SP Baselines

Coaching and serving massive transformer fashions at scale is essentially a reminiscence administration drawback. Each GPU in a cluster has a hard and fast quantity of VRAM, and as mannequin sizes and context lengths develop, engineers continually must make trade-offs about the way to distribute work throughout {hardware}. A new approach from Zyphra, referred to as Tensor and Sequence Parallelism (TSP), gives a method to rethink that trade-off — and in benchmark checks on as much as 1,024 AMD MI300X GPUs, it persistently delivers decrease per-GPU peak reminiscence than any of the usual parallelism schemes used right now, for each coaching and inference workloads.

https://www.zyphra.com/submit/tsp

The Drawback TSP Is Fixing

To know why TSP is essential, it’s essential to first perceive the 2 parallelism methods it folds collectively.

Tensor Parallelism (TP) splits mannequin weights throughout GPUs. When you have a weight matrix in an consideration or MLP layer, every GPU within the TP group holds solely a fraction of that matrix. This instantly reduces the per-GPU reminiscence occupied by parameters, gradients, and optimizer states — the ‘mannequin state’ reminiscence. The trade-off is that TP requires collective communication operations (usually all-reduce or reduce-scatter/all-gather pairs) each time a layer is computed. This communication is proportional to activation dimension, so it turns into more and more costly as sequence size grows.

Sequence Parallelism (SP) takes a special strategy. As a substitute of splitting weights, it splits the enter token sequence throughout GPUs. Every GPU processes solely a fraction of the tokens, which reduces activation reminiscence and the quadratic price of consideration computation. Nevertheless, SP leaves mannequin weights absolutely replicated on each GPU, which implies model-state reminiscence stays precisely the identical no matter what number of GPUs you add to the SP group.

In commonplace multi-dimensional parallelism, engineers mix TP and SP by inserting them on orthogonal axes of a tool mesh. If you would like a TP diploma of T and an SP diploma of Σ, your mannequin duplicate consumes T.Σ GPUs. That is costly in two methods. First, it makes use of extra GPUs for the model-parallel group, leaving fewer out there for data-parallel replicas. Second, if T.Σ is massive sufficient to span a number of nodes, some collective communication has to journey over slower inter-node interconnects like InfiniBand or Ethernet as an alternative of the high-bandwidth intra-node cloth, reminiscent of AMD Infinity Cloth or NVIDIA NVLink. Knowledge Parallelism (DP), the opposite frequent baseline, avoids these model-parallel prices fully however replicates all mannequin state on each machine, making it impractical for giant fashions or lengthy contexts by itself.

What Folding Truly Means

TSP’s core thought is parallelism folding: as an alternative of inserting TP and SP on separate, orthogonal mesh dimensions, it collapses each onto a single device-mesh axis of dimension D. Each GPU within the TSP group concurrently holds 1/D of the mannequin weights and 1/D of the token sequence. As a result of each are sharded throughout the identical D GPUs, the per-device reminiscence footprint decreases by 1/D for each parameter reminiscence and activation reminiscence — one thing no single commonplace parallelism scheme achieves by itself. TSP is subsequently the one scheme that concurrently reduces weight-proportional reminiscence (parameters, gradients, optimizer states) and activation reminiscence by the identical 1/D issue on a single axis with out requiring a two-dimensional T.Σ machine structure.

The problem is that if every GPU solely has a part of the weights and a part of the sequence, it must coordinate with different GPUs to finish every layer’s ahead cross. TSP makes use of two totally different communication schedules to deal with this, one for consideration and one for the gated MLP.

For consideration, TSP iterates over weight shards. At every step, one GPU broadcasts its packed consideration weight shards (WQ, WK, WV, and WO) to all different GPUs within the group. Each GPU then applies these weights to its native sequence tokens to compute native Q, Ok, and V projections. Since causal consideration requires entry to the complete key/worth context, the native Ok and V tensors are all-gathered throughout the TSP group and reordered utilizing a zigzag partition scheme earlier than FlashAttention is utilized. The zigzag partition ensures that causal consideration workload is balanced throughout ranks, since later tokens attend to bigger prefixes and would in any other case trigger load imbalance.

For the gated MLP, TSP makes use of a hoop schedule. Every GPU begins with native shards of the gate, up, and down projections. These weight shards flow into across the TSP group by way of point-to-point ship/recv operations, and every GPU accumulates partial outputs domestically because the shards arrive. Critically, this eliminates the all-reduce that commonplace TP requires for MLP output — the sequence stays native, and solely the weights transfer. The ring is designed to overlap weight transfers with GEMM computation, so communication occurs within the background whereas the GPU is computing.

Reminiscence and Throughput Outcomes

Examined on a single 8-GPU MI300X node throughout sequence lengths from 16K to 128K tokens, TSP achieves the bottom peak reminiscence at each level. At 16K tokens, TSP and TP are almost equal, 31.0 GB versus 31.5 GB per GPU, as a result of model-state reminiscence dominates at brief context. At 128K tokens, the image modifications dramatically: TSP makes use of 38.8 GB per GPU, in comparison with 70.0 GB for TP and 85.0 GB and 140.0 GB for 2 totally different TP+SP factorizations on the identical node. The theoretical figures all through this analysis are based mostly on a 7B dense decoder-only transformer reference mannequin (hidden dimension h=4096, 32 layers, 32 question heads, 32 KV heads, FFN enlargement issue F=4, bf16 precision), offering a reproducible baseline for evaluating the schemes.

Throughput outcomes on 128 full nodes (1,024 MI300X GPUs) present TSP persistently outperforming matched TP+SP baselines. At a folded diploma of D=8 and sequence size of 128K tokens, TSP achieves 173 million tokens per second in comparison with 66.30 million tokens per second for the matched TP+SP baseline (roughly a 2.6x speedup). The benefit grows with increased parallelism diploma and longer sequence size.

https://www.zyphra.com/submit/tsp

Sensible Commerce-offs to Perceive

TSP does improve whole communication quantity in comparison with TP alone. It provides a weight-movement time period per layer on prime of the identical activation-proportional Ok/V all-gather that SP makes use of. Nevertheless, the analysis group exhibits that when batch dimension B and sequence size S fulfill BS > 8h (the place h is the mannequin’s embedding dimension), TSP’s ahead communication quantity is aggressive with TP’s. This situation is met in most long-context coaching and inference eventualities.

The important thing perception the Zyphra group emphasizes is that communication quantity and communication price aren’t the identical factor. Whether or not additional communication quantity interprets into wall-clock slowdown relies on whether or not the collectives are latency-bound or bandwidth-bound, and the way a lot of that site visitors could be overlapped with matrix multiplication. Their implementation pipelines weight transfers behind dominant GEMM operations in order that weight communication consumes bandwidth with out including to critical-path time.

TSP shouldn’t be designed to exchange TP, SP, or TP+SP in all settings. It’s meant as an extra axis within the multi-dimensional parallelism design area. It composes orthogonally with pipeline parallelism, professional parallelism, and information parallelism. This implies groups can slot TSP into an present parallelism configuration wherever the usual structure would in any other case pressure model-parallel teams throughout slower inter-node hyperlinks.

Key Takeaways

Zyphra’s Tensor and Sequence Parallelism (TSP) folds tensor parallelism and sequence parallelism onto a single device-mesh axis, so every GPU concurrently holds 1/D of the mannequin weights and 1/D of the token sequence, lowering reminiscence overhead for each coaching and inference.
TSP is the one parallelism scheme that reduces each weight-proportional reminiscence (parameters, gradients, optimizer states) and activation reminiscence by the identical 1/D issue on a single axis, with out requiring a two-dimensional T.Σ machine mesh.
Empirical outcomes on a single 8-GPU MI300X node present TSP makes use of 38.8 GB per GPU at 128K sequence size, in comparison with 70.0 GB for TP and 85.0–140.0 GB for TP+SP configurations.
At massive scale (1,024 MI300X GPUs, 128K context, D=8), TSP achieves 173 million tokens per second versus 66.30 million tokens per second for a matched TP+SP baseline (roughly a 2.6x throughput benefit).
TSP composes orthogonally with pipeline, professional, and information parallelism and is finest fitted to long-context, memory-constrained coaching and inference workloads the place eliminating weight and activation replication outweighs the added communication quantity.

Try the Paper and Technical details. Additionally, be at liberty to observe us on Twitter and don’t overlook to affix our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Must companion with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so forth.? Connect with us

The submit Zyphra Introduces Tensor and Sequence Parallelism (TSP): A {Hardware}-Conscious Coaching and Inference Technique That Delivers 2.6x Throughput Over Matched TP+SP Baselines appeared first on MarkTechPost.

Source link

Zyphra Introduces Tensor and Sequence Parallelism (TSP): A {Hardware}-Conscious Coaching and Inference Technique That Delivers 2.6x Throughput Over Matched TP+SP Baselines

As staff fear about AI, Nvidia’s Jensen Huang says AI is ‘creating an infinite variety of jobs’

Geothermal startup Fervo Vitality to lift as much as $1.3B in IPO

Greg Brockman Defends $30B OpenAI Stake: ‘Blood, Sweat, and Tears’

Zyphra Introduces Tensor and Sequence Parallelism (TSP): A {Hardware}-Conscious Coaching and Inference Technique That Delivers 2.6x Throughput Over Matched TP+SP Baselines

The Drawback TSP Is Fixing

What Folding Truly Means

Reminiscence and Throughput Outcomes

Sensible Commerce-offs to Perceive

Key Takeaways

Related Posts

As staff fear about AI, Nvidia’s Jensen Huang says AI is ‘creating an infinite variety of jobs’

Geothermal startup Fervo Vitality to lift as much as $1.3B in IPO

Greg Brockman Defends $30B OpenAI Stake: ‘Blood, Sweat, and Tears’