**Game-Changing Repository for Large Language Model Pruning: LLM-Pruning Collection**
Hey, fellow ML enthusiasts! Have you been struggling to find a reliable way to prune large language models (LLMs)? Well, your prayers have been answered! The zlab group at Princeton University has just launched a groundbreaking repository called LLM-Pruning Collection that streamlines the process of pruning LLMs using a single, reproducible framework. In this post, we’ll dive into the nitty-gritty details of this impressive repository and what it means for the LLM pruning community.
**What’s in the LLM-Pruning Collection?**
The repository is structured into three main directories:
1. **pruning**: This directory is the meat of the repository, featuring implementations for various pruning strategies, including:
* Minitron: A smart pruning and distillation recipe developed by NVIDIA that compresses LLaMA 3.1 8B and Mistral NeMo 12B to 4B and 8B while preserving efficiency.
* ShortGPT: A technique that removes redundant Transformer layers by direct layer deletion, outperforming earlier pruning strategies for multiple generation and alternative tasks.
* Wanda, SparseGPT, and Magnitude: Post-training pruning techniques that score weights by the product of weight magnitude and corresponding input activation, prune the smallest scores, and induce sparsity that works well even at billion-parameter scales.
2. **training**: This directory integrates with FMS-FSDP for GPU training and MaxText for TPU training, ensuring seamless deployment on both hardware platforms.
3. **evaluation**: This directory is home to JAX-based evaluation scripts built around lm-eval-harness, with speedup-based support for MaxText that provides about 2-4 times speedup.
**Key Takeaways**
* The LLM-Pruning Collection is a JAX-based, Apache-2.0 repository that unifies modern LLM pruning strategies with shared pruning, training, and evaluation pipelines for GPUs and TPUs.
* The codebase implements block, layer, and weight-level pruning approaches, including Minitron, ShortGPT, Wanda, SparseGPT, Sheared LLaMA, Magnitude pruning, and LLM-Pruner.
* The repository reproduces key results from prior pruning work, publishing side-by-side “paper vs reproduced” tables for techniques like Wanda, SparseGPT, Sheared LLaMA, and LLM-Pruner, so engineers can validate their runs against recognized baselines.
* The repository is a significant contribution to the field of LLM pruning, providing a unified framework for comparing different pruning strategies and techniques.
