• The AI Timeline
  • Posts
  • LLM Pruning & Distillation, Impact of Code in Pre-training, and Transfusion

LLM Pruning & Distillation, Impact of Code in Pre-training, and Transfusion

#20 | Latest AI Research Explained Simply

In this issue: x2 industry news, x3 AI research papers

Aug 19th ~ Aug 25th

🗞️ Industry News in 1 Line

  1. ♥ 3K The Unitree G1 robot has entered its mass production, and you can buy it for $16,000. It has an array of sensors such as 3D lidar, microphones, depth camera, and a 5W speaker but it comes with an 9000mAh battery which only lasts 2hrs.

    Unitree G1 Humanoid agent

    Unitree G1 Humanoid agent

  2. ♥ 2.7k Nous Research has released a preliminary report on DisTrO (Distributed Training Over-the-Internet), a groundbreaking family of optimizers that reduces inter-GPU communication needs by up to 10,000x, without sacrificing performance. DisTrO enables efficient, low-latency training of large neural networks even on slow internet connections and diverse hardware setups.

    Distributed Training Over-the-Internet

    DisTrO’s loss comparison

Supercharge Your Workday: 100 ChatGPT Productivity Prompts

Transform your workday with HubSpot's free guide, "Using ChatGPT at Work." Discover how to leverage ChatGPT for efficiency and innovation with 100 actionable prompt ideas. This comprehensive resource covers:

  • Demystifying ChatGPT: Understand its full potential.

  • Practical Insights: Real-world use cases to streamline processes.

  • Best Practices: Expert tips for maximum effectiveness.

Download now and revolutionize your workflow with the power of ChatGPT!

LLM Pruning and Distillation in Practice: The Minitron Approach

Sreenivas et al. [NVIDIA]

♥ 829   LLM Compression

Introduction to LLM Pruning and Distillation

Making really smart AI models is super expensive and takes a ton of time and resources. Companies usually make a bunch of different sized models to fit different needs, but that means doing all that expensive work multiple times. Instead of making a bunch of different sized models from scratch, this paper is trying a clever trick. They start with one big, smart model and then:

  1. "Trim the fat" (they call this pruning) - They cut out parts of the big model to make it smaller.

  2. "Teach the smaller version" (they call this distillation) - They use the knowledge from the big model to train the smaller one, kind of like a teacher helping a student.

This allowed them to make a smaller 8B model that's actually better than other similar-sized models out there. Their 4B models are also really good compared to the bigger ones they started with and the best part is that the smaller models run faster too, which is great for practical use.

How Does LLM Pruning and Distillation Work?

LLM Pruning and Distillation can be thought of as teaching a small child everything that you know. When teaching a child, we often simplify the complexities and hide unnecessary information – a similar process is followed here:

  1. Teacher Tune-up (Teacher Correction): Think of this like you reviewing your notes before teaching someone else. They take the big, smart AI (the teacher) and give it a quick study session on the specific stuff they want the smaller AI to learn. This helps make sure the teacher is ready to pass on the right knowledge.

  2. Trimming the Fat (Pruning): Now, imagine your brain is like a big, tangled web of connections. Some parts are super important, others... not so much. They look at all these connections and figure out which ones matter most. Then, they start cutting away the less important bits. They do this in two main ways:

    1. Making the model shallower (less layers)

    2. Making it thinner overall (less connections in various parts)

  3. Teaching the Mini-model (Distillation): This is where the trimmed-down AI (let's call it Mini-Model) tries to copy what the big AI does. It uses Kullback-Leibler (KL) Divergence to compare how the teacher model thinks (its probability distribution) to how the student model thinks. The goal is to make the student's thinking process more similar to the teacher's. Based on this KL Divergence, we adjust the student model's parameters. 

  4. Practice Makes Perfect: They keep doing this teaching process over and over, using tons of examples (we're talking billions!). 

The end result? You get a smaller, faster AI that's almost as smart as the big one. It's learned to think in similar ways and can do a lot of the same tasks.

Benchmarking LLM Pruning and Distillation 

Pruning and distilling LLMs can produce impressive results in terms of model compression and performance retention. They compressed Llama 3.1 8B to 4B parameters and Mistral NeMo 12B to 8B parameters while maintaining, and in some cases even improving, performance across various benchmarks. The compressed MN-Minitron-8B model outperformed similarly sized models, including the recent Llama 3.1 8B, despite using 40 times fewer training tokens.

The Llama-3.1-Minitron 4B models showed favorable performance compared to their teacher model (Llama 3.1 8B) and the previous generation Minitron 4B model, using 150 times fewer training tokens. The width-pruned variant generally outperformed the depth-pruned one in terms of accuracy, while the depth-pruned variant achieved better runtime performance with up to 2.7x speedup over the original model. This means that pruning and distillation approaches can produce smaller, faster models that retain much of the capabilities of larger models.

To Code, or Not To Code? Exploring Impact of Code in Pre-training

Aryabumi et al. [Cohere For AI]

♥ 350   LLM Pre-training

Workflow pipeline of the paper.

Why do We Need Code in Pre-training

We know that big AI models can do all sorts of tasks, like answering questions and even writing code. But people noticed that even when an AI isn't specifically designed to write code, it often gets better at other tasks if it sees code during its training.

We don’t know how these AI models learn, so these researchers decided to dig deeper. They did a bunch of experiments where they trained AI models with different amounts of code mixed in with regular text. They wanted to see how this affected the AI's performance on all sorts of tasks, not just coding. They looked at things like how well the AI could reason, how much general knowledge it had, and even how good it was at creative writing. What they found was pretty surprising - adding code to the training actually made the AI better at almost everything

How to Add Code in Pre-training to Improve LLMs?

The researchers approached their experiment by gathering large amounts of text and code, treating them like a mountain of books and a stack of computer programs, and mixing them in various ways.

In their experimentation, the researchers tried different combinations of text and code, sometimes using more of one than the other. They even experimented with different types of code, such as code from websites or special code created just for this experiment. Additionally, the researchers explored various strategies for incorporating code into the pre-training stages of their AI models to enhance performance across different tasks.

They compared several "recipes" that mixed text and code in different proportions during pre-training, followed by a cooldown phase that further refined the models. Their findings showed that adding code to the mix significantly improved the AI's capabilities, particularly in code-related tasks but also in natural language reasoning and world knowledge.

After each batch, the researchers tested their AI models to see how well they performed on various tasks, like answering questions, solving problems, and even writing code. By experimenting with many different recipes, they discovered that the right balance of code could significantly boost the AI's capabilities.

Testing Code in Pre-training 

One of the standout strategies was the "balanced→text" approach, where an equal mix of text and code was used initially, followed by additional text during continued pre-training, and finally, a cooldown phase that reintroduced code. This method led to substantial gains across the board: an 8.2% improvement in natural language reasoning, a 4.2% boost in world knowledge, a 6.6% increase in generative performance, and a remarkable 12x enhancement in code-related tasks compared to models trained solely on text. 

However, for the best results in code-specific tasks, the "balanced-only" approach, which maintained an equal mix of text and code throughout the entire pre-training process, proved most effective. This model achieved a 20% relative gain in code benchmarks over the next best model, though it slightly lagged behind the "balanced→text" model in natural language performance. 

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Zhou et al. [Meta, Waymo, University of Southern California]

♥ 325   Multimodal LLM

Introduction to Transfusion

When developing multi-modal AI models, they struggle to effectively handle both discrete data (like text) and continuous data (like images). Traditionally, language models excel at processing text through next-token prediction, while diffusion models are the go-to choice for generating images. However, combining these approaches into a single model that can seamlessly manage both text and image data is difficult.

Previous methods, such as quantizing images into discrete tokens for language models, simplify the model's architecture but at the cost of losing crucial information which results in poor performance.

The paper introduces "Transfusion," a new method that aims to fully integrate discrete and continuous modalities into a single transformer model. Transfusion combines the strengths of language modeling and diffusion techniques by training on both text and image data simultaneously, using next-token prediction for text and diffusion for images. This approach allows the model to learn from and generate both types of data without losing information.

Inner-Workings of Transfusion

The Transfusion architecture is designed to handle both text and images in a unified model to effectively integrate different types of data. The key to its functionality lies in using lightweight, modality-specific components tailored to process either text or images.

For text, the architecture uses embedding matrices that convert input text into vector representations and output vectors into a probability distribution over the vocabulary.

For images, the model compresses small patches of image data into single vectors that the transformer can process. It does this in two ways: either through a simple linear layer or by using more complex up and down blocks of a U-Net, which allows for more detailed image processing.

The attention mechanism in Transfusion is also uniquely tailored to manage both text and images. In typical language models, causal masking ensures that each word is predicted based only on the words that come before it, which suits the sequential nature of text. But images require a different approach because their elements (image patches) are not naturally sequential. Transfusion fixes this by combining both causal attention for the text and bidirectional attention within each image, allowing image patches to interact freely with each other while maintaining the sequential processing of text. 

When it comes to training, Transfusion uses a dual-objective approach. For text, it applies a standard language modeling loss, predicting the next token in a sequence. For images, it uses a diffusion-based loss, where noise is added to the image data, and the model learns to denoise it.

The losses from both tasks are combined with a balancing coefficient, ensuring that the model optimizes for both text and image tasks simultaneously. During inference, the model switches between language modeling and diffusion modes depending on the input.

For example, when generating text, it follows the standard token-by-token sampling process. But when an image is to be generated, the model shifts to diffusion mode, where it iteratively denoises a sequence of image patches. This flexible decoding algorithm allows Transfusion to seamlessly generate any combination of text and images, making it a versatile tool for multi-modal AI tasks.

Real-World Implications of Transfusion

The researchers did some deep dives into how the Transfusion model works and found a few things that make a big difference in how well it handles both text and images. One of the key discoveries was about how the model pays attention to different parts of an image. Normally, models use causal attention for text, which means they process words in order. But images aren’t like that – they’re more free-flowing.

So, the researchers added "bidirectional attention" for images which lets all parts of the image talk to each other. This tweak really boosted the model’s ability to generate high-quality images, proving that giving image parts the freedom to communicate is super important.

Editing parts of an image with only text prompt.

They also experimented with the size of the image patches the model uses. Bigger patches make the model run faster and more efficiently, but they can also make the results a little less accurate. Interestingly, while using bigger patches made the text performance drop a bit, it actually helped the model do better with images; especially when they used a more advanced technique called U-Net for handling these patches.

This shows that while you can get away with simpler methods for some tasks, adding a bit more complexity, like the U-Net, can give the model a nice performance boost. 

Benchmark results of Transfusion

Enjoyed reading this issue? You might also enjoy the following AI papers:

bycloud’s new upload:

Reply

or to participate.