- The AI Timeline
- Posts
- BitNet a4.8, LoRA vs Full Fine-tuning, and Mixture of In-context Learner
BitNet a4.8, LoRA vs Full Fine-tuning, and Mixture of In-context Learner
#31 | Latest AI Research Explained Simply
In this issue: 3x industry news, 3x AI research papers
Nov 4th ~ Nov 10th
🗞️ Industry News in 1 Line
♥ 1.3k FLUX1.1 has released a new update which offers high-resolution im
age generation up to 4MP at 10 seconds per sample, as well as a new "raw mode" for a more natural, authentic aesthetic, all at a competitive price of $0.06 per image.
♥ 1k A new version of Claude 3.5 Haiku is fast and particularly strong at coding, outperforming state-of-the-art models on SWE-bench Verified, which measures how models solve real software issues. During testing, Haiku surpassed Claude 3 Opus, our previous flagship model, on many benchmarks—at a fraction of the cost. As a result, Anthropic has increased pricing for Claude 3.5 Haiku to reflect its increase in intelligence which is a controversial decision.
♥ 1.5k Qwen has released a number of coder models, and most of the model weights are released under Apache 2.0. Many of these smaller models demonstrate SOTA performance at their model size and can perform complex tasks such as code generation, code repair, and code reasoning.
Mixtures of In-Context Learners
Hong et al. [University of Edinburgh, Miniml.AI]
♥ 529 LLM MoE bycloud’s pick
Introduction to Mixtures of In-Context Learners
There is a fundamental limitation of traditional in-context learning (ICL) with large language models: as you add more examples to the context window, you face quadratic increases in computational complexity and memory usage. This paper proposes MOICL (Mixtures of In-Context Learners) as a solution, which cleverly partitions demonstrations into smaller subsets that act as "experts" and then learns a weighting function to combine their predictions.
This approach not only reduces computational overhead by distributing demonstrations across multiple smaller contexts, but also intelligently weights the contribution of different demonstration subsets, making it more robust to noisy, out-of-distribution, or imbalanced examples.
Understanding Mixtures of In-Context Learners
MOICL starts by taking a set of example demonstrations and splitting them into several smaller groups (called experts). Instead of feeding all demonstrations into the language model at once, which would be computationally expensive, each group of demonstrations is processed separately. Then comes the clever part - the method learns how much to "trust" or "weight" each group's predictions through a weighting function. This weighting function can either be a simple set of numbers (one for each group) or a more sophisticated neural network that looks at all the demonstration groups and decides how much to trust each one.
Analysis of selecting useful demonstrations with the proposed MOICL on the TweetEval Offensive test set on Llama-3-8b-Instruct.
To make the process more efficient, they also introduce a way to only use predictions from the most relevant groups rather than all of them. They do this by adding a mechanism that can automatically select just the top few most important groups for any given input, which helps reduce computational costs while maintaining good performance.
The whole system learns these weights through training - when it makes predictions that turn out to be correct, it adjusts the weights to rely more on the groups that helped make those good predictions. The opposite happens when it makes mistakes, gradually learning which demonstration groups are most reliable for different types of inputs.
Evaluating Mixtures of In-Context Learners
During evaluation, MOICL performed better than standard in-context learning on 5 out of 7 classification tasks. This method worked best when splitting demonstrations into 10 or 30 groups. Additionally, using negative weights (anti-experts) improved performance significantly.
Comparison between baseline methods and the proposed Mixture of In-Context Learners across classification tasks using Llama-3-8B-Instruct.
When handling problematic data, MOICL maintained good performance even when 70% of examples were from an unrelated dataset. This method worked well across different model sizes (7B, 13B, and 70B parameters), but that’s not all! Even when using a small neural network (16M parameters) to compute weights produced good results.
BitNet a4.8: 4-bit Activations for 1-bit LLMs
Wang et al. [Microsoft Research, University of Chinese Academy of Sciences]
♥ 371 LLM Compression
The overview of BitNet a4.8 with both weight and activation quantization.
Introduction to BitNet a4.8
Couple weeks back, we discussed BitNet but it has some computational overhead. While these 1-bit models have already reduced memory requirements by using binary weights, they still face high computational costs during inference, especially due to how they handle activations (intermediate values during computation). This paper proposes BitNet a4.8, which uses a hybrid approach combining 4-bit quantization for certain parts of the model (attention and feed-forward network inputs) with selective sparsification and 8-bit quantization for other parts, managing to keep only 55% of parameters active.
This clever combination allows them to maintain the same performance as the original 1-bit model while significantly improving inference speed and efficiency, particularly by enabling the use of 4-bit computation kernels and supporting 3-bit KV cache for memory efficiency.
The distribution of the inputs to each projection.
Architecture of BitNet a4.8
BitNet a4.8 is designed to be highly efficient by using very low-precision numbers for both its weights (parameters) and activations (intermediate calculations). It follows the same layout as its predecessor (BitNet b1.58) but it uses special "BitLinear" layers instead of regular linear layers in both its attention mechanism and feed-forward networks.
Smart Data Handling: The model uses different strategies for different parts based on how the data typically looks:
For most basic calculations (attention and feed-forward inputs):
Uses 4-bit precision since the numbers tend to follow a nice, bell-curve pattern
This works because these values don't have many extreme outliers
For specific intermediate calculations (like FFN down-projections and attention outputs):
Uses a combination of 8-bit precision and sparsification
Sparsification means it only keeps the most important values and sets others to zero. This is necessary because these parts tend to have more extreme values that need more precise handling
Special Feed-Forward Network (FFN) Design: It uses the "squared ReLU" function with a gating mechanism, this creates lots of zero values (over 80% in some cases). The model takes advantage of these zeros to skip unnecessary calculations.
Training Process: It starts with 8-bit precision for all calculations and gradually moves to using 4-bit precision where possible. It uses a special technique called "straight-through estimator" to handle the training process. Model keeps a full-precision copy of weights during training but uses low-precision during actual operation.
Special Floating-Point Version: This variant uses floating-point numbers instead of integers, this helps handle extreme values better. Most calculations use 4-bit floating-point numbers and it uses 8-bit precision only where absolutely necessary – this provides a good balance between precision and efficiency.
The distribution of the inputs to the output projection of attention with different quantization and sparsification.
The key innovation here is how the model intelligently uses different levels of precision for different parts based on what those parts typically need, rather than using the same precision everywhere. This makes it both efficient and effective.
Evaluating BitNet a4.8
BitNet a4.8 shows that large language models can be made much more efficient while maintaining similar performance to their full-precision counterparts like LLaMA. The model achieves this by using very low precision (1.58-bit) weights and a mix of 4-bit and 8-bit precision for calculations, with the larger versions (7B parameters) showing almost no performance drop compared to full-precision models.
This model has high sparsity - in the 7B version, only 3.4B parameters are actively used, achieving 44.5% overall sparsity through clever use of zero values and selective computation. Additionally, the model demonstrates that attention mechanisms can run effectively with just 3-4 bits of precision, which is particularly important for processing long sequences efficiently.
Perplexity and results of BitNet a4.8, BitNet b1.58 and LLaMA LLM on the end tasks. The standard variance of error for average scores is 1.06%.
LoRA vs Full Fine-tuning: An Illusion of Equivalence
Shuttleworth et al. [MIT CSAIL]
♥ 1.2k LLM LoRA
Characterizing structural differences between solutions learnt by LoRA Vs full Fine-tuning.
Comparing LoRA and Full Fine-tuning in LLMs
There are two main ways to tweak a LLM, either by using a LoRA or Full Fine-tuning. This paper addresses a fundamental question about parameter-efficient fine-tuning methods like LoRA, which can match the performance of full fine-tuning while using far fewer parameters.
While LoRA achieves similar performance metrics on target tasks as full fine-tuning, the researchers discovered that the two methods actually produce structurally different solutions - LoRA creates what they call "intruder dimensions" (high-ranking singular vectors orthogonal to the pre-trained model's structure) which don't appear in full fine-tuning.
These structural differences lead to practical implications: LoRA-tuned models with intruder dimensions tend to forget more of their pre-training knowledge and handle sequential learning tasks less robustly than fully fine-tuned models, though this can be mitigated by using higher-rank LoRA with rank stabilization. This research helps explain why LoRA sometimes struggles with complex tasks like code generation and long-form text generation, while also suggesting ways to improve LoRA's effectiveness.
What is better? LoRA or Full Fine-tuning
There are two main ways to adapt (or "fine-tune") large language models for specific tasks, they can either update all the model's parameters (full fine-tuning) or use a more efficient method called LoRA that updates only a small portion of parameters. While both methods can achieve similar performance on the target task, this research reveals they work in fundamentally different ways.
Different Internal Structure
The researchers found that LoRA creates what they call "intruder dimensions" - new patterns in the model that are very different from the original pre-trained patterns. Full fine-tuning, on the other hand, mostly makes small adjustments to existing patterns.
Think of it like renovating a house: full fine-tuning makes careful modifications to existing rooms, while LoRA builds completely new rooms that don't match the original architecture.
Impact of cosine similarity threshold ϵ on the number of intruder dimensions.
Impact on Model Behavior
Models fine-tuned with LoRA tend to "forget" more of their original pre-training knowledge. They also handle learning multiple tasks in sequence less effectively than fully fine-tuned models. Using the house analogy: the new rooms (intruder dimensions) built by LoRA might interfere with the flow and functionality of the original house.
Solutions and Improvements
Using a higher-rank LoRA (essentially allowing it to make more complex changes) can help reduce these problems. When LoRA is allowed to make more sophisticated adjustments, it starts behaving more like full fine-tuning. However, even with these improvements, LoRA still makes fundamentally different changes to the model compared to full fine-tuning.
Practical Implications of LoRA
This study explains why LoRA sometimes struggles with complex tasks like coding or long-form writing. It suggests that while LoRA is more efficient, it might not always be the best choice for every task. The research provides guidance on how to improve LoRA by using higher ranks and proper scaling
The researchers demonstrated these findings across different model sizes and tasks, showing that this is a consistent pattern rather than a one-off observation. This research is important because it helps us understand the trade-offs between efficiency (LoRA) and comprehensiveness (full fine-tuning), allowing developers to make better choices about which method to use for different applications.
The key takeaway for non-experts is that while LoRA is a more efficient way to customize AI models, it makes more dramatic changes to the model's internal structure, which can lead to some limitations. Understanding these differences helps developers choose the right method for their specific needs and potentially improve LoRA's performance in the future.
🚨This week’s top AI/ML research papers:
- Mixture-of-Transformers
- BitNet a4.8
- LoRA vs Full Fine-tuning: An Illusion of Equivalence
- Mixtures of In-Context Learners
- Emergence of Hidden Capabilities
- DimensionX
- The Surprising Effectiveness of Test-Time Training for… x.com/i/web/status/1…— The AI Timeline (@TheAITimeline)
10:06 AM • Nov 11, 2024
Reply