• The AI Timeline
  • Posts
  • Mixture of Million Experts, Binary LLMs, and Test Time Training

Mixture of Million Experts, Binary LLMs, and Test Time Training

#14 | Latest AI Research Explained Simply

In this issue: x3 industry news, x3 AI research papers

July 8th ~ 14th

🗞️ Industry News in 1 Line

  1. ♥ 2k A new attention mechanism FlashAttention-3, has been introduced which comes with groundbreaking techniques to leverage asynchrony and low-precision (uses fewer bits to represent floating point numbers). It can achieve up to 1.2 PFLOPS on H100 GPUs which makes it 1.5-2.0x faster than its predecessor.

  2. ♥ 839 A Chinese AI company, SenseTime, has launched SenseNova 5.5, a state-of-the-art LLM which is supposedly competing head-to-head with GPT-4o on key benchmarks.

  3. ♥ 639 If you use the torch.compile module to speed up your Pytorch code, then you should definitely check out this handy guide/tutorial (in a Google Doc!?) which covers lots of basic use-cases which are not available in the documentation. It is written by Edward Z. Yang.

Fine-Tune Florence-2 for Object Detection

Open-sourced by Microsoft under the MIT license

Florence-2: An Open Source Lightweight Vision-Language Model

This model demonstrates incredible zero-shot and strong fine-tuning capabilities across various tasks such as:

  • Image Captioning

  • Object Detection

  • Phrase Grounding

  • Image Segmentation

Like other pre-trained foundational models, Florence-2 may lack domain-specific knowledge. This free tutorial will show you how to fine-tune Florence-2 using open-source tools on object detection datasets to improve model performance for your specific use case and outperform the YOLO series.

FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation

Ma et al. [Mohamed bin Zayed University of AI, Carnegie Mellon University]

♥ 347   Binary LLM
Fully Binarized Large Language Model

Structure of a Fully Binarized Large Language Model

Introduction to FBI-LLM

As LLMs grow in size and capability, they require substantial resources as these models have a large number of parameters (often billions) and they use high precision (32-bit floating point) storage for storing parameters which leads to intensive computational demands during inference and high energy consumption.

Previous attempts to address this through quantization and binarization have faced limitations as methods based on pruning or retaining salient parameters can lose important information and approaches that continue training from a full-precision model limit flexibility in model architecture and vocabulary size

Fully Binarized Large Language Model (FBI-LLM) introduces a method to train a large-scale binary language model from scratch, where each parameter is represented by just {-1, 1}. Unlike previous approaches that start from a pretrained full-precision model, this method trains binarized LLMs from random initializations which allows for more flexibility in model architecture and vocabulary size.

Architecture of FBI-LLM

The FBI-LLM architecture is designed to create a highly efficient language model by replacing most of the full-precision parameters with binary values. It maintains the general structure of transformer-based Large Language Models. The main difference lies in how it handles the parameters within various modules.

  1. FBI-Linear Modules: The core innovation is the replacement of standard linear modules with FBI-Linear (Fully BInarized Linear) modules. In these modules, the main weight matrix consists only of 1 and -1 values, drastically reducing the memory footprint.

  2. Preserved Full-Precision Components:

    • Causal Head: The final layer that predicts the next token remains in full precision which is important for maintaining accurate output probabilities.

    • Embedding Layer: This is kept as a full precision layer to preserve rich initial representations of input tokens.

    • Layer Normalization: This step remains in full precision to effectively scale activation values between layers.

  3. Binarization Process: During training, the model starts with full-precision weights, these weights are then binarized using a sign function, which converts positive values to 1 and non-positive values to -1.

  4. Training Procedure: The model uses an Autoregressive Distillation (AD) approach to train a student model (FBI-LLM) using a full-precision pre-trained LLM that serves as a teacher model.

  5. Gradient Flow: To handle the non-differentiable nature of the binarization function, the model uses a Straight-Through Estimator (STE) during backpropagation. This allows gradients to flow through the binary layers, enabling effective training.

Results and Real-World Implications of FBI-LLM

The paper shows that the FBI-LLM produces impressive results across different model sizes (130M, 1.3B, and 7B parameters) while

maintaining the lowest average bit-width for parameters. The FBI-LLM 1.3B achieves up to 87% of the performance of similar-scale full-precision models in downstream tasks.

Contrary to initial expectations, training from scratch shows comparable or even better stability than continuing from a pre-trained LLM. The flip-flop ratio, training loss, and gradient norm analyses suggest that binarization is not sensitive to parameter initialization.

These results suggest that FBI-LLM could be a promising approach for developing more efficient and deployable large language models, particularly in resource-constrained environments.

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Sun et al. [Stanford University, UC San Diego, UC Berkeley, Meta AI] 

♥ 1.6k   ML Theory

Test-Time Training vs RNN

Introduction to Test-Time Training

While recent advancements have improved RNN performance, they still face challenges in effectively utilizing long context sequences, unlike Transformers. On the other hand, Transformers have quadratic complexity, making them inefficient for very long sequences.

This paper introduces a sequence modeling technique that aims to combine the efficiency of RNNs with the long-context performance of Transformers. By framing the hidden state update as a self-supervised learning process, Test-Time Training (TTT) layers offer a new way to handle long-range dependencies in sequence models which could potentially address the limitations of both RNNs and Transformers.

key differences

What is Test-Time Training (TTT) in LLMs?

The Test-Time Training (TTT) approach was designed to be an efficient way to handle sequence modeling tasks, like language modeling. TTT views the task of processing a sequence as a process of compressing historic context into a hidden state. Unlike traditional RNN layers that have a fixed-size hidden state, or self-attention that grows linearly, TTT uses a learning-based approach to compress information. The TTT architecture has the following main components:

  1. Inner Model: It can be a simple linear model or a small neural network such as MLP.

  2. Self-Supervised Task: The inner model is trained on a task derived from the input sequence itself.

  3. Update Rule: This describes how the inner model is updated (typically using gradient descent).

  4. Output Rule: This describes how the current token is processed using the updated inner model.

Test-Time Training

Test-Time Training architecture.

It treats the context (previous tokens) as an unlabeled dataset and uses this dataset to train a small model (called the inner model) in real-time as it processes the sequence. The weights of this inner model become the hidden state. For each token in the sequence, the token is used to update the inner model (training step), the updated model is then used to process the same token (inference step), this process repeats for each token, continuously adapting the model.

The above image shows a high-level computation graph for the first Test-Time Training (TTT) mini-batch. The graph consists of nodes representing variables and edges representing computations between them. The blue nodes are input variables (W0 is an initial state, while x1 to xb represent a sequence of input tokens). The z nodes represent processed outputs for each input x, while Wb is the final state after processing the batch.

Test-Time Training

Pseudocode of Test-Time Training algorithm.

Results and Evaluation

This paper looked at how well different types of language models work with long texts. They tested TTT, Mamba, and Transformer models on texts ranging from 1,000 to 32,000 words long. The TTT models, especially TTT-Linear and TTT-MLP, did really well. When dealing with very long texts (32,000 words), TTT models beat Mamba, which is impressive. When the training FLOPs are matched, TTT outperforms both Transformers and Mamba.

They also found that the way TTT models are set up can make a big difference. Interestingly, a version of TTT that uses a setup similar to older models (Transformers) showed promise for even bigger models and longer texts. The paper also pointed out that the length of text a model can handle is something that can be adjusted to get the best results, just like other settings in these models. 

Mixture of A Million Experts

Xu Owen He [Google DeepMind]

♥ 922   MoE LLM
Inner architecture of Parameter Efficient Expert Retrieval Mechanism

Inner architecture of Parameter Efficient Expert Retrieval Mechanism

Introduction to Parameter Efficient Expert Retrieval (PEER)

As transformer models grow larger, the computational costs and memory usage of Feed Forward (FFW) layers increase linearly with their size which makes it challenging to scale models efficiently. While MoE architectures have been used to address the scaling issue, they are typically limited to a small number of experts due to computational and optimization challenges. Moreover, existing models can’t adapt to new data without forgetting old data, which potentially requires an ever-growing pool of experts.

This paper introduces the Parameter Efficient Expert Retrieval (PEER) architecture which suggests using an extremely large number (over a million) of very small experts instead of using a small number of large experts. This approach aims to provide a more efficient way to scale transformer models, unlocks better performance-compute trade-offs and enables more effective lifelong learning in language models. 

What are Parameter Efficient Expert Retrieval (PEER)

The Parameter Efficient Expert Retrieval (PEER) layer improves neural network efficiency by using a Mixture of Experts (MoE) framework. It has three main parts:

  1. Experts: A large pool of simple neural networks, each with only one neuron.

  2. Product Keys: Special vectors used to select the most relevant experts.

  3. Query Network: Transforms input data into a query vector to find the best experts.

The PEER layer processes input data through the following steps:

  1. Expert Retrieval: For a given input, the query network generates a query vector which is then compared against the product keys using inner products to determine the most relevant experts, a subset of top-k experts is selected based on the highest inner product values.

  2. Router Scores: The inner products of the top-k experts are passed through a non-linear activation function (e.g., softmax or sigmoid) to produce router scores.

  3. Expert Output Aggregation: The outputs of the selected experts are linearly combined, weighted by their respective router scores, to produce the final output.

Instead of checking every expert, product keys split into smaller parts help quickly find the best experts which reduces the complexity. Using many small experts (single-neuron networks) allows the model to scale up without using too much memory, making it both efficient and powerful. Multiple query networks are used to enhance expressiveness, combining their results for better performance, efficient computation and memory use which makes it ideal for large-scale neural networks.

Evaluating Parameter Efficient Expert Retrieval (PEER) Models

In lower compute budgets tests, PEER achieved the lowest perplexity scores across all datasets, indicating better language modeling performance. Among other models, MoE also performed well, but still lagged behind PEER whereas dense transformers had the highest perplexity scores which indicates less effective language modeling capabilities compared to the other methods. Similarly for the high compute budget tests, PEER led in performance with the lowest perplexity scores, but dense transformers remained the least effective.

The results clearly show that the PEER architecture significantly outperforms dense transformers, coarse-grained MoEs, and product key memory layers across various language modeling tasks and compute budgets. The fine-grained MoE approach, combined with efficient expert selection via product keys, enables PEER to achieve lower perplexity scores which could indicate better language understanding.

Reply

or to participate.