• The AI Timeline
  • Posts
  • Self Rewarding LLMs, Mixture of Nested Experts, Modality Aware Experts

Self Rewarding LLMs, Mixture of Nested Experts, Modality Aware Experts

#17 | Latest AI Research Explained Simply

In this issue: x5 industry news, x3 AI research papers

July 29th ~ Aug 6th

🗞️ Industry News in 1 Line

  1. ♥ 1.6k Google has released Gemma-2, a 2 billion parameter model which punches way above its weight. The results are too good to be true because the release announcement mentions that this tiny model can surpass Llama 2, GPT 3.5, and Mixtral 8×7b models.

  2. ♥ 7.2k Meta has release Segment Anything Model 2 (SAM 2) which can do real-time object segmentation in image and videos. The best part is that this model is licensed under Apache 2.0 which means anyone can use it.

  3. ♥ 1.2k A new text-to-image model called FLUX.1 was released by Black Forest Labs, and it is everything that Stable Diffusion 3 was supposed to be. It is backed by the original authors of Latent Diffusion, and the key researchers that departed StabilityAI. It has both open source and paid versions, with inference code available on GitHub.

  4. ♥ 895 PyTorch has released torchchat, a lightweight library to run LLMs locally across mobile, desktop and laptops powered by PyTorch.

  5. ♥ 4.7k A number of high-ranking executives at OpenAI have left the company. The trio including Co-Founder Greg Brockman (step away for break), Co-Founder John Schulman (joins Anthropic), and Head of Product Peter Deng. (source)

Build Gen AI Apps Ready For Export in Minutes with OnDemand

Creating Gen AI Applications in OnDemand

Unlock the future of AI with OnDemand! Our cutting-edge platform offers exclusive tools and insights to streamline businesses of any size, boosting productivity and efficiency. Building AI Powered Apps is as Easy as:

  1. Find plugins on the marketplace

  2. Configure your playground environment

  3. Export your chatbot into any programming language to integrate into your coding IDE

Gain early access to groundbreaking AI solutions and stay ahead of the curve. Whether you're a startup or an established enterprise, OnDemand provides the solutions you need to succeed. Don't miss out on this opportunity to transform your business by joining right now free of charge!

Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge

Wu et al. [Meta FAIR, University of California, Berkeley, New York University]

♥ 394   LLM Post-training
Benchmark results of Self-Improving Alignment Models

Benchmark results of Self-Improving Alignment Models

Introduction to Self-Improving Alignment Models

LLMs are rapidly advancing, but their improvement relies on expensive human-curated datasets and creating high-quality human-labeled data is both expensive as well as time-consuming. Moreover, how will we judge and steer AIs when their actions surpass human comprehension?

This paper introduces Meta-Rewarding approach which enables AI systems to improve without relying on additional human-labeled data. Unlike other methods that focused solely on improving response generation, this technique enhances both the model's ability to generate responses and to judge them. By improving the AI's judgment capabilities, this method could contribute to solving the "Super Alignment" challenge as AI systems become more advanced.

How Does Meta-Rewarding Work?

Meta-Rewarding is an iterative self-improvement mechanism for Large Language Models (LLMs) that enhances both generative and evaluative capabilities without human supervision. The process employs a single LLM in three distinct roles: actor, judge, and meta-judge.

In each step, the actor generates multiple responses to prompts. The judge then evaluates these responses using an LLM-as-a-Judge prompt, assigning scores based on a predefined rubric. To mitigate length bias, a length-control mechanism is implemented in the selection of preferred responses.

Meta-Rewarding iterative training scheme

Meta-Rewarding iterative training scheme

The meta-judge evaluates pairs of judgments made by the judge to determine which judgment is more accurate according to a given set of rules. It generates chain-of-thought reasoning and decisively selects the better judgment to create a dataset that enables the model to improve its judging capabilities.

The resulting datasets for both actor and judge are used to train the model via Direct Preference Optimization (DPO). This dual optimization approach simultaneously improves the model's ability to generate high-quality responses and to accurately evaluate them. The iterative nature of Meta-Rewarding allows for continuous self-improvement which could potentially address the challenge of scaling AI capabilities beyond human capabilities.

Prompt used for LLM Meta Judge

Prompt used for LLM Meta Judge

Evaluating Meta-Rewarding Approach 

The Meta-Rewarding approach shows significant Improvement in Model Performance. It substantially improved upon the Llama-3-8B-Instruct base model and outperformed both Self-Rewarding and SPPO methods, despite not using additional human feedback. It shows promising improvements on AlpacaEval 2 benchmark, the length-controlled win rate increased from 22.9% (base model) to 39.4% (Meta-Rewarding Iteration 4). This made it comparable to more advanced models like Claude-Opus.

The model continued to show improvements across multiple iterations, indicating the effectiveness of the iterative self-improvement process. This suggests that self-improving models without human supervision could be a viable approach to achieve super alignment in AI systems.

It is important to note that the 5-point judging system used in this paper often leads to ties and makes it hard to see small quality differences. As training went on, scores got too close to the maximum, so it was difficult to identify improvements. A better scoring system is needed in the future to tackle these challenges.

Mixture of Nested Experts: Adaptive Processing of Visual Tokens

Jain et al. [Google DeepMind, University of Washington]

♥ 135   LLM Efficiency
Architecture of (a) Nested Model (b) Mixture of Nested Experts

Architecture of (a) Nested Model (b) Mixture of Nested Experts

Introduction to Mixture of Nested Experts

Images and videos naturally contain lost of redundant information, but current Vision Transformers (ViT) models don't do anything to address this redundancy. ViTs process all visual tokens (parts of an image or video) with equal emphasis, regardless of their importance. This leads to unnecessary computational costs and makes them difficult to deploy in scenarios with limited resources or where real-time processing is needed.

This paper aims to solve these problems using the Mixture of Nested Experts (MoNE) approach which can prioritize tokens and process them differently based on their importance. It uses a structure of nested experts, where each expert represents a different point on the compute-accuracy trade-off curve. Unlike traditional Mixture of Experts models, MoNE doesn't increase the overall parameter count, but it significantly reduces inference time computation.

How does Mixture of Nested Experts Work?

Mixture of Nested Experts (MoNE) is a framework designed to process visual information more efficiently. It divides the input image or video into smaller parts called tokens (for videos, tokens represent both spatial and temporal information), then uses a set of tiny models called "experts" that are nested within each other. These experts are different-sized versions of the same model; smaller experts use less computation but may be less accurate.

A special network called a router looks at each token and a probability to each expert to decide which expert should process each token. It is an algorithm that takes the router's probabilities and assigns tokens to experts while prioritizing larger experts. It also ensures that each expert processes a certain number of tokens based on a pre-defined capacity distribution.

Pseudocode of Algorithm used for Expert Preferential Routing

Pseudocode of Algorithm used for Expert Preferential Routing

Each token is processed by its assigned expert and the output is then scaled based on the router's original prediction for that expert. For videos, MoNE is applied to the spatial part of the video processing and each frame's tokens are routed independently. This system allows the model to adapt to different computational budgets. It can process less important or redundant information with smaller experts, saving computations, while still having the important parts of the image or video processed with larger and more capable experts.

In these images, highlighted regions were sent to full model and remaining regions were sent to a nested model.

In these images, highlighted regions were sent to full model and remaining regions were sent to a nested model.

Testing Mixture of Nested Experts

Tests show that MoNE can reduce computational costs by over two-fold while maintaining equivalent performance to baseline models on image and video datasets. Visualizations showed that tokens routed to larger experts correlated well with regions of interest in images and areas of motion in videos, indicating effective learning of token importance. 

MoNE works well with different amounts of computing power and performs effectively on both image and video tasks, which makes it stand out from the rest. While it has some limitations, like possible difficulties in using it for language models that generate text, the overall results show that MoNE can help make big vision models easier to use and better for the environment as they use less resources.

MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts

Lin et al. [Meta FAIR]

♥ 284   Multi-Modal Efficiency
Architecture of Mixture of Modality-Aware Experts

Architecture of Mixture of Modality-Aware Experts

Introduction to Mixture of Modality-Aware Experts

Existing mixed-modal, early-fusion language models use a single transformer to process both text and image tokens, which allows for better integration of information across modalities. However, scaling these mixed-modal early-fusion models to greater capacities is computationally challenging and resource-intensive.

The paper introduces MoMa, a modality-aware mixture-of-experts (MoE) architecture designed for pre-training mixed-modal, early-fusion language models. It divides expert modules into modality-specific groups (text experts and image experts) where each group exclusively processes tokens of its designated modality. MoMa is combined with the Chameleon architecture as the base transformer, and you can read our complete breakdown of the Chameleon architecture to learn more.

Inner-Workings of Mixture of Modality-Aware Experts

The modality-aware sparse architecture aims to significantly improve the efficiency while maintaining powerful cross-modal reasoning capabilities. Let’s break down the key components of the Mixture of Modality-Aware Experts (MoMa) model:

  1. Modality-Specific Expert Groups: MoMa divides experts into separate groups for text and image processing. This allows each group to specialize in features relevant to its specific modality (text or image), while still maintaining cross-modal integration through shared self-attention mechanisms in non-MoE layers. 

  2. Hierarchical Routing: The model uses a two-stage routing mechanism: 

    1. Modality-aware routing: Tokens are first routed to their corresponding modality-specific expert group. 

    2. Intra-modality routing: Within each group, tokens are then routed to specific experts using a learned routing function. 

  3. Expert Choice Routing: Within each modality group, the model uses expert-choice (EC) routing. Each expert has a fixed bucket size and processes the top-ke tokens in a batch. This ensures balanced expert utilization during training and eliminates the need for a separate load-balancing loss term. 

  4. Addressing Causality in Auto-regressive models: To maintain causality during inference, the model uses two techniques: 

    1. Sigmoid non-linearity in the router scoring function to enable independent calculation of token-to-expert affinity scores. 

    2. Auxiliary routers that predict the likelihood of an expert selecting a token based solely on its hidden state representation. 

  5. Mixture-of-Depths (MoD): In addition to width scaling (MoMa), the model incorporates depth scaling using MoD. This technique allows tokens to selectively skip certain layers, introducing sparsity in the depth dimension. 

  6. Inference: During inference, the model uses auxiliary routers to ensure causality. These routers predict the likelihood of a token being selected by an expert or layer based only on its hidden representation. 

Transformer layer in Mixture of Modality-Aware Experts

Transformer layer in Mixture of Modality-Aware Experts

Evaluating Mixture of Modality-Aware Experts

This paper evaluated the performance of various model configurations, including dense models, Mixture of Experts (MoE), and Modality-aware Mixture of Experts (MoMa) architectures. The 1.4B MoMa model with 1 text and 1 image expert outperformed the dense baseline on most metrics, particularly in image-related tasks. Adding more experts further improved performance, with the 1.4B MoMa containing 4 text and 4 image experts achieving the best overall results in interleaved data modeling.

MoE and MoMa models also showed improvements in text-to-text commonsense reasoning tasks compared to the dense baseline. The best-performing Chameleon-MoMa architecture demonstrated significant improvements over state-of-the-art baselines. These results show the effectiveness of modality-aware sparse architectures in improving performance across various tasks while maintaining efficiency. 

Reply

or to participate.