• The AI Timeline
  • Posts
  • Fine-Tuning Diffusion Models for 200x Speedup, Foundational Multimodal LLMs, Gradient Informed MoE

Fine-Tuning Diffusion Models for 200x Speedup, Foundational Multimodal LLMs, Gradient Informed MoE

#24 | Latest AI Research Explained Simply

In this issue: x3 industry news, x3 AI research papers

Sep 16th ~ Sep 22nd

🗞️ Industry News in 1 Line

  1. ♥ 1.2k Qwen2.5 is a new family of open-source foundational models, and excels at topics such as Math or Coding, scoring second on Live Bench’s Coding. The models are available in different sizes ranging from 0.5B to 72B parameters. Check out Qwen2.5 weights on Hugging Face.

  2. ♥ 1.5k Moshi is an open-source speech-text foundation model for real time dialogue conversation. This is very different from other text to speech models as Moshi maintains two streams of audio - one for user and the other for Moshi’s inner monologue. You can read the Moshi’s technical report on GitHub, download Moshi’s weights from Hugging Face, or talk to Moshi online.

  3. ♥ 1.5k Google has released NotebookLM, a new AI tool which uses Gemini 1.5’s multimodal capabilities to provide an audio overview of your personal documents. This is an exciting new tool, but it only works with English documents.

  4. ♥ 11k OpenAI’s CEO Sam Altman wrote a blog about “The Intelligence Age”. In summary: “AI advancements, driven by deep learning, will lead to unprecedented prosperity and progress, transforming society while requiring careful management of risks like labor market shifts.”

Master In-Demand Skills In AI Through NVIDIA's Specialized Training and Certification

  • Enhanced Skill Set: Gain in-depth knowledge and hands-on experience with NVIDIA technologies. Learn the latest advancements and techniques in generative AI, deep learning, data science, graphics & simulation, and more. 

  • Career Advancement: Improve job prospects and career oppor

    tunities with industry-recognized certification that demonstrates proficiency and commitment to continuous learning. Gain recognition from peers and employers as an expert in NVIDIA technologies. 

  • Access to Advanced Tools and Resources: Utilize innovative NVIDIA tools and platforms during training. Stay current with the latest industry trends and technological advancements. 

  • Networking Opportunities: Connect with a global community of professionals, experts, and peers. Collaborate and share knowledge within the NVIDIA ecosystem. 

  • Free Training for Higher Education Students: Take advantage of free instructor-led training offered on campus via NVIDIA’s University Ambassador and Teaching Kit programs. 

âťť

Use code “BYCLOUD” at checkout for 10% off

applicable only on self-paced courses, instructor-led workshops and certifications

Validate your expertise and advance your career with NVIDIA Certification. This program offers certifications across various specializations, demonstrating your proficiency in cutting-edge technologies and enhancing your professional credibility.

Explore a comprehensive learning path designed to help you master Generative AI. This program provides hands-on experience with the latest tools and techniques, guiding you through the essential skills needed to build and deploy AI models.

AI Learning Essentials hub is designed to equip individuals with the knowledge and skills needed to thrive in the dynamic world of AI.

NVLM: Open Frontier-Class Multimodal LLMs

Dai et al. [NVIDIA]

♥ 324   VLM

There have been a few multi-modal LLMs but none of these models are as good as proprietary models like GPT-4. While open-source MLLMs have made progress, they often lag behind in performance, especially in tasks that require both visual and textual understanding. Moreover, we still don’t have a good comparison of different MLLM architectures which makes it challenging to determine the best approach.

Architecture of NVLM model family

This paper introduces NVLM-1.0, a family of open-access, state-of-the-art MLLMs that address these shortcomings. It proposes a novel hybrid architecture that combines the strengths of decoder-only and cross-attention-based approaches, resulting in a model that excels in multimodal reasoning while maintaining computational efficiency. The paper also introduces a tile-tagging design for processing high-resolution images which will improve accuracy in OCR-related tasks.

How Does NVLM Work?

Let's break down the architecture of NVLM-1.0 which is a family of multimodal LLMs designed to understand both images and text. Think of NVLM-1.0 as a team of three different specialists:

  • NVLM-D (Decoder-only): This specialist excels at understanding the details of images, like reading text from them. It works by taking in both text and image information, processing them together in a single, unified way within the LLM's decoder. This allows NVLM-D to make connections and inferences between what it sees and what it reads.

  • NVLM-X (Cross-attention): This specialist is all about efficiency. It processes images quickly and effectively using a technique called cross-attention. This means it can focus on specific parts of the image, like reading a particular section of a document, without being overwhelmed by the whole picture.

  • NVLM-H (Hybrid): This specialist is the best of both worlds. It combines the strengths of NVLM-D and NVLM-X. NVLM-H uses NVLM-X's efficient cross-attention for detailed image processing, while still incorporating NVLM-D's ability to make sense of the entire image through its decoder. This allows for both quick understanding and deep reasoning.

Before going further, let's understand the shared foundation of these specialist models: the vision pathway. This is how NVLM-1.0 sees and interprets the world.

  1. Image Input: The input is an image.

  2. Dynamic Tiling: The image is divided into smaller sections called tiles. This allows for processing even very large images. A special "thumbnail" tile captures the overall context of the image.

  3. InternViT-6B: This powerful vision encoder analyzes each tile, transforming it into a set of numerical representations (tokens). These tokens represent the key visual features of the tile.

  4. Downsampling: To make the processing even more efficient, the number of tokens is reduced.

Now, let's see how each specialist combines text and visual tokens:

  • NVLM-D: All the visual tokens (including the thumbnail) are combined with text tokens and fed directly into the LLM's decoder. This allows the model to draw connections between what it sees and reads.

  • NVLM-X: The visual tokens from each tile are processed using cross-attention. This means the model can selectively focus on certain parts of the image while still considering the text.

  • NVLM-H: The thumbnail tile is processed with the text tokens in the decoder, while the other tiles are processed using cross-attention. This allows for both detailed analysis and holistic understanding.

To help the model understand the structure of the image, each tile is given a unique label called a "tile tag." This allows the model to keep track of where the tile is located within the image and how it relates to the other tiles.

Results and Real-World Implications of NVLM

NVLM-1.0 achieves state-of-the-art results across a wide range of vision-language benchmarks. It can compete with both leading proprietary models like GPT-4o and open-access models like InternVL and Llama 3. Unlike many open-source models that experience significant degradation in text-only performance after multimodal training, NVLM-1.0 either maintains or even improves its text-only capabilities across benchmarks like MMLU, GSM8K, MATH, and HumanEval. This is a significant achievement, demonstrating that the model's ability to understand images doesn't compromise its text-based skills.

GRIN: GRadient-INformed MoE

Liu et al. [Microsoft]

♥ 718   LLM MoE

Introduction to GRIN MoE

Large language models are getting even larger, but as they grow larger, they become even harder to train. We still don’t have a good way to improve efficiency and performance of these models while managing computational resources.

To solve this problem, this paper introduced GRIN (GRadient-INformed MoE training), a new way to incorporate sparse gradient estimation for expert routing and reconfigures model parallelism. It uses SparseMixer-v2 to estimate gradients for expert routing and goes beyond conventional methods that use gating gradients as proxies. This allows it to achieve better performance in coding and mathematics related tasks while using fewer activated parameters than comparable dense models.

How does GRIN MoE Work?

The GRIN MoE (GRadient-INformed Mixture-of-Experts) model is an innovative approach to large language model architecture that aims to improve efficiency and performance. Let’s break down its architecture and understand inner workings:

  1. Foundation: The model is built on a transformer architecture, which is a stack of transformer blocks. Each block contains two main components: an attention layer and a feedforward layer.

  2. Attention Layer: 

    1. Uses grouped-query attention and sliding window attention for efficiency.

    2. Implements rotary position encoding (RoPE) to handle long contexts.

    3. Utilizes FlashAttention 2 for optimized performance.

  3. Feedforward Layer (Mixture-of-Experts): Instead of a standard feedforward network, this layer is constructed as a Mixture-of-Experts (MoE) system. It has three components – Multiple expert networks (feedforward networks), a router network, and a gating mechanism.

  4. MoE Operation:

    1. For each input token, the router network determines which expert(s) to activate.

    2. The model selects the top K (usually 2) experts for each token.

    3. The selected experts process the input.

    4. The gating mechanism combines the outputs of the selected experts.

  5. SparseMixer-v2: This is a gradient estimation technique for the expert routing process. It improves upon conventional MoE training by replacing the TopK function with random sampling during training. It uses Heun's third-order method to approximate the expert routing gradient and constructs a modified backpropagation for more accurate gradient estimation.

  6. Training Process: The model is trained using a combination of techniques to overcome the challenges of sparse computation. It avoids token dropping and uses pipeline and tensor parallelism instead of expert parallelism.

  7. Scaling: The full model has a large number of total parameters (e.g., 42B), but only a fraction (e.g., 6.6B) are activated for any given input. This sparse activation allows the model to achieve performance comparable to much larger dense models while using fewer computational resources.

Evaluating GRIN MoE

GRIN MoE uses a transformer-based structure with a new Mixture-of-Experts approach in the feedforward layers which allows for sparse activation of experts. Despite having a total of 42B parameters, only about 6.6B are activated for any given input. This leads to resource efficiency comparable to much larger dense models.

To address concerns about reliance on synthetic data, GRIN MoE was tested on the 2024 GAOKAO exam (Chinese national college entrance exam), specifically on math questions. Here are the GAOKAO Math results:

  1. GRIN MoE scored 46 out of 73 points.

  2. Outperformed Llama3 70B by 11 points.

  3. Scored only 6 points less than Gemini Ultra-1.0 and 5 points less than Claude3 Opus.

The strong performance on GAOKAO, which was not part of the training data, suggests that GRIN MoE's capabilities are likely due to effective learning and generalization rather than mere memorization.

Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think

Garcia et al. [RWTH Aachen University, Eindhoven University of Technology]

♥ 630   Depth Estimation   bycloud’s pick  

Introduction to Image-Conditional Diffusion Models

Large diffusion models take up a lot of computation resources which makes them expensive. This paper has uncovered a crucial flaw in the inference pipeline of existing diffusion-based models, which explains their poor performance. By correcting this bug, the authors enabled single-step inference which can make these models more than 200 times faster while maintaining accuracy.

Furthermore, the paper also introduces an efficient fine-tuning protocol for the single-step model. By using task-specific losses, they achieve a deterministic model that outperforms all previous diffusion-based depth and normal estimation methods.

Training pipeline of Image-Conditional Diffusion Models

Fine-Tuning Image-Conditional Diffusion Models

When AI models create images, they are far from perfect. We often see artifacts, such as blurring and over-sharpening which makes the images look unnatural. These artifacts are caused because the diffusion training objective doesn't guarantee optimal performance for the specific task of depth prediction.

To address this, the researchers employ end-to-end fine-tuning by directly optimizing the model for the desired task. However, fine-tuning a diffusion model with multiple inference steps is computationally prohibitive for large models. Fortunately, the bug-fix described earlier enables single-step inference, making fine-tuning feasible.

The fine-tuning process involves training the same UNet used in the initial diffusion training stage, with the following key modifications:

  1. Single-Step Prediction: The timestep 't' is fixed to the final step (T), ensuring the model learns to produce a depth prediction in a single step. This removes the need for backpropagation through multiple inference steps, significantly reducing computational cost.

  2. Zero Noise: The input noise is set to zero, simplifying the training process. Instead of adding noise and then denoising, the model directly learns to convert a clean latent representation into a depth prediction.

  3. Task-Specific Loss: The training objective is replaced with task-specific loss functions. For depth estimation, they use an affine-invariant loss, which is robust to global scaling and shifting of the depth map. This ensures the model learns to predict accurate depth values, regardless of overall scale. For surface normal estimation, they employ a loss based on the angular difference between predicted and ground truth normals, encouraging the model to produce accurate normal vectors.

In this paper, the researchers use the following hyperparameters:

  • The UNet is trained for 20,000 iterations using the AdamW optimizer with a base learning rate of 3 Ă— 10-5 and an exponential learning rate decay after a 100-step warm-up.

  • The batch size is set to 2, with gradient accumulation over 16 steps which effectively creates a batch size of 32. This strategy allows for mixing images with different aspect ratios and resolutions, further improving generalization.

  • This paper used a mixture of both indoor and outdoor images by using Hypersim (90%) and Virtual KITTI 2 (10%) training datasets as both providing high-quality synthetic data with ground truth annotations.

Testing Image-Conditional Diffusion Models

We can see that the fixed DDIM scheduler (which corrects the inference bug) leads to a significant improvement in single-step performance compared to the original Marigold model, even with 50 steps and ensembling. Moreover, ensembling multiple predictions provides noticeable benefits when using more than one inference step. However, the results are highly correlated for single-step predictions and ensembling doesn't offer significant improvement.

Fine-tuning the fixed Marigold model end-to-end results in further improvements in depth estimation and it outperforms all previous configurations, including those with multiple steps and ensembling. Even directly fine-tuning Stable Diffusion achieves comparable performance to the fine-tuned Marigold model. This indicates that the fine-tuning approach itself is effective, and the choice of pretrained model may not be as critical as previously thought.

This paper also found that fine-tuning the GeoWizard model for both depth and normal estimation leads to substantial improvements in normal estimation which outperforms the original model with multiple steps and ensembling. However, it is important to note that while the improvement in depth estimation is smaller, it's still consistent. This shows that the fine-tuning approach benefits both tasks.

Reply

or to participate.