- The AI Timeline
- Posts
- Inductive Moment Matching
Inductive Moment Matching
Plus more about Generalized Kullback-Leibler Divergence Loss and Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Mar 10th ~ Mar 17th
#47 Latest AI Research Explained Simply
🗞️ Industry News in 1 Line
♥ 7.2k Mistral Small 3.1 is a fast, lightweight multimodal model with a 128k token context window, outperforming Gemma 3 and GPT-4o Mini while running efficiently on consumer hardware with 32GB VRAM. Both base and instruct model are now available on HuggingFace.
♥ 4.5k ERNIE 4.5, a next-gen multimodal foundation model, and ERNIE X1, a deep-thinking reasoning model, now offer top-tier performance at lower costs, with ERNIE X1 matching DeepSeek R1 at half the price. ERNIE 4.5 excels in reasoning, memory, and hallucination prevention, while both models are freely available via ERNIE Bot and accessible to enterprises through Baidu AI Cloud.

ERNIE 4.5 benchmark
RTX 4080 SUPER Giveaway (RIGHT NOW!) With NVIDIA’s GTC 2025
During NVIDIA’s GTC event which is NVIDIA’s annual flagship AI & developer Conference, March 17-21, 2025, there will be various big announcements, events, and sessions you can attend both in-person or virtually.
You can virtually discover the latest breakthroughs in generative AI and NVIDIA technologies from subject matter experts at #GTC25
By virtually attending sessions, you can join my giveaway for an RTX4080 SUPER. Currently only 25 people has joined the giveaway.
All you have to do is to take a selfie of yourself attending the LIVE virtual sessions that are available during GTC (March 17-21), submit it to this Google Form, and you can learn while possibly win a GPU at the same time! You can find more information on the google form.

GTC2025 Keynote by Jensen Huang
Here is a summary of the Keynote:
A NVIDIA Dynamo infrastructure software that boosts 30x throughput on Blackwell
A New Blackwell Ultra has 1.5x higher inference speed than Blackwell
DGX Spark, the smallest AI supercomputer, has a 128GB unified memory
RTX Pro, a new GPU series for consumers, which is using Blackwell architecture, ranges from 24GB up to 96GB
Enterprise Agentic AI for all levels, with focus on reasoning for agents
Llama Nemotron Reasoning Models, trained specifically for reasoning agentic use, has 3 sizes: Nano, Super, Ultra
AI-Q, an NVIDIA AI Blueprint that connects reasoning to AI agents, data, and tools for enterprise to connect data and deploy AI efficiently.
plus many other physical AI and simulation announcements, you can check out the GTC2025 Keynote replay on YouTube.
Inductive Moment Matching
Zhou et al. [Luma AI, Stanford University]
♥ 2.6k Diffusion Models
Introduction to Inductive Moment Matching
There are many challenging problems in the field of generative AI: achieving high-quality outputs, efficient inference, and stable training simultaneously. Current approaches like diffusion models produce impressive results but require many inference steps, while attempts to accelerate them through distillation or Consistency Models often lead to instability and extensive hyperparameter tuning.
This paper aims to solve this by introducing Inductive Moment Matching (IMM), a single-stage training procedure that directly learns generative models capable of high-quality one-step or few-step sampling without requiring pre-training or model distillation. By operating on time-dependent marginal distributions and enforcing distribution matching through mathematical induction, IMM guarantees convergence to the data distribution while maintaining stability across various hyperparameters and model architectures.

How Inductive Moment Matching Works
Inductive Moment Matching (IMM) creates AI-generated images in just a few steps instead of hundreds, while maintaining high quality and training stability. IMM uses a clever shortcut - directly learning how to transform noise into images without the lengthy step-by-step process traditional models use.
The Interpolation Framework
IMM creates a continuous path between random noise (t=1) and real images (t=0)
At any point in time t, there exists a distribution of partially-noised images
The key insight: learn to jump directly from any time point to any earlier time point
Learning Through Induction
IMM learns by comparing two routes to the same destination:
Going directly from time t to time s
Going from t to an intermediate time r, then from r to s
When these two routes produce the same results, the model has learned correctly
This "inductive bootstrapping" technique guarantees the model converges to the correct distribution

Training Stability
IMM uses "moment matching" (comparing statistical properties between distributions)
Instead of matching single samples, it matches multiple "particles" at once
This makes training much more stable than previous approaches

Results and Real-World Implications of Inductive Moment Matching
Researchers tested different approaches to find optimal configurations for Inductive Moment Matching. Across architectures (DDPM++ for CIFAR-10 and DiT-B for ImageNet-256×256), they found that network parameterization and flow schedules significantly impact performance. While Simple-EDM with OT-FM flow excels on smaller datasets, the Euler-FM combination demonstrates superior scalability on larger images, which suggests different optimal configurations based on resolution.

Researchers analyzed different mapping functions and concluded that constant decrements in η_t consistently outperform alternative approaches, though stability requires careful selection of decrements proportional to 2^-k. Weighting functions also proved crucial, with the combination of ELBO factors, α_t weighting, and middle time-step emphasis via 1/(α²_t+σ²_t) yielding substantial improvements.
Generalized Kullback-Leibler Divergence Loss
Cui et al. [Nanyang Technological University, The Chinese University of Hong Kong, The University of Hong Kong, Harbin Institution of Technology, Hefei University of Technology]
♥ 312 KL Divergence
Drawbacks and Limitations of Kullback-Leibler Divergence Loss
This paper addresses limitations in the Kullback-Leibler (KL) Divergence loss function widely used in deep learning. The authors mathematically prove that KL loss can be decoupled into two components: a weighted Mean Square Error and Cross Entropy with soft labels. This decoupling reveals two critical weaknesses: asymmetric optimization that hinders convergence during knowledge distillation, and sample-wise prediction bias.
To address this problem, they proposed a solution, Generalized Kullback-Leibler (GKL) Divergence loss, which breaks this asymmetric property while introducing a smoother weight function and class-wise global information. The effectiveness of GKL is demonstrated through impressive experimental results, achieving state-of-the-art adversarial robustness on RobustBench and competitive knowledge distillation performance across CIFAR, ImageNet, and CLIP models.

Comparisons of gradient backpropagation between KL, DKL, and GKL losses.
Understanding KL Loss Decoupling
The authors of this paper argue that Kullback-Leibler (KL) Divergence loss, widely used in deep learning, can be decoupled into two simpler components:
Weighted Mean Square Error (wMSE) - Captures local relationships between pairs of classes
Cross-Entropy with soft labels - Ensures global similarity between probability distributions

This decoupling exposes two key problems in KL loss:
Asymmetric Optimization Issue: In knowledge distillation, where a student model learns from a fixed teacher model, the wMSE component becomes ineffective because the teacher's outputs are detached from gradient calculations. This means only half of the optimization mechanism works, leading to worse performance.
Sample-wise Prediction Bias: Hard examples or outliers with incorrect predictions can mislead the training process when using sample-based weights.
The authors' solution (GKL loss) addresses these issues by:
Breaking the asymmetric property - Allowing gradients to flow through both components
Introducing a smoother weight function - Making training more stable for classes with high predicted scores
Using class-wise information - Reducing bias from individual sample predictions by incorporating global class statistics
These improvements significantly enhance performance in both adversarial training and knowledge distillation without requiring mathematical reformulation of the original loss function.
Results and Evaluation of Kullback-Leibler Divergence Loss
The authors' Generalized Kullback-Leibler (GKL) Divergence loss achieved state-of-the-art adversarial robustness on CIFAR-10/100 and delivered competitive knowledge distillation performance across CIFAR-10/100, ImageNet, and CLIP models. By decoupling KL loss into weighted MSE and Cross-Entropy components, breaking asymmetric optimization, and incorporating class-wise global information with smoother weight functions, GKL-AT significantly outperformed baselines across various attacks and perturbation sizes, demonstrating 1.34% higher average robustness than TRADES.

However, despite these advances, the authors acknowledge GKL's effectiveness has only been demonstrated in adversarial training and knowledge distillation, suggesting future work should explore its applications in out-of-distribution robustness and incremental learning scenarios.
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Huang et al. [East China Normal University, Xiaohongshu Inc.]
♥ 568 Vision Reasoning bycloud’s pick
Introduction to Vision-R1
This paper highlights the challenge of creating Multimodal Large Language Models (MLLMs) with robust reasoning capabilities, similar to those recently demonstrated in LLMs using Reinforcement Learning (RL) like DeepSeek-R1. One of the biggest problems is the scarcity of high-quality, multimodal data that includes complex, human-like reasoning steps (Chain-of-Thought or CoT), as opposed to simpler "Pseudo-CoT" data.
To tackle this, the researchers introduce Vision-R1, a novel reasoning MLLM. They first create a large (200K) multimodal CoT dataset, Vision-R1-cold, without manual annotation. This is achieved by an innovative "Modality Bridging" technique where an existing MLLM generates initial "Pseudo-CoT" data, which is then refined using a powerful text-based reasoning LLM (DeepSeek-R1) and filtering. This dataset provides a "cold-start" for Vision-R1.
Then, to overcome the "overthinking" issue observed during RL training, they propose Progressive Thinking Suppression Training (PTST) combined with Group Relative Policy Optimization (GRPO) and a specialized reward function. This progressively shapes the model's reasoning on a smaller (10K) multimodal math dataset, leading to a significant average improvement of ~6% on various benchmarks, with Vision-R1-7B achieving near state-of-the-art performance (73.5% on MathVista, close to OpenAI's O1).

Vision-R1 Pipeline
Can we teach reasoning to MLLMs just with rewards?
The researchers were inspired by DeepSeek-R1-Zero, which used Reinforcement Learning (an approach which gives a computer program rewards for good behavior and penalties for bad). They wanted to see if they could do the same with multimodal models (ones that understand both text and images).
They tried the simplest approach first: They took a bunch of math problems (10,000 of them), and used RL to train a basic MLLM. The MLLM got a reward if it:
Used the correct format: The output had to be in a specific structure: "<think> [reasoning steps] </think> <answer> [answer] </answer>". Think of it like forcing the model to "show its work."
Got the right answer: The final answer had to match the correct solution to the problem.
They called this model "Vision-R1-Zero." The result? It didn't work very well. The model struggled to learn complex reasoning, and the reasoning it did produce wasn't very long or sophisticated. Even worse, when they trained it for a long time, it started to get worse, even though it was using longer reasoning steps.

Data generation pipeline incorporating our Modality Bridging method
Inner Workings of Vision-R1
Since the "RL-only" method failed, the researchers tried a different strategy. They combined two ideas:
Cold-Start Initialization: First, they would "pre-train" the MLLM using a special dataset of multimodal problems that already included the reasoning steps (the "Chain-of-Thought" or CoT). This is like giving the model a textbook with worked examples before asking it to solve problems on its own. This pre-trained model is called "Vision-R1-CI."
Reinforcement Learning (again): After the cold-start, they would use RL to fine-tune the model. This time, the RL is used to help the model learn the correct reasoning process, not to teach it reasoning from scratch.
The final model, after both steps, is called "Vision-R1." The key to the cold-start is having a good dataset with examples of human-like reasoning. Existing datasets were often too simple and didn't show the kind of back-and-forth thinking that humans do (like questioning assumptions or correcting mistakes).

Comparison between the CoT processes generated by descriptions with and without “Pseudo-CoT”.
Researchers wanted to use the reasoning abilities of DeepSeek-R1 (the text-only model that was good at reasoning) to help create this dataset. But DeepSeek-R1 can't understand images!
Here's their clever solution, called "Modality Bridging":
Pseudo-CoT: They took an image, a question, and the correct answer, and fed them to a different MLLM. They asked this MLLM to generate a "Pseudo-CoT" – a description of the image and some initial reasoning steps. This is like asking a student to explain the problem and take a first stab at solving it.
Detailed Description: They then took the original image and question, plus the "Pseudo-CoT," and fed them back into the MLLM. This time, they asked for a very detailed description of the image, incorporating the information from the Pseudo-CoT. This is like asking a student to refine their explanation after getting some hints. The Pseudo-CoT helps the MLLM focus on the important visual details.
DeepSeek-R1's Magic: Now they had a detailed textual description that included all the important visual information. They fed this description to DeepSeek-R1 (the text-only reasoning expert). DeepSeek-R1 could then generate high-quality, human-like reasoning steps (the CoT).
Clean Up: They filtered out any reasoning that led to the wrong answer, and did some minor cleaning to make the text more consistent.
Putting it Together: Finally, they combined the original image with the high-quality CoT generated by DeepSeek-R1. This created the "Vision-R1-cold" dataset – a collection of multimodal problems with excellent reasoning examples.
After the cold-start initialization (using the Vision-R1-cold dataset), the model ("Vision-R1-CI") had learned how to reason in a complex way. But it had a new problem: "Overthinking."

GRPO with the proposed PTST strategy.
The model would sometimes spend too much time thinking, even when a shorter reasoning process would have been correct. The correct reasoning was often shorter and simpler. This made it hard for the RL training (in the next step) to work properly, because the model was getting "lost" in long, incorrect reasoning chains.
Teaching the Model to Think Efficiently via Progressive Thinking Suppression Training
To solve the "overthinking" problem, they came up with "Progressive Thinking Suppression Training" (PTST). The idea is simple but powerful:
Start Short: At the beginning of RL training, they forced the model to use short reasoning processes. This prevented it from getting lost in long, incorrect chains. It's like making the model practice solving easy problems first.
Gradually Lengthen: As training went on, they slowly allowed the model to use longer and longer reasoning. This gave the model time to learn the correct reasoning patterns before it was allowed to explore more complex possibilities.
Hard Formatting Reward: They used a strict reward system. The model only got a reward if it used the correct format and got the right answer. No partial credit!
They used a specific RL algorithm called "Group Relative Policy Optimization" (GRPO), but the key idea is the progressive increase in allowed reasoning length.

The output examples of Vision-R1-7B on MathVerse benchmark.
Real-World Implications of Vision-R1
The resulting Vision-R1-7B model had impressive mathematical reasoning capabilities. On the MathVista benchmark, Vision-R1-7B nearly matches OpenAI's O1, a leading reasoning model, despite being significantly smaller (7B parameters vs. O1's unspecified, but likely much larger, size). It shows substantial gains (+10% accuracy) over its base model (Qwen-2.5-VL-7B) on challenging sub-tasks, indicating a strong grasp of complex, "human-like" reasoning. Vision-R1-7B also achieves top or near-top performance on other difficult math benchmarks (MathVerse and MM-Math).
The Vision-R1-cold dataset, used for the model's initial training, is shown to be of high quality, containing significantly more instances of human-like cognitive processes (questioning, reflection, etc.) than previous datasets like Mulberry and LLaVA-CoT. This is demonstrated quantitatively and confirmed by the fact that a model trained on Vision-R1-cold (Vision-R1-LlamaV-CI-11B) outperforms models trained on other datasets, achieving state-of-the-art results across various benchmarks.
Reply