Apple Joins DiffusionLM Research

plus more about RLVR Generalization without Verifiers and VLM can think visually without generating pixels

June 23rd ~ June 30th
#62 Latest AI Research Explained Simply

🗞️ Industry News in 1 Line

  1. ♥ 489 Baidu introduces ERNIE 4.5, ten dense and MoE LLM variants, ranging from a 0.3 B model to a 424 B-parameter system with 47 B active parameters, that collectively enable scalable language-and-vision reasoning through post-training and multimodal thinking modes. Read their technical report or try out on Baidu AI Studio.

    ERNIE-4.5

  2. ♥ 1.3k Hunyuan-A13B is an open-source MoE LLM with 80 B parameters (13 B active) that couples a hybrid fast-and-slow reasoning architecture and advanced agentic tool-calling to match o1 and DeepSeek across mainstream benchmarks, excel at long-text tasks, and arrives with the new ArtifactsBench and C3-Bench datasets for richer code and agent-centric evaluation. Try it out now.

    Hunyuan-A13B benchmark

  3. ♥ 1.4k Qwen-VLo is a multimodal generative model that converts rough sketches or multilingual text prompts into high-resolution visuals, enables real-time style and layout edits, and incrementally composes complex scenes via progressive generation that can accelerate creative workflows. Read more about it or try it out now.

    Qwen VLo interaction

Support My Newsletter

As I aim to keep this newsletter free forever, your support means a lot. If you like reading The AI Timeline, consider forwarding it to another research enthusiast, It helps us keep this up for free!

RLPR: Extrapolating RLVR to General Domains without Verifiers 

Yu et al. [Tsinghua University, National University of Singapore, Shanghai Qi Zhi Institute, Harbin Institute of Technology, Beijing University of Posts and Telecommunications, University of Illinois Urbana-Champaign]

♥ 240   LLM RLVR

Introduction to RLPR

In the past few weeks, we have seen that Reinforcement Learning with Verifiable Rewards (RLVR) offers promising results for improving reasoning in language models, especially in structured domains like math and code. But it hits a wall in broader areas like science or open-ended questions. This is mainly because RLVR relies on custom verifiers, rule-based systems or trained models, to judge answers. Building these systems for every new domain is complex and often impossible for free-form responses. This limitation blocks progress in general reasoning.

In this paper, the researchers have proposed RLPR which uses a language model’s own confidence: the probability it assigns to correct reference answers, instead of external verifiers. This simple change allows using RLVR for any domain.

RLPR achieves better reasoning capability enchancement on both mathematical and general-domain reasoning benchmarks, even surpassing strong methods using verifier models.

Inner Mechanism of RLPR

The RLPR mechanism calculates rewards differently. For a given question, the model generates reasoning steps and an answer. Normally, a verifier would score correctness. Here, RLPR replaces the generated answer with the reference one and checks the model’s token-level probabilities for that reference. The average probability across tokens becomes the reward. Using the mean (not product) avoids sensitivity to single low-probability words, making rewards robust.

RLPR features an efficient Probability-based Reward (PR) using average decoding probabilities of reference answers.

But raw probabilities can be noisy. If the model already assigns high probability to a reference answer without any reasoning, that skews results. RLPR fixes this with debiasing. It subtracts the probability score of generating the reference answer directly, without intermediate reasoning, from the original reward. This isolates the contribution of the reasoning steps themselves.

After changing rewards, it was challenging to achieve training stability. Rewards varied wildly between easy and hard prompts. RLPR filters out prompts with low reward variance (indicating overly simple or impossible tasks) using a dynamic threshold. This adaptive curriculum learning keeps training focused on useful samples and improves reliability.

PR exhibits better reward quality compared with rule-based, model-based reward, and naive likelihood as a reward.

Evaluating RLPR on Benchmarks

This paper tested RLPR across seven benchmarks, four general-domain (e.g., MMLU-Pro, TheoremQA) and three math-focused (e.g., Minerva), using models like Gemma, Llama, and Qwen. Without verifiers, it consistently outperformed baselines. On TheoremQA, RLPR beat the verifier-free VeriFree by 7.6 points and surpassed General-Reasoner (which uses a 1.5B verifier model) by 1.6 points on average. It even excelled in math domains, improving Minerva scores by 7.5 points over VeriFree.

RLPR with different training prompt templates and find it achieves robustness reasoning capability enhancement.

The probability reward proved highly reliable. In tests, it matched human judgments better than rule-based verifiers in general domains and rivaled specialized verifier models in math. By eliminating verifiers, RLPR opens doors to training on diverse, real-world data. In future work, we can see this technique being used to refine reward stability or explore hybrid approaches for niche domains. 

DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation

Gong et al. [Apple, The University of Hong Kong]

♥294   DiffusionLM   bycloud’s pick  

Introduction to DiffuCoder

LLMs can generate paragraphs of text easily but code generation often requires jumping back and forth to refine logic, which doesn’t always fit the left-to-right approach of autoregressive models. Diffusion large language models (dLLMs) offer an alternative by refining entire code sequences in parallel, enabling global planning. However, their decoding behavior and training methods for coding tasks remain underexplored.

To address this, the authors of this paper developed DiffuCoder, a 7-billion-parameter dLLM trained on 130 billion code tokens. They analyzed its unique generation patterns and introduced coupled-GRPO, a reinforcement learning method designed to enhance performance without compromising the model’s non-autoregressive strengths.

Inner Workings of DiffuCoder

The DiffuCoder model uses a masked diffusion approach. During training, it corrupts code sequences by randomly masking tokens and learns to reconstruct them step by step. Unlike autoregressive models that predict tokens sequentially, this method allows flexible refinement of any part of the sequence. The model’s decoding exhibits an "entropy sink" phenomenon: it initially shows high confidence in tokens near the given prefix, creating a bias toward left-to-right generation. But by raising the sampling temperature, the model becomes less autoregressive, diversifying not just which tokens are chosen but also their generation order. This flexibility enables more parallel processing.

To optimize performance, the team adapted reinforcement learning for diffusion models. Traditional methods rely on inefficient Monte Carlo sampling for token probability estimates, increasing training overhead. Their solution, coupled-GRPO, generates pairs of complementary masked versions of the same code completion. Each token appears unmasked in one version, which ensures full coverage while reducing variance in probability calculations. This approach avoids semi-autoregressive decoding, preserving the model’s ability to refine code globally.

Evaluation of DiffuCoder 

The researchers of this paper tested DiffuCoder’s base model and achieved competitive results on code benchmarks. These models can match autoregressive models like OpenCoder (67.1% on HumanEval). Instruction tuning alone provided modest gains, but coupled-GRPO training significantly boosted performance: it improved EvalPlus scores by 4.4% using only 21,000 samples. The model also reduced its autoregressive bias, maintaining better accuracy even with half the decoding steps, effectively doubling generation speed. On BigCodeBench, coupled-GRPO increased pass rates by up to 5.6%, which demonstrates efficiency in complex tasks.

These results show that dLLMs have potential for code generation. DiffuCoder has surpassed older diffusion-based approaches, by using temperature to diversify generation order and coupling masks for precise reinforcement learning. 

Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens

Yang et al. [University of Massachusetts, Massachusetts Institute of Technology]

♥ 732   VLM

Introduction to Machine Mental Imagery

Vision-language models are pretty good at understanding images and text but they have an inherent limitation: they must translate every thought into words, even for tasks that demand visual imagination. This forces them to struggle with spatial reasoning, like solving a jigsaw puzzle by mentally matching piece contours, not describing each fragment.

Recent attempts to add image generation capabilities often compromise reasoning quality due to the heavy demands of pixel-level synthesis. This paper introduces Mirage, a new framework that bypasses explicit images entirely. Instead, it allows models to weave compact visual tokens into their reasoning, that mimicks human mental imagery for stronger performance without pixel overhead.

Data Generation

Architecture of the Machine Mental Imagery

Mirage improves the vision-language models by letting them interleave latent visual tokens with ordinary text during decoding. When the model decides to "think visually," it generates a special token and reuses its current hidden state as a compact visual embedding. These embeddings act like simplified mental sketches, providing task-relevant cues for subsequent steps without generating full images. For example, in a navigation task, the model might insert a latent token to represent an imagined path arrow, then use it to plan the next move. This approach keeps reasoning lightweight and focused, avoiding the computational cost of external image decoders.

Latent Grounding Supervision & Joint Optimization

To train this capability, Mirage uses a two-stage process. First, it grounds the latent tokens in actual visual data. The model learns to predict text tokens while reconstructing compressed embeddings from real images, using a combination of cross-entropy loss for text and cosine similarity loss for the visual alignments. This ensures the latent tokens stay meaningful and anchored to visual concepts. After this anchoring phase, the second stage removes direct supervision for the latent tokens. Now, the model generates these tokens autonomously and uses them as flexible priors to guide text generation. This shift allows the latent tokens to adapt freely to the task, optimizing only for accurate text outputs.

Sample output

Finally, reinforcement learning fine-tunes the entire system. The model samples multiple reasoning trajectories, then optimizes for correctness and proper formatting. Gradients flow through both text and latent tokens, refining how visual cues influence decisions. This step boosts performance by encouraging the model to interleave tokens strategically, like placing a visual hint mid-thought to clarify a spatial relationship, which makes the reasoning more intuitive and effective.

Benchmark Performance of Machine Mental Imagery

Mirage was tested on diverse spatial reasoning benchmarks, including Visual-Spatial Planning (VSP), jigsaw puzzles, and SAT-style tasks. The results showed consistent improvements over text-only baselines. For example, on VSP navigation, Mirage achieved up to 89% accuracy, a 5% gain over chain-of-thought prompting and a 27% jump over zero-shot approaches. It also outperformed unified models like MVoT and Anole, which generate explicit images but falter in complex reasoning. Notably, these gains held even for smaller models, with a 3B-parameter version showing 10% improvements on SAT tasks.

7B VSP benchmark results

Additional tests confirmed that both training stages are essential: skipping the initial visual grounding caused a 7% drop in accuracy, while omitting the adaptive second stage led to a 37% decline. The latent tokens themselves clustered near genuine visual embeddings, which validated their role as compact imagery. However, it still has a few limitations such as reliance on synthesized helper images for training, which could introduce noise. 

Reply

or to participate.