• The AI Timeline
  • Posts
  • The Unreasonable Effectiveness of Reasonless Intermediate Tokens

The Unreasonable Effectiveness of Reasonless Intermediate Tokens

Plus more about RL Finetunes Small Subnetworks in LLMs and Multimodal Large Diffusion LMs

May 19th ~ May 26th
#57 Latest AI Research Explained Simply

🗞️ Industry News in 1 Line

  1. ♥ 20k Anthropic has released Claude Opus 4 and Claude Sonnet 4, the next generation of its AI models featuring hybrid architecture with instant and extended reasoning modes. Claude Opus 4 is Anthropic’s biggest model and leads in software engineering benchmarks, while Claude Sonnet 4 provides enhanced coding and reasoning capabilities compared to its predecessor.

  2. ♥ 5.3k Mistral has launched Document AI, which is an enterprise-grade document processing solution that uses advanced OCR technology and achieves 99%+ accuracy across multiple languages. The platform processes up to 2,000 pages per minute on a single GPU and offers structured data extraction capabilities for complex documents including tables, forms, and handwritten content.

findmypapers.ai got a tiny new feature 👀

papers preview, go give it a spin!

While we are improving the retrieval quality for finding AI research papers, we still want to make the search experience a bit less boring.

So now, we are able to display the papers that are being searched on! (truly a new tiny feature lol)

and a nicer citation section

Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens

Stechly et al. [SCAI, Arizona State University]

♥ 569   LLM Reasoning

Thinking about Chain of Thought

If you’ve followed recent advances in AI reasoning, you would know that it’s common for LLMs to “think” through problems by generating intermediate steps known as a Chain of Thought (CoT). These steps are meant to resemble human reasoning traces and are widely assumed to explain how models arrive at answers.

But are these intermediate tokens really necessary? This study argues that the semantic validity of CoT steps might matter far less than we think, and that models can excel even when their “thoughts” are nonsensical. Do training models to produce structured, algorithm-like reasoning traces actually teach them to reason algorithmically, or is the performance boost from CoT a byproduct of something else?

Tracing the Path Between Noise and Accuracy in Chain of Thought

To test the relationship between reasoning traces and model performance, the researchers designed a controlled environment using maze-solving tasks. Transformers were trained to generate both solutions (plans) and intermediate traces mimicking the A* search algorithm, a classic pathfinding method. Unlike prior work, this setup included a formal validator to check whether generated traces adhered to A*’s semantics (such as correctly updating node costs and prioritizing paths) while also verifying final solutions.

Examples of mazes. The left is generated by Wilson’s algorithm and is used for model training. The right is generated by the Drunkard’s Walk algorithm.

Surprisingly, models trained on correct A* traces often produced valid solutions alongside invalid intermediate steps. What’s even more surprising is that performance improved when models were trained on swapped traces, where intermediate steps were randomly paired with unrelated mazes. These models ignored the noisy, irrelevant traces during inference but still achieved higher accuracy, particularly on out-of-distribution mazes. 

After conducting this experiment, the researchers were asking exactly the same question that you are asking right now. Why would nonsensical traces help? The paper speculates that intermediate tokens act as a form of “prompt augmentation,” nudging the model toward better solutions without requiring semantically meaningful reasoning. Just like how adversarial prompts subtly shift model behavior, CoT-like tokens might create an internal context that biases the model toward correct answers. 

Beyond the Illusion of Reasoning via Chain of Thought

As you might have guessed by now, the results were quite surprising as models trained on swapped traces achieved 51.6% plan validity on Wilson-generated mazes (vs. 50.1% for correct traces) and outperformed baselines on out-of-distribution tasks. For example, on “Drunkard’s Walk” mazes, the swapped-trace model solved 26% of tasks, compared to just 2.5% for the A*-trace model. Meanwhile, trace validity plummeted to 0%, confirming that the model’s success had little to do with executing A* semantics.

These findings challenge two assumptions:

  1. Anthropomorphization of CoT: Intermediate tokens are not reliable indicators of human-like reasoning. Models can generate correct answers despite flawed or irrelevant traces.

  2. Trace semantics as a training target: Enforcing algorithmically valid reasoning steps may be unnecessary (or even counterproductive) for maximizing performance.

Reinforcement Learning Finetunes Small Subnetworks in Large Language Models

Mukherjee et al. [University of Illinois Urbana-Champaign]

♥ 741   Sparse LLMs  

The Mystery of Sparse Updates in RL

When we think about training AI models, the assumption is often that more computation and more parameter updates lead to better performance. But what if the opposite were true? This study argues that RL fine-tuning often updates only 5-30% of a model’s parameters but still achieves results nearly identical to full fine-tuning. 

Remarkably, this sparsity emerges without explicit regularization or architectural constraints. The researchers found a way where RL fine-tuning consistently modifies a small, consistent subset of parameters across diverse algorithms (PPO, DPO, GRPO, etc.) and model families (Llama, Tulu, DeepSeek). What’s even more intriguing is that freezing the majority of parameters and training only this “subnetwork” reproduces the performance of a fully fine-tuned model.

How RL Discovers and Uses Sparse Subnetworks

The study reveals three core insights about how RL interacts with LLM parameters:

  1. Sparsity is Universal and Intrinsic: The updates occur on a small subnetwork (5-30% of parameters) across 10 models and 7 RL algorithms, while the rest remain nearly unchanged. This sparsity isn’t confined to specific layers or components, as nearly all parameter matrices receive sparse, scattered updates. 

  1. Subnetworks Are Consistent Across Training Conditions: When training with different random seeds, datasets, or RL methods, the updated parameters overlap significantly more than chance. For example, subnetworks trained with DPO and PRIME (two distinct algorithms) share ~59% of their updated parameters. This suggests pretrained LLMs contain a latent, reusable structure that RL selectively activates, regardless of external variables.

  1. Updates Are Sparse but Full-Rank: Despite their sparsity, the parameter changes are full-rank, meaning they span the entire subspace of possible updates for each matrix. This contrasts with methods like LoRA, which constrain updates to low-rank subspaces. In other words, RL fine-tuning isn’t just pruning redundant parameters; it’s identifying a compact set of weights that can represent nearly any adjustment the model needs.

Benchmark performance and Implications

  • Performance Parity: Training only the identified subnetwork matches or slightly exceeds the accuracy of full fine-tuning. For example, on MATH500, a subnetwork-trained PRIME model outperformed its fully fine-tuned counterpart by +2.4% overall, with gains up to +5.2% on harder problems.

  • Parameter Replication: Subnetwork-trained models converge to nearly identical parameter values as their fully fine-tuned counterparts (94-100% similarity, depending on numerical tolerance).

  • Efficiency Insights: Sparsity correlates with training on “in-distribution” data (samples close to the model’s current policy). Techniques like KL regularization or gradient clipping—often thought to stabilize training—have minimal impact on sparsity.

The implications of these findings:

  • Reduced Compute Costs: Early identification of subnetworks could drastically cut RL training time and memory usage.

  • Preserved Pretrained Knowledge: By updating fewer parameters, RL may avoid overwriting valuable pretrained capabilities, addressing a common concern in fine-tuning.

Algorithm Design: Future RL methods could explicitly target sparse updates, combining the benefits of full-rank adjustments with parameter efficiency.

MMaDA: Multimodal Large Diffusion Language Models

Yang et al. [Princeton University, Peking University, Tsinghua University, ByteDance Seed]

♥ 393   Multimodal Diffusion LM   bycloud’s pick  

Introduction to MMaDA

When we think of super strong AI, we might imagine a single model can solve complex math problems, explain visual diagrams, and generate photorealistic images. This has become easier with the rise of multimodal foundation models, which aim to unify language, vision, and other modalities under one architecture.

However, there’s still a critical gap as most existing models struggle to balance post-training refinement (like reinforcement learning) with the demands of diverse tasks, leading to trade-offs between reasoning accuracy, factual consistency, and generative quality.

This paper introduces MMaDA (Multimodal Masked Diffusion Architecture), which is a novel framework designed to harmonize these objectives. This paper aims to solve this by reimagining diffusion models as unified solvers for both discrete (text) and continuous (image) data.

Specific design choices employed by different unified multimodal foundation model families.

How MMaDA Works

There are three main innovations which make the MMaDA model possible.

  1. Unified Diffusion Architecture
    The MMaDA treats text and images as sequences of discrete tokens processed through a shared diffusion pipeline. Unlike traditional hybrid models that use separate components for different modalities, MMaDA uses a single masked token predictor. This predictor learns to reconstruct corrupted inputs (whether text snippets or image patches) using a unified cross-entropy loss. By aligning the noise-corruption process across modalities during pretraining, the model builds an intrinsic understanding of how textual logic and visual semantics interconnect.

  2. Mixed Long Chain-of-Thought Fine-Tuning
    To enhance reasoning, MMaDA introduces mixed long chain-of-thought (CoT) fine-tuning. Here, the model generates step-by-step explanations (e.g., solving a geometry problem) or factual rationales (e.g., describing why a landmark is culturally significant) before producing a final answer or image. These CoT traces follow a standardized format, allowing the model to generalize across tasks. For instance, when generating an image of the Statue of Liberty, MMaDA first infers its historical context (“gifted by France”) and then synthesizes visuals aligned with that knowledge.

  3. Unified Reinforcement Learning with UniGRPO
    The final piece is UniGRPO, which is a reinforcement learning algorithm tailored for diffusion models. Traditional RL methods falter with diffusion’s iterative denoising process, but UniGRPO cleverly samples diverse “masking ratios” during training. By partially masking outputs (like hiding 25% of tokens in a math solution or image captions) and rewarding accurate reconstructions, the model learns to balance correctness, formatting, and human preferences. This approach unifies rewards for text reasoning, multimodal understanding, and image generation, avoiding the need for task-specific tuning.

Results and Implications of MMaDA

The MMaDA-8B model outperforms specialized models on three tests:

  • Textual Reasoning: Surpasses LLaMA-3-7B and Qwen2-7B on math benchmarks like GSM8K (73.4% accuracy vs. 53.1% for LLaMA-3).

  • Multimodal Understanding: Achieves 76.7% on VQAv2, eclipsing Show-o (69.4%) and SEED-X (47.9%).

  • Image Generation: Scores 32.46 CLIP points and 1.15 Image Reward, outperforming SDXL (32.12 CLIP) and Janus (1.03 Image Reward).

Evaluation on Multimodal Understanding Benchmarks

Although this is promising, MMaDA relies on discrete tokenization and this introduces computational overhead. Moreover, its CoT traces occasionally deviate into verbose reasoning. 

Reply

or to participate.