The AI Timeline
Posts
LLM That Can Modify Itself?

LLM That Can Modify Itself?

Plus more about "The Diffusion Duality" and "Reinforcement Pre-Training"

by cloud
June 17, 2025

JUNE 10th ~ JUNE 16th
#60 Latest AI Research Explained Simply

🗞️ Industry News in 1 Line

HOT Google has officially launched the Gemini 2.5 family, with an addition of Gemini 2.5 Flash Lite preview.
Gemini 2.5 family benchmark
♥ 5.5k OpenAI o3-pro now available for all Pro users in ChatGPT and in the API.
o3 Benchmark
♥ 1.1k Hailuo AI released MiniMax-M1, which is a 456B hybrid MoE model with lightning attention, supporting 1M-token contexts. Currently it is a bit controversial due to how they didn’t properly compare the test-time scaling vs FLOPs performance.
MiniMax-M1
♥ 3.7k New Anthropic Research dives in how they build a multi-agent system where LLMs can autonomously coordinate tool usage and parallelize web searches. Offering insights into scalable system architecture, agent reliability, and prompt design for production-ready deployments.
multi-agent system diagram

The AI Timeline: Premium Insights

Recently, we introduced a premium membership for The AI Timeline!

With the membership, you would receive exclusive insights/explainers into technical AI topics and monthly research trend reports that contains my analysis of up to 40+ papers.

Research Trend Report

May 2025

https://mail.bycloud.ai/p/may-2025-research-trend-report

Understanding Floating Points in LLMs

Introduction into Floating Points through DeepSeek-V3

mail.bycloud.ai/p/understanding-floating-points-in-llms

Plus, we are also scheduling more content in the future, so don’t miss out!

_{Advertise with The AI Timeline!}

The Diffusion Duality

Sahoo et al. [Cornell Tech, EPFL Lausanne]

♥ 365 LLM Diffusion bycloud’s pick

Unlocking Faster Text Generation with Diffusion Duality

We have seen recent papers where researchers have tried to create text generation models using diffusion processes. These models promise efficient self-correction, but discrete variants like Uniform-state Diffusion Models (USDMs) are often worse than autoregressive and masked diffusion approaches in speed and quality.

The biggest challenge with the USDMs approach is lack the advanced training and sampling techniques that power their continuous counterparts. This paper introduces Duo, which is a method that bridges Gaussian and discrete diffusion by revealing a hidden connection: discrete states naturally emerge from underlying Gaussian processes. This duality unlocks game-changing optimizations, accelerating both training and sampling while closing the performance gap.

How Duo Rewires Diffusion Mechanics

The Duo mechanism maps the Gaussian diffusion latents to discrete states via a simple argmax operation. When this transformation is applied to noisy Gaussian vectors, argmax transforms them into categorical distributions matching USDMs, with diffusion parameters remapped through a diffusion transformation operator. This approach has several practical implications, first, curriculum learning leverages the Gaussian backbone to reduce training variance. By starting with a tempered softmax approximation of argmax (easing reconstruction) and gradually hardening it to true argmax, models learn faster.

Second, the duality enables Discrete Consistency Distillation (DCD). Since USDMs lack deterministic Probability Flow ODEs, Duo constructs proxy trajectories in Gaussian space: clean data and noise are combined into continuous paths, then discretized via argmax. A student model distills knowledge from a teacher by matching output distributions across these discrete points, by skipping stochastic sampling. This results in reduction of sampling steps and speeds up the entire workflow.

DUO project page

The Diffusion Duality

s-sahoo.com/duo

Benchmark Gains and Future Horizons

The benchmark results show that Duo’s curriculum-trained models outperform autoregressive baselines in zero-shot perplexity on 3 of 7 language tasks. This shows that USDMs can compete with established methods. We also saw that the sampling efficiency of the models improved significantly. The DCD approach reduces inference costs by 100× while maintaining quality, which notably outpaced the masked diffusion in few-step regimes.

However, it is not all sunshine and roses, there are still a few limitations. For example, when handling large vocabularies where the diffusion transformation narrows, but the path forward is clear. By borrowing from Gaussian diffusion’s rich toolkit, Duo shows that USDMs can be considered a viable alternative for real-time applications.

Self-Adapting Language Models

Zweiger et al. [Massachusetts Institute of Technology]

♥ 3.1k LLM RL

Introduction to Self-Adapting LLMs

Large language models often feel frozen in time as they are unable to integrate new knowledge or adapt to tasks beyond their initial training. This limitation forces reliance on in-context learning or resource-heavy finetuning, which struggles with sparse data or suboptimal formats.

This paper introduces SEAL, a framework that enables models to generate their own training directives, self-edit their weights, and evolve autonomously.

Overview of SEAL. In each RL outer loop iteration, the model generates candidate self-edits (SE) — directives on how to update the weights, applies corresponding updates, evaluates performance on a downstream task, and uses the resulting rewards to improve the self-edit generation policy.

Inner Workings of SEAL

The SEAL architecture uses a multi-loop design to iteratively improve. First, the model processes a task context, like a factual passage or few-shot examples, and generates a "self-edit", which is a natural-language instruction specifying synthetic data (e.g., implications of a text) or optimization parameters (e.g., learning rates). For knowledge integration, it rewrites passages into distilled facts and for few-shot reasoning, it selects data augmentations like rotations or resizing.

SEAL Reinforcement Learning Loop. The specific format of the self-edits (SE) are defined per task domain.

These self-edits trigger a supervised fine-tuning process which updates the model’s weights via lightweight LoRA adapters. This reinforcement learning loop trains the self-edit policy which contains the model samples multiple edits, applies them, and receives rewards based on downstream performance (e.g., QA accuracy). Only edits boosting performance are reinforced via rejection sampling. This dual-loop design, generation followed by validation, transforms static models into adaptive learners.

Evaluation and Benchmark Results

In knowledge incorporation tests using SQuAD passages, SEAL lifted no-context QA accuracy from 33.5% (base model) to 47.0% after two RL iterations which outperformed GPT-4.1-generated synthetic data. For few-shot reasoning on ARC tasks, it achieved 72.5% success by autonomously configuring augmentations and hyperparameters, vastly exceeding non-RL baselines (20%). However, sequential updates revealed catastrophic forgetting, and computational overhead remains high due to per-edit finetuning.

These results spotlight SEAL’s potential for data-scarce settings, where models must self-distill knowledge. Future work could tackle forgetting via retention-focused rewards or expand to continual pretraining.

Reinforcement Pre-Training

Dong et al. [Microsoft Research, Peking University, Tsinghua University]

♥ 424 LLM Training

Reinforcement Pre-Training for Language Models

Large language models have improved and gotten better on self-supervised pre-training by training on large amounts of data scraped from the internet. However, integrating reinforcement learning (RL) into this process has faced several hurdles such as costly human feedback data risks reward hacking, while verifiable-reward approaches struggle with limited annotated datasets. This paper introduces Reinforcement Pre-Training (RPT) which bridges this gap by transforming next-token prediction into a reasoning task trained with scalable RL.

How Reinforcement Pre-Training Reinvents Language Modeling

The Reinforcement Pre-Training modifies the next-token prediction as a reasoning challenge. For any text snippet, the model generates multiple "thinking trajectories", chains of thought exploring why a token should follow, before predicting the next token. Each trajectory earns a verifiable reward: 1 if the prediction matches the ground-truth token from the corpus, 0 otherwise. This rule-based reward sidesteps reward hacking and leverages unannotated text as RL training data.

To optimize learning, RPT prioritizes challenging tokens. Before training, a proxy model identifies high-entropy tokens (where predictions are uncertain), while focusing computational effort where reasoning matters most. During rollout, the model samples multiple reasoning paths per context, using a prefix-matching reward that validates predictions against token boundaries in the corpus. This encourages the model to explore hypotheses, self-correct, and deduce patterns rather than memorize.

The approach uses established RL algorithms like GRPO for training, with dynamic sampling to boost efficiency. By integrating reasoning directly into pre-training, RPT aligns the model’s internal "thought process" with token prediction and effectively scales inference-time computation during training itself.

Performance Gains and Scaling Potential of Reinforcement Pre-Training

RPT significantly boosts next-token accuracy, especially on complex tasks. When tested on math-heavy datasets, a 14B-parameter RPT model matched the performance of a 32B-parameter baseline and achieved up to 45% accuracy on hard tokens, which is nearly 3× higher than standard methods. Additionally, RPT exhibits predictable scaling: accuracy improves consistently with compute across easy, medium, and hard tasks, following a power-law curve.

For building new applications, the RPT approach can serve as a robust foundation. Fine-tuning it with RL on specialized tasks like competition-level math gives faster convergence and higher performance ceilings. Zero-shot evaluations on MMLU-Pro and SuperGPQA showed gains of 7–22 points over baselines, which highlights its generalization power.

These results indicate that RPT can be considered as a scalable alternative to conventional pre-training, minimizing the gap between self-supervised learning and reinforcement fine-tuning.

🚨This week's top AI/ML research papers:
- Self-Adapting Language Models
- V-JEPA 2
- The Illusion of the Illusion of Thinking
- Magistral
- Reinforcement Pre-Training
- VideoDeepResearch
- Unsupervised Elicitation of LMs
- CoRT
- The Diffusion Duality
- Ming-Omni
- One
— The AI Timeline (@TheAITimeline)
11:15 PM • Jun 15, 2025

Reply

or to participate.

LLM That Can Modify Itself?

Plus more about "The Diffusion Duality" and "Reinforcement Pre-Training"

JUNE 10th ~ JUNE 16th#60 Latest AI Research Explained Simply

🗞️ Industry News in 1 Line

The AI Timeline: Premium Insights

The Diffusion Duality

Unlocking Faster Text Generation with Diffusion Duality

How Duo Rewires Diffusion Mechanics

Benchmark Gains and Future Horizons

Self-Adapting Language Models

Introduction to Self-Adapting LLMs

Inner Workings of SEAL

Evaluation and Benchmark Results

Reinforcement Pre-Training

Reinforcement Pre-Training for Language Models

How Reinforcement Pre-Training Reinvents Language Modeling

Performance Gains and Scaling Potential of Reinforcement Pre-Training

Reply

JUNE 10th ~ JUNE 16th
#60 Latest AI Research Explained Simply