The AI Timeline
Posts
What even is a >< former (yes >< former)

What even is a >< former (yes >< former)

plus more about Looped World Models, Fixed-Point Reasoners, and ExpRL

by cloud
June 23, 2026

June 17th ~ June 23rd
#113 Latest AI Research Explained Simply

🗞️ Industry News in 1 Line

♥ 12k Z.ai has announced the release of GLM-5.2, an open-weights model under the MIT license with 1-million-token context window and updates in coding, reasoning, and agentic tasks. It has two reasoning effort levels [GLM-5.2 (max) and GLM-5.2 (high)] designed to help developers balance task performance with token efficiency. You can try it on Hugging Face.
♥ 1k Poolside has released the base and post-trained weights for Laguna M.1, a model featuring a 256K context length licensed under Apache 2.0. Alongside the model checkpoints, the team has also released "pool" (an agent harness) designed to let developers run the model locally as a coding agent. You can try it on Hugging Face.
♥ 37k Sakana AI has announced Sakana Fugu, a multi-agent orchestration system accessible through a single model API. The underlying Fugu Ultra model is trained to recursively call and coordinate various LLMs in an agent pool to manage complex, multi-step technical workflows. You can try it on the Sakana AI website.

Intuitive AI Academy - NEW Optimization Chapter!

My latest project: Intuitive AI Academy has the perfect starting point for you! We focus on building your intuition to understand LLMs, from transformer components, to post-training logic. All in one place.

We just added a new chapter on Optimization, that goes through the history, the key techniques, and the current state of optimizers that frontier model uses.

We currently have an exclusive newsletter offer, where you would get 40% off on the yearly plan for our users.

Use code: TIMELINE

Advertise with The AI Timeline!

Looped World Models

Lu et al. [FaceMind Research Asia]

♥ 473 World Models

AI models that can navigate the physical world are called "world models". These models allocate the same amount of computing power to every single moment, whether an object is simply sitting still or dynamically colliding with another. Moreover, these models are too slow and computationally expensive to run on smaller, everyday devices.

The overall framework of our proposed Looped World Models (LoopWM).

To solve this, this paper introduced Looped World Models, or LoopWM. Instead of using a giant, resource-heavy stack of unique layers, this architecture uses a smaller, shared set of layers and runs them repeatedly in a loop to refine its predictions.

Because the model reuses its existing components rather than adding new ones, it achieves the predictive quality of much larger networks with a fraction of the parameters.

Method	Latent dynamics	Intermediate decode	Action injection	Looped depth
Dreamer (Hafner et al., 2020)	RSSM	reward + value at each step	per step	–
MuZero (Schrittwieser et al., 2020)	learned MLP	policy + value + reward	per step	–
PlaNet (Hafner et al., 2019)	RSSM	reconstruction at each step	per step	–
ETD (Koishekenov et al., 2026)	looped layers	decode only at end	– (language)	✓
NE-Dreamer (Bredis et al., 2026)	RSSM	embedding alignment	per step	–
LoopWM-DD (ours)	looped transformer	decode only at step K	per step in latent	✓

This looping structure also allows the model to adjust its computing power on the fly. It can run fewer loops during simple, predictable events and automatically allocate more loops to complex scenarios like physical collisions.

Relative increase over Qwen3.7-max on automatic online performance, compared against baselines.

Furthermore, the architecture introduces "deferred decoding," which allows the system to process a sequence of actions entirely in its abstract internal language. Rather than wasting energy rendering the visual details of every intermediate step, it only decodes the final result at the very end. This combination of parameter efficiency and smart resource allocation offers a promising, practical path forward for real-time AI planning.

Variable-Width Transformers

Wu et al. [MIT-IBM Watson AI Lab]

♥ 437 Transformers bycloud’s pick

When we scale LLMs, we assumes that every layer requires the same computational budget, even though different stages of a model's processing pipeline perform entirely different roles.

> <former, where different layers have different widths

This paper introduces an hourglass-shaped transformer architecture known as the >< former. This design keeps the initial and final layers wide but significantly narrows the layers in the middle.

The effect of the bottleneck layer index and dimension on language modeling loss, parameterized as a ratio to the total number of layers and the base dimension.

To handle the fluctuating widths without losing information or adding complex math, the researchers implemented a clever, parameter-free residual resizing mechanism. The model maintains a wide global residual stream where narrower middle layers only read from and write to a specific slice. The remaining inactive dimensions are simply carried forward, bypassing the narrow layers entirely so they can be restored later in the network.

Language modeling loss vs. pre-training FLOPs (left) and average layer size (right). > <former produces lower loss at smaller FLOP and average layer size costs.

Across various model sizes, the ><former consistently outperformed traditional, uniform-width baselines in language modeling. By optimizing how space is used, the architecture achieved a twenty-two percent reduction in training computations and a fifteen percent reduction in memory and input-output costs for the key-value cache.

Fixed-Point Reasoners: Stable and Adaptive Deep Looped Transformers

Movahedi et al. [ELLIS Institute Tübingen, Max Planck Institute for Intelligent Systems, Tübingen AI Center, ETH Zurich, Swiss Institute of Bioinformatics, Université Paris Cité, Liquid AI]

♥ 363 Transformers

When solving a difficult puzzle like a complex maze or a challenging Sudoku, humans naturally spend more time thinking. For AI, scaling its computational effort based on the difficulty of a task is hard.

Signal propagation and adaptivity, FPRM vs. TRM:

Some researchers use looped neural networks, which process information by running it through the same internal layers repeatedly. However, these looped models face a double-edged sword: looping too many times can destabilize the mathematical signals inside the network, and it is notoriously difficult for the model to naturally decide when it has "thought" enough and should stop.

The blessing and the curse of depth in Looped Transformers.

To address these challenges, researchers designed a framework called the Fixed-Point Reasoning Model, or FPRM. The team resolved the stability issue that affects deeply looped networks by restructuring how the model balances its internal signals. By switching to a pre-normalization setup paired with residual scaling, they successfully kept the model's internal data stable and trainable, even when running through many loops.

FPRM architecture

Most importantly, FPRM introduces a self-contained halting mechanism based on mathematical convergence. The model loops until its internal representations settle into a stable state, known as a fixed point instead of relying on an external, complex module to decide when to stop.

ExpRL: Exploratory RL for LLM Mid-Training

Xiang et al. [Stanford University, Carnegie Mellon University, OpenAI,]

♥ 855 LLM RL

RL helps AI models solve complex reasoning problems, but training collapses when problems are too difficult because the model rarely stumbles upon the correct final answer to receive any feedback. Without a way to reward partial progress and smart intermediate steps, AI models cannot easily discover the creative strategies needed to solve challenging math and science problems.

Exploratory RL (ExpRL)

To solve this, researchers developed a method called ExpRL, which stands for Exploratory Reinforcement Learning. Instead of forcing the AI to blindly copy human solutions, this approach keeps the correct answer hidden from the AI while it attempts a problem.

Pass@k after training with ExpRL on HMMT-Nov-2025 (128 samples).

An automated judge then uses that hidden answer as a customized grading rubric to evaluate the AI's step-by-step thinking. By analyzing these drafts, the judge can award points for productive intermediate progress. This turns a simple pass-fail test into a learning experience that guides the AI's exploration.

ExpRL training dynamics during Stage-I.

The researchers found that this supportive warm-up phase prepares the AI for subsequent, more rigorous training. Compared to traditional fine-tuning or basic right-or-wrong feedback, models prepared with this method showed a much broader diversity of problem-solving strategies.

Behavior changes after RL priming relative to the base model

The AI naturally began to exhibit helpful thinking habits, such as double-checking its work, self-correcting mistakes mid-calculation, and backtracking when a strategy failed.

How Transparent is DiffusionGemma?

Engels et al. [Google DeepMind]

♥ 210 DIffusionLM

Understanding how AI arrives at its decisions is vital for ensuring safety, debugging errors, and preventing misuse. While traditional LLMs think out loud step-by-step in plain English, newer text diffusion models like DiffusionGemma perform their reasoning in a continuous mathematical space.

A simplified architecture diagram of DiffusionGemma, with the first two denoising steps shown

Researchers tried to translate these hidden mathematical states back into human-readable concepts. They discovered that the complex vector information flowing between the model's iterative denoising steps can actually be mapped into a small bottleneck of natural language tokens.

Breakdown of intermediate state top token identities with different restrictions averaged across WildChat prompts

Restricting the model to just a handful of these mapped words at each step caused almost no drop in its overall performance. By showing that these intermediate states represent interpretable guesses about the final text, the researchers successfully demonstrated that the model’s hidden reasoning depth can be simplified to a level nearly identical to traditional systems.

DiffusionGemma thinks less than Gemma on all monitorability evaluations. Error bars show 95% confidence intervals of mean number of characters in each model’s chain of thought.

The researchers watched the model accurately estimate its final response length before choosing its words, retroactively correct earlier mistakes after formulating its reasoning, and even temporarily weigh multiple sentences at once.

Most importantly, evaluations showed that these complex internal mechanics do not compromise safety; the model remains just as monitorable as traditional architectures.

Reply

or to participate.