- The AI Timeline
- Posts
- Wait, Wait, Wait... Why Do Reasoning Models Loop?
Wait, Wait, Wait... Why Do Reasoning Models Loop?
and more on Dead Salmons of AI Interp, GDPO, From Entropy to Epiplexity
Jan 7th ~ Jan 13th
#90 Latest AI Research Explained Simply
🗞️ Industry News in 1 Line
♥ 6.7k Midjourney has launched Niji V7, and it delivers significantly more realistic anime aesthetics alongside improved text rendering and coherence. Many users are already praising the stunning quality of the new model as it is nearly indistinguishable from Anime created by traditional artists.

♥ 596 MiniMax has made a debut on the Hong Kong Stock Exchange and it achieved a staggering $13.7 billion valuation on its first day. MiniMax is one of the few open-source models dominating global benchmarks; it has an Anthropic compatible API that delivers top-tier performance, but you can also access the MiniMax agent on the web.

♥ 486 Axiom has released an analysis of its AxiomProver model, which reveals that the AI successfully solved complex problems (like A6) that stumped human mathematicians, even while struggling with calculus concepts humans consider "obvious." The breakdown highlights a fascinating divergence in logic, where the model often ignores human elegance in favor of brute-force analysis or unexpected geometric strategies to construct valid Lean proofs. View a lean dependcy graph of how AI solved Putnam problems.

Support My Newsletter
As I aim to keep this newsletter free forever, your support means a lot. If you like reading The AI Timeline, consider forwarding it to another research enthusiast. It helps us keep this up for free!
The Dead Salmons of AI Interpretability
Meloux et al. [Universite Grenoble Alpes, Icahn School of Medicine at Mount Sinai]
♥ 4.9k Interpretability
Many years ago, neuroscientists detected brain activity in a dead salmon, which was a mistake caused by statistical errors. Similarly, today, the researchers are arguing that the field of AI interpretability (which tries to explain how complex models think) is facing its own "dead salmon" moment.
As artificial intelligence becomes a part of our lives, we need to know if the methods we use to understand these systems are actually working. The authors of this study tried to determine whether current techniques are finding true insights into machine logic or simply hallucinating patterns in the noise.

The tuple of a target behavior PU and the computational system with its internal components form an SCM.
The team discovered that many popular methods used to interpret AI behave surprisingly well on neural networks that are completely random and untrained. Much like the salmon experiment, these tools generated plausible-sounding explanations for mathematical gibberish, which reveals that the field suffers from significant statistical fragility.
The researchers explained that this happens because current queries are often "non-identifiable," meaning convincing answers can be found even where no real logic exists. They propose that instead of just accepting an explanation at face value, we must treat it as a statistical estimate.

An interpretability task is defined by three elements: E, the hypothesis space; µ, the distribution over causal queries about the SCM (model and behavior); and D, the error measure.
This requires a new framework where findings are rigorously tested against random baselines to prove they describe a real computational mechanism rather than a statistical fluke.
From Entropy to Epiplexity: Rethinking Information for Computationally Bounded Intelligence
Zheng et al. [New York University]
♥ 1.9K Entropy Computationally Bounded Intelligence
Traditional theories suggest that processing data deterministically cannot create new information, but despite this, AI systems like AlphaZero develop superhuman strategies without seeing human data, and models trained on synthetic data often outperform their predecessors.
Researchers realized that treating all observers as having infinite computing power (which is a standard mathematical assumption) misses the point. This paper tries to define how a computer with limited processing power actually perceives value in data to distinguish between random noise and the useful, learnable structures.

Illustration of random vs structural information.
The team introduced a concept called "epiplexity," a measure of the structural information a specific, resource-constrained observer can extract. They found that information isn't just about what is in the data, but how hard a computer has to work to decode it. When models are forced to solve harder problems (such as deducing the logic of a chess game from the board state rather than just predicting the next move), they acquire higher epiplexity.

Experiments on Factorization
This struggle forces the AI to construct richer internal programs and sophisticated mental models. The research also explains why language data, which is dense with logical rules, often builds more versatile intelligence than image data, which contains high randomness but less complex, learnable structure.

Information created with cellular automata.
This framework suggests that by measuring epiplexity, engineers could move beyond trial-and-error to scientifically select training data that maximizes learning.
GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
Liu et al. [NVIDIA]
♥ 680 GDPO RL
As artificial intelligence evolves, users are no longer satisfied with models that simply output correct facts. We now expect them to juggle complex behaviors simultaneously (writing clean code, adhering to strict formatting, and remaining concise), all while being accurate.
However, teaching models to balance these competing actions has proven surprisingly difficult. The standard training technique, known as GRPO, tends to blur these distinct goals together. When researchers looked closely, they realized this approach causes a "signal collapse," where the mathematical feedback for different levels of success looks identical to the model.
This means that the AI becomes unable to distinguish between a partially correct attempt and a truly successful one, which often causes it to ignore harder tasks in favor of easier ones.

To solve this, researchers at NVIDIA and HKUST introduced a new method called Group reward-Decoupled Normalization Policy Optimization (GDPO). Instead of mixing every piece of feedback into a single, muddy signal, this approach processes each goal independently before combining them.

By normalizing rewards separately, the method preserves the fine-grained resolution of the training signal, ensuring the model understands that hitting two targets is numerically better than hitting just one. In tests spanning mathematical reasoning, coding, and tool usage, this clearer feedback allowed models to significantly outperform previous standards. The AI could finally balance strict constraints, such as keeping answers short, without sacrificing the accuracy of its reasoning.

This allows researchers to balance incentives used during training and fine-tune models to respect complex, multi-layered preferences without destabilizing the learning process.
How to Set the Learning Rate for Large-Scale Pre-training?
Zhou et al. [Shanghai AI Laboratory, Shanghai JiaoTong University, Fudan University]
♥ 450 LLM Learning Rate
Training the massive AI systems of tomorrow is incredibly expensive, consuming vast amounts of time and computational power. One of the biggest headaches engineers face is setting the "learning rate"—essentially the speed at which the AI absorbs new information. If the rate is set too slow, training drags on inefficiently; if it is set too fast, the model gets confused and fails to learn. Until now, finding that "Goldilocks" speed for giant models has felt like a high-stakes guessing game, because running trial-and-error tests on such a massive scale costs a fortune. Researchers set out to solve this by determining if we can run small, cost-effective experiments to accurately predict the perfect settings for the big leagues.

Visualization of the optimal learning rate relative to model size N and data size D
The team compared two major strategies: trying to mechanically transfer settings from small models to big ones, versus using mathematical "scaling laws" to predict the best numbers based on trends. The results were illuminating. The researchers found that by analyzing the relationship between the model's size and the amount of data it consumes, they could derive a precise formula to predict the optimal learning speed. This "fitting" approach significantly outperformed older methods. Surprisingly, they also discovered that simpler is often better. While some theories suggest tweaking the learning speed for different parts of the AI’s "brain" separately, this study showed that a single, globally optimized speed works just as well. Modern AI architectures proved robust enough to learn effectively without needing complex, piece-by-piece micromanagement.

Performance comparison between the global optimal LR (red line) and module-wise optimal LR (blue line) on a 4B model trained for 120B tokens.
This discovery is a breath of fresh air for the future of AI development. It offers a reliable map for engineers, allowing them to skip the expensive, wasteful trial-and-error phase and move straight to efficient training. By understanding exactly how data volume and model size influence the learning process, developers can confidently scale up their systems to unprecedented sizes. It suggests a future where building more capable AI isn't just about having the biggest budget, but about leveraging the fundamental laws that govern how machines learn to build smarter and more sustainable systems.
Wait, Wait, Wait... Why Do Reasoning Models Loop?
Pipis et al. [MIT, Microsoft Research, University of Wisconsin-Madison]
♥ 731 Reasoning Models
AI is evolving from simple chatbots to complex reasoning engines and a peculiar behavior has emerged: when faced with difficult math or logic puzzles, models sometimes get stuck in endless repetitive loops.
Researchers recently launched a deep dive to understand why these models get stuck in a rut. This paper tries to answer whether adding randomness (raising the "temperature") is a genuine solution or just a temporary band-aid. By investigating how smaller models learn from larger, smarter ones, the team sought to uncover the root causes of this stalling behavior to build more reliable thinkers.

Looping with greedy decoding.
The investigation revealed that looping is often a symptom of "risk aversion" born from imperfect learning. When a smaller "student" model tries to mimic a "teacher," it often fails to grasp the difficult, precise steps required to make progress. Instead, it retreats to safe, easy-to-learn actions, such as restating the problem or repeating a previous thought.
The researchers discovered that while dialing up the randomness does help break these loops, it does not actually fix the model's underlying confusion. Instead of learning the correct path, the model simply explores more chaotically until it potentially stumbles across the solution, resulting in reasoning chains that are much longer than necessary.

Temporally correlated errors induce low-temperature loops.
Now that we understand looping is caused by specific learning errors rather than just bad settings, engineers can design better training benchmarks. Additionally, instead of relying on randomness to shake models out of a loop, researchers can target these "hard-to-learn" steps.
Reply