- The AI Timeline
- Posts
- Generative Recursive Reasoning
Generative Recursive Reasoning
plus more on the Benefits of Subword Tokenization, HRM-Text, Probabilistic Tiny Recursive Model, and Vector Policy Optimization
May 19th ~ Mayr 26th
#109 Latest AI Research Explained Simply
🗞️ Industry News in 1 Line
♥ 525 Tencent has released Hy-MT2, a new open-source multilingual translation model which supports 33 languages across 1.8B, 7B, and 30B parameter scales. This update has improved translation accuracy, stronger instruction-following capabilities, and a highly quantized 1.8B version optimized for local deployment on mobile devices. You can try it on GitHub or Hugging Face.

♥ 4.8k Alibaba has launched Qwen3.7-Max, a LLM designed to support advanced agentic workflows and long-horizon tasks. The model has enhanced capabilities for autonomous end-to-end coding, multi-agent orchestration, and complex tool-calling environments. You can try it on Qwen Studio or access it via the Alibaba Cloud API.

♥ 9.5k Google has introduced Gemini 3.5 Flash, a lightweight model optimized for complex coding and agentic workflows. This model has a nice balance between fast execution speeds with improved performance on multi-step reasoning and developer-focused tasks. You can try it on Google AI Studio or the Gemini App.

♥ 13k Cursor has introduced Composer 2.5, an updated model designed to handle complex programming and long-running development tasks. It is built on Moonshot's open-source Kimi K2.5 base, but this version uses reinforcement learning with text feedback to improve instruction following over extended context windows.

Intuitive AI Academy - NEW Advanced RL Chapter!
My latest project Intuitive AI Academy has the perfect starting point for you! We cover everything from the basics, like transformer architecture, all the way to more advanced topics like LoRA, distillation, Mixture of Experts, and RLHF.
The goal is simple: make frontier AI systems easy to understand with clear explanations, visuals, interactive learning, and a structured path from fundamentals to cutting-edge techniques.
We have just added a new advanced RL chapter, that includes the basics of RL and the current state of RLHF! We currently have an early bird offer, where you would get 40% off on the yearly plan for our early users.
Use code: TIMELINE
Generative Recursive Reasoning
Baek et al. [KAIST, Québec AI Institute, New York University, Université de Montréal]
♥ 1.4k Reasoning
Currently, many AI models use a technique called recursive reasoning. Instead of relying on massive parameter sizes, these compact models loop through the same internal functions to repeatedly refine their computations and think deeper about a problem.

Comparison of Latent Reasoning Trajectories.
However, there is a major limitation: these existing models are completely rigid. They lock onto a single train of thought and march toward a single conclusion. True problem-solving requires holding onto multiple hypotheses and exploring alternative strategies, especially when a puzzle has more than one valid answer.

Performance on puzzle benchmarks
To solve this, researchers developed Generative Recursive Reasoning Models, or GRAM. This framework transforms rigid AI logic into a flexible, probabilistic process. Instead of forcing the model to make a predetermined update at each step, GRAM introduces calculated probability.
When evaluating a problem, the system cycles through inner and outer loops of internal refinement. At each step, it takes its current reasoning state and adds a learned stochastic perturbation (essentially a controlled mathematical variation). This tweak allows the system to branch out and maintain multiple parallel trains of thought simultaneously.

Evaluation on N-Queens and Graph Coloring benchmarks.
By allowing models to be both deep in their thinking and wide in their exploration, GRAM successfully tackles structured puzzles that require balancing hard constraints, like complex Sudoku. It allows an AI’s thinking power to be scaled up on the fly simply by exploring more parallel paths.
Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation
Gigant et al. [Nous Research]
♥ 1k Tokenization bycloud’s pick
When AI reads a sentence, it first has to chop the text into smaller pieces, this process is known as tokenization. The researchers of this paper have tried to explain the mechanics of tokenization to understand what gives token based models performance advantage over simpler, byte-level models.

To solve this mystery, the researchers trained a model to read raw bytes, but artificially injected the specific benefits of subword tokenization one at a time to see which variables actually moved the needle.
Their experiments revealed two things. First, these tokens act as a powerful form of data compression. By packing more information into fewer structural pieces, the AI can process a significantly larger volume of text using the exact same amount of computing power.
In addition to processing speed, the researchers discovered that the boundaries of these subwords serve as crucial structural hints. Because subword chunks naturally align with human semantics, knowing where a chunk begins and ends gives the AI a structural map, making the complex task of predicting language inherently easier.

Validation loss when providing the start or end of subword boundaries
By isolating the ingredients that make current language models so successful, researchers no longer have to rely on trial and error to understand their tools. Understanding the basics of tokenization will allow engineers to design even smarter, more elegant architectures.
HRM-Text: Efficient Pretraining Beyond Scaling
Wang et al. [Sapient Intelligence, MIT]
♥ 759 LLM pre-training
Building a foundational language model requires massive supercomputers to process all the data on the entire internet. This brute-force approach is incredibly expensive and locks the broader research community out of foundational exploration. This is very different from how humans learn things, most people can grasp complex rules from just a few examples.

HRM-Text architecture
The researchers of this paper took inspiration from how biological brains process information at multiple speeds, and used it to create a new system called HRM-Text. Instead of using the standard architecture that powers most modern models, they built a system that separates thinking into two distinct rhythms.
A slow-evolving strategic layer maintains the big-picture context, while a fast-evolving execution layer handles immediate details. To ensure this deep, looping computation remains stable, the team developed clever mathematical balancing techniques.

Evaluation results of HRM-Text 1B and contemporary fully open or open-weight models.
Furthermore, they abandoned the traditional method of forcing models to endlessly guess the next word in random internet text. Instead, they trained the system exclusively on instruction-and-response pairs, allowing the model to fully absorb a complete question before efficiently generating an answer.

The researchers trained a model from scratch for roughly fifteen hundred dollars, using a tiny fraction of the standard data. Despite requiring hundreds of times less computing power, this compact system performs competitively against open foundation models that are significantly larger and far more expensive to build.
This remarkable achievement fundamentally reshapes the landscape of machine learning, proving that intelligent design can radically reduce the cost of entry and truly democratize the future of artificial intelligence research.
Probabilistic Tiny Recursive Model
Sghaier et al. [Mila – Quebec AI Institute, ILLS & ETS Montreal]
♥ 459 Recursive models
Tiny Recursive Models are small AI systems that tackle complex reasoning tasks like extreme Sudoku. Instead of generating text word by word like massive language models, these tiny systems solve problems by continuously refining a single internal thought until they reach an answer.

However, because the models followed a strictly rigid, deterministic path, taking a wrong mental turn early on trapped them in a dead end. They would get stuck in bad solution basins and were unable to backtrack or brainstorm new approaches.

PTRM mechanism
To solve this, researchers staretd injecting a small amount of Gaussian noise into the AI's thought process at every step. This gentle disruption allows the model to run dozens of parallel trains of thought and explore diverse possibilities simultaneously.

To choose the best answer from these new paths, the team cleverly repurposed an existing part of the model called a Q head. Originally designed just to tell the AI when to stop thinking during its initial training, this component turned out to be a phenomenal judge of whether a thought trajectory was actually correct.

PTRM vs. frontier LLMs on PPBench golden.
On a suite of complex pencil puzzles, the tiny seven-million-parameter model jumped from sixty-two percent accuracy to over ninety-one percent. It achieved nearly double the puzzle-solving accuracy of the world's largest language models at less than one-ten-thousandth of the cost, proving that a hopeful future of brilliant machine reasoning does not always require massive supercomputers.
Vector Policy Optimization: Training for Diversity Improves Test-Time Search
Boldi et al. [MIT, Improbable AI Lab, MIT-IBM Computing Research Lab, Sakana AI]
♥ 845 Test time search
AI is used not just to provide a single answer, but to brainstorm multiple possibilities so a larger system can pick the absolute best one. However, standard training methods push language models to obsess over a single, rigid score, forcing them to converge on one theoretically perfect response.

Vector Policy Optimization (VPO)
When we ask AI to generate a varitey of solutions, it isn’t able to think creatively. It just repeats the same narrow answer over and over, losing the rich diversity that complex problem-solving requires. Researchers realized that to build a true reasoning engine, they needed to fundamentally change what the AI is rewarded for, separating the act of exploring different ideas from the act of picking the final winner.
To solve this, researchers developed a new training approach called Vector Policy Optimization. Instead of giving the AI a single flat grade, they evaluate responses using a multi-part scorecard that judges distinct aspects like code correctness, logic steps, or formatting.

Outline of Vector Policy Optimization.
During training, the researchers constantly shuffle which part of the scorecard matters most, while simultaneously having the model generate a continuous batch of answers in a single breath. This combination forces the AI to stop putting all its eggs in one basket. Instead of producing identical clones, it learns to offer a diverse menu of highly competent solutions, with each answer mastering a slightly different trade-off.

Best@k on Maze
This new method consistently outperformed traditional training across tasks spanning logic reasoning, digital navigation, and software coding tasks. The more solutions the system was allowed to generate, the wider its advantage became. When paired with advanced evolutionary search tools, this diversity allowed the AI to crack incredibly difficult problems that standard models could not touch at all.

Reply