- The AI Timeline
- Posts
- Fractal Generative Models
Fractal Generative Models
Plus more about SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution and Reasoning with Latent Thoughts: On the Power of Looped Transformers
Feb 24th ~ Mar 2nd
#45 Latest AI Research Explained Simply
🗞️ Industry News in 1 Line
♥ 4.5k OpenAI released GPT-4.5, a large base model LLM that has an incredible sense of humor and natural sounding response. But I was still disappointed.
GPT-4.5 hallucinated much less and is better at factual QA
♥ 4.8k Inception AI Labs, introduced Mercury, the first commercial-scale diffusion LLM. Founded by key researchers behind Diffusion Models, Mercury offers an incredible generation speed and structural integrity compared to autoregressive LLMs.
♥ 4.6k After the release of Claude 3.7 Sonnet, Anthropic has raised $3.5 billion at a $61.5 billion post-money valuation, led by Lightspeed Venture Partners.
Optimize global IT operations with our World at Work Guide
Explore this ready-to-go guide to support your IT operations in 130+ countries. Discover how:
Standardizing global IT operations enhances efficiency and reduces overhead
Ensuring compliance with local IT legislation to safeguard your operations
Integrating Deel IT with EOR, global payroll, and contractor management optimizes your tech stack
Leverage Deel IT to manage your global operations with ease.
Fractal Generative Models
Li et al. [MIT CSAIL, Google DeepMind]
♥ 635 Generative Models
Introduction to Fractal Generative Models
Current AI models struggle to achieve satisfactory results in both likelihood estimation and generation quality, particularly when dealing with non-sequential data that contains intrinsic structures. To address this challenge, researchers have introduced an innovative concept called "fractal generative models," which takes modularization to a new level by treating generative models themselves as atomic modules that can be recursively combined.
This novel approach, inspired by fractal patterns found in nature and biological neural networks, creates self-similar architectures where each parent autoregressive block spawns multiple child blocks, forming a hierarchical structure that can better capture complex data patterns.

How Do Fractal Generative Models Work?
The Fractal Generative Model introduces an innovative architectural approach that mirrors the hierarchical patterns found in nature. The model uses a recursive "divide-and-conquer" strategy, where larger generative tasks are broken down into smaller, manageable components.
The architecture works through multiple levels, with each level containing autoregressive models that act as "generators." These generators operate in a hierarchical fashion: the first level divides the input into larger patches, which are then progressively broken down into smaller patches by subsequent levels. For instance, when processing a 256Ă—256 image, the first generator divides it into 16Ă—16 patches, with each subsequent level handling increasingly smaller sections.

A key feature of the model is its transformer-based design, where each autoregressive model processes both the output from the previous level's generator and the corresponding image patches. The model becomes progressively lighter at deeper levels, with fewer transformer blocks and reduced width, making it computationally efficient. Remarkably, processing a 256Ă—256 image requires only twice the computational power needed for a 64Ă—64 image.
The model offers two main variants: FractalAR, which uses a raster-order, GPT-like causal transformer, and FractalMAR, which uses a random-order, BERT-like bidirectional transformer. This design is particularly efficient compared to traditional scale-space autoregressive models, requiring 4000 times fewer computations at the finest resolution, making it practical for high-resolution image generation at the pixel level.

What makes this architecture particularly powerful is its ability to capture complex data patterns while maintaining computational efficiency. Unlike previous approaches that treated image generation as a simple sequential task, this model recognizes and leverages the inherent hierarchical structure of the data, similar to how images are naturally composed of sub-images, making it more effective for handling complex generative tasks.
Results and Real-World Implications of Fractal Generative Models
The researchers thoroughly tested their Fractal Generative Model on the ImageNet dataset, focusing on both 64Ă—64 and 256Ă—256 resolution images. The results were impressive across multiple performance metrics:
Likelihood Estimation: The model achieved a significant improvement over previous methods, reaching 3.14 bits per dim compared to the previous best of 3.40. It used more fractal levels proved both more efficient and effective, suggesting the model successfully captures the hierarchical nature of images.
Image Generation Quality: The largest version (FractalMAR-H) achieved strong results with an FID score of 6.15 and generated high-quality images in about 1.29 seconds per image. It also showed versatility in tasks like image inpainting, outpainting, and class-conditional editing.

The Fractal Generative Model represents a significant step forward in AI image generation, particularly in pixel-by-pixel approaches. Its success lies in breaking down complex tasks into manageable pieces while maintaining high quality and efficiency. The model shows particular promise in three key areas:
Scalability: Performance consistently improves as the model size increases, suggesting potential for even better results with larger models
Interpretability: The pixel-by-pixel generation process is more transparent and understandable than other approaches
Versatility: The model's strong performance across various tasks indicates its potential for broader applications
While there's still room for improvement in some areas, particularly in generating more diverse images, the results suggest this approach could open new possibilities in generative AI, especially for data types that have inherent structural patterns beyond simple sequences.

Reasoning with Latent Thoughts: On the Power of Looped Transformers
Saunshi et al. [Google Research, Toyota Technological Institute at Chicago]
♥ 399 LLM Reasoning
Introduction to Looped Transformers
Large language models are great at reasoning, but people thought you needed a lot of parameters (like a big brain) to do this well. This paper says you don't need a big brain, just a deep one - you can solve many reasoning problems by using a smaller model that thinks in loops, like going over a problem multiple times. They tested this idea on math problems and language tasks, and found that these looped models can do just as well, or even better, than bigger models, and they even came up with a new way to train models that helps them reason better.

How do Looped Transformers Work?
This study created looped models for solving reasoning tasks, and it focused on three specific problems: n-ary addition, p-hop induction, and synthetic grade-school math problems (i-GSM). Looped models need running a smaller model through multiple iterations, denoted as (k ⊗ L), where k represents the number of layers in the model and L represents the number of loops. The experiments showed that looped models, such as a one-layer model looped 12 times, could achieve near-perfect accuracy on addition tasks, closely matching the performance of a larger 12-layer model while using significantly fewer parameters.

For the p-hop induction task, which requires tracking back through sequences, looped models again demonstrated strong performance. A one-layer model looped six times achieved accuracy comparable to a six-layer model, indicating that depth through looping is crucial for solving such problems. Similarly, in the i-GSM task, which involves solving symbolic math problems, a one-layer model looped eight times outperformed a non-looped eight-layer model, suggesting that looped models can handle more complex reasoning tasks effectively with fewer parameters.
The researchers trained models on specific datasets and evaluated their performance across different numbers of operands for addition, varying p values for p-hop induction, and different problem complexities for i-GSM. The results consistently showed that looped models could match or exceed the performance of larger, non-looped models, highlighting the importance of iterative processing in reasoning tasks. This method leverages the power of depth over parameter count, offering a more efficient way to enhance model reasoning capabilities.
Results and Evaluation
This study introduces a promising new approach with "looped models for reasoning," and demonstrates that these models can effectively solve a variety of reasoning tasks using significantly fewer parameters than traditional models. The researchers found an inductive bias in looped models, which enhances reasoning performance in language models more than memorization, despite having lower perplexity.

Downstream evaluations for language models trained on the Pile dataset.
The experiments in this paper focused on specific reasoning tasks, but the broader applicability of looped models to other forms of reasoning, such as multimodal and common-sense reasoning, remains an open and exciting question. The observed scaling behavior and the connections to latent thoughts and chain-of-thought reasoning provide valuable insights into how looped models can improve reasoning capabilities.
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution
Wei et al. [FAIR at Meta, University of Illinois Urbana-Champaign, GenAI at Meta, Carnegie Mellon University]
♥ 424 LLM Coding bycloud’s pick
What is SWE-RL
This paper introduces SWE-RL, the first reinforcement learning approach designed to enhance large language models for real-world software engineering tasks by leveraging comprehensive software evolution data (like GitHub pull requests) and rule-based rewards. By teaching LLMs to autonomously recover developers’ reasoning and generate corrective code changes, the authors demonstrate that SWE-RL not only delivers state-of-the-art performance on SWE-bench Verified but also significantly improves the models’ general reasoning skills across diverse out-of-domain tasks.

Demonstration of one navigation task in V-IRL.
Inner-Workings of SWE-RL
The mechanism begins with a comprehensive data curation process that transforms raw information from GitHub into self-contained pull request instances. Data is collected from both GitHub events and complete git clones to capture every detail of a pull request, including associated discussions, code changes, and review comments. This process requires cleaning and aggregating the data by predicting which files are relevant, filtering out irrelevant or noisy examples such as bot-generated messages and overly large changes, and finally isolating around 11 million high-quality pull requests that truly represent human-driven software changes.

A comparative study of RL and SFT on the visual navigation environment V-IRL for OOD generalization.
Once the curated dataset is ready, the system uses a specially designed prompt template to train the policy language model. For each validated pull request, the model receives an input that includes the issue description and the corresponding code context, along with both modified and certain unmodified files. Tasked with generating search-and-replace edits to address the reported issue, the model produces outputs that are then compared to the correct solution, known as the “oracle patch.” To guide the training, a reward is assigned based on how closely the generated patch matches the oracle, while incorrect formats receive a penalty, driving the model to learn accurate and structured revisions through iterative reinforcement learning.

Example of applying sequential verification-revision on GeneralPoints.
This training process not only helps the model solve real-world software issues effectively, but it also encourages the development of broader reasoning skills. As the model goes through cycles of self-assessment and refinement, it begins to reflect on alternative approaches and break down complex tasks into manageable parts. Interestingly, even though the reinforcement signal is derived solely from software issue resolution, the model exhibits emergent “aha moments” and the ability to generalize its reasoning to a variety of out-of-domain tasks such as function-level code generation, library usage, and even solving mathematical problems. This indicates that the training mechanism instills a flexible and robust problem-solving skill set that extends well beyond its original programming domain.
Will AI Replace Software Engineers?
The results show that the Llama3-SWE-RL-70B model achieves state-of-the-art performance among small and medium-sized language models, resolving 41.0% of issues on SWE-bench Verified. This result compares favorably with other open-source baselines that incorporate knowledge from proprietary models, highlighting the effectiveness of using reinforcement learning with publicly available software evolution data.

Figure 5: Success rate (%) - GFLOPs trendlines for RL and SFT on GeneralPoints and V-IRL.
When evaluated on the repair task alone, the reinforcement learning model provides significant improvements over a standard Llama-3.3 model and a supervised fine-tuning (SFT) variant. Although the SFT version shows higher output format accuracy, the reinforcement learning framework enhances the model's ability to reason through the process of issue solving and code repair. This difference suggests that the reinforcement learning approach contributes to deeper reasoning capabilities, leading to better overall repair performance.

Comparison of out-of-distribution performance under visual variants.
Scaling the model shows that performance on SWE-bench improves as both the number of repair samples and reproduction test samples are increased. A noticeable improvement is observed when expanding repair samples from 20 to 160, with further gains plateauing as sample counts reach 320 and beyond. These findings suggest that careful scaling of inputs can optimize performance while underscoring the versatility and robustness of the SWE-RL mechanism.
🚨Last week's top AI/ML research papers:
- SWE-RL
- The FFT Strikes Back
- Chain of Draft
- GPT-4.5 System Card
- Claude-3.7 Sonnet System Card
- Fractal Generative Models
- Reasoning with Latent Thoughts
- BIG-Bench Extra Hard
- Big-Math
- Towards an AI co-scientist
-… x.com/i/web/status/1…— The AI Timeline (@TheAITimeline)
4:20 PM • Mar 3, 2025
Reply