- The AI Timeline
- Posts
- Titans: Learning to Memorize at Test Time
Titans: Learning to Memorize at Test Time
Plus more about MiniMax-01 and Scaling LLM Test-Time Compute
Jan 13th ~ Jan 20th
#39 Latest AI Research Explained Simply
🗞️ Industry News in 1 Line
♥ 14k DeepSeek has released DeepSeek-R1, a fully open-source model that performs on par with OpenAI-o1, along with several smaller distilled models. It is licensed under MIT license and can be used commercially for free. Read more about DeepSeek-R1 model and its powerful capabilities (math, code, and reasoning) or start using them via DeepSeek Web application or API.
♥ 1.3k MiniMax has launched an open-source language model series MiniMax-01 that introduces a Lightning Attention mechanism, which sets its apart from traditional Transformer architectures. The models have an impressive 4-million token processing capacity and surpass current industry leaders by 20-32 times. The API access is available at just $0.2 per million input tokens and $1.1 per million output tokens, making it one of the most economically competitive options in the market today. Try out MiniMax’s text & vision language model online or continue reading to learn more about its architecture.
♥ 7.3k Luma AI has launched Ray2, its next-generation video model that can create realistic, naturally moving footage from text descriptions. It uses10 times more computational power than its predecessor but has remarkable capabilities. You can use it to create anything from dynamic action sequences and delicate hand movements to physics simulations and artistic scenes.
The Top AI lab: Roboflow Is Hiring!
Open Source Software Engineer at Roboflow
Hybrid (New York City, NY; San Francisco, CA) | Remote (International)
$175,000 - $190,000
About Roboflow:
Roboflow simplifies building and using computer vision models. Today, over 500,000 developers, including those from half the Fortune 100, use Roboflow’s machine learning open source and hosted tools.
Raised over $60 million in their Series B, Roboflow is building tools, community, and resources needed to make the world programmable with artificial intelligence.
What Roboflow is looking for:
Love to work autonomously and build solutions from the ground up
Love the environment where each department is running like a startup within itself
Have Technical Knowledge on Git, Python, PyTorch, Docker, OpenCV, NumPy, GitHub Actions, PyPI, Linux
JavaScript, OpenVINO, TensorRT, TFjs, TensorFlow, ONNX, CUDA are a plus
Experience building open-source projects
What you will do:
Expand their thriving open-source ecosystem and create a future where computer vision is accessible to every developer. You'll be building upon a foundation of 40k+ GitHub stars across our projects, with over 1 million monthly PIP downloads, 500k+ datasets, and 100k+ pre-trained models.
The role would be focusing on 50% core contribution, 30% communication/promotion, and 20% in other areas.
Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps
Ma et al. [NYU, MIT, Google]
♥ 843 Image Generation
Diffusion models can create high-quality, coherent images but they have one big downside, their inference-time scaling behavior: increasing the number of denoising steps initially improves sample quality, but the returns diminish rapidly. After a few dozen steps, performance plateaus. This made researcher question: Can we squeeze more potential out of diffusion models without other forms of inference-time scaling?
This paper tries to answer this question by reimagining how diffusion models allocate compute during inference. This approach hinges on a simple yet powerful insight: not all noise seeds are created equal. By strategically searching for better initial noises, diffusion models can achieve substantial quality gains, even with fixed architectures.
Figure 1: Scaling denoising steps vs scaling search for initial noise seeds
Why Denoising Steps Aren’t Enough
Diffusion models generate samples by iteratively denoising random initial latents. Intuitively, more denoising steps should refine outputs. But practically, improvements flatten beyond ~50 steps (figure 1). This occurs because errors compound: each step’s approximation inaccuracies accumulate, capping gains. Traditional solutions focus on optimizing step efficiency, but this paper asks a different question: What if we invest compute not in more steps, but in better noise initialization?
Illustration of Search Algorithms.
Their hypothesis stems from an underappreciated property of diffusion sampling: the initial noise seed critically influences output quality. Some latents map to sharper, more aligned samples; others produce artifacts or misaligned features. By treating noise selection as a search problem, the authors unlock a new axis for inference-time scaling.
This paper uses a new approach where they decouple inference compute into two components:
Verifiers: Pre-trained models that score samples (e.g., CLIP for text alignment, DINO for visual consistency).
Algorithms: Methods to explore the noise space, prioritizing high-scoring candidates.
Verifiers as Task-Aligned Judges
This paper uses verifiers as surrogate evaluators. In the class-conditional ImageNet generation, an “oracle” verifier (InceptionV3) selects samples with high class confidence. In text-to-image tasks, combinations like CLIPScore (text alignment), Aesthetic Predictors (visual appeal), and ImageReward (human preference) are used to obtain multi-faceted feedback.
Verifiers must align with the end task, so there would be cases for misalignment which leads to verifier hacking. This means models overfit to narrow metrics at the expense of diversity where over-optimizing for one verifier (eg. aesthetics) can degrade other metrics (eg. CLIPScore).
Algorithms for Creating Noise
This paper compares three different search strategies for this step:
Random Search: Generates multiple noise seeds, keeping the top scorer. Simple but prone to verifier hacking
Zero-Order Search: Iteratively refines a “pivot” noise by exploring its neighborhood, balancing exploration and exploitation.
Search over Paths: Modifies intermediate noises during denoising, leveraging the diffusion trajectory’s structure.
On ImageNet, Zero-Order Search with DINO verifiers improves Inception Score by 15% while preserving diversity. For text-to-image tasks, a Verifier Ensemble (averaging CLIP, Aesthetic, and ImageReward rankings) mitigates individual biases and achieves broad quality gains.
Performances of Search with FLUX.1-dev at inference-time.
Compatibility and Future Directions
A key advantage of inference-time search is its compatibility with existing models. Unlike fine-tuning, which alters model weights, search operates post-hoc and preserves the original distribution while shifting modes toward desired traits. This makes it ideal for scenarios where retraining is impractical (e.g., medical imaging).
Titans: Learning to Memorize at Test Time
Behrouz et al. [Google Research]
♥ 2.4k LLM Architecture bycloud’s pick
Introduction to Titans in LLMs
Most AI models follow two main approaches for building LLMs: either by using recurrent models that compress information into fixed-size memory (potentially losing important details) or by using attention-based models that can access the full context but scale poorly with sequence length. The authors of this study argue that neither of these approaches fully capture how biological memory systems work - with distinct but interconnected short-term and long-term memory components.
This paper introduces "Titans," a novel architecture that combines a limited-window attention mechanism (acting as short-term memory) with a new neural long-term memory module that actively learns to memorize important historical information. This memory module determines what to store based on "surprise", which is measured by gradients in associative memory loss and includes a decay mechanism to manage limited memory capacity.
How Does The Memory Storing Mechanism Work?
The memory storage mechanism of the Titan architecture learns and remembers information while it's being used, rather than just during initial training. The system determines what to remember based on how "surprising" new information is - if something is unexpected or different from what the system has seen before, it's more likely to be stored in memory. They measure surprise by looking at how much the system needs to change to handle new information.
Memory as a Context (MAC) Architecture.
This approach mimics how humans tend to remember unusual or unexpected events better than routine ones. The system also includes a "forgetting mechanism" that helps it manage limited memory space by gradually letting go of less important information. The memory system has three main parts working together:
A short-term memory that handles immediate information (like attention in current AI systems)
A long-term memory that learns to store important historical information
A "persistent memory" that holds general knowledge about the task at hand.
This can be thought of as a student taking notes in class - they have their immediate focus on what the teacher is saying (short-term), important concepts they've written down and internalized (long-term), and their general understanding of the subject (persistent). The researchers made this system practical by developing a way to process all this information in parallel, making it much faster than if they had to process each piece of information one at a time.
Evaluating the Titan Memory Storage Mechanism
The authors of this paper tested the Titan memory storage mechanism across diverse tasks such as language modeling, commonsense reasoning, needle-in-haystack tasks, DNA modeling, and time series forecasting. In language modeling tasks, all three variants of Titans (MAC, MAG, and MAL: methods of how memories are combined) outperformed existing hybrid models like Samba and Gated DeltaNet-H2, with the MAC (Memory as Context) variant showing particularly strong performance on longer sequences.
In addition to superior performance, the model also shows impressive scalability as it can handle sequences longer than 2 million tokens, which is a significant improvement over traditional Transformers. In the challenging BABILong benchmark, which tests reasoning across extremely long documents, Titans outperformed much larger models including GPT-4, despite having far fewer parameters. This suggests the architecture's memory mechanism is particularly effective at processing and retrieving information from long sequences.
These results suggest that deeper memory layers generally improved performance but came with computational trade-offs, and all major components (convolution, momentum in surprise measure, weight decay, and persistent memory) contributed positively to the model's effectiveness. In specialized tasks, the neural memory module outperformed existing methods in time series forecasting across multiple datasets (ETT, ECL, Traffic, and Weather) and achieved competitive results in DNA modeling tasks.
If you like reading The AI Timeline, consider forwarding it to another research enthusiast, It helps us keep this up for free!
MiniMax-01: Scaling Foundation Models with Lightning Attention
MiniMax
♥ 1.8k LLM Architecture
Introduction to MiniMax
Current LLMs perform well on a few tasks but their context window which typically ranges from 32K-256K tokens is too small for many practical applications like analyzing entire books or programming projects. This limitation is caused due to the quadratic computational complexity of traditional transformer architectures, where processing longer contexts requires exponentially more computing resources.
The researchers at MiniMax tackle this by developing a novel hybrid architecture that combines "lightning attention" (a linear attention variant) with Mixture of Experts (MoE). Their model, MiniMax-Text-01, processes up to 1 million tokens during training and can handle 4 million tokens during inference, while maintaining performance comparable to top models like GPT-4 and Claude-3.5-Sonnet.
MiniMax Model Architecture
The MiniMax model introduces a hybrid architecture that combines two different types of attention mechanisms to achieve both efficiency and performance at scale. This model alternates between two types of attention layers: "lightning attention" (a linear attention variant) and traditional "softmax attention." It uses a pattern of seven lightning attention layers followed by one softmax attention layer, repeating this pattern throughout its 80-layer architecture.
What makes this architecture particularly powerful is how it combines different approaches to overcome their individual limitations. Lightning attention achieves linear computational complexity by splitting computations into smaller chunks (called blocks) and processing them separately, which makes it much more efficient for handling long sequences.
However, the researchers found that pure lightning attention struggled with certain tasks, particularly information retrieval. This is where the periodic softmax attention layers come in - they maintain the model's ability to perform complex pattern matching and retrieval tasks while still benefiting from lightning attention's efficiency. The model also uses a Mixture of Experts (MoE) approach, where instead of having a single neural network process all inputs, it has 32 specialized "expert" networks and a routing system that decides which experts should handle each input.
This mechanism is similar to having multiple specialists rather than a single generalist. Having multiple specialists allows the model to be both more efficient (since only a few experts are activated at once) and more capable (since different experts can specialize in different types of processing). These architectural choices allow the model to achieve impressive performance while handling extremely long sequences of up to 4 million tokens - far beyond what conventional architectures can manage.
Benchmark Results of MiniMax
The MiniMax model was tested on several benchmarks to evaluate its performance and capabilities. The researchers used the MMLongBench-Doc benchmark, which tests document understanding, and used the results from this test to adapt their testing approach. For example, they concatenate multiple document images into groups of 5 or 10 for open-source models and Claude, while using a default setting of up to 120 image pages at 144 resolution for commercial models and MiniMax-Text-01.
The testing team also used MEGA-Bench, which is an extensive multimodal testing suite that examines 7 input formats, 6 output formats, and 10 different types of skills across various visual inputs including both images and videos. Furthermore, they tested the document understanding and multimodal reasoning capabilities by testing their model against MMMU and DocVQA tests.
🚨This week's top AI/ML research papers:
> Do generative video models learn physical principles from watching videos?
> Transformer^2: Self-adaptive LLMs
> MiniMax-01
> The Lessons of Developing Process Reward Models in Mathematical Reasoning
> Imagine while Reasoning in Space:… x.com/i/web/status/1…— The AI Timeline (@TheAITimeline)
5:21 AM • Jan 20, 2025
Reply