- The AI Timeline
- Posts
- Multi-Token Attention
Multi-Token Attention
Plus more about Inference-Time Scaling for Generalist Reward Modeling and Why do LLMs attend to the first token?
Mar 31th ~ Apr 6th
#50 Latest AI Research Explained Simply
🗞️ Industry News in 1 Line
♥ 12k Meta has released the Llama-4 series on a SATURDAY. They are a MoE multi-modal model family that is capable of language and visual understanding. Scout and Maverick are open weights and now available for download. The highlight of this release is Scout is able to process up to 10M context window, while Maverick is capable of 1M. However, the community has been disappointed with its performance. A bycloud video is coming.
Llama-4 model specs
♥ 3k Gemini 2.5 Pro has moved to a preview version, called Gemini 2.5 Pro Preview, and is now available for scaled usage. You can still use it for free on Aistudio with a rate limit.
Gemini 2.5 Pro Preview API pricing
♥ 663 OpenAI raises $40B at $300B post-money valuation, becoming one of the largest private funding rounds in history. Via TechCrunch
Support My Newsletter
As I aim to keep this newsletter free forever, your support means a lot. If you like reading The AI Timeline, consider forwarding it to another research enthusiast, It helps us keep this up for free!
Multi-Token Attention
Golovneva et al. [FAIR at Meta]
♥ 793 LLM Attention
Introduction to Multi-Token Attention
There is a fundamental limitation in traditional attention mechanisms used in large language models, where attention weights are determined by comparing only single query and key token vectors. This "single token attention" restricts the model's ability to identify relevant context that requires multiple token associations.
To solve this problem, the authors propose Multi-Token Attention (MTA), which applies convolution operations across keys, queries, and attention heads, allowing neighboring tokens to influence each other's attention weights. This approach enables the model to condition its attention on multiple vector pairs simultaneously, facilitating more precise context location in complex scenarios.

Understanding Multi-Token Attention (MTA) Mechanism
Multi-Token Attention solves a fundamental limitation in traditional attention mechanisms by allowing LLMs to consider multiple tokens simultaneously when deciding where to focus. In standard attention, each attention value depends solely on comparing a single query vector with a single key vector. This creates a bottleneck when the model needs to find content that contains multiple elements together (like a sentence mentioning both "Alice" and "rabbit").
MTA introduces three new steps in this process:
Key-Query Convolution: Instead of looking at token pairs in isolation, MTA applies a sliding window (convolution) over the attention matrix before or after the softmax operation. This allows nearby queries and keys to influence each other's attention weights. For example, when searching for "Alice" and "rabbit," this convolution helps the model focus on areas where both appear in proximity.
Head Mixing Convolution: MTA groups attention heads together and allows them to share information through another convolution operation. If one head finds "Alice" and another finds "rabbit," this mixing helps combine these findings to locate where both terms appear together.
Group Normalization with Depth Scaling: This helps maintain balanced gradients throughout the network, preventing the attention signals from being overwhelmed by the residual stream as they flow through deeper layers.

The result is an attention mechanism that can effectively use richer contextual information to locate relevant content. Rather than being limited to what can be encoded in a single vector, MTA enables the model to consider patterns across multiple tokens, making it particularly effective for tasks requiring complex information retrieval from long contexts.
Evaluating the Performance of Multi-Token Attention (MTA)
The Multi-Token Attention mechanism shows impressive performance improvements across various tasks. On the motivating toy task, MTA achieved nearly perfect results (0.1% error rate) while standard Transformers struggled significantly (31-78% error rates). This shows MTA's fundamental advantage in tasks requiring multi-token information processing.

Validation perplexity for 880M Transformer model on SlimPajama dataset
In large language modeling, MTA consistently outperformed baseline models:
Reduced average validation perplexity to 11.09, compared to 11.25 for standard Transformers
Improved performance on benchmark tasks like LAMBADA (13.6 perplexity vs. 17.6)
Achieved higher average scores across nine popular benchmarks (44.4% vs. 43.7%)

Multi-needle retrieval accuracy (%) when varying the number of needles (N).
MTA particularly excelled at long-context tasks:
On Needle-In-A-Haystack with 6 needles, MTA achieved 67.0% accuracy after fine-tuning, compared to just 31.9% for standard Transformers
On BabiLong question-answering tasks, MTA maintained higher accuracy across various distraction text lengths, especially when the context was filled with 4K tokens of distractions
These improvements came with minimal parameter increase (0.001%) and didn't require applying key-query convolution to all layers - just adding it to 2-6 layers delivered significant gains.
Inference-Time Scaling for Generalist Reward Modeling
Liu et al. [DeepSeek, Tsinghua University]
♥ 555 LLM Test Time Compute
Introduction to Generalist Reward Modeling
There is a big challenge in using reinforcement learning (RL) for large language models: obtaining accurate reward signals across diverse domains beyond just verifiable questions. The authors of this paper propose Self-Principled Critique Tuning (SPCT), a new approach that allows reward models to generate adaptive principles and accurate critiques when evaluating responses.
This study tests this method by sampling multiple reward signals in parallel during inference time and using a meta reward model to guide the voting process. This research suggests that effective inference-time scaling techniques may be more efficient than traditional training-time scaling for improving reward modeling in general domains.

Different paradigms for reward generation
Understanding Self-Principled Critique Tuning (SPCT)
SPCT is an innovative approach to make reward models better at evaluating AI responses. Traditional reward models simply score responses. SPCT takes a different approach by teaching the model to:
First generate "principles" - criteria for what makes a good response to a particular query
Then apply these principles to critique and score responses

SPCT develops this ability in two phases:
Phase 1: Rejective Fine-Tuning (Cold Start)
The model learns to generate principles and critiques in the correct format
It's trained on examples where it evaluates different numbers of responses
Low-quality outputs are rejected during training
Some training includes "hints" about the correct answer to guide learning
Phase 2: Rule-Based Reinforcement Learning
The model generates principles and critiques for various queries
It receives positive rewards when its evaluations correctly identify the best response
It receives negative rewards when its evaluations are incorrect
This teaches the model to develop useful principles that lead to accurate judgments
How It Scales During Inference
What makes SPCT powerful is its ability to improve with more computing power:
Parallel Sampling: The model generates multiple sets of principles and critiques for the same query and responses
Expanded Value Space: By combining multiple evaluations, the model can provide more nuanced scores (like 17/40 instead of just 4/10)
Meta Reward Modeling: An additional model helps filter out low-quality evaluations before voting
The beauty of this approach is that it generates different perspectives (principles) for evaluation automatically, adapting to each specific query. This leads to more accurate and nuanced judgments than traditional reward models.
Results and Evaluation of Generalist Reward Modeling
The benchmark results show that DeepSeek-GRM-27B performs quite well. When using multiple evaluations (scaling at inference time), it outperforms much larger models including some with 340 billion parameters and even matches GPT-4o on benchmarks. Generating multiple sets of principles and combining their evaluations (called "voting") significantly improves performance without needing a larger model. Adding a "meta reward model" to filter out low-quality evaluations boosts performance even further.

Overall results of different methods and models on RM benchmarks
This approach works better than traditional methods that use simple scoring, especially since it avoids biases toward particular types of tasks. Most impressively, their 27 billion parameter model with 32 evaluation samples performed similarly to a massive 671 billion parameter model, suggesting that smart inference techniques can be more efficient than simply building bigger models.

Why do LLMs attend to the first token?
Barbero et al. [University of Oxford, National University of Singapore, Google DeepMind]
♥ 589 LLM Attention bycloud’s pick
Introduction to "Attention Sinks"
When you pass a prompt to LLM, it pays some amount of attention to each part of the sentence. This new paper explores the curious phenomenon of "attention sinks" in LLMs, where heads disproportionately attend to seemingly meaningless tokens (typically the beginning-of-sequence token), with as much as 80% of attention focused there in models like Llama 405B.
Rather than viewing these sinks as a defect, this paper suggests that they serve a crucial functional purpose: preventing "over-mixing" of information. Their theoretical and empirical analysis suggests attention sinks act as a control mechanism that slows down information propagation through the deep transformer architecture. This avoids representational collapse and maintains distinct token representations throughout the network.

Illustration of how attention sinks are usefully leveraged by decoder-only Transformers.
This perspective helps explain why deeper models and those trained on longer contexts develop stronger sinks, and why this behavior emerges naturally during gradient descent rather than through explicit architectural design. The research connects attention sinks to established concepts like rank collapse and over-squashing, offering a unified framework for understanding this previously puzzling but widespread pattern in modern LLMs.
Understanding "Attention Sinks" in AI Language Models
LLMs use a mechanism called "attention sinks," where they direct a substantial portion of their attention to the first token in a sequence. This new research explains that rather than being inefficient, attention sinks serve as an important control valve that prevents "over-mixing" of information. As text flows through the many layers of an AI model, there's a risk that distinct token representations could blend together too much, causing what experts call "representational collapse." The attention sink effectively slows down this mixing process by redirecting attention away from meaningful interactions.
The researchers show that attention sinks form naturally during the training process, and they emerge gradually as models learn to process text. This phenomenon is stronger in larger models and those trained on longer contexts; for instance, the 405B parameter version of LLaMa 3.1 directs nearly 80% of its attention to sinks, compared to just 46% in the 8B parameter version. Interestingly, the sink always forms at the first position regardless of what specific token appears there, though models perform best when using the same beginning token they were trained with.

The research team verified their theories through both mathematical analysis and empirical experiments. They trained multiple models with different context lengths while keeping the total training tokens constant, confirming that longer-context models develop stronger sinks. They also examined how information propagates through models with and without sinks present, showing that sinks help maintain more distinct token representations throughout the network.

Implications of "Attention Sinks" in AI Language Models
Understanding attention sinks could lead to more efficient model designs that better control information flow without wasting computational resources. This research connects previously disparate concepts like rank collapse, representational collapse, and over-squashing into a unified framework. It provides deeper insights into how transformer-based architectures function at scale.

The study also shows practical applications by examining the LLaMa 3.1 family of models, ranging from 8B to 405B parameters. Their analysis revealed how attention patterns evolve with scale. We now know that architectural decisions in pre-training directly impact how models form these attention sinks. This research advances our understanding of why certain patterns emerge naturally during AI training, potentially guiding the development of future models that can achieve better performance while maintaining computational efficiency.
🚨This week's top AI/ML research papers:
- Inference-Time Scaling for Generalist Reward Modeling
- Multi-Token Attention
- Why do LLMs attend to the first token?
- Command A
- LLMs Pass the Turing Test
- Advances and Challenges in Foundation Agents
- PaperBench
- Effectively— The AI Timeline (@TheAITimeline)
7:08 PM • Apr 6, 2025
Reply