The AI Timeline
Posts
DeepSeek's Native Sparse Attention

DeepSeek's Native Sparse Attention

Plus more about Mixture of Block Attention for Long-Context LLMs, and Idiosyncrasies in Large Language Models

by cloud
February 25, 2025

Feb 17th ~ Feb 23rd
#44 Latest AI Research Explained Simply

🗞️ Industry News in 1 Line

♥ 18k Anthropic has announced Claude 3.7 Sonnet, a hybrid reasoning model that can toggle “extended thinking” on or off, up to 128k tokens. They also announced Claude Code, a CLI coding assistant that lets Claude read and interact with local repositories.
Claude 3.7 Sonnet benchmark
♥ 21k DeepSeek has announced an #OpenSourceWeek, and they will be releasing 5 of their repos this week. So far, they have released FlashMLA, optimized decoding kernel for Hopper GPUs; DeepEP, EP communication library for MoE model training and inference.
♥ 1.4k Alibaba Wan has announced Wan 2.1, a 14B open-source video generation model. This includes a state-of-the-art image-to-video generation support, and consistent human anatomy on highly challenging movements. It scores #1 on VBench, a comprehensive benchmark for video models. Wan is now available on GitHub.

Support My Newsletter

As I aim to keep this newsletter free forever, your support means a lot. If you like reading The AI Timeline, consider forwarding it to another research enthusiast, It helps us keep this up for free!

Advertise with The AI Timeline!

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

Yuan et al. [DeepSeek]

♥ 10k LLM Attention

Introduction to Native Sparse Attention

Long-context modeling is essential for creating next-generation language models but it is easier said than done. Creating long contexts has significant computational challenges due to the high cost of standard attention. Existing sparse attention methods often fail to achieve practical speedups in real-world deployments, especially during training.

This paper introduces NSA (Natively trainable Sparse Attention), which combines hierarchical token compression and selective token retention to preserve both coarse-grained and fine-grained context.

How Does Native Sparse Attention Work?

NSA processes each query by creating a smaller, more information-dense set of keys and values rather than operating on every token from the original context. It generates three different kinds of key-value pairs and combines their attention outputs through learned “gate” signals. These three types (compression, selection, and sliding window) ensure that the model can handle broad context while also focusing on key details and local patterns.

Kernel design for NSA. The kernel loads queries by GQA groups (Grid Loop), fetches corresponding sparse KV blocks (Inner Loop), and performs attention computation on SRAM. Green blocks indicate data on SRAM, while blue indicates data on HBM.

COMPRESSION: The compression step groups tokens within consecutive blocks into single representations. A small neural layer then summarizes the entire block’s content into a compressed key and value. This means that much of the broader context is preserved in fewer, higher-level “tokens,” cutting down on the amount of work required during attention calculations.
SELECTION: After compression, NSA identifies the most relevant blocks of tokens to keep in their original, non-compressed form. It does this by reusing or lightly processing the attention scores from the compression step to figure out which blocks are most important to the current query. Only those high-importance blocks remain fully detailed, which helps the model retain finer-grained information for crucial parts of the sequence while discarding less relevant details.
SLIDING WINDOW: In addition to compression and selection, NSA has a short-range attention mechanism. It takes a recent segment of tokens around the current position (like a rolling window) and ensures the model captures local context. Having a dedicated branch for local detail prevents the broader compression or selection pathways from being overwhelmed by shorter patterns, allowing all three branches to specialize.

Once the model produces compressed, selected, and local-window key-value sets, it computes three separate attention outputs and merges them via gate signals learned through a small neural network. To run efficiently on modern GPUs, NSA is also implemented with a special kernel design.

11.6x speed increase on decoding

Rather than reading memory for each query in small, scattered chunks, it groups queries that share the same sparse key-value blocks. This grouping method keeps data reads as continuous as possible, drastically reducing overhead and allowing NSA to achieve high-speed throughput on both training and inference.

Results and Real-World Implications of Native Sparse Attention

NSA surpasses Full Attention on most tests despite its sparsity. In reasoning-heavy tasks like DROP and GSM8K, NSA shows notable gains, implying that the hierarchical sparse design helps the model retain and prioritize critical details.

On long-context evaluations, such as the 64k-token needle-in-a-haystack retri

eval test, NSA achieves perfect accuracy by combining compression for coarse-grained scanning and selection for fine-grained retrieval. This dual-level focus allows NSA to maintain both global awareness and local precision, enabling it to excel on multi-hop QA and code understanding challenges in LongBench.

Native Sparse Attention's needle-in-a-haystack

NSA also supports advanced chain-of-thought reasoning. When distilled and fine-tuned on mathematical reasoning data, NSA outperforms a full-attention baseline in both 8k and 16k token contexts, which shows its ability to capture crucial long-range dependencies.

MoBA: Mixture of Block Attention for Long-Context LLMs

Lu et al. [Moonshot AI, Tsinghua University, Zhejiang Lab/Zhejiang University]

♥ 1.1k LLM Attention

Introduction to MoBA

Traditional attention mechanisms in LLMs face a quadratic increase in computational complexity as sequence lengths grow, making it prohibitively expensive to handle extended contexts. While existing solutions have attempted to address this through predefined structural constraints or linear approximations, they often introduce task-specific biases or require substantial model modifications, which limits their effectiveness in complex reasoning tasks.

This paper introduces Mixture of Block Attention (MoBA), which applies the Mixture of Experts (MoE) principle to attention mechanisms. This innovative approach allows models to autonomously determine where to focus their attention without imposed structural biases, enabling seamless transitions between full and sparse attention modes.

MoBA Architecture

How does MoBA Work?

Mixture of Block Attention (MoBA) rethinks how AI models process information by introducing a smart, selective approach to handling long sequences of data. Think of it as a highly efficient filtering system that knows exactly where to focus its attention.

MoBA breaks down incoming information into manageable blocks and uses a sophisticated routing system to determine which blocks deserve attention. The system employs a "gating mechanism" that acts like a smart traffic controller, directing each query (or information request) to only the most relevant blocks of historical data. This selective approach dramatically reduces computational costs while maintaining high performance.

The architecture consists of three key components:

Block Partitioning: MoBA divides the input context into smaller, manageable blocks
Dynamic Selection: A smart routing system selects the most relevant blocks for each query using a top-k gating mechanism
Causal Protection: Built-in safeguards ensure the model only looks at past and present information, never future data

What makes MoBA particularly innovative is its flexibility. The system can seamlessly switch between full attention (looking at everything) and selective attention (focusing on specific blocks) as needed. This adaptability makes it especially valuable for processing long documents or complex reasoning tasks.

MoBA vs Full Attention

MoBA has already proven its worth in real-world applications. When implemented in Kimi's language model, it achieved significant performance gains while reducing computational overhead. The system processes information up to 5 times faster than traditional methods when handling long sequences, making it a game-changer for applications requiring extensive context processing.

Performance and Scalability Results of MoBA

The evaluation of MoBA shows impressive performance across both standard benchmarks and extreme-length scenarios. Using the Llama 3.1 8B Base Model as a foundation, researchers gradually scaled the context length from 128K to an impressive 1M tokens during pre-training. The model, dubbed Llama-8B-1M-MoBA, achieved this while maintaining a remarkable 95.31% attention sparsity through strategic block sizing (4096) and top-K selection (12).

In head-to-head comparisons with full attention models, MoBA demonstrated comparable or superior performance across multiple benchmarks. Notably, on the RULER benchmark at 128K context length, MoBA achieved a score of 0.7818, nearly matching the full attention model's 0.7849, despite operating at 62.5% sparsity. The efficiency gains are even better: MoBA achieved up to 6.5x speedup when processing 1M tokens and a remarkable 16x reduction in attention computation time when scaling to 10M tokens.

The scalability tests pushed boundaries further, successfully extending context lengths to 10 million tokens while maintaining consistent performance. This was achieved through innovative tensor parallelism and strategic block size scaling.

Idiosyncrasies in Large Language Models

Sun et al. [CUHK, CityU, Tencent AI Lab]

♥ 87 LLM Classification bycloud’s pick

Can You Differentiate Between ChatGPT and DeepSeek?

One of the biggest problems in LLM is the lack of systematic understanding of how different LLMs' outputs can be distinguished from one another. This is a challenging problem which has implications for model attribution, training data analysis, and understanding model similarities. While previous research has focused extensively on differentiating between human-written and AI-generated content, there's been minimal investigation into distinguishing between outputs from different LLMs.

This paper aims to solve this gap by introducing a novel framework that studies "idiosyncrasies" through a classification task. Their approach demonstrates remarkably high accuracy (97.1%) in distinguishing between major LLMs like ChatGPT, Claude, Grok, Gemini, and DeepSeek. This shows that these models have distinctive "fingerprints" in both their word-level distributions and semantic content.

Understanding Idiosyncrasies of LLMs

The researchers employed three main methods to quantify and understand the differences between LLM outputs:

Text Similarity Analysis:
1. They used established metrics (ROUGE-1, ROUGE-L, and BERTScore) to measure lexical similarities between outputs
2. Results showed lower similarity scores between different LLMs compared to outputs from the same LLM, indicating distinct "writing styles"
Word and Letter Level Analysis:
1. They conducted text shuffling experiments at both word and letter levels
2. Word-level shuffling maintained high classification accuracy (88.9% for chat APIs), suggesting word choice is a key distinguishing factor
3. Letter-level shuffling drastically reduced accuracy, indicating letter patterns aren't significant identifiers
4. They identified characteristic phrases for each LLM using TF-IDF analysis and logistic regression
5. Each LLM showed preferences for specific transition phrases and sentence starters
Markdown Formatting Analysis:
1. They examined six markdown elements: bold text, italic text, headers, enumeration, bullet points, and code blocks
2. Classification based solely on markdown patterns achieved 73.1% accuracy for chat APIs
3. Each LLM showed distinctive formatting preferences (e.g., Claude uses fewer bold texts and headers compared to others)

Frequencies of words and letters of LLMs

The mechanism shows that LLMs have unique "fingerprints" in their outputs, manifested through word choices, formatting preferences, and semantic patterns. This comprehensive analysis demonstrates that these characteristics are robust and can be used to reliably identify which LLM generated a particular text.

Do LLMs Have Fingerprints?

This study shows us that when different base models are trained on the same synthetic dataset (like UltraChat generated by ChatGPT), they begin to exhibit similar characteristics, reducing their distinguishability from 96.5% to 59.8%. This suggests that the practice of using synthetic data for training might lead to a form of "characteristic inheritance" where new models inadvertently adopt the unique patterns and biases of their training data's source model.

Perhaps even more intriguingly, the research provides a novel framework for understanding relationships between different LLMs, particularly between proprietary and open-source models. The study's model similarity analysis revealed that many models' outputs are frequently classified as coming from ChatGPT, suggesting its significant influence on the field.

🚨Last week's top AI/ML research papers:
- Native Sparse Attention
- Idiosyncrasies in Large Language Models
- SWE-Lancer
- Roadmap to fault tolerant quantum computation using topological qubit arrays
- Qwen2.5-VL Technical Report
- Scaling Test-Time Compute Without Verification… x.com/i/web/status/1…
— The AI Timeline (@TheAITimeline)
4:52 PM • Feb 24, 2025

Reply

or to participate.