The AI Timeline
Posts
Log-Linear Attention: in-between of mamba & attention?

Log-Linear Attention: in-between of mamba & attention?

Dive into AI's latest breakthroughs: Beyond 80/20 Rule, How much do language models memorize, and cutting-edge insights from top research institutions like Qwen Team, MIT-Princeton research in this week's AI Timeline update.

by cloud
June 10, 2025

June 2ne ~ June 9th
#59 Latest AI Research Explained Simply

🗞️ Industry News in 1 Line

♥ 1.4k The Qwen3 Embedding and Reranker series introduce multilingual text embedding and relevance-ranking models (available in 0.6B/4B/8B sizes) built on the Qwen3 LLMs, achieving state‑of‑the‑art performance across benchmarks
Qwen-3 Embedding & Reranker benchmarks
♥ 4.1k Google just upgraded Gemini 2.5 Pro (Gemini-2.5-pro-0605), showing substantial Elo gains on LMArena (+24) and WebDevArena (+35) while maintaining top performance in coding (Aider Polyglot) and reasoning benchmarks like GPQA and HLE. Pretty much a new SoTA.
♥ 1k Apple released Foundation Models framework that delivers an on-device 3B-parameter LLM for iOS 26+ platforms, enabling private, offline execution of tasks like summarization, dialog, and entity extraction, while supporting structured output via @Generable and dynamic function execution through tool calling.

The AI Timeline: Premium Insights

Recently, we introduced a premium membership for The AI Timeline!

With the membership, you would receive exclusive insights/explainers into technical AI topics and monthly research trend reports that contains my analysis of up to 40+ papers.

Check out the Monthly Research Reports (~4000 words) here:

May 2025 Research Trend Report

Premium Insights: A recap of popular AI research papers and research trends in May 2025

mail.bycloud.ai/p/may-2025-research-trend-report

April 2025 Research Trend Report

Premium Insights: A recap of popular AI research papers and research trends in April 2025

mail.bycloud.ai/p/april-2025-research-trend-report

Deep Dive Blogs:

How DeepSeek Made The Best Math Prover Ever (+500% vs prev. SoTA)

Premium Insights: A closer look into the DeepSeek Prover series

mail.bycloud.ai/p/how-deepseek-made-the-best-math-prover-ever

Plus, we are also scheduling a technical explainer for what FP8 is later this week, so subscribe now to stay tuned!

_{Advertise with The AI Timeline!}

Log-Linear Attention

Guo et al. [Massachusetts Institute of Technology, Princeton University, Together AI, Carnegie Mellon University]

♥ 1.4k LLM Attention

Efficient Sequence Modeling with Log-Linear Attention

Attention mechanisms in Transformers are necessary to model sequences but it also has some significant challenges. First of all, it requires quadratic compute and linear memory costs which grows with sequence length. Although linear attention and state-space models offer linear-time alternatives, they rely on a fixed-size hidden state, which fundamentally limits their ability to capture extensive context. This degrades their performance in long-context scenarios, such as associative recall tasks.

The researchers of this paper have introduced log-linear attention, which is a novel approach that bridges the gap between efficiency and expressiveness. Instead of using a single hidden state, it maintains a logarithmically growing set of states, and enables richer context modeling without sacrificing hardware-friendly parallelism.

Log Linear attention

How Log-Linear Attention Works

The Log-linear attention mechanism uses a hierarchical partitioning scheme inspired by Fenwick trees. For each token position, the input sequence is divided into disjoint buckets of power-of-two lengths, prioritizing fine-grained resolution for recent tokens and coarser summaries for distant ones. This structure ensures the number of hidden states grows logarithmically with sequence length.

Additionally, each bucket contributes to the output via a data-dependent scalar weight λ, projected from the input. These weights allow the model to adaptively blend information across temporal scales. For instance, recent context might dominate via larger λ values for finer buckets, while older context is compressed into broader summaries.

During inference, the model updates states incrementally. When processing a new token, it inserts the current memory into the finest bucket (level 0). Buckets up to a dynamically determined level are merged and promoted to coarser resolutions, maintaining only O(log T) active states. This enables constant-memory decoding with logarithmic time per step. For training, a parallel algorithm decomposes computations into intra-chunk and inter-chunk phases. Intra-chunk interactions use standard matrix multiplications, while inter-chunk dependencies leverage hierarchical scans (applying existing linear-attention primitives across chunks). The result is O(T log T) time complexity, optimized for modern hardware through matmul-rich operations.

Chunk-wise algorithm for decomposition of matrix.

This framework generalizes linear attention variants like Mamba-2 and Gated DeltaNet. By composing their structured masking matrices with the log-linear hierarchy, these models gain multi-scale memory without altering their core interaction mechanics. For example, log-linear Mamba-2 retains its data-dependent gating but attends to logarithmic states instead of a single fixed state.

Performance and Implications

The initial results show that log-linear attention performs well on synthetic associative recall tasks (MQAR), it maintains near-perfect accuracy as sequence length scales, while linear baselines like DeltaNet degrade significantly. In language modeling pretrained on 50B tokens, log-linear variants of Mamba-2 and Gated DeltaNet reduce perplexity and improve performance on 6–8 of 9 commonsense reasoning benchmarks.

Additionally, log-linear Gated DeltaNet matches or exceeds its linear counterpart on real-world retrieval tasks (e.g., SQuAD and TriviaQA), even at 16K-token contexts. However, custom kernels are needed to optimize intra-chunk operations, and the hierarchical inductive bias may not suit all applications.

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Wang et al. [QwenTeam, LeapLab]

♥ 340 LLM RLVR

High-Entropy Tokens in LLMs

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced reasoning in large language models, but its inner workings are still unclear. Current methods update all tokens equally during training, overlooking their distinct roles in reasoning chains. This gap led researchers to investigate RLVR through token entropy patterns, which revealed that only a critical minority of tokens steer reasoning paths.

This research proposes focusing updates on these high-entropy "forking tokens" to understand the RLVR process and unlock additional efficiency gains.

How Token Entropy Guides Reasoning

In Chain-of-Thought reasoning, tokens split into two functional groups. Roughly 80% are low-entropy tokens, which contain deterministic elements like word suffixes or code fragments that follow established paths. The remaining 20% are high-entropy "forking tokens," such as logical connectors ("however," "thus") or decision points ("assume," "define"). These introduce uncertainty, branching reasoning into multiple pathways. For instance, in a math problem, a token like "suppose" might pivot the solution between algebraic or geometric approaches.

During RLVR training, models largely preserve the base model’s entropy distribution. Policy updates primarily adjust high-entropy tokens and subtly increase their exploratory potential. On the other hand, low-entropy tokens show minimal entropy fluctuation, acting as stable anchors. This selective adaptation suggests RLVR refines reasoning by optimizing forks instead of rewriting entire paths.

Average scores of AIME 2024 and AIME 2025.

After that, the method masks gradients for low-entropy tokens and updates only the top 20% high-entropy tokens during training. This takes advantage of the finding that forking tokens drive nearly all performance gains. By concentrating learning where uncertainty matters, the approach reduces computational overhead while amplifying exploration.

Scaling Gains for High-Entropy Tokens

The initial benchmark results show striking efficiency as updating only 20% of tokens matches full-update performance on smaller models like Qwen3-8B. For larger models, it delivers substantial boosts, the scores for Qwen3-32B jumped +11.04 on AIME’25 and +7.71 on AIME’24, which sets new state-of-the-art benchmarks.

Entropy patterns in the chain of thoughts of LLMs.

On the other hand, training solely on low-entropy tokens caused sharp decline in performance. Additionally, the performance gains scaled with model size, and the technique generalized to out-of-domain tasks like LiveCodeBench, which hints at broader applicability. The token-entropy metric provides a new perspective on why RLVR outperforms supervised fine-tuning. It optimizes exploratory forks rather than memorizing paths.

Comparison between vanilla DAPO using all tokens and DAPO using only the top 20% high-entropy tokens.

How much do language models memorize?

Morris et al. [FAIR at Meta, Google DeepMind, Cornell University, NVIDIA]

♥ 3.2k LLM Interp bycloud’s pick

Understanding Memorization in Language Models

Language models are getting incredibly powerful, but many people have the same question about them. Are these models really intelligent or are these models only memorizing answers? Existing definitions and approaches often fail to provide this answer as merely extracting a string from a model doesn’t prove memorization, and verbatim reproduction isn’t always necessary. This ambiguity makes it hard to measure what’s truly stored in a model’s parameters. This paper tackles this by redefining memorization through an information-theoretic lens, and offering a clearer way to quantify unintended memorization and generalization.

How Much Language Models Actually Memorize

The authors of this study propose using the Kolmogorov complexity, which measures the shortest description length of data. Here, memorization is defined as the reduction in bits needed to encode a data point when using the model as a reference. Unintended memorization captures sample-specific details stored beyond what’s expected from generalization.

You can think of it as a model that has "memorized" a data point if that point can be represented more compactly when the model is used as a reference. Concretely, they measure unintended memorization in bits: the difference between the inherent information in a data point and its compressed size when leveraging the model. This separates unintended memorization (sample-specific details) from generalization (learned patterns). For example, a model might memorize an exact phone number (unintended) versus learning arithmetic to generate new ones (generalization).

To validate this, the team first eliminated generalization variables using synthetic datasets of random bit strings. Here, every bit must be memorized since patterns don’t exist. They trained GPT-style transformers of varying sizes and precision, finding models that store 3.51 bits for bfloat16 precision and 3.83 for float32. Next, they applied the method to real text (from the deduplicated FineWeb dataset). Results showed unintended memorization peaks when dataset size approaches model capacity. Beyond this point, double descent occurs: test loss drops sharply as models shift from memorizing samples to generalizing patterns.

Capacity Limits, Double Descent, and Membership Inference in LLMs

The results of this study reveal clear patterns. On synthetic data, models plateau in memorization once dataset size exceeds their capacity (roughly 3.5 bits per parameter). For real text, double descent emerges precisely when dataset size surpasses this capacity: test loss spikes as models shift from memorizing samples to learning general patterns. This transition forces models to share information across data points, enabling generalization.

Membership inference attacks (predicting if a sample was in the training data) follow a scaling law based on the capacity-to-dataset ratio. Success drops predictably as datasets grow, nearing random guessing (F1 ~0.5) for large datasets like those in modern LLMs. Validation on GPT-2-scale models confirms this: at high data-to-capacity ratios, attacks fail. The authors predict most large models are trained on too much data for reliable membership inference.

In future, perhaps we can use this approach to design architectures that generalize more efficiently without over-retaining data by quantifying unintended memorization.

🚨This week's top AI/ML research papers:
- Log-Linear Attention
- Beyond the 80/20 Rule
- Why Gradients Rapidly Increase Near the End of Training
- How much do language models memorize?
- General agents need world models
- The Illusion of Thinking
- MiMo-VL Technical Report
-
— The AI Timeline (@TheAITimeline)
2:01 AM • Jun 9, 2025

Reply

or to participate.