MiniMax M3's New Attention: MiniMax Sparse Attention

plus more about FlashMemory-DeepSeek-V4, Trajectory-Refined Distillation, Test-Time Gradient Guidance, and End-to-End Context Compression at Scale

June 8th ~ June 16th
#112 Latest AI Research Explained Simply

🗞️ Industry News in 1 Line

  1. ♥ 13k Moonshot AI has released Kimi-K2.7-Code, which is an open-source coding model with improved instruction following, coding, and agent capabilities compared to its predecessor, K2.6. The updated model shows benchmark gains, including a 21.8% increase on the Kimi Code Bench v2, while reducing reasoning-token usage by 30% for greater efficiency. You can try it on Kimi Platform.

  2. ♥ 8.5k Z.ai has announced the rollout of GLM-5.2, its new flagship model with advanced coding capabilities, a 1-million-token context window, and configurable reasoning modes. The model is currently available to GLM Coding Plan subscribers and is scheduled to be officially open-sourced under the MIT License next week.

  3. ♥ 88k Anthropic has suspended access to its Fable 5 and Mythos 5 models after a US government export control directive citing national security concerns. The restriction impacts all global customers and limits access for foreign nationals. While Anthropic is working to address the regulation, you can explore other open-source models.

  4. ♥ 1.4k Google has introduced DiffusionGemma, a new 26B Mixture of Experts open model designed to explore text diffusion techniques. By generating entire 256-token blocks simultaneously, the model achieves up to four times faster inference on GPUs, though its overall output quality is lower than standard Gemma 4 models. You can try it on Model Garden or Hugging Face.

Intuitive AI Academy - NEW Optimization Chapter!

My latest project: Intuitive AI Academy has the perfect starting point for you! We focus on building your intuition to understand LLMs, from transformer components, to post-training logic. All in one place.

We just added a new chapter on Optimization, that goes through the history, the key techniques, and the current state of optimizers that frontier model uses.

We currently have an exclusive newsletter offer, where you would get 40% off on the yearly plan for our users.

Use code: TIMELINE

End-to-End Context Compression at Scale

Li et al. [New York University, Modal Labs, University of Maryland, Princeton University, Columbia University, Harvard University, Lawrence Livermore National Laboratory, FAIR at Meta]

♥ 430   LLM Context  

There is a fundamental bottleneck in AI: memory. When LLMs try to read massive documents or entire software codebases, the system's memory footprint and processing time increase exponentially.

Until now, the main workaround has been to selectively forget information by trimming the system's internal memory cache. However, this approach is often remarkably slow, computationally unstable, or degrades the model’s intelligence.

Examples of the three data types used to train LCLMs

This paper created Latent Context Language Models, which rethinks how machines ingest information. Instead of forcing the main AI engine to read a sprawling mountain of text word by word, researchers placed a smaller, highly efficient encoder model in front of it.

A from-scratch pre-training sweep identifies the best encoder-decoder compressor architecture.

This encoder acts like a brilliant summarizer. It processes large blocks of text and mathematically compresses them into a much shorter sequence of dense representations called soft tokens. An adapter then translates these soft tokens into a format the main model natively understands. By handling the heavy lifting upfront, the main model processes a fraction of the data.

LCLMs can use tools to retrieve compressed context and improve exact string-match accuracy.

This breakthrough shifts the limits of what models can handle. The researchers found that this architecture beautifully reduces peak memory usage and the time it takes to generate an answer, all while preserving the system's baseline intelligence.

Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning

Zhou et al. [UC Berkeley, Physical Intelligence]

♥ 1k   RL

Researchers have been struggling to scale up reinforcement learning because of messy training dynamic where an "actor" system learns to make decisions while a "critic" system simultaneously learns to judge them. Training these two together is highly unstable, especially when using advanced generative models that create complex actions step-by-step.

Illustrative example of 1D denoising process mapping Gaussian noise to a tri-modal distribution

If the critic model updates, the actor gets confused, which makes it incredibly difficult to build larger, smarter robotic control systems. Researchers wanted to know if they could skip this chaotic paired training altogether.

The researchers discovered a remarkably elegant workaround called Q-Guided Flow. Instead of forcing the actor and critic to learn together, they trained them separately. When the AI is actually running (a phase called test time) it generates actions through a gradual process of removing noise. Normally, asking the critic for directions during these noisy, half-finished steps causes computational confusion or requires massively expensive backward math.

However, the researchers found a brilliant shortcut. At each step, the system makes a rapid mathematical guess of what the final, perfectly clean action will look like. It shows this clean guess to the critic, receives reliable feedback, and uses that insight to gently steer the ongoing action toward a higher-value outcome.

Offline RL performance at 500k training steps (20 tasks, 10 seeds)

By applying this simple guidance exactly when the AI is acting, the resulting models are faster, cheaper to run, and smoothly scale up to handle much harder tasks without breaking.

MiniMax Sparse Attention

Lai et al. [MiniMax, Peking University, NVIDIA, Zhejiang University, Huazhong University of Science and Technology, Nanjing University, Hangzhou Dianzi University]

♥ 511   LLM Attention   bycloud’s pick  

Overview of MSA

This paper introduces a new technique called MiniMax Sparse Attention, in this Instead of forcing the AI to expend heavy processing power analyzing every single word in its vast memory simultaneously, researchers designed a highly efficient filtering system. They built a lightweight indexer that quickly scans the data in blocks.

It identifies only the most important information relevant to the current task (while always keeping the most recent context active to maintain stability) and skips the rest. Once these blocks are selected, the model’s main engine focuses its full attention exclusively on them.

Efficiency comparison between GQA and MSA under the shared experimental model configuration.

When researchers tested this on a massive, 109-billion parameter model, this method maintained the high quality of traditional approaches but drastically slashed the workload. The system absorbs initial data fourteen times faster and generates new responses more than seven times faster.

Trajectory-Refined Distillation

Jiang et al. [McGill University, Mila Quebec AI Institute, UT Austin]

♥ 360   LLM Distillation  

Researchers are building smarter AI models by pairing a smaller "student" model with a highly advanced "teacher" model. The student practices solving a problem step-by-step, and the teacher grades its output word-by-word. However, researchers recently identified a major structural roadblock in this process called "prefix failure."

TRD refines student-generated trajectories yo into improved trajectories yr, which are then used for distillation.

Let’s imagine a student taking a complex math test and making a logical error in the very first step. Because AI models generate text sequentially, once the student goes down this wrong path, the rest of the answer is doomed. When this happens, the teacher model gets mathematically confused, awkwardly trying to correct the student word-by-word while still trapped in the student's flawed train of thought.

Under prefix failure, the teacher distribution becomes a mixture with two modes.

Until now, fixes merely involved ignoring or adjusting the penalties for these individual bad words, completely failing to address the reality that the underlying logic was already hopelessly broken.

OPD Avg@16 results (%) using Qwen3-8B as the teacher

To solve this, researchers introduced a highly promising approach called Trajectory-Refined Distillation. Rather than relying on word-level nitpicking, this method steps back to look at the big picture. Instead of grading a bad thought process, the system takes the student's initial attempt and allows the teacher to gently revise the entire reasoning path into a cohesive, corrected draft before the actual learning occurs.

Trajectory analysis

By correcting foundational missteps at their source, the student model is exposed to completely new, valid ways to reason through complex problems.

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

Wang et al. [Independent Researchers, Tencent, The Hong Kong University of Science and Technology (Guangzhou), Tsinghua University]

♥ 219   LLM Attention  

To generate a single word of response, LLMs must keep that entire massive history actively loaded in its working memory. Researchers realized this creates a severe hardware bottleneck. They discovered a striking inefficiency: most of the time, models processing massive contexts only need the most recent sliver of information to form a response.

Architectural overview of LSA vs. CSA.

This paper introduces a new framework called Lookahead Sparse Attention. Instead of passively forcing the AI to carry the full weight of its history, they introduced a "Neural Memory Indexer." Every few steps, this indexer evaluates the AI's current thought process, dynamically predicting and fetching only the critical historical chunks needed for the immediate future.

The researchers managed to train this lightweight indexer entirely independently from the massive core model, bypassing tremendous computational costs and allowing it to be optimized on its own.

By loading only what is strictly necessary, the system shrinks the active memory footprint down to just 13.5 percent of traditional models. At extreme context scales of half a million tokens, memory overhead drops by over ninety percent.

Reply

or to participate.