Language Modeling with Autoregressive U-Nets

plus more about LongLLaDA and MiniMax-M1

June 16th ~ June 22nd
#61 Latest AI Research Explained Simply

🗞️ Industry News in 1 Line

  1. ♥ SPONSORED New Warp 2.0 launch introduces an agentic development environment embeds top-ranked coding agents inside a GPU-accelerated terminal, letting developers prompt, debug, and ship in parallel threads. Currently rank #1 on Terminal Bench (outperforming Claude Code). Use code “BYCLOUD” for 1 month free Warp Pro.

    rank #1 on Terminal Bench

  2. ♥ 3.9k Midjourney introduces their first video generation model called V1 Video Model. It is an extremely aesthetic video generator just like their image generators. You can try it out now for just $10 per month.

    V1 Video Model Demo

  3. ♥ 1.3k Moonshot AI introduces Kimi-Researcher, an autonomous LLM agent trained end-to-end with agentic RL that excels at multi-turn search and reasoning, topping xbench-DeepSearch.

    Kimi-Researcher Benchmark

  4. ♥ 868 Prime Intellect introduces SYNTHETIC-2, a planetary-scale peer-to-peer synthetic dataset generation run across decentralized systems. They have the goal of releasing an open reasoning dataset on the toughest RL tasks via this generation run using crowd-sourced GPUs.

    live SYNTHETIC-2 dashboard

LongLLaDA: Unlocking Long Context Capabilities in Diffusion LLMs

Liu et al. [Fudan University, Shanghai Innovation Institute, Shanghai AI Lab]

♥ 212   Diffusion LM

Introduction to Long-Context Challenges

In past few weeks, we have seen an uptick in diffusion-based language models (dLLMs) and some newer models like LLaDA show promise in addressing limitations of auto-regressive models. These models can reverse curse mitigation and multimodal adaptability, however we don’t know how they handle and explore the long-context problems.

Traditional auto-regressive LLMs like LLaMA3 struggle really bad when context exceeds their pre-trained length (e.g., 8k tokens). These models suffer perplexity spikes and retrieval failures. But diffusion LLMs show surprising stability in direct extrapolation. This paper investigates why and how we can extend their context capabilities systematically.

Stability and Local Perception in LLMs

Diffusion LLMs use bidirectional attention during training, which exposes them to symmetric relative positions (e.g., [-4095, 4095] for a 4k context). This is very different from auto-regressive models, which see only forward positions (e.g., [0, 8191] for 8k). The Rotary Position Embedding (RoPE) in diffusion models captures complete sinusoidal periods for moderate frequencies, which lead to robust extrapolation.

When the context surpasses training length, diffusion LLMs don’t collapse like auto-regressive counterparts. Instead, they show "local perception": focusing on recent segments. For instance, in Needle-In-A-Haystack (NIAH) tasks, LLaDA retrieves information from the latest 4k tokens even at 24k context, acting like a sliding window.

After this, we can modulate this behavior if we continue to sample steps further. Increasing the number of steps (e.g., 16 vs. 4) extend retrievable depth slightly, but the fundamental limit remains tied to pre-training context. Visualization of QK states via t-SNE projections allowed us to confirm that diffusion models maintain uniform position-embedding manifolds during extrapolation and avoid the distribution shifts seen in auto-regressive LLMs. 

Extending LLM Context with LongLLaDA

After analyzing the stability of LLMs, the authors of this paper suggested LongLLaDA, which is a training-free method to extend context windows. This method adapts NTK-based RoPE extrapolation, which was originally designed for auto-regressive models, to diffusion LLMs. By calculating a scaling factor λ (e.g., λ=14 for 16k context) based on rotary base dimensions and training length, LongLLaDA adjusts position embeddings dynamically. At λ=14, LLaDA achieves near-perfect retrieval across 16k contexts, with the "local perception" effect scaling proportionally. Pushing further to λ=31 (24k) introduces a "lost-in-the-middle" pattern, indicating practical limits, while λ=55 (32k) fails. Crucially, scaling laws for auto-regressive models transfer seamlessly here.

Performance and Task-Specific Strengths of LongLLaDA

The researchers tested their models on various benchmarks and noticed some nuanced tradeoffs. On NIAH, the diffusion LLMs with LongLLaDA match auto-regressive performance within extended contexts (e.g., 96.4% accuracy at 16k). In LongBench and RULER evaluations:

  • Retrieval tasks: Diffusion LLMs perform comparably to auto-regressive models.

  • Aggregation tasks (e.g., variable tracing): They lag significantly.

  • QA and synthetic tasks: They excel, outperforming LLaMA3 by up to 20% in accuracy.

For example, LLaDA-8B-Instruct with λ=14 scored 88.9% on RULER’s QA subset at 8k context, surpassing LLaMA3’s 63.5%. However, aggregation tasks highlighted weaknesses, with scores dropping below 50%.

From Bytes to Ideas: Language Modeling with Autoregressive U-Nets

Videau et al. [FAIR at Meta, TAU, INRIA and LISN, CNRS & Université Paris-Saclay, INSA Rouen Normandy]

♥ 22k   LLM Architecture    bycloud’s pick  

Introduction to Autoregressive U-Nets

Language models usually start with a fixed tokenization step, where they chop text into predefined units like words or subwords. This approach locks in granularity early, forcing models to work within rigid boundaries. For example, a word-level tokenizer might handle "The quick" as two tokens, while a character-level one sees each letter separately.

However, this inflexibility creates issues: rare tokens become harder to predict. Additionally, morphological relationships like "strawberry" and "strawberries" go unrecognized, and adapting to new languages or dialects is cumbersome. Byte-Pair Encoding (BPE) alleviates some problems but still relies on static embedding tables and a finite vocabulary.

Pooling selects the vectors at the positions specified by the splitting function.

To overcome these constraints, researchers introduced the Autoregressive U-Net (AU-Net). This method eliminates predefined tokenization by processing raw bytes directly. Instead of embedding tables, it uses attention mechanisms to build contextual representations dynamically. The architecture adapts to multiple levels of granularity, bytes, words, or word groups, creating a flexible hierarchy that evolves with the data.

Inner Workings of Autoregressive U-Nets

The AU-Net architecture uses a U-shaped structure with a contracting path and an expanding path. The contracting path compresses the input sequence progressively. At the first level, it pools information at user-defined split points, like spaces between words, to form higher-level representations. For instance, Stage 1 processes individual bytes, Stage 2 pools at word boundaries, and Stage 3 groups every two words. At each split point, vectors are selected and projected into the next stage’s dimensionality using linear layers. Crucially, self-attention ensures these vectors summarize all preceding context, capturing dependencies like word roots or semantic connections.

AU-Net scaling w.r.t compute

After this, the skip connections bridge the contracting and expanding paths, preserving fine-grained details. During expansion, coarse vectors from deeper stages guide finer predictions. For example, a vector representing "The quick" might be upsampled to help predict "brown fox" at the byte level. During the upsampling stage, the model duplicates each coarse vector across its segment and applies position-specific linear transformations. This allows deeper stages, which activate less frequently, to influence spelling or phrasing without constant computation.

Evaluation and Results of Autoregressive U-Nets

The AU-Net architecture was tested against BPE-based transformers and byte-level baselines across benchmarks like Hellaswag, MMLU, and GSM8K. At a 1B-parameter scale with 370B training tokens, AU-Net-4 scored 73.7% on Hellaswag and 31.7% on MMLU, outperforming BPE’s 70.2% and 27.0%. Multistage hierarchies consistently matched or exceeded baselines, with gains amplifying in reasoning-heavy tasks like ARC Challenge. Efficiency remained practical: AU-Net-2 processed 225K bytes/second on H100 GPUs, rivaling BPE’s 210K.

Additionally, the AU-Net excelled in multilingual settings. For instance, on FLORES-200 translation, it improved BLEU scores for low-resource languages like Faroese (+1.2) and Limburgish (+4.6). In MMLU evaluations across 26 languages, it improved Roman and Germanic languages by 3-4 points, which demonstrates cross-linguistic transfer. 

However, there are still a few limitations as it under-performs on math-heavy GSM8K due to sparse training data and reliance on space-based splitting for Latin scripts. 

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

MiniMax Team

♥ 424   LLM Training

Efficient Reasoning at Scale with MiniMax-M1

Everyone knows that bigger LLMs are better but creating bigger reasoning LLMs is really hard as language models often hit a computational wall. Traditional transformer architectures struggle with the quadratic complexity of attention mechanisms, which makes long-context tasks such as processing million-token inputs or generating extensive reasoning chains, prohibitively expensive.

This paper introduces the MiniMax-M1 architecture which tackles this by rethinking efficiency from the ground up, combining a novel hybrid architecture with optimized training to slash computational costs while boosting performance in complex domains.

Inner workings of MiniMax-M1

The MiniMax-M1 uses a hybrid Mixture-of-Experts (MoE) foundation, which activates 45.9 billion of its 456 billion parameters per token. The main change in this approach is using the Lightning Attention, an I/O-aware linear attention variant integrated into a hybrid block design: every eight Lightning Attention blocks alternate with one traditional softmax attention block. This hybrid approach reduces FLOPs dramatically for long sequences, and uses just 25% of the compute compared to models like DeepSeek R1 at 100K-token generations. Lightning Attention also natively supports 1M-token contexts, which is eight times larger than most competitors.

The training process used a three-stage pipeline. First, its continual pretraining on 7.5 trillion tokens improved its STEM and reasoning-heavy data. Next, the supervised fine-tuning step injected chain-of-thought patterns. The final phase used a novel RL algorithm called CISPO (Clipped Importance Sampling Policy Optimization). Unlike prior methods that clip token updates, CISPO stabilizes training by clipping importance sampling weights, which preserves gradient flow for low-probability "fork" tokens critical to reasoning. 

After this, the researchers added a few architectural tweaks that were essential for RL stability. A precision mismatch between training and inference kernels was resolved by switching the LM head to FP32. Next, they used the optimizer settings (β1=0.9, β2=0.95, ε=1e-15) to accommodate extreme gradient ranges, while repetition detection truncated degenerate outputs early. Together, these enabled full RL training on 512 H800 GPUs in just three weeks.

Evaluation and Benchmark Results of MiniMax-M1

The MiniMax-M1 model outperforms leading open-weight models (DeepSeek-R1, Qwen3-235B) across software engineering, tool use, and long-context tasks. In agentic benchmarks like TAU-Bench, it surpassed Gemini 2.5 Pro, while outperforming OpenAI o3 and Claude 4 Opus in long-context understanding. However, it is slightly behind DeepSeek-R1-0528 in math and coding competitions, which highlights a trade-off between specialization and versatility.

Reply

or to participate.