DiffusionBlocks: Save 2-3x Training Memory!?

plus more about Bitter Lesson in Data Filtering, Do Language Models Need Sleep, and Neural Weight Norm.

May 26th ~ June 2nd
#110 Latest AI Research Explained Simply

🗞️ Industry News in 1 Line

  1. ♥ 1.5k StepFun has released Step 3.7 Flash, a 198B sparse Mixture-of-Experts model designed to optimize agentic, coding, and multimodal workflows with approximately 11B active parameters. The model offers 256K context support, high-performance tool use, and optimized speed, making it capable of running locally on specialized hardware. You can try it on GitHub or HuggingFace.

  2. ♥ 3.7k Liquid AI has released LFM2.5-8B-A1B, a hybrid Mixture-of-Experts model that significantly upgrades training data to 38T tokens and expands context length to 128k. Designed for agentic workflows, the model enhances instruction following and tool-use capabilities while maintaining efficient local performance across diverse hardware. You can try it on Liquid playground or Hugging Face.

  3. ♥ 67k Anthropic has released Claude Opus 4.8, which introduces improved judgment, enhanced self-assessment capabilities, and a "fast mode" that offers 2.5x speed at a lower price point. The update also brings dynamic workflows to Claude Code, allowing the model to manage complex, multi-file tasks by deploying parallel subagents.

  4. ♥ 529 Tencent has released Hy-MT2, a new open-source multilingual translation model available in sizes ranging from 1.8B to 30B parameters. The series features impressive efficiency, with the 1.8B version leveraging extreme quantization to run locally on mobile devices while outperforming several mainstream commercial APIs. You can try it on GitHub or Hugging Face.

Intuitive AI Academy - NEW Advanced RL Chapter!

My latest project Intuitive AI Academy has the perfect starting point for you! We cover everything from the basics, like transformer architecture, all the way to more advanced topics like LoRA, distillation, Mixture of Experts, and RLHF.

The goal is simple: make frontier AI systems easy to understand with clear explanations, visuals, interactive learning, and a structured path from fundamentals to cutting-edge techniques.

We have just added a new advanced RL chapter, that includes the basics of RL and the current state of RLHF! We currently have an special newsletter offer, where you would get 40% off on the yearly plan! 

Use code: TIMELINE

DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation

Shing et al. [Sakana AI]

♥ 2.2k   LLM Training  

A lot of GPU memory is used when training neural networks because standard backprop has to store activations through every layer. This becomes a bottleneck as models get deeper, since memory grows with depth. DiffusionBlocks asks whether we can train only small chunks of the model at a time, without using the usual fragile local objectives that made older block-wise training methods perform badly.

The core idea is to reinterpret residual layers as steps in a diffusion denoising process. Instead of training the whole transformer end-to-end, the model is split into blocks, and each block is assigned a specific noise range. Each block then learns to denoise within that range independently, so training only needs gradients for one block at a time.

The research showed that this works across very different architectures, not just toy classification models. On CIFAR-100, DiffusionBlocks got 59.30% accuracy versus 60.25% for standard ViT while only training 4 layers at a time. For image generation, it matched or improved DiT results on CIFAR-10 and ImageNet with around 3Ă— memory reduction. For autoregressive language modeling, it even improved LM1B MAUVE from 0.50 to 0.71 while training only 3 layers at a time.

The most interesting part is that the blocks are not just random chunks. They use equi-probability partitioning, meaning more capacity is placed around the intermediate noise levels where denoising is hardest, instead of wasting equal space on easy noise regions. This beat uniform partitioning in the ablations. But there is still a tradeoff: moderate block counts like 2 or 3 worked best, while too many blocks hurt quality because each block gets too little capacity.

A Bitter Lesson for Data Filtering

Mohri et al. [Stanford University]

♥ 1.2k   LLM data curation  

We spend a lot of resources to train LLMs, filtering web data to remove "low-quality" or noisy text. This seems intuitive, but filtering throws away a vast majority of the internet's text. This creates a bottleneck because modern models require trillions of words to keep improving. This paper explores whether expensive, human-designed filters are truly necessary as computational power scales, or if models can learn to navigate the web's messy reality on their own.

670M-token CC pool versus junk-injected versions.

The research showed that with enough computing power and sufficiently large models, the optimal data filter is actually no filter at all. While smaller models struggle with cluttered datasets, larger models trained for longer periods are highly robust. They not only tolerate noisy text but actually benefit from seemingly "poor" data. To test this, researchers injected scrambled documents with completely randomized word orders into the training pool.

1B model performance as we vary the pool size; the total needed steps for pool to outperform RefinedWeb grows rapidly. Crossing point as a function of pool size for various model sizes

Surprisingly, the larger models still succeeded, even benefiting from the shuffled text. Because the original words remained, the models could still learn which terms frequently co-occur, helping them build associations despite the chaotic structure.

The team concluded that as training budgets scale toward the high-compute limits of the near future, training directly on massive, unfiltered web pools will likely become the most effective strategy.

Neural Weight Norm = Kolmogorov Complexity

Musat [ETH ZĂĽrich]

♥ 1.1K   Complexity   bycloud’s pick  

It is common to penalize large weights as it helps neural networks generalize to new data, but classical learning theories cannot fully explain why. Because a network trained with weight decay has the same theoretical capacity as one trained without, traditional mathematics struggles to distinguish between a model that genuinely learns and one that merely memorizes noise.

Comparison of weight-norm-vs-complexity bounds. “Two-sided” indicates whether both directions of a sandwich are proved.

This paper tries to build a mathematical bridge connecting weight decay to Solomonoff’s universal prior, the theoretically optimal but historically uncomputable method for learning. By analyzing neural networks operating under fixed-precision arithmetic, the researchers proved that minimizing any weight norm is equivalent to finding the shortest computer program that outputs a given result, known as its Kolmogorov complexity.

They established this through two tight reductions. First, any program can be preloaded directly into network weights at a cost of one parameter per bit. Second, any fixed-precision network can be compressed back into a program with only a slight logarithmic addressing overhead.

This discovery suggests that training with weight decay implicitly guides a model toward the most computationally elegant hypotheses, successfully bringing an idealized theory of universal learning into practical deep learning.

When Does LeJEPA Learn a World Model?

Klindt et al. [Cold Spring Harbor Laboratory, New York University, Brown University]

♥ 878   JEPA  

We want to build AI systems that truly understand the physical world instead of just memorizing patterns. Self-supervised learning tries to solve this by training models to predict how the world changes, but historically, we lacked a guarantee that these models were actually uncovering the true, underlying physical variables.

Without this guarantee, a model might scramble unrelated concepts, like mixing up an object's velocity with its texture. This kind of entanglement makes it incredibly difficult for an AI agent to plan actions or adapt when its environment changes. To build reliable world models for tasks like robotic planning, we need "linear identifiability," a mathematical assurance that the AI is cleanly separating the world's true degrees of freedom rather than creating a tangled web of observations.

LeJEPA Theory Illustration

This paper provided the first mathematical proof of linear identifiability for a class of self-supervised models known as Joint-Embedding Predictive Architectures. By analyzing how these models align different views of the same scene while regularizing the outputs to fit a Gaussian distribution, the team proved that the model is mathematically forced to recover the world's true latent variables.

Using spectral analysis, they demonstrated that any nonlinear distortion
strictly degrades the model's performance, making a clean, linear recovery the
optimal solution.

Additionally, they also proved that the Gaussian is the unique distribution that makes this guarantee possible. Even when conditions in the real world are only approximately met, the model's accuracy degrades gracefully, successfully enabling optimal planning in latent spaces.

Do Language Models Need Sleep? Offline Recurrence for Improved Online Inference

Lee et al. [Carnegie Mellon University, University of Maryland]

♥ 912   LLM Sleeping  

AI models struggle to handle long-term tasks because their temporary memory system requires immense computational power to keep active. While newer hybrid models attempt to compress this raw data to save space, they often lose the ability to perform complex reasoning over details they can no longer actively see.

Researchers realized that the true bottleneck in AI memory is not just storage capacity, but having enough computational "thinking time" to transform past experiences into a highly organized, usable format. To bridge this gap, they sought a way for models to deeply process and store past details without slowing down their split-second response times.

To tackle this challenge, the research team designed a biologically inspired process they call "sleep." Just as the human brain replays memories during rest to cement them into long-term storage, this new architecture pauses when its active memory window becomes full.

At the eviction boundary, an SSM-attention hybrid performs N offline recurrent passes over the current context before discarding the attention cache.

During this quiet phase, the model runs multiple offline passes over the accumulated context, recursively updating its permanent weights inside its state-space blocks through a learned local rule before wiping its temporary cache clean. This clever design shifts the heavy computational work to the sleep phase, ensuring the model remains fast and efficient during active prediction.

Recurrence across context windows incur minimal training overhead; recurrent-depth linearly increases cost.

When tested on demanding reasoning tasks, including cellular automata simulations, multi-hop network paths, and complex mathematical equations, the researchers discovered that models with longer sleep durations achieved significantly improved accuracy. The gains were most pronounced on questions that demanded the deepest logic.

Reply

or to participate.