Scaling Up Diffusion Language Models to 100B, Adding 1 Attention Layer & Make Visual Encoders Generate Images, LayerNorm Is Not Needed In Transformer, and more
Basically recapping what I missed in the last 4 months
PretrainZero, Stabilizing RL with LLMs and more
Breaking down "Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?"
FreeFlow, DeepSeekMath-V2, Soft Adaptive Policy Optimization, and more
Plus more on Seer, Virtual Width Networks, SAM 3, and Evolution Strategies at the Hyperscale
LeJEPA, The Path Not Taken, and more
From Memorization to Reasoning in the Spectrum of Loss Curvature and Introducing Nested Learning: A new ML paradigm for continual learning
and more on Kimi Linear, Looped Transformer, How FP16 fixes RL...
How to Compress Long Text into Images To Reduce LLM Tokens and more
RLM, RAE, Reasoning with Sampling, and more
Plus more about Moloch's Bargain: Emergent Misalignment When LLMs Compete for Audiences and LLM Fine-Tuning Beyond Reinforcement Learning