The AI Timeline

The AI Timeline
Archive
Page 1

Why Memorized Knowledge Fails to Generalize in LLM Finetuning

Plus more about Single Async Opt for Agentic RL, Remember When It Matters, and Sparse Delta Memory

by cloud

Jul 07, 2026

You Only Need 1 Layer for RLVR?

plus more about AdaJEPA, Program-as-Weights, The World Is In Your Mind, and Dual On-policy Distillation

by cloud

Jun 30, 2026

DeepSeek Just dropped a new speculative decoding method!

plus more about Tapered LMs, Improved LLDMs, AutoData, and You Don't Need To Run Every Eval

by cloud

Jun 23, 2026

What even is a >< former (yes >< former)

plus more about Looped World Models, Fixed-Point Reasoners, and ExpRL

by cloud

Jun 16, 2026

MiniMax M3's New Attention: MiniMax Sparse Attention

plus more about FlashMemory-DeepSeek-V4, Trajectory-Refined Distillation, Test-Time Gradient Guidance, and End-to-End Context Compression at Scale

by cloud

Jun 09, 2026

Microsoft just shared the frontier data engineering secrets

plus more about If LLMs Have Human-Like Attributes, Then So Does Age of Empires II, Cosmos 3, and Robots Need More than VLA and World Models

by cloud

Jun 02, 2026

DiffusionBlocks: Save 2-3x Training Memory!?

plus more about Bitter Lesson in Data Filtering, Do Language Models Need Sleep, and Neural Weight Norm.

by cloud

May 26, 2026

Generative Recursive Reasoning

plus more on the Benefits of Subword Tokenization, HRM-Text, Probabilistic Tiny Recursive Model, and Vector Policy Optimization

by cloud

May 19, 2026

Long Context Pre-Training w/ Lighthouse Attention

plus more about Self-distilled Agentic RL, Embedded Language Flows, and Negation Neglect

by cloud

May 12, 2026

Think In Diffusion: Continuous Latent Diffusion Language Model

plus more on Sparser, Faster, Lighter Transformer LMs, Manifold Steering, and Teaching Claude Why

by cloud

May 05, 2026

DeepSeek's Deleted Paper: Thinking With Visual Primitives

can't believe they removed this paper unknowningly

by cloud

Apr 29, 2026

There Will Be a Scientific Theory of Deep Learning

plus more about Hyperloop Transformer, Qwen-3.5 Omni, and Scaling Self-Play with Self-Guidance

First Back

1 2 3 4 5 6 7 8

Next Last

Archive

Why Memorized Knowledge Fails to Generalize in LLM Finetuning

You Only Need 1 Layer for RLVR?

DeepSeek Just dropped a new speculative decoding method!

What even is a >< former (yes >< former)

MiniMax M3's New Attention: MiniMax Sparse Attention

Microsoft just shared the frontier data engineering secrets

DiffusionBlocks: Save 2-3x Training Memory!?

Generative Recursive Reasoning

Long Context Pre-Training w/ Lighthouse Attention

Think In Diffusion: Continuous Latent Diffusion Language Model

DeepSeek's Deleted Paper: Thinking With Visual Primitives

There Will Be a Scientific Theory of Deep Learning