The AI Timeline

The AI Timeline
Archive
Page 0

MiniMax M3's New Attention: MiniMax Sparse Attention

plus more about FlashMemory-DeepSeek-V4, Trajectory-Refined Distillation, Test-Time Gradient Guidance, and End-to-End Context Compression at Scale

by cloud

Jun 09, 2026

Microsoft just shared the frontier data engineering secrets

plus more about If LLMs Have Human-Like Attributes, Then So Does Age of Empires II, Cosmos 3, and Robots Need More than VLA and World Models

by cloud

Jun 02, 2026

DiffusionBlocks: Save 2-3x Training Memory!?

plus more about Bitter Lesson in Data Filtering, Do Language Models Need Sleep, and Neural Weight Norm.

by cloud

May 26, 2026

Generative Recursive Reasoning

plus more on the Benefits of Subword Tokenization, HRM-Text, Probabilistic Tiny Recursive Model, and Vector Policy Optimization

by cloud

May 19, 2026

Long Context Pre-Training w/ Lighthouse Attention

plus more about Self-distilled Agentic RL, Embedded Language Flows, and Negation Neglect

by cloud

May 12, 2026

Think In Diffusion: Continuous Latent Diffusion Language Model

plus more on Sparser, Faster, Lighter Transformer LMs, Manifold Steering, and Teaching Claude Why

by cloud

May 05, 2026

DeepSeek's Deleted Paper: Thinking With Visual Primitives

can't believe they removed this paper unknowningly

by cloud

Apr 29, 2026

There Will Be a Scientific Theory of Deep Learning

plus more about Hyperloop Transformer, Qwen-3.5 Omni, and Scaling Self-Play with Self-Guidance

Apr 21, 2026

Kimi Moonshot: Prefill-as-a-Service!?

plus more about Looped Transformers, Nexus, RNN with Memory, and more

by cloud

Apr 14, 2026

Neural Computer: Running an OS within an AI?!

plus more about In-Place TTT, TriAttention, and Interleaved Head Attention.

by cloud

Apr 07, 2026

Embarrassingly Simple Self-Distillation Technique

plus more on Path-Constrained MoE, HISA, and Screening is not enough

by cloud

weekly papers recapweekly papers recap

Mar 31, 2026

LeWorldModel: JEPA but more practical

plus more on Claudini, Composer 2, and self-distillation

by cloud

First Back

1 2 3 4 5 6 7 8

Next Last

Archive

MiniMax M3's New Attention: MiniMax Sparse Attention

Microsoft just shared the frontier data engineering secrets

DiffusionBlocks: Save 2-3x Training Memory!?

Generative Recursive Reasoning

Long Context Pre-Training w/ Lighthouse Attention

Think In Diffusion: Continuous Latent Diffusion Language Model

DeepSeek's Deleted Paper: Thinking With Visual Primitives

There Will Be a Scientific Theory of Deep Learning

Kimi Moonshot: Prefill-as-a-Service!?

Neural Computer: Running an OS within an AI?!

Embarrassingly Simple Self-Distillation Technique

LeWorldModel: JEPA but more practical