- The AI Timeline
- Posts
- Depth Anything 3: Recovering the Visual Space from Any Views
Depth Anything 3: Recovering the Visual Space from Any Views
LeJEPA, The Path Not Taken, and more
Nov 10th ~ Nov 19th
#82 Latest AI Research Explained Simply
🗞️ Industry News in 1 Line
♥ 7.5k OpenAI has begun deploying a new group chat feature in select regions across the globe. This allows multiple users to collaborate with each other and ChatGPT in a single, shared conversation. The feature is being rolled out on both mobile and web for logged-in users across all ChatGPT plans, including free and paid tiers.

♥ 1.5k Baidu has released ERNIE 5.0, its latest natively omni-modal foundational model, which excels in omni-modal understanding, creative writing, and instruction following. The ERNIE 5.0 Preview 1022 variant features stronger text capabilities, while the standard ERNIE 5.0 Preview is the latest overall version.

♥ 11k OpenAI has announced GPT-5.1, an upgraded model series that aims to be more intelligent and conversational. They have also released GPT-5.1 Instant, which is designed to be warmer and better at following instructions, and GPT-5.1 Thinking, an advanced reasoning model that is now faster on simple tasks. The new models, which also feature improved math and coding abilities, are rolling out to paid users first, with API access coming later in the week.

♥ 1.4k Google DeepMind has introduced SIMA 2, which is an AI agent powered by Gemini that can interact with and follow instructions in 3D virtual worlds. Unlike its predecessor, which followed simple commands, SIMA 2 can reason about high-level goals, converse with users, and describe the steps it's taking to complete a task. This new version shows improved generalization, performing well in games it has never been trained on, and features a self-improvement capability, allowing it to learn from its own experiences without additional human data.

Support My Newsletter
As I aim to keep this newsletter free forever, your support is greatly appreciated. If you enjoy reading The AI Timeline, consider sharing it with another research enthusiast. It helps us keep this up for free!
LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics
Balestriero and LeCun [Brown University, New York University, Meta-FAIR]
♥ 2k Supervised Learning
Building useful representations of the world is an important goal for AI models; however, current self-supervised methods rely on a collection of tricks to function properly. This paper proves that embeddings should follow an isotropic Gaussian distribution to be most useful for any future task.

N= 100 samples are drawn from a 1024-dimensional standard Gaussian, and the first 2 coordinates are altered to produce the “X” distribution
To push the model's embeddings toward this ideal shape, the researchers introduce a new technique called SIGReg. Instead of comparing complex high-dimensional distributions directly, which is computationally expensive, SIGReg works by projecting embeddings onto many random directions.
It then checks if these simplified, one-dimensional projections match a Gaussian pattern. This elegant approach avoids collapse (a common failure where all inputs map to the same point) without needing common heuristics like stop-gradient or teacher-student networks.

In extensive tests across over 60 architectures and 10 datasets, their resulting framework, LeJEPA, demonstrated strong and stable performance. For example, a Vision Transformer trained with LeJEPA achieved 79% accuracy on ImageNet using a linear probe, which is competitive with more complex methods.

LeJEPA learns rich semantic representations through self-supervised learning.
Depth Anything 3: Recovering the Visual Space from Any Views
Lin et al. [ByteDance Seed]
♥ 1.9k Image bycloud’s pick
Depth Anything 3 simplifies how AI models understand spatial geometry from multiple images. Traditionally, tasks like depth estimation and camera pose prediction required separate, complex models for each scenario. DA3 addresses this with a minimal design, utilizing a single plain transformer to handle multiple views and predict consistent depth and ray maps without requiring specialized architectures.

Pipeline of Depth Anything 3
The model works by building on a standard pretrained vision transformer, which processes visual inputs efficiently. It introduces an input-adaptive cross-view self-attention mechanism that dynamically shares information across all images, enabling the model to produce aligned depth and ray predictions for each view. Training relies on a teacher-student approach, where synthetic data generates high-quality pseudo-labels to refine real-world depth maps, ensuring detailed and accurate geometry without complex multi-task setups.

Comparisons with SOTA methods on pose accuracy.
DA3 achieved state-of-the-art results on a new visual geometry benchmark, improving camera pose accuracy by 35.7% and geometric accuracy by 23.6% over prior methods, while also outperforming Depth Anything 2 in monocular depth tasks. This unified approach could make 3D perception more accessible for robotics and mixed reality, though its reliance on public datasets may limit some applications.
Why Less is More (Sometimes): A Theory of Data Curation
Dohmatob et al. [Concordia University, FAIR at Meta, Mila–Quebec AI Institute]
♥ 680 LLM Training Data
Almost everyone agrees that bigger datasets always make smarter models; however, this paper argues that sometimes, less data really can lead to better performance. Recent methods like LIMO and s1 have shown that aggressively curating small, high-quality datasets can outperform training on massive collections. The key question is: when does this strategy work, and when is more data still better?

Theory Prediction across four key regimes.
This paper explains how an imperfect oracle selects examples based on their difficulty and correctness. In label-agnostic curation, examples are kept or discarded based solely on their features, such as retaining either the hardest or easiest problems. The framework demonstrates that for a strong generator, where training labels are already highly accurate, and it focuses on hard examples, effectively refines the model. However, if the generator is weak, using simpler examples helps the model grasp the basics, aligning with traditional scaling laws.

Strategic pruning prevents model collapse.
When labels are also considered, as in label-aware curation, the oracle filters for both difficulty and correctness; this mirrors real-world methods where only valid and challenging examples are kept. The theory adapts to this setting, showing how the fraction of data retained and the alignment between the oracle, generator, and true labels shape the final performance. In both cases, the framework identifies specific conditions where curated, small sets outperform the full dataset.
The Path Not Taken: RLVR Provably Learns Off the Principals
Zhu et al. [Meta AI, The University of Texas at Austin]
♥ 487 RLVR
It turns out that reinforcement learning can significantly enhance reasoning skills in large language models while only slightly adjusting a small fraction of their parameters. Researchers wanted to understand why such sparse updates lead to such strong improvements. They found that what appears to be sparsity is actually a sign of a deeper, model-guided optimization pattern: for a given pretrained model, updates consistently land in the same parameter regions, regardless of the dataset or training method used.

Update sparsity in SFT vs. RLVR.
This behavior is explained by the Three-Gate Theory.
First, a KL constraint ensures that each update step remains small, preventing the model from straying too far from its original behavior.
Second, the model’s own geometry steers these small updates toward directions that don’t disrupt its core structure, preserving important weight patterns.
Third, the limited numerical precision in training makes many of these tiny adjustments invisible, causing the overall update pattern to appear sparse, even though learning is occurring across many parameters.

Model List for analyzed checkpoints for agentic tasks and RLHF algorithms.
Experiments confirmed that RL updates avoid changing the model’s most influential “principal” weights, which are often targeted by supervised fine-tuning. As a result, RL preserves the model’s original spectral structure and shows minimal rotation in its main learning directions.
On the other hand, supervised fine-tuning tends to alter principal weights more significantly, which results in greater changes in model behavior. These findings suggest that RL operates in a fundamentally different optimization regime, meaning that methods designed for supervised tuning may be poorly suited for reinforcement learning.
Reply