- The AI Timeline
- Posts
- DeepSeek-V3.2 Technical Report Is Pure Gold
DeepSeek-V3.2 Technical Report Is Pure Gold
FreeFlow, DeepSeekMath-V2, Soft Adaptive Policy Optimization, and more
Nov 25th ~ Dec 2nd
#84 Latest AI Research Explained Simply
🗞️ Industry News in 1 Line
♥ 1.6k Mistral AI introduces Mistral 3, an open-weight model family, that includes three SoTA small, dense models (14B, 8B, and 3B) and Mistral Large 3, a sparse MoE trained with 41B active and 675B total parameters.

♥ 332 Arcee AI introduces Trinity, an open-weight MoE family. Model series includes Trinity Nano Preview: 6B parameter MoE and Trinity Mini: 26B parameter MoE (3B active), fully post-trained reasoning model.

♥ 3.9k Runway ML introduces Gen-4.5, a new SoTA video generation model.

Support My Newsletter
As I aim to keep this newsletter free forever, your support means a lot. If you like reading The AI Timeline, consider forwarding it to another research enthusiast. It helps us keep this up for free!
DeepSeek-V3.2 Tech Report
DeepSeek Team
♥ 13k LLM

Benchmark of DeepSeek-V3.2
DeepSeek-V3.2 just achieved a huge milestone for open-source LLMs, effectively closing the performance gap to 0 with proprietary frontier models like Gemini-3.0-Pro, and even outperforming GPT-5-high. The model’s key architectural breakthrough is DeepSeek Sparse Attention (DSA), which uses a lightweight "Lightning Indexer" to dynamically select the most relevant tokens (top-k). This reduces the heavy computational complexity of the core attention mechanism to near-linear levels O(Lk), allowing for efficient processing of 128K context windows without performance degradation.

The price of tokens vs token position
Other than the architecture, the paper highlights a massive scaling of Reinforcement Learning (RL), allocating over 10% of the pre-training compute budget to post-training. This is supported by a novel Large-Scale Agentic Task Synthesis pipeline, which generates thousands of synthetic environments (search, coding, data analysis) to boost the model's tool-use and agentic capabilities.

Attention architecture of DeepSeek-V3.2
The release includes the standard V3.2 which is a more balanced release, and DeepSeek-V3.2-Speciale, a high-compute, extended thinking variant with relaxed length constraints during RL. The Speciale model achieves reasoning parity with Gemini-3.0-Pro, achieving Gold Medal performance in both the 2025 International Mathematical Olympiad (IMO) and International Olympiad in Informatics (IOI), proving that open models can now rival the industry's best in complex reasoning tasks.
I highly recommend reading this research paper. Every part of it is well worth your time.
Characterizing control between interacting subsystems with deep Jacobian estimation
Eisen et al. [MIT, IBM Research]
♥ 436 Understanding LLMs
It is very difficult to understand how different parts of a complex system, like brain regions, influence each other. This research introduces a new, data-driven method called JacobianODE that directly estimates the mathematical relationships governing these interactions from observed data alone.

Schematic overview of control-theoretic framework applied to neural interactions.
The core idea is to learn a system's Jacobian, which captures how a small change in one part affects the whole. JacobianODE trains a neural network to predict this Jacobian by ensuring its estimates are physically consistent. It uses path integration to predict future states and adds a clever self-supervised "loop closure" loss. This loss ensures that if you theoretically perturb the system in a loop, the total effect sums to zero, which forces the model to learn an accurate representation of how perturbations work in all directions, not just along observed paths.

Jacobian estimation with JacobianODE models.
Initial tests on chaotic systems like Lorenz models show JacobianODE estimates Jacobians much more accurately than other methods, even with noisy data. The researchers then applied it to a multi-area neural network trained on a memory task. They found the model could reveal how control between a "sensory" area and a "cognitive" area changed during learning, with the sensory area gaining more influence.

Mean Frobenius norm error on Jacobian estimation for each system and noise level.
FreeFlow: Flow Map Distillation Without Data
Tong et al. [New York University, MIT]
♥ 549 Image Generation bycloud’s pick
Current flow models create great images but are slow because they require many iterative steps. A common trick to speed them up is "flow map distillation," where a fast student model learns to mimic a powerful, pre-trained teacher. But there's a catch: this process traditionally relies on an external dataset to generate examples for the student to learn from, which can lead to a "Teacher-Data Mismatch." If the static dataset doesn't fully represent what the teacher model can actually generate, the student learns from a flawed guide, limiting its potential.

Teacher-Data Mismatch and the data-free alternative.
To solve this, the researchers developed a data-free method called FreeFlow. The core idea is elegantly simple: instead of using an external dataset, the student learns by starting only from random noise, known as the prior distribution. Since the teacher model is guaranteed to also start from this same noise when generating an image, it completely avoids the mismatch problem. The student learns by trying to predict the teacher's entire creative journey in one big jump.
It samples a noise vector and a random timestep, then tries to predict where the teacher's process would end up. The training forces the student's own internal "generating velocity" (the speed and direction it thinks the image should evolve) to match the teacher's true velocity at that point.

Impact of Teacher-Data Mismatch
It also aligns the student's "noising velocity," which is essentially the reverse process of turning a generated image back into noise. By ensuring this backward flow also matches the teacher's dynamics, the model can actively correct its mistakes and stay on the true path.

When distilling from a top-tier teacher model, FreeFlow set a new state-of-the-art, achieving an impressive FID score of 1.45 on ImageNet at 256x256 resolution using just a single sampling step, significantly outperforming all data-dependent methods. This proves that an external dataset is not necessary for successful distillation.
How to Correctly Report LLM-as-a-Judge Evaluations
Lee et al. [Yonsei University, University of Wisconsin–Madison, KRAFTON]
♥ 718 LLM Training
We can not evaluate things like factual accuracy or code quality using LLMs as these AI judges aren't perfect and can make mistakes. This means the raw score they give us can be misleading, sometimes overestimating or underestimating true performance. This paper introduces a straightforward method to correct for this bias and provide a reliable confidence interval for the true accuracy.

There are two ways an LLM judge can be wrong: it might incorrectly label a correct answer as wrong, or it might mistakenly label a wrong answer as correct. The paper shows that if you know the LLM’s specific error rates, you can mathematically adjust the raw score to get a better estimate of the true accuracy. These error rates can be learned from a small calibration dataset where you know the true human-provided labels.

After making the correction, we can also measure our uncertainty. The new method constructs a confidence interval that accounts for randomness from both the main test dataset and the separate calibration dataset, giving a much more complete picture of how reliable the final accuracy estimate is. To make this process efficient, the paper also provides an adaptive algorithm that smartly decides how many calibration samples to collect for each type of answer, which helps minimize uncertainty for a given budget.
DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning
Shao et al. [DeepSeek-AI]
♥ 3.2k LLM Reasoning
Large language models have become incredibly skilled at solving math problems that have a final numeric answer, often performing at competition levels. However, getting the right answer doesn't always mean the logic used to get there was correct. This is a major issue for more advanced tasks like theorem proving, where the step-by-step reasoning itself is the real goal. DeepSeekMath-V2 tackles this by learning not just to generate proofs, but by checking and improving its own work.

Average proof scores on CNML-level problems by category and model, as evaluated by our verifier.
Researchers first trained a separate "verifier" model to critique proofs. They gave it clear rules to follow, asking it to list any flaws it found and then assign a score. To make sure this verifier was trustworthy and wouldn't invent problems, they added a second layer called "meta-verification". This step reviews the verifier's own critiques to confirm they are accurate and justified, which significantly improved the quality of its feedback.
Next, this reliable verifier was used to train a proof generator. The generator's goal is to write proofs that earn a high score from the verifier. Crucially, the generator was also trained to perform self-verification. It learns to write a proof and then immediately analyze it, just as the verifier would.

On elite high-school and undergraduate competitions like the IMO and the Putnam exam, DeepSeekMath-V2 achieved gold-medal level performance, scoring 118 out of 120 on the Putnam. When allowed to iteratively refine its proofs based on its own verification, its success rates improved significantly.
Soft Adaptive Policy Optimization
Gao et al. [Qwen Team]
♥ 256 LLM Training
Training advanced AI models is tricky because fine-tuning them with reinforcement learning can often lead to unstable updates. This issue is caused by the high variance in the importance scores assigned to individual tokens, especially in complex models.
To solve this, researchers developed Soft Adaptive Policy Optimization (SAPO). Instead of a hard cutoff, SAPO uses a smooth, temperature-controlled gating function. For each token, this function smoothly reduces the influence of updates when the token's importance ratio is far from normal, rather than abruptly silencing it. This creates a continuous trust region. The system also uses two different temperature settings: one for positive updates and a larger one for negative updates. This makes the model more careful when reducing the probability of tokens, which is a more unstable operation, while still being receptive to positive feedback.

Empirical validation of assumptions (A1)–(A2) on the MoE model
This soft approach is both sequence-coherent and token-adaptive. When most tokens in a sequence are well-behaved, SAPO acts like a sequence-level method, keeping the learning aligned with the overall goal. However, if a few troublesome tokens appear, it doesn't discard the entire sequence. It selectively dampens the noisy tokens while preserving the learning signal from the good ones, which makes training more sample-efficient and stable than methods that apply all-or-nothing clipping.
In tests on mathematical reasoning, SAPO showed better training stability and higher accuracy compared to GRPO and GSPO under the same compute budget. It also consistently improved performance when used to train the large Qwen3-VL model family across various tasks.
Reply