- The AI Timeline
- Posts
- Video models are zero-shot learners and reasoners
Video models are zero-shot learners and reasoners
Plus more about Thinking Augmented Pre-training and Reinforcement Learning on Pre-Training Data
Sep 22nd ~ Sep 29th
#75 Latest AI Research Explained Simply
🗞️ Industry News in 1 Line
♥ 6.3k
DeepSeek-V3.2-Exp
is now available on App, Web, and API platforms with DeepSeek Sparse Attention (DSA) technology that enables faster training and inference on long-context tasks. The model is fully open source with technical documentation and GPU kernels available on Hugging Face. You can test out DeepSeek-V3.2-Exp through the DeepSeek API right now (API pricing has been reduced by over 50%).♥ 19k Anthropic has announced significant upgrades to Claude, starting with Claude Sonnet 4.5, which brings enhanced capabilities including code execution for data analysis and visualization, a redesigned Claude Code terminal interface, and a checkpoints feature that enables users to save progress and roll back when needed during large tasks.
Additionally, the Claude for Chrome extension is now available to all waitlist members. Max users can access a five-day research preview of "Imagine with Claude", which is an experimental feature that generates software dynamically without predetermined functionality or prewritten code.
♥ 1.8k inclusionAI has released
Ring-1T-preview
, which is the first open-source thinking model with 1 trillion parameters. This model solved IMO25 Question 3 in a single attempt with partial solutions for Questions 1, 2, 4, and 5. You can visit Hugging Face to test Ring-1T-preview yourself and experience trillion-parameter reasoning in action.
Not actively job hunting? Great, most people on Dex aren’t.
Dex is a conversational AI and career matchmaker that works on behalf of each person. You spend 15-20 minutes on the phone with him, talking about your experience, your ambitions and your non-negotiables.
Dex then scans thousands of roles and companies to identify the most interesting and compatible opportunities.
Once we’ve found a match, Dex connects you to hiring managers and even helps you prep for interviews.
Thousands of exceptional engineers have already signed up and we’re partnered with many of the UK’s leading Start-ups, Scale-ups, hedge funds and tech companies.
Don’t waste another day at a job you hate. Speak with Dex today.
Video models are zero-shot learners and reasoners
Wiedemer et al. [Google DeepMind]
♥ 485 LLM Reasoning
Introduction to Video Models as Zero-Shot Learners
For years, computer vision has depended on specialized tools (one model for segmentation, another for object detection). This made the field fragmented and less adaptable.
The Veo 3 research paper shows that when video models are trained on vast amounts of video data with simple generative objectives, they can become general-purpose foundation models for vision, much like LLMs did for language.
Inner Working of Veo 3's Zero-Shot Mechanism
The approach behind Veo 3 is very straightforward: users provide an initial image and a text instruction, and the model generates a short video in response. This method mirrors the prompting strategy that made LLMs so versatile, avoiding the need for fine-tuning or custom architectures.
Veo 3 processes both spatial and temporal information, which allows it to animate scenes frame by frame based on the prompt. This frame-by-frame generation acts like a "chain-of-frames," where each step in the video can represent a logical progression, similar to how chain-of-thought reasoning works in language models.

Veo 3 zero-shot learning and reasoning examples.
This capability enables Veo 3 to handle a hierarchy of visual tasks, starting with basic perception, like identifying edges or segmenting objects, and then moving to modeling physical properties, such as buoyancy or material interactions.
From there, it progresses to manipulation tasks, such as editing images by changing colors or removing backgrounds, and finally to visual reasoning, where it solves puzzles or navigates mazes over multiple frames. The model's training on diverse video data gives it a broad understanding of visual concepts, which it applies dynamically through this structured generation process.
Evaluation and Benchmark Performance of Veo 3
Veo 3 shows impressive zero-shot performance in tests across various tasks, and sometimes rivals specialized models like Nano Banana. For instance, in edge detection, Veo 3 achieved a pass@10 rate of 0.77, and in segmentation, it reached a mean Intersection over Union of 0.74, comparable to dedicated image editing models.
It excelled in object extraction, correctly identifying and lining up animals in 92% of cases with multiple attempts, and demonstrated strong abilities in image editing, though it sometimes introduced unintended animations.

Testing visual symmetry
On reasoning tasks, Veo 3 solved mazes with up to 78% accuracy on 5x5 grids and handled visual symmetry problems with high success rates, outperforming Veo 2 by wide margins. However, complex analogies occasionally make errors, and better control over unintended scene changes is needed.
Thinking Augmented Pre-training
Wang et al. [Microsoft Research]
♥ 22k Pre-training
Introduction to Thinking Augmented Pre-Training
As large language models grow, the demand for high-quality training data is quickly outpacing the supply of human-written text on the web. In LLMs, some valuable tokens are inherently difficult for a model to learn directly because they summarize a long chain of reasoning in just one step.
Thinking Augmented Pre-Training, or TPT, tackles this by enriching existing text data with automatically generated thinking trajectories. These trajectories act like a step-by-step reasoning guide, and break down complex ideas into simpler parts that are easier for models to digest. This training method boosts data efficiency without requiring more raw documents.

The average few-shot accuracy scores on the GSM8k and MATH datasets with respect to total training tokens.
Inner Working of Thinking Augmented Pre-Training
TPT augments each document in the training set with a thinking trajectory generated by an existing language model. For a given text, such as a math problem or an explanatory passage, the system prompts an off-the-shelf model to simulate an expert’s thought process as they analyze the content.
This thinking text is then appended to the original document, which forms a single, extended training example. The model is then trained on these augmented samples using the standard next-token prediction objective. It means that the model is learning not only from the original content but also from the detailed reasoning that accompanies it.

This approach naturally directs more training attention toward high-value or difficult concepts. For example, in domains like mathematics and physics, the generated thinking trajectories tend to be longer, meaning the model spends more time processing and learning from these reasoning-intensive sections.
Evaluation and Benchmark Performance of Thinking Augmented Pre-Training
Experiments with TPT show substantial improvements in both data efficiency and final model performance. When pre-training an 8-billion-parameter model from scratch on 100 billion tokens, the TPT-enhanced version reached performance comparable to LLaMA-3.1-8B, which was trained on 15 trillion tokens (3x improvement in data efficiency).
On reasoning-heavy benchmarks like GSM8k and MATH, TPT models more than doubled the scores of vanilla pre-training, and achieved 50.1% and 21.8% respectively, compared to 19.2% and 9.1% for the baseline.

Ablation studies confirmed that even when using smaller models to generate the thinking trajectories, the performance remained strong. The approach consistently improved results as training data increased, with no signs of plateauing even at 100 billion tokens.
One limitation is that thinking trajectories for expert-level texts were sometimes shorter. This is possibly because such content assumes prior knowledge and requires fewer explanatory steps.
Reinforcement Learning on Pre-Training Data
Li et al. [Tencent, HunYuan Infra Team, The Chinese University of Hong Kong]
♥ 424 LLM Training bycloud’s pick
Introduction to Reinforcement Learning on Pre-Training Data (RLPT)
As large language models grow, simply adding more parameters or training tokens no longer guarantees major gains. This challenge has sparked interest in new ways to use existing data more effectively.
One promising direction comes from a method called Reinforcement Learning on Pre-Training data, or RLPT. Instead of relying only on supervised learning, RLPT applies reinforcement learning directly to the raw text data models that are already trained on.

Overview of RLPT. Raw data from the internet corpora is processed into training samples.
How Does Reinforcement Learning on Pre-Training Data Work
RLPT works by having the language model predict the next segment of text (like a full sentence or a reasoning step) based on the context that comes before it. This is different from standard next-token prediction, which only looks one word ahead.
By predicting larger chunks of text, the model is encouraged to build more coherent and meaningful thought processes. The training uses two types of tasks: one where the model predicts the next sentence given only the prior context, and another where it fills in a missing middle segment using both preceding and following text. These are called Autoregressive Segment Reasoning and Middle Segment Reasoning, respectively.

To guide learning, a generative reward model checks whether the predicted segment matches the meaning of the actual text that follows, even if the wording isn’t identical. This reward signal helps the model explore different reasoning paths while staying semantically accurate.
By alternating between the two task types during training, RLPT balances the model’s ability to generate text step-by-step and to understand broader contextual relationships, which improves generalization.

Evaluation and benchmark performance of RLPT
RLPT was tested on both general knowledge benchmarks like MMLU and MMLU-Pro, and on mathematical reasoning tests such as AIME and MATH-500. Across different model sizes, including Qwen3-4B and Llama3.2-3B, RLPT produced consistent gains. For example, on Qwen3-4B, it improved MMLU accuracy by 3.0 points and MMLU-Pro by 5.1 points. In mathematical reasoning, Pass@1 scores on AIME24 and AIME25 rose by 6.6 and 5.3 points respectively.

Performance on general-domain tasks across different models, with the best results highlighted.
Additionally, when used as a foundation for reinforcement learning with verifiable rewards (RLVR), RLPT provided an extra boost, improving both exploitation and exploration in mathematical problem-solving.
Reply