The AI Timeline
Posts
How AI is Learning to Reason: RL Tricks, Policy Optimization, and the New WebWatcher Agent

How AI is Learning to Reason: RL Tricks, Policy Optimization, and the New WebWatcher Agent

In this article, we will analyze the use of Reinforcement Learning for LLM reasoning, a new policy optimization method for more concise outputs, and the groundbreaking WebWatcher vision-language research agent.

by cloud
August 19, 2025

Aug 11th ~ Aug 17th
#69 Latest AI Research Explained Simply

🗞️ Industry News in 1 Line

♥ 1.2k Google has launched "Flight Deals", which is a new AI-powered search tool within Google Flights that allows users to find airfare using conversational language. The feature is designed for flexible travelers, using AI to parse natural language queries for specific criteria like budget or destination type and then searching live flight data for relevant options. You can try Flight Deals right now if you are in the US, Canada, or India.
♥ 1.5k StepFun AI has released NextStep-1, a new open-source autoregressive model for generating and editing images from text prompts. The model works by processing sequences of text and continuous image tokens together, which allows it to preserve more visual detail compared to traditional methods. You can try it today by downloading the GitHub repository, which includes pre-trained models for text-to-image generation and image editing.
♥ 424k The U.S. General Services Administration (GSA) has launched USAi, which is a new platform allowing federal agencies to test and adopt generative AI technologies at no cost. This evaluation suite provides government employees with tools for tasks like code generation and document summarization, and enables them to assess different systems before procurement.
♥ 1.3k Former Twitter CEO Parag Agrawal has launched Parallel Web Systems, and it has raised approximately $30 million in funding. The company is developing an AI system that interacts with the public web in real-time to fetch, verify, and organize information. Parallel claims its technology has beaten not only unreleased models like GPT-5 in deep web research but also specialized tools like EXA and human researchers. You can try it today in the Parallel playground.
♥ 2.4k Google has released Imagen 4, its most advanced text-to-image model. This model significantly improves image quality and is much better at rendering text within images than previous versions. All images created with Imagen 4 will include a non-visible SynthID digital watermark to make them easier to detect. Download this Jupyter notebook to start using Imagen today.

Support My Newsletter

As I aim to keep this newsletter free forever, your support means a lot. If you like reading The AI Timeline, consider forwarding it to another research enthusiast. It helps us keep this up for free!

Advertise with The AI Timeline!

Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning

Liu et al. [Alibaba Group, Beijing Jiaotong University, Hong Kong University of Science and Technology, Nanjing University, Peking University, OpenRLHF, CleanRL]

♥ 485 LLM Reasoning

Reinforcement Learning for LLM Reasoning

Reinforcement learning has become a key tool for unlocking advanced reasoning in large language models, and this is the driving factor of progress in areas like mathematical problem-solving and code generation. However, as research in “RL for LLM” (RL4LLM) is gaining popularity, AI researchers are faced with confusion.

Different papers recommended conflicting techniques, like group-level vs. batch-level reward normalization, without clear guidelines. Additionally, experimental inconsistencies, from training data to model initialization, muddied the waters, making it hard to choose effective methods. This paper tackles the chaos head-on by reproducing and evaluating popular RL techniques in a unified framework.

A minimalist two-technique combination that enhances learning capacity in critic-free policies with vanilla PPO loss.

Inner Workings of RL Techniques for LLM Reasoning

The study tests four core techniques shaping RL4LLM. First, advantage normalization stabilizes training by adjusting rewards. Group-level normalization (averaging rewards within responses to a single prompt) proves reliable across settings, while batch-level normalization (averaging across all responses) excels with large-scale rewards.

However, when rewards cluster tightly, like on easy tasks, removing standard deviation from calculations prevents skewed updates. Combining group-level mean with batch-level standard deviation creates a robust hybrid approach.

Next, the Clip-Higher approach tweaks PPO’s clipping mechanism and expands the upper bound for policy updates. This encourages exploration in aligned models (already fine-tuned for reasoning) by preserving token diversity. For smaller models, performance scales with the clipping bound; larger models peak at specific values. Token-level linguistics reveal that higher clipping frees connectors like “therefore” from suppression, enabling more innovative reasoning paths.

Test accuracy and response length of four model variants

Loss aggregation balances how tokens influence training. Token-level aggregation (weighting each token equally) helps base models learn from lengthy reasoning chains. However, for aligned models, sequence-level aggregation (averaging per-response loss) works better, likely because these models already handle structure well.

Finally, overlong filtering masks reward responses exceeding length limits. This boosts accuracy for short-to-medium tasks by avoiding penalizing truncated reasoning but adds little for complex, long-tail problems.

Evaluation and Results of RL Techniques for LLM Reasoning

The experiments tested models with 4B to 8B parameters and datasets of varying difficulty. A minimalist combination (Lite PPO) that uses group-mean and batch-std normalization with token-level loss consistently outperformed complex methods like GRPO and DAPO.

It improved base model accuracy by up to 12% on mathematical benchmarks while simplifying implementation. Clip-Higher lifted aligned model performance by 2–4%, and overlong filtering boosted short-task accuracy by 3–5%.

The key takeaway is that we should use group-level normalization for reliability, Clip-Higher for aligned models, and token-level loss for base models. And if you are just getting started, you should start with the Lite PPO approach.

Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning

Shrivastava et al. [Microsoft Research, University of Wisconsin-Madison]

♥ 22k LLM Reasoning

Introduction to Efficient Reasoning with GFPO

LLMs trained with reinforcement learning often produce longer responses to gain accuracy, leading to “filler” text that doesn’t add value. This length inflation is inefficient, especially since longer answers aren’t always more accurate. The paper introduces Group Filtered Policy Optimization (GFPO) to solve this.

GFPO samples larger groups of responses during training and filters them based on key metrics, like response length or token efficiency (reward per token). By learning only from the best responses, GFPO teaches models to generate concise answers.

How GFPO Works

GFPO builds on GRPO (Group Relative Policy Optimization), which samples multiple responses per question and uses their average reward as a baseline. GFPO improves this by sampling a larger group (e.g., 16 or 24 responses instead of 8). It then filters these responses, retaining only the top-k based on a chosen metric, such as shortest length or highest token efficiency.

The advantages (used to update the model) are computed solely for these selected responses, while others are ignored. This filtering acts as implicit reward shaping, steering the model toward desired behaviors without complex reward engineering.

A variant called Token Efficiency GFPO ranks responses by reward divided by length, promoting outputs that justify their length with high rewards. This cuts filler tokens more aggressively than length-based filtering alone. Another variant, Adaptive Difficulty GFPO, adjusts the retained group size dynamically based on question difficulty, keeping more responses for harder problems to preserve accuracy.

For instance, it retains eight responses for very hard questions but only 4 for easy ones. By sampling more during training, GFPO reduces the need for lengthy reasoning chains during actual use.

Results and Impact of GFPO

GFPO significantly reduces response lengths across benchmarks like AIME, GPQA, and LiveCodeBench. Token Efficiency GFPO achieves the strongest cuts, 71–85% less length inflation than GRPO, while matching accuracy. Adaptive Difficulty GFPO excels on hard problems, reducing length by 60% without accuracy loss. Out-of-distribution tests (e.g., coding tasks) show GFPO not only trims excess length but sometimes improves accuracy.

Pass@1 Accuracy, Response Lengths, and Length Inflation Reduction on AIME 25, AIME 24, and GPQA.

Pareto analysis confirms GFPO often delivers shorter responses with equal or better performance than GRPO. Additionally, GFPO shifts response distributions away from verbosity, reducing ≥20k-token outputs from 32% to 22%. This efficiency demonstrates how targeted training can produce leaner, faster models without sacrificing reasoning quality.

WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

Geng et al. [Tongyi Lab, Alibaba Group]

♥ 424 Deep Research bycloud’s pick

Introduction to Multimodal Deep Research Agents

AI agents can produce a lot of text, but what if they could also research complex topics like a human expert, searching the web, analyzing documents, and synthesizing answers? While text-based agents excel at these tasks, they stumble when faced with real-world challenges requiring visual understanding, like interpreting scientific diagrams or navigating image-rich websites. This gap limits their usefulness for everyday multimodal problems.

To solve this problem, this paper developed WebWatcher, which is a new multimodal agent that combines visual and textual reasoning with sophisticated tool use. Unlike existing approaches that rely on rigid templates or single-modality tools, WebWatcher integrates web search, image analysis, and code execution to handle high-difficulty tasks where perception alone fails.

WebAgent for Information Seeking (by Tongyi Lab)

WebWalker & WebDancer & WebSailor & WebShaper & WebWatcher

github.com/Alibaba-NLP/WebAgent

Inner Workings of WebWatcher

WebWatcher uses five tools:

Web Image Search retrieves relevant visuals
Web Text Search gathers textual information
Webpage Visit navigates and summarizes sites
Code Interpreter handles calculations
The internal OCR tool extracts text from images.

This toolkit enables multi-step reasoning, for example, identifying an obscure animal in a photo, searching Wikipedia for related details, and then cross-referencing revisions in its edit history.

Comparison of VL reasoning agents.

Training WebWatcher required high-quality multimodal data. Researchers created the BrowseComp-VL benchmark, featuring questions demanding both visual perception and deep research. They first generated complex text-based questions through web crawling, then transformed them into visual queries.

For example, a question about “a railway station in northern India” might pair with relevant images, forcing the agent to combine visual cues with external knowledge.

Domain Distribution for Level 1 and Level 2.

To teach WebWatcher effective tool use, researchers generated synthetic reasoning trajectories. Using GPT-4o, they simulated step-by-step task-solving sequences (e.g., <think>Identify the bird</think><tool_call>Image Search</tool_call>). These trajectories were filtered for correctness, logical consistency, and multi-step depth. The agent then underwent supervised fine-tuning to predict tool actions, followed by reinforcement learning (GRPO algorithm) to refine decision-making. This two-stage training optimized both tool selection and answer accuracy.

Evaluation and Impact on Multimodal AI

WebWatcher outperformed top proprietary and open-source models across four challenging benchmarks. On Humanity’s Last Exam (HLE), it achieved 13.6% accuracy, surpassing GPT-4o (9.8%) and Gemini 2.5 (9.2%). It dominated BrowseComp-VL (27.0% vs. baselines’ ≤13.4%), LiveVQA (58.7% vs. ≤43.9%), and MMSearch (55.3% vs. ≤43.9%). These gains highlight its advantage in tasks requiring visual-textual synthesis, like identifying a snake species using birthplace clues from an image.

Data generation pipelines.

WebWatcher prioritized text search for information-heavy tasks (62% usage in BrowseComp-VL) but balanced image search for visual benchmarks (39% in SimpleVQA). However, WebWatcher still faces challenges in highly specialized domains like advanced physics.

Main results on HLE.

Reply

or to participate.