TTRL: Test-Time Reinforcement Learning

Plus more about Process Reward Models That Think and PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models

Apr 21st ~ Apr 27th
#53 Latest AI Research Explained Simply

🗞️ Industry News in 1 Line

  1. ♥ 7.3k Qwen has released Qwen3, a new series of eight open-weight large language models ranging from 235B-A22B (MoE), 30B-A3B (MoE), 32B, 14B, 8B, 4B, 1.7B, and 0.6B parameters. The flagship Qwen3-235B-A22B model achieves competitive results against other OS models.

    Qwen3-235B-A22B benchmark

    Qwen3-235B-A22B benchmark

  2. ♥ 12k OpenAI has enhanced ChatGPT with improved search features like better citations and WhatsApp integration, alongside introducing a new lightweight version to expand access to its popular "deep research" capability for both paid and free users. Additionally, an improved shopping experience featuring detailed product information and direct purchase links is currently rolling out to all user tiers.

    ChatGPT native shopping

  3. ♥ 1.1k NousResearch has introduced Minos, a new binary classifier designed to detect refusals from LLMs by estimating the likelihood that a response constitutes a refusal, potentially aiding redteamers and jailbreakers. Built upon AnswerAI's ModernBERT-Large 400M model for quality and speed, Minos is available on HuggingFace along with example scripts for usage.

bycloud’s new project: search AI papers semantically

Hey guys! I am very excited to share with you my latest project that I just shipped, called:

A semantic search engine for 300k+ AI research papers!

Outcompete Deep Research apps like Grok, OpenAI, Perplexity, and Gemini at finding relevant papers. Check out my demo video on YouTube:

Specifically, there are ~300,000 AI/ML research papers currently indexed in my engine, and by next month, we are planning to increase this by 4x, indexing the entire Arxiv.org.

But why ANOTHER search engine? So there are currently 2 problems for each existing solution:

  1. Generative AI models trained with papers is built for serving hallucination

  2. Deep Research Agents is good but wasting compute by browsing 80% SEO optimized slop

findmypapers.ai addresses both of these problems, and takes the best of both worlds.

I believe that surveying research shouldn’t be that hard. You can be as specific and technical about your search query, and it’ll not give you made up unuseful bs.

Before u try:

  • Search time is long (est. 1~3min depending on search breadth)

  • Limited to AI research papers, but will be expanding soon

  • Broad/wide search is really REALLY useful if you need big compilation of information like my own use cases

To celebrate our launch, us code BETAN50 for 50% off for the next 2 months! You can follow our official X account or Discord for any updates, and feedbacks are really appreciate it.

TTRL: Test-Time Reinforcement Learning

Zuo et al. [Tsinghua University, Shanghai AI Lab]

♥ 554   LLM RL

Training LLMs Without a Teacher

LLMs are getting better at solving complex tasks but sometimes they struggle with new test data which wasn’t part of their training dataset. Traditional RL trains models using clear reward signals derived from labeled data. But in dynamic environments, labels are scarce or nonexistent. Existing methods, such as majority voting during test-time scaling, help refine outputs but don’t actively improve the model’s underlying reasoning. 

This paper introduces Test-Time Reinforcement Learning (TTRL) approach by letting models teach themselves using nothing but raw test data. Instead of relying on external labels, the model samples multiple responses to a problem, votes on the best answer internally, and uses that consensus to guide its own learning. This turns unlabeled test data into a self-sustaining training loop.

How Test-Time Reinforcement Learning (TTRL) Turns Guesses Into Lessons

TTRL operates like a group of students that is debating solutions. For each input (e.g., a math problem), the model generates multiple candidate answers. A majority vote selects the most popular response, which becomes a pseudo-label. The model then rewards itself for answers matching this consensus and penalizes deviations. Over time, this feedback loop sharpens its reasoning.

This uses three key mechanisms:

  1. Repeated Sampling: By generating dozens of potential answers, the model explores diverse reasoning paths. This diversity  ensures the majority vote isn’t skewed by repetitive errors.

  2. Reward From Consensus: Answers aligning with the majority earn a reward of 1; others get 0. Though simplistic, this binary signal stabilizes training by making it more consistent.

  3. Policy Updates: Using RL algorithms like GRPO or PPO, the model adjusts its parameters to favor high-reward behaviors. Crucially, this happens during inference, allowing real-time adaptation.

Benchmarking Results for Test-Time Reinforcement Learning (TTRL)

Researchers tested TTRL on mathematical benchmarks like AIME 2024 and MATH-500 and it performed remarkably well. For instance, Qwen-2.5-Math-7B saw a 159% jump in pass@1 accuracy on AIME after TTRL training, matching models trained with full supervision. These improvements generalized across tasks: a model fine-tuned on math problems also improved its performance on unrelated reasoning benchmarks.

  • Scaling Wins: Larger models benefit more from TTRL. The 7B model outperformed its 1.5B counterpart by wider margins, as its capacity to generate accurate consensus labels fueled better rewards.

  • Surpassing the Vote: Models consistently exceeded the performance ceiling set by their own majority votes. This "bootstrapping" effect suggests TTRL doesn’t just amplify existing knowledge, but it uncovers new reasoning strategies.

However, TTRL isn’t foolproof. Its success hinges on the model’s initial competence: weaker models (e.g., LLaMA-3.1-8B) showed minimal gains on harder tasks, likely due to insufficient prior knowledge. Hyperparameters like temperature and batch size also require careful tuning, too little exploration leads to stagnation, while too much introduces noise.

Process Reward Models That Think

Khalifa et al. [University of Michigan, Mila, LGAIResearch, University of Illinois Urbana-Champaign]

♥ 147   LLM PRM  

Using Generative Process Reward Models to Build Better AI Verifiers with Less Data

Modern AI systems rely heavily on verification to separate correct solutions from plausible-sounding errors. Until now, we have relied on huge labeled datasets or human verifiers, which is extremely slow. Process Reward Models (PRMs) can act as quality inspectors for AI reasoning and check each step in a solution. But training them has required expensive step-by-step annotations. Discriminative PRMs, which classify steps as correct/incorrect, struggle without this data. Meanwhile, zero-shot "LLM-as-a-judge" methods lack reliability, often overcomplicating verification or missing subtle errors.

This paper introduces THINKPRM, a new method that combines generative reasoning with minimal supervision. THINKPRM leverages existing reasoning capabilities in language models more effectively to train generative verifiers using 1% of the labeled data needed by discriminative models while outperforming them. 

How THINKPRM Works

THINKPRM starts with pre-trained reasoning models (like Qwen or Llama variants) and fine-tunes them to generate verification CoTs. Instead of requiring manual annotations for every step, it uses a clever synthetic data pipeline:

  1. Synthetic Verification Chains: A base model critiques solutions, producing step-by-step assessments. These critiques are filtered to match sparse gold labels (e.g., from datasets like PRM800K), ensuring only high-quality examples are kept.

  2. Lightweight Fine-Tuning: Models train on these filtered chains, learning to align their verification reasoning with correct judgments. This process takes just hours on a single GPU, even for billion-parameter models.

  3. Scalable Verification: At test time, THINKPRM generates multiple verification CoTs per solution. These can be aggregated (parallel scaling) or extended via self-correction prompts (sequential scaling) to boost accuracy.

Benchmark Results for THINKPRM

Researchers tested THINKPRM across math, coding, and science benchmarks, and it consistently outperformed both discriminative PRMs and zero-shot LLM judges.

  • In-Domain Tasks: On MATH-500 and AIME’24, THINKPRM-14B achieved up to 8% higher accuracy than discriminative PRMs trained on 100x more data. 

  • Out-of-Domain Generalization: Without any domain-specific tuning, THINKPRM surpassed discriminative models by 8% on PhD-level physics questions (GPQA) and 4.5% on code generation (LiveCodeBench).

  • Compute Scaling: Allowing THINKPRM to generate longer CoTs or aggregate multiple verifications improved results significantly. For example, doubling verification tokens boosted F1 scores by 15 points on ProcessBench, outperforming fixed-length discriminative checks.

Although THINKPRM is promising, but it isn’t perfect. Its verification chains can be overconfident, and errors in early steps sometimes cascade to later judgments. It also becomes more computationally expensive with longer CoTs, though the performance gains often justify the expense.

PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models

Qiu et al. [Peking University, Beijing Computational Science Research Center]

♥ 340   LLM Reasoning   bycloud’s pick  

Master Real-World Physics Reasoning with PHYBench 

AI is getting pretty good at creating basic english sentences and can talk about weather. But it can’t really calculate the tension in a pendulum system after a sudden impact or predict relativistic effects in particle collisions. While today’s LLMs excel at many tasks, their ability to reason about the physical world remains surprisingly limited.

This is because current benchmarks for evaluating AI reasoning often fall short in three ways: they oversimplify problems, rely on abstract math disconnected from real-world scenarios, and use crude metrics like binary accuracy. These limitations make it hard to distinguish between models that genuinely grasp physics and those that merely memorize patterns.

This paper introduces PHYBench, a new benchmark designed to rigorously test how well AI models understand and reason through physics problems. PHYBench contains 500 carefully curated physics problems spanning mechanics, electromagnetism, thermodynamics, and advanced topics like relativity. Each problem mirrors real-world scenarios and requires complex thinking from calculating string tensions in pendulum systems to analyzing photon-mirror collisions.

It also introduces the Expression Edit Distance (EED) Score, a metric that measures how “close” a model’s answer is to the correct solution, even when it’s not fully right. This granular approach captures partial understanding, offering a clearer picture of where models stumble.

Understanding the PHYBench Benchmark

Creating a benchmark for physical reasoning requires a good understanding of reality. PHYBench’s questions are adapted from human physics exams and olympiads, to ensure they reflect authentic challenges. For example, one problem asks models to determine the tension in a string connecting three suspended balls after one is struck. This task requires spatial reasoning, force analysis, and multi-step calculations. Each question undergoes rigorous review by physics students to eliminate ambiguities and ensure solvability through pure textual descriptions.

Traditional metrics like accuracy fail to reward models for getting part of a solution right. But this benchmark uses EED Score to address this by comparing the structure of a model’s answer to the ground truth using expression trees. If a model forgets a term or miscalculates a coefficient, the EED quantifies how many “edits” (like adding or removing nodes in the tree) would fix the error. For instance, if the correct answer is T = 2mg + 4mv²/l and a model outputs T = 2mg + 2mv²/l, the EED Score reflects this as a minor edit rather than a complete failure.

Example questions and errors. The errors are from the solution generated by DeepSeek-R1. Here we demonstrate the main parameters and physical process.

Even state-of-the-art models often misjudge kinematic relationships or make algebraic slips during long chains of equations. PHYBench forces models to demonstrate both skills, exposing weaknesses that simpler benchmarks miss.

Testing AI models on Real Physics with PHYBench

When tested on PHYBench, even top-tier models like Gemini 2.5 Pro and GPT-4o lag far behind human experts. Humans scored 61.9% accuracy and 70.4 EED, while Gemini 2.5 Pro managed just 36.9% accuracy and 49.5 EED. Smaller models fared worse, with some scoring near zero.

  • Specialization matters: Models fine-tuned for reasoning outperformed general-purpose ones, but none approached human levels.

  • Domain disparities: All models struggled most with thermodynamics and advanced physics, suggesting gaps in handling multi-step processes or abstract concepts.

  • EED’s advantage: The EED Score provided 3x more discrimination between models than traditional accuracy, making it a sharper tool for benchmarking progress.

include a link to the new premium posts and subscription module

Reply

or to participate.