• The AI Timeline
  • Posts
  • Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Plus more about Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction and RM-R1: Reward Modeling as Reasoning

May 5th ~ May 11th
#55 Latest AI Research Explained Simply

🗞️ Industry News in 1 Line

  1. ♥ 1.5k Prime Intellect has launched INTELLECT-2, which is the first 32B parameter model trained using fully asynchronous reinforcement learning across a decentralized network of permissionless compute contributors.

  2. ♥ 434 ByteDance has released Seed1.5-VL technical report, a new vision-language foundation model that integrates a 532M-parameter vision encoder with a 20B active parameter Mixture-of-Experts LLM architecture. Despite its relatively compact design, the model achieves state-of-the-art performance on 38 out of 60 public benchmarks. It also demonstrates exceptional capabilities in agent-centric tasks like GUI control and gameplay, outperforming established systems such as OpenAI CUA and Claude 3.7.

  3. ♥ 4.7k Mistral AI has launched Mistral Medium 3, a new language model that delivers state-of-the-art performance at 8X lower cost. The model performs at or above 90% of Claude Sonnet 3.7 across benchmarks at a significantly reduced price point ($0.4 input / $2 output per million tokens) and outperforms open models like Llama 4 Maverick particularly in professional use cases such as coding and multimodal understanding. You can use it on Mistral La Plateforme and Amazon Sagemaker.

Bhindi is your cursor for apps

It’s your AI tool, connecting 70+ tools to spin up seamless workflows in seconds.

Tell Bhindi what you need, and it’ll stitch together your apps like magic. You can use Bhindi to:

  • Pull tasks from Trello, merge the right PRs on GitHub, and drop updates in Slack.

  • Search products on Amazon, log them in Google Sheets, and notify the team — one simple query.

  • Quote a tweet, set calendar reminders, plan trips, hunt jobs — no tab juggling.

  • Bhindi can fetch your last emails, highlights the latest one about leave, and keeps your records updated.

We’ve got image generation and TTS support across models too.

The new UI is a conversation.

Just tell your requirements in plain English and Bhindi will handle the rest.

Give it a spin. We’d love your feedback!

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Zhao et al. [Tsinghua University, Beijing Institute for General Artificial Intelligence, Pennsylvania State University]

♥ 1.6k   LLM Reasoning

How AI Models Can Teach Themselves Without Human Data

LLMs are getting better at reasoning by learning from human-curated examples, but this reliance on expert-crafted data is becoming a bottleneck. As models grow more capable, the effort to maintain high-quality training datasets is getting unsustainable. 

This paper introduces a new approach called Absolute Zero Reasoner (AZR) that offers a way for models to autonomously evolve their reasoning skills, no human input required. Most reasoning models today depend on reinforcement learning with verifiable rewards (RLVR), where feedback comes from outcome-based metrics like code correctness.

Although this method is effective, these methods still need carefully designed question-answer pairs curated by humans. This creates a scalability wall: as tasks grow more complex, manual dataset creation becomes impractical. Worse, if AI eventually outperforms humans, it could stagnate when limited to human-designed challenges. AZR tackles this by eliminating external data entirely. Instead of relying on predefined tasks, the model invents its own problems, solves them, and learns from the results.

How Absolute Zero Reasoner Works

The AZR model uses a continuous loop of task creation and problem-solving, guided by three core reasoning modes. It relies on a code executor, which validates tasks and checks solutions and provides objective feedback without human intervention.

  1. Dual Roles: Proposer and Solver The same model wears two hats. As a proposer, it generates coding tasks, like writing a function or predicting an output, while ensuring they’re neither too easy nor unsolvable. As a solver, it attempts these tasks, refining its reasoning skills through trial and error. Rewards are split: the proposer earns points for creating "Goldilocks" tasks (moderately challenging), while the solver is graded on correctness.

  2. Three Modes of Reasoning
    Tasks fall into three categories, inspired by logical reasoning:

    1. Deduction: Predict an output given code and input (e.g., "What does f(x)=x+2 return for x=3?").

    2. Abduction: Infer an input that produces a specific output (e.g., "Find x so that f(x)=5").

    3. Induction: Write code that matches input-output examples (e.g., "Create a function that maps these pairs").

Each mode targets different cognitive skills, from step-by-step logic (deduction) to creative problem-solving (abduction). By cycling through these tasks, AZR builds a broad, flexible understanding of code and logic.

  1. Code as a Grounded Playground
    Using Python as its environment, AZR validates tasks through execution. For example, if the model proposes a function, the code executor runs it to confirm it works. This ensures rewards are based on objective, verifiable outcomes, avoiding the pitfalls of learned reward models that can be "hacked" by exploiting biases.

Strengths of Absolute Zero Reasoner

The Absolute Zero Reasoner model was trained entirely without human data and it matches or outperforms models fine-tuned on thousands of expert examples. On coding benchmarks like HumanEval+ and MBPP+, it sets new state-of-the-art scores.

In math reasoning (AIME, AMC), it shows strong cross-domain generalization, even though it was trained solely on code tasks. Key findings include:

  • Scaling Benefits: Larger base models (7B→14B parameters) show bigger performance jumps which suggests continued gains as models grow.

  • Code Supercharges Reasoning: Models pretrained on code outperformed general-purpose counterparts in math after AZR training, hinting at synergies between programming and abstract reasoning.

  • Emergent Planning: Like humans, AZR began adding step-by-step comments to its code, mirroring techniques like ReAct prompting, a behavior not explicitly taught.

However, there are caveats. Larger models occasionally produced poor results in reasoning chains, underscoring the need for safety safeguards. Moreover, autonomous systems might develop unintended behaviors, and verifying their solutions grows harder as tasks become more abstract.

Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction

Gong et al. [Inclusion AI, Ant Group]

♥ 55   Multi-modal Architecture    bycloud’s pick  

The output results and multimodal interactive demos of Ming-Lite-Uni.

Understanding Unified Multimodal Models with Ming-Lite-Uni

When we think of AI, most people imagine a single system that can understand and generate images, text, and other modalities. While models like GPT-4o have shown impressive native image generation, many open-source frameworks struggle with a critical problem: balancing high-fidelity visual synthesis with precise semantic understanding.

Existing models often prioritize pixel-perfect image generation but this usually results in poor quality context alignment. These images look great but they often miss the intent of the user. This paper introduces Ming-Lite-Uni, an open-source multimodal framework designed to unify visual and language tasks without compromising either capability.

How Ming-Lite-Uni Understands Multiple Modalities

The Ming-Lite-Uni model uses both frozen and trainable components. The model fixes a pre-trained multimodal large language model (MLLM) to retain its understanding of text and images, while fine-tuning a diffusion model to handle generation. This separation avoids the common pitfall of feature mismatch, where visual synthesis drifts away from the original semantic context. To bridge these components, the team introduced multi-scale learnable tokens which uses adaptive query tokens that capture visual details at different resolutions. Low-resolution tokens handle global layout and color, medium ones focus on objects, and high-resolution tokens encode fine textures. These tokens act as a universal language between understanding and generation, ensuring the diffusion model stays aligned with the MLLM’s intent.

The AR part of Ming-Lite-Uni.

In addition to this, it also uses the multi-scale representation alignment strategy, which enforces consistency across hierarchical features. By aligning intermediate outputs of the diffusion model with the final semantic representations from the MLLM, Ming-Lite-Uni reduces discrepancies that typically plague unified models. This approach improves high-resolution reconstruction quality by over 2 dB in PSNR and boosts generation accuracy. The diffusion model itself is trained with a FlowMatching loss, borrowed from recent advances in continuous trajectory modeling, which helps refine details without destabilizing the frozen MLLM.

The framework’s architecture also introduces a new multimodal autoregressive module. Instead of reinventing the wheel, it reuses components from existing models like M2-omni and Llama3, modifying positional encodings to handle mixed modalities. This allows Ming-Lite-Uni to process arbitrary-resolution images and variable-length text in a single sequence. This makes it adaptable to tasks from style transfer to multi-round image editing. By freezing the MLLM and focusing training on the diffusion model, the team sidestepped the computational cost of end-to-end optimization while still achieving strong interoperability.

Performance and Open Challenges of Ming-Lite-Uni

The Ming-Lite-Uni shows promising results. On GenEval (a benchmark for text-to-image generation) it scored 0.62 accuracy which matches specialized diffusion models like SDXL (0.55) and approaching closed-source tools like DALL-E 3 (0.67). In multimodal understanding tasks, it outperformed similarly sized models on benchmarks like MMB and MMMU, though it lags behind larger closed-source systems like GPT-4o. 

Evaluation of text-to-image generation ability on GenEval benchmark Ghosh et al. (2024).

However, the framework is still in its alpha stage. It has a few limitations such as a reliance on curated datasets for style transfer and editing, which may restrict generalization. The team also pointed out that scaling the autoregressive component could further close the gap with proprietary models. Future work will focus on expanding the training data and refining the balance between modalities.

RM-R1: Reward Modeling as Reasoning

Chen et al. [University of Illinois Urbana-Champaign, University of California, Texas A&M University]

♥ 184   LLM Reward Modeling

How Reasoning Reward Models Are Shaping AI Alignment

When we train any AI model, we use a reward function that tells LLMs what behaviors humans actually want. But today’s reward models have a critical limitation: they’re either too opaque to trust or too simplistic to handle nuanced tasks. Traditional reward models fall into two camps.

  • Scalar models spit out numerical scores without explanation, this leaves users guessing why one response is better than another. 

  • Generative models produce free-form judgments but often default to surface-level critiques. For example, they will point out grammar errors while missing deeper issues like emotional harm. This lack of interpretability and depth becomes a liability in high-stakes scenarios, for instance, when evaluating mental health advice or complex code.

The off-the-shelf instruct model overfits to patterns in supervised data, failing to evaluate the emotional harm and lack of nuance in the rejected response.

This paper introduces Reasoning Reward Models (REASRMS), a new approach that treats reward modeling not as a black-box scoring game, but as a reasoning-intensive task. The RM-R1 project tackles the above problems by asking: What if reward models reasoned like humans?

Training pipeline of RM-R1

How REASRMS Work: Teaching Models to Think Before Judging

REASRMS uses a two-stage training pipeline that is designed to bake reasoning into every judgment.

Stage 1: Distilling High-Quality Reasoning

The process starts with a standard instruction-tuned LLM (like Qwen-2.5-14B), the team first teaches it to generate structured critiques. After this, they create “reasoning traces” using synthetic data from stronger models (e.g., Claude-3). These are detailed evaluations where the model explains its rubric (e.g., “Does this response validate emotions?”) before declaring a winner. This phase helps the model internalize how to evaluate, not just what to choose.

Stage 2: Reinforcement Learning with Verifiable Rewards

However, distillation alone risks overfitting to synthetic patterns. To refine the model’s judgment, reinforcement learning (via Group Relative Policy Optimization) rewards the model for correct final verdicts while penalizing deviations from its original knowledge. The reward signal focuses only on whether the model’s final answer matches human preferences, no partial credit for elegant but incorrect reasoning. This forces the model to align its elaborate critiques with ground-truth outcomes.

The user prompt used for RM-R1 rollout (for reasoning models).

The Chain-of-Rubrics Framework

At inference time, RM-R1 classifies tasks into two types:

  • Chat: Generates custom rubrics (e.g., “emotional safety,” “actionable advice”) tailored to the query.

  • Reasoning: Solves the problem itself first, then compares responses against its own solution.

This division lets the model apply domain-specific evaluation strategies. For emotional support queries, it might prioritize empathy; for math problems, correctness reigns supreme.

Outperforming Big LLMs with Smaller, Smarter Models

The model trained using the REASRMS approach described above performed quite well on a number of benchmarks:

  • On RewardBench, the 32B variant outperformed GPT-4o and Claude-3 by up to 13.8% accuracy.

  • In RM-Bench (a test of sensitivity to subtle errors), it achieved 83.9% accuracy, 12.8% higher than prior models, particularly shining in math and coding evaluations.

  • Even smaller 7B/14B models rivaled or surpassed much larger LLMs, which suggests efficiency gains from focused reasoning training.

Results of proposed method and baselines on the RewardBench.

While REASRMS excel at structured reasoning, they occasionally overfit to synthetic critique formats. Although larger models performed better, the gains from REASRMS’ training pipeline outstripped raw scaling. The team also notes that current benchmarks underemphasize multilingual or multi-modal tasks.

Reply

or to participate.