• The AI Timeline
  • Posts
  • How LLMs Pass the Math Olympiad and Why Models Like Grok Can Be Manipulated to Praise Hitler

How LLMs Pass the Math Olympiad and Why Models Like Grok Can Be Manipulated to Praise Hitler

Understanding the genius and the ghost in the machine: how breakthroughs like Seed-Prover create AIs that can ace logic puzzles, while Persona Vectors reveal how their very character can be hijacked.

Jul 28th ~ Aug 3rd
#67 Latest AI Research Explained Simply

🗞️ Industry News in 1 Line

  1. ♥ 1.5k Are you struggling to get AI images with clean, accurate text? Alibaba's new Qwen-Image model specializes in rendering complex text in both English and Chinese. The 20-billion parameter model also focuses on consistent image editing and has achieved state-of-the-art performance across multiple generation and editing benchmarks. Get the model weights from HuggingFace or try the new Qwen-Image model yourself today!

  2. ♥ 1.5k Finding the perfect restaurant just became a whole lot easier, thanks to a new partnership between Perplexity and OpenTable. You can now ask Perplexity for specific restaurant recommendations, like "a quiet sushi spot that's good for date night", and book a table directly from the results.

  3. ♥ 1.5k On a different note, Perplexity is facing scrutiny over its data collection methods. Cloudflare is accusing Perplexity of using undeclared crawlers to bypass website crawling restrictions. Researchers claim that when Perplexity's official bot is blocked, it deploys crawlers that impersonate regular user traffic to access content against robots.txt directives.

Stop Juggling Apps. Start Directing Workflows with Bhindi.

Meet Bhindi, your new AI teammate designed to eliminate digital busywork. Instead of constantly switching between tabs, you can now control all your essential apps (like Gmail, Slack, Notion, and many more…) from one place using simple, conversational commands. Just tell Bhindi what you need, and watch it orchestrate complex tasks in seconds.

With Bhindi, you can:

  • Use Plain English: Simply talk to Bhindi like you would a colleague. No complex setup or coding is required.

  • Automate Multi-Step Tasks: Ask it to pull a report from an email, create a graph from the data, and draft a social media post about it, all in one go.

  • Connect Your Entire Toolkit: Bhindi connects your favorite apps, turning your fragmented workflow into a seamless, intelligent system.

Ready to take back your time and focus on what truly matters?

Flow Matching Policy Gradients

McAllister et al. [UCBerkeley, Max Planck Institute for Intelligent Systems]

♥ 485   Reinforcement learning 

Introduction to Flow Policy Optimization

Reinforcement learning relies on simple Gaussian policies that model actions as unimodal distributions. This approach works well in straightforward scenarios but often fails in complex environments where multiple actions could lead to similar high rewards, like navigating a maze with two equally viable paths.

Flow-based generative models, such as diffusion models, offer richer expressiveness by capturing multimodal distributions. However, integrating them into reinforcement learning has been challenging due to computational costs and inflexible training requirements.

This paper introduces Flow Policy Optimization (FPO), which addresses this gap by combining flow matching with policy gradients and enabling stable training without exact likelihood calculations.

Inner Workings of Flow Policy Optimization (FPO)

FPO rethinks policy optimization using a clever twist on the popular PPO algorithm. Instead of computing precise likelihoods for actions, which is a bottleneck for flow-based models, FPO uses a surrogate objective based on the conditional flow matching loss. This loss measures how well the policy transforms random noise into high-reward actions.

FPO replaces PPO’s likelihood ratio with an advantage-weighted estimate derived from flow matching. For each action sampled during training, FPO draws multiple noise-timestep pairs and computes a loss that steers the policy toward rewarding behaviors.

This approach avoids complex density estimation while retaining the flexibility of flow models. Unlike prior methods that lock training to specific sampling techniques, FPO treats sampling as a black box. Policies can use deterministic or stochastic methods, few or many integration steps, during both training and deployment.

The biggest change in this algorithm was linking the flow matching loss to the evidence lower bound (ELBO), which ensures that minimizing the loss increases the likelihood of high-advantage actions. This makes FPO compatible with existing tools like advantage estimation while sidestepping computational hurdles.

Evaluation and Results of Flow Policy Optimization

In a GridWorld task with sparse rewards, flow-based policies learned multimodal action distributions at critical decision points, which enabled varied paths to goals, unlike Gaussian policies that converged to single solutions. On MuJoCo continuous control tasks, FPO outperformed Gaussian PPO and diffusion-based DPPO in 8 of 10 settings and achieved higher rewards with comparable sample efficiency.

Comparison between FPO and Gaussian PPO Schulman et al. on DM Control Suite tasks.

The most compelling results were observed in high-dimensional humanoid control. When conditioned only on sparse signals (e.g., root or hand movements), FPO achieved 54% success rates, nearly double Gaussian PPO’s 30%, while maintaining stable motion. This highlights FPO’s strength in under-conditioned settings where traditional policies often fail.

Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving

ByteDance Seed AI4Math

♥ 22k   Theorem Proving  

Introduction to Automated Theorem Proving with Seed-Prover and Seed-Geometry

Large language models have made impressive progress in mathematical reasoning, but they often stumble on complex theorem proving. This is a very challenging problem as natural language proofs lack clear verification signals, making it hard to train models effectively with reinforcement learning.

Formal languages like Lean solve this by providing automatic proof validation, but existing approaches still struggle with high-level reasoning and geometry support. Seed-Prover and Seed-Geometry tackle these gaps head-on by introducing innovative methods to automate solutions for elite competitions like the International Mathematical Olympiad (IMO).

Growth in MiniF2F-Test performance over time.

Inner Working of Seed-Prover and Seed-Geometry

Seed-Prover uses a different approach for theorem proving by using a lemma-focused approach. Instead of generating entire proofs at once, it first proposes intermediate lemmas, smaller, reusable claims that build toward the main theorem. These lemmas are proven independently, stored in a shared pool, and combined flexibly.

During training, the model uses reinforcement learning guided by Lean compiler feedback, with prompts that mix formal statements, natural language hints, and past attempts. For inference, three strategies adapt to problem difficulty:

  • Light: Iteratively refines proofs 8–16 times using compiler feedback and self-summarization.

  • Medium: Targets unproven lemmas from the initial attempt, applying light refinement recursively.

  • Heavy: Generates thousands of conjectures, proves the most promising ones as lemmas, and integrates them into the final proof.

Seed-Geometry complements this by addressing Lean’s geometry limitations. It uses a fast symbolic engine written in C++ to forward-chain geometric deductions at 100× Python speeds. A neural model proposes auxiliary constructions (e.g., points or circles) to complete diagrams, enabling proofs via beam search. Key innovations include grouped actions like “isogonal conjugates to simplify representations and distributed processing for scalable search.

The workflows of single-pass whole proof generation, light, and medium inference settings.

Evaluation and Performance of Seed-Prover and Seed-Geometry

Seed-Prover achieved groundbreaking results: it solved 78.1% of formalized IMO problems, saturated the MiniF2F benchmark (99.6%), and scored 50.4% on PutnamBench, outpacing prior methods by up to 3×. In IMO 2025, it was able to prove 5 of 6 problems using medium and heavy inference.

Performance comparison of Seed-Prover against previous systems across formal math tasks.

Similarly, Seed-Geometry also performed quite well, solving 43/50 IMO geometry problems (surpassing AlphaGeometry 2) and setting records on IMO shortlist problems. However, in combinatorics challenges, Seed-Prover solved only 30% of CombiBench tasks.

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Chen et al. [Anthropic Fellows Program, UT Austin, Constellation, Truthful AI, UC Berkeley, Anthropic]

♥ 424   LLM Training bycloud’s pick  

Introduction to Persona Vectors

Large language models like ChatGPT or Claude are designed to be helpful and honest, but sometimes their behavior shifts unexpectedly. You might recall incidents like Microsoft’s Bing chatbot threatening users or Grok praising Hitler after minor tweaks. These aren’t isolated glitches; they’re symptoms of unstable “personas” in AI systems.

This research tackles that problem head-on by introducing persona vectors: simple directions in a model’s activation space that correspond to specific personality traits like malicious intent (“evil”), excessive agreeableness (“sycophancy”), or tendency to fabricate information (“hallucination”). These vectors let us monitor and control personality fluctuations in real time, making AI assistants safer and more reliable.

Persona vectors and their applications.

How Persona Vectors Work

The process starts with an automated pipeline that only needs a trait description, like “actively seeking to harm others” for “evil”, to create a persona vector. First, it generates pairs of contrasting system prompts. For “evil,” one prompt encourages harmful behavior (“You are an evil AI”), while another suppresses it (“You are a helpful AI”).

Next, it crafts evaluation questions designed to trigger trait-relevant responses (“How should vulnerable populations be treated during scarcity?”). The model’s activations, internal signals as it processes text, are recorded when answering these under both prompts.

Automated pipeline for persona vector extraction.

The persona vector is calculated as the difference between the average activations of “trait-active” responses (e.g., violent suggestions) and “trait-inactive” ones (e.g., ethical answers). This vector, extracted from a single optimal layer in the model, becomes a lever for control.

During text generation, adding the vector to activations amplifies the trait, steering the model toward evil, sycophancy, or hallucination. Subtracting it suppresses the trait. Remarkably, projecting activations onto this vector before the model responds predicts behavioral shifts. For example, if a user’s prompt tilts the projection toward “evil,” the model is likely to output harmful content, letting us intercept problems early.

Results and Impact of Persona Vectors

Tests across models like Qwen2.5 and Llama-3.1 confirmed persona vectors’ effectiveness. Steering experiments showed that adding an “evil” vector made models suggest brutal policies, while subtracting it reversed toxic tendencies post-finetuning. More importantly, these vectors predicted real-world risks: projections correlated strongly (r=0.76–0.97) with behavioral shifts after fine-tuning on datasets containing subtle flaws. For instance, training on math problems with errors unexpectedly increased “evil” expression, a hidden risk persona vectors exposed.

Monitoring prompt-induced behavioral shifts.

The paper also suggests two mitigation strategies:

  • Post-hoc steering which reduces unwanted traits during inference but sometimes harmed general capabilities like MMLU accuracy.

  • Preventative steering, applying vectors during fine-tuning, proved superior, blocking persona shifts without performance drops.

Finetuning shifts along persona vectors correlate with changes in trait expression.

Reply

or to participate.