• The AI Timeline
  • Posts
  • AlphaEvolve: A coding agent for scientific and algorithmic discovery

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Plus more about Parallel Scaling Law for Language Models and faster matrix multiplications

May 12th ~ May 19th
#56 Latest AI Research Explained Simply

🗞️ Industry News in 1 Line

  1. ♥ 5.4k You can now use GPT-4.1 in ChatGPT if you are a Plus, Pro, or Team user. It's a faster model with 1 million context window that's good at coding and following instructions. All users will also get the new GPT-4.1 mini (replacing GPT-4o mini), and Enterprise/Edu users will get GPT-4.1 access in the coming weeks. OpenAI has also released a test version of Codex, a new agentic tool that can help with many coding tasks at the same time, like finding issues or suggesting code changes. Pro, Enterprise, and Team users can start using Codex in ChatGPT today.

  2. ♥ 1.5k Nous Research has launched Psyche, a decentralized network that allows many different computers to work together to train large AI models. They are currently testing the network by training a 40b LLM called consilience-40b, the biggest of its kind to be trained over the internet so far.

  3. ♥ 12k During Google I/O, their annual developer conference, they had some really impressive announcements. We will highlight a few important ones below.

    all Google I/O announcements

  4. ♥ 11k Google has launched Jules, a experimental coding agent that helps you fix bugs, add documentation, and build new features by connecting to your GitHub. It works on tasks in the background while you do other things, and you can start using it by visiting jules.google.com.

  5. ♥ 13k Google has launched Veo 3, a state-of-the-art video generation model that now includes built-in sound generation, allowing you to add dialogue, sound effects, and background noise. This updated version, which also features extremely realistic generations, is available now in the Gemini App for Google AI Ultra subscribers in the U.S. Check out Veo 3 demos.

  6. ♥ 3.8k Google DeepMind has released Gemini Diffusion, a state-of-the-art text diffusion model. I tried it, and it can generate at a staggering 1000 token/second. You can sign up to its waitlist here.

we have just revamped findmypapers.ai !

If you stare closely… Something is different

Remember findmypapers.ai, our project designed to revolutionize how you survey for AI research papers? Well, we've got some exciting news! Since our initial launch, we've been all ears, soaking up your amazing feedback, and have been hard at work making improvements to create an even better experience for you.

This time, we have improved everything UI related, and made the UI a lot faster and responsive, so your experience is a lot better. Check out the new UI below!

For a bit of context, findmypapers.ai is a semantic search engine for 340k+ AI research papers, outcompeting Deep Research apps like Grok, OpenAI, Perplexity, and Gemini at finding relevant AI research papers. 

To celebrate how far we've come with your help, you can still grab 50% off with code BETAN50 if you want to unlock more usage (only a handful of redeems left!). Head over to findmypapers.ai, give it a spin, and keep the amazing feedback coming on our X account or Discord!

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Novikov et al. [Google DeepMind]

♥ 7k   LLM Ensemble

Introduction to AlphaEvolve

This week Google launched AlphaEvolve, a new evolutionary coding agent that combines LLMs with evolutionary computation to tackle problems ranging from matrix multiplication optimizations to open mathematical conjectures.

Until now, LLMs have shown promise in accelerating scientific discovery, but their ability to generate entirely novel, high-impact algorithms has been limited. This is mainly because they rely on static training data and can not refine their solutions. AlphaEvolve fixes this by creating a dynamic feedback loop where code is written, tested, criticized, and improved autonomously.

Inner Workings of AlphaEvolve

The AlphaEvolve architecture uses a distributed evolutionary pipeline. It has a number of code variants, each of which represent a potential solution to a user-defined task (e.g., “optimize this matrix multiplication kernel”). The system begins with an initial codebase which is annotated by developers. The developers mark specific components for evolution.

After this, a controller orchestrates LLMs (like Gemini 2.0 Flash and Pro) to propose code modifications in the form of diffs. If you have worked with git, then you would know that a diff is a targeted edit that replaces existing code blocks with new versions. These diffs are generated using rich contextual prompts that include past successful solutions, evaluation metrics, and problem-specific instructions.

Example diff output generated by the LLM

The evolutionary process relies on two key mechanisms. First, an automated evaluation function scores each proposed variant on task-specific metrics (e.g., computational speed, mathematical correctness). Second, a program database acts as a collective memory, storing high-performing solutions and feeding them back into LLM prompts to inspire future generations.

This creates a survival-of-the-fittest dynamic: code that improves performance persists, while less effective variants are pruned. However, unlike traditional genetic programming, the “mutation” operator here isn’t random. The mutations are guided by LLMs capable of reasoning about algorithmic improvements.

One of the strongest suits of AlphaEvolve is its ability to handle multi-component optimization. For example, when improving a matrix multiplication algorithm, it might simultaneously evolve the numerical initialization scheme, gradient update rules, and noise injection strategies. The system supports cascading evaluations, where promising candidates undergo increasingly rigorous testing. This balances exploration (trying radically new ideas) with exploitation (refining known good solutions), allowing AlphaEvolve to escape local optima that trap conventional approaches.

Examples of SOTA-breaking mathematical constructions

Benchmark results of AlphaEvolve

AlphaEvolve’s has made several computational breakthroughs. In matrix multiplication, which is arguably the most important mathematical operation for linear algebra (and AI), it discovered a 48-scalar-multiplication algorithm for 4×4 complex matrices, breaking a 56-year record held by Strassen’s algorithm. This single improvement could ripple through countless applications, from machine learning training to scientific simulations. In addition to solving math problems, the system optimized critical infrastructure at Google, including data center scheduling algorithms and TPU arithmetic circuits.

AlphaEvolve solved or improved upon 20% of a curated set of 50+ open math problems. This includes better constructions for Erdős’ Minimum Overlap Problem and 11-dimensional kissing numbers. These successes highlight its versatility: the same architecture that tunes low-level code also navigates abstract combinatorial spaces.

However, AlphaEvolve isn’t without constraints. Its effectiveness depends on having automated evaluation metrics, which limits its applicability to domains requiring manual experimentation. Additionally, while sample-efficient compared to brute-force search, the system still requires substantial compute for complex tasks (hours of parallel accelerator time).

We've just published the very first post in our brand new "Premium Insights" series! We're diving deep into how DeepSeek absolutely CRUSHED formal math, creating the current best math prover. In this in-depth blog post, we will explain how DeepSeek got a +500% improvement over the previous best, it's insane!

We'd love for you to check it out and then let us know what you think. We're looking forward to hearing your feedback on this first installment!

Parallel Scaling Law for Language Models   

Chen et al. [Zhejiang University, Qwen Team, Alibaba Group]

♥ 642   LLM Scaling    bycloud’s pick  

How to Scale Language Models 

LLMs are getting bigger and bigger and deploying them is becoming a logistical nightmare. On top of that, a lot of the traditional scaling methods are just not accurate enough, especially when it comes to the latest advanced LLM methods.

This paper from Alibaba’s Qwen Team proposes a new perspective: parallel scaling. This architecture rethinks how we use the compute which doesn’t require bigger and more powerful hardware to run bigger models. 

The Mechanics of Parallel Scaling in LLMs

The idea behind parallel scaling (PARSCALE) is straightforward. Instead of stacking layers or widening networks, the method runs multiple parallel "streams" of computation through the same model and each stream processes a slightly modified version of the input. 

For example, a stream could differ by adding unique learnable prefixes; after processing the streams the results are dynamically aggregated. You can think of it as brainstorming with multiple copies of the same model, each offering a different perspective before combining the best ideas.

For a given input, the model appends P distinct prefixes (e.g., P=8), processes them in parallel, and merges the outputs using a lightweight neural network that learns optimal weights for combining predictions. This approach recycles the model’s core parameters and adds only a tiny fraction of new parameters (0.2% per stream). Even minor variations in input prefixes lead to divergent predictions, and aggregating them amplifies the model’s reasoning power.

The researchers draw inspiration from classifier-free guidance in diffusion models, where perturbing inputs during inference improves output quality. But PARSCALE takes this further by making both input perturbations and aggregation learnable during training. This turns a heuristic trick into a systematic scaling strategy. 

Performance and Efficiency of PARSCALE Architecture 

The Qwen team tested the effectiveness of PARSCALE through extensive experiments. The biggest finding is that scaling parallel computation (P streams) matches the benefits of scaling parameters by a factor of O(log P). For instance, an 8-stream PARSCALE model behaves like a model with 22Ă— fewer parameters but similar performance. This holds across tasks but shines brightest in reasoning-heavy domains like math and coding. On GSM8K, scaling to P=8 boosted accuracy by 34% using the same training data.

Efficiency metrics are equally compelling. Compared to parameter scaling, PARSCALE reduces memory overhead by 22Ă— and latency by 6Ă— for batch size 1. Even at larger batch sizes, the computational overhead remains manageable, thanks to optimized GPU utilization. The method also supports dynamic adjustment: models can switch P during deployment, scaling capabilities up or down without retraining.

However, there are still a few limitations with the PARSCALE Architecture. First of all, training multi-stream models still demands careful resource allocation, and the interplay between data diversity and parallel scaling needs deeper exploration. 

XXt Can Be Faster 

Rybin et al. [The Chinese University of Hong Kong, Shenzhen Research Institute of Big Data]

♥ 4.3k   Matrix Computation

Accelerating Matrix Computations with Machine Learning-Guided Algorithms

Matrix multiplication is one of the most important mathematical operations required for processing data. Matrix multiplication is used for everything from neural networks to scientific simulations. Even if you just display a picture on your computer, the processor performs hundreds of thousands of matrix multiplications just to display that picture. Since it is so widely used, it would be great if we can optimize this operation. 

This paper introduces a new approach that rethinks how we compute a fundamental operation: the product of a matrix with its own transpose, XXáµ—. These operations are ubiquitous in machine learning (e.g., covariance estimation, optimizer steps like Shampoo) but are often treated as generic matrix multiplications, ignoring their inherent symmetry.

How RXTX Multiplies Matrices

The RXTX approach uses the recursive block matrix multiplication. It partitions the input matrix into 4×4 blocks and processes them using a combination of recursive calls and carefully orchestrated intermediate products. Unlike previous approaches that adapt general-purpose algorithms like Strassen’s, RXTX introduces a novel decomposition strategy. By reorganizing arithmetic operations and reusing intermediate terms, it sidesteps redundant calculations inherent in symmetric products.

The RXTX algorithm’s design is derived using a hybrid search methodology. A reinforcement learning agent proposed candidate computation steps, while combinatorial optimization techniques pruned these candidates to find minimal-operation configurations. This two-stage process effectively navigated the vast space of potential algorithms, balancing exploration (via RL) with rigorous validation (via MILP).

This process was further improved by exploiting shared sub-expressions. For example, terms like X₆ + X₇ appear in multiple steps of the computation. By identifying and reusing these overlaps, RXTX reduces the number of additions required, further trimming operational overhead. The result is a recursive blueprint that scales efficiently: for an n×n matrix, the algorithm requires 26/41 the multiplications of Strassen’s method asymptotically, with gains visible even at small n.

Benchmarks results of RXTX Matrix Multiplication

The researchers tested the RXTX algorithm against existing methods and found that RXTX consistently outperforms recursive Strassen-based approaches. For large matrices, it achieves a 5% reduction in both multiplications and total operations (additions + multiplications). For smaller cases like n=4, the improvement jumps to 10%. These gains can be pretty useful for building LLMs as several model structures use the XXáµ— computations thousands of times during training.

However, the algorithm’s recursive nature introduces implementation complexity. Managing block subdivisions and intermediate terms requires careful engineering, and the benefits depend on using optimized low-level kernels for base-case computations. Additionally, while RXTX is great at solving the XXᵗ operation, using the same method for other similar operations like XᵗX is still not possible.

Reply

or to participate.