"3D" Attention Is Here!

Read more about new the attention technique "3D Attention", plus more about RL experiments and hyperparameter sweep in this week's AI Timeline...

June 30th ~ July 6th
#63 Latest AI Research Explained Simply

🗞️ Industry News in 1 Line

  1. ♥ 1k Google released Gemma 3n, SoTA open models that is capable of running on mobile devices. This series has a 5B and 8B model, each can be ran on setup that typically requires a 4B model.

    Gemma 3n Chatbot Arena Score

  2. ♥ ??? Other than that, this week’s AI news is not really that exciting…

The AI Timeline: Premium Insights

Recently, we introduced a premium membership for The AI Timeline!

With the membership, you would receive exclusive insights/explainers into technical AI topics and monthly research trend reports that contains my insights & analysis of up to 40+ papers.

Check out the Monthly Research Reports (~4000 words) here:

Plus, we are also scheduling more technical explainers, so subscribe now to stay tuned!

The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements

Zhao et al. [Meta, University of Edinburgh]

♥ 1k   LLM Self-Improving

Introduction to NanoGPT

Scientific progress relies on trustworthy, reproducible results. But can AI agents actually help by reimplementing and building on existing research? A new benchmark tackles this by testing how well these agents reproduce improvements in training large language models.

The Automated LLM Speedrunning Benchmark uses the NanoGPT Speedrun, a community effort that reduced GPT-2 training times from 45 minutes to under 3 minutes. However, even though the researchers provided detailed hints that described each improvement, current AI agents struggle to match human innovations. This gap indicates that automated reproducibility is a roadblock for AI-driven science.

The Automated LLM Speedrunning Benchmark

Verifying Scientific Claims using LLMs

The researchers in this study evaluated agents using a benchmark across 19 tasks, and each of them required them to speed up training starting from a previous record’s code. Agents receive hints in three formats: pseudocode, plain-text descriptions, or mini-papers summarizing changes.

They operate within a flexible scaffold that iteratively generates, tests, and refines code solutions. This scaffold branches into multiple variations, like Flat (testing many independent ideas) or Multi-AIDE (debugging and improving top candidates). At each step, the agent modifies the training script, runs it on fixed hardware, and analyzes results to guide the next attempt. The process emphasizes practical adjustments, such as optimizing algorithms or hardware usage, without demanding deep theoretical expertise.

After this, the researchers connect these tasks sequentially. The benchmark mirrors real-world research where each advance builds on the last. For example, later tasks might require implementing attention optimizations or mixed-precision training. The scaffold’s design allows agents to learn from failures; buggy solutions trigger debugging, while promising ones spawn further refinements. This structure tests not just coding skill but also how agents handle compounding innovations, a key aspect of scientific progress.

Results and Evaluations

Agents were evaluated using Fraction of Speedup Recovered (FSR), measuring how much of a human record’s training-time improvement they replicated. Without hints, performance was poor, as all models recovered ≤20% of the speedup. Even with hints, results varied: top models like o3-mini achieved 40-46% FSR using pseudocode or combined hints, while others like Gemini-2.5-Pro lagged.

We noticed that the performance dropped sharply for later, more complex records, and code-similarity analysis showed agents often missed key changes. Surprisingly, adding external documentation sometimes hurt performance, suggesting agents struggle to integrate new knowledge.

When the agents built on their own prior solutions, these initial gains faded quickly, and by the third task, speedups vanished. These results highlight a limitation: current agents can’t reliably chain improvements like humans. The benchmark reveals reproducibility as a critical bottleneck for autonomous research. 

Fast and Simplex: 2-Simplicial Attention in Triton

Roy et al. [Meta, University of Texas at Austin]

♥ 1k   Attention  

Introduction to 2-simplicial Transformers

Large language models are advancing rapidly, but they face a growing challenge: the shortage of high-quality training data. Current scaling laws suggest models need ever more tokens as they grow larger, but internet-scale datasets are nearing their limits. This creates a token efficiency problem, how can we achieve better results without endless data? Enter the 2-simplicial Transformer.

This architecture uses a new attention mechanism to extract more value from each token. This is especially useful for those tasks that require complex reasoning like math and coding. Instead of relying on standard dot-product attention, it introduces a higher-order approach that fundamentally changes how models scale under token constraints.

Inner workings of the 2-simplicial Transformer

The researchers of this study decided to replace the standard bilinear attention mechanism with a trilinear one. Where traditional attention computes pairwise interactions (query-key), the 2-simplicial version adds a third dimension. Each query now interacts with two keys simultaneously through a three-way tensor, capturing more nuanced relationships in sequences.

This allows the model to identify patterns that bilinear attention might miss, such as logical dependencies across multiple tokens. The trilinear operation involves multiplying queries with two distinct key projections, then applying a softmax across the combined dimensions before merging value vectors.

Visualization of sliding window 2-simplical attention and Tiling to reduce 2-simplicial einsum

To manage computational costs, the approach uses localized attention windows. Rather than processing full sequences, which would scale cubically, each query only attends to a limited neighborhood of keys. For example, a window size of 512 × 32 balances efficiency and coverage, keeping latency comparable to standard attention at 48k context lengths. A custom Triton kernel optimizes this by tiling operations across GPU cores and tensor cores, achieving near-peak hardware utilization.

The architecture integrates these layers sparingly, using them in every fourth block to distribute computational load. Combined with grouped query attention, this maintains training stability while adding minimal overhead compared to dense attention variants.

Evaluation and implications of 2-simplicial Transformers

The researchers tested this method on reasoning-heavy benchmarks like GSM8K and MMLU, larger 2-simplicial models (3.5B active parameters) showed consistent gains over standard Transformers. The negative log-likelihood improved by 2.27% on math tasks and 2.15% on complex question answering.

This indicates a better grasp of underlying patterns. Additionally, scaling law analysis revealed a steeper exponent for parameter efficiency, meaning that for fixed token budgets, performance improves faster as models grow. This contrasts with Chinchilla-style scaling, which requires proportional token increases.

Negative log-likelihood of Transformer versus 2-simplicial attention.

These results suggest a new way to improve LLMs when enough data isn't present. This architecture can extract richer signals per token. In future, it could be used to explore hybrid attention hierarchies or refined windowing strategies. For now, the 2-simplicial Transformer shows us that rethinking attention mechanisms, not just scaling data, might unlock the next leap in reasoning capabilities.

R2 and residuals measuring goodness of fit.

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

Xing et al. [CUHK, CityU, Tencent AI Lab]

♥ 820   LLM Training   bycloud’s pick  

Introduction to Transferability of Reasoning in LLMs

Large language models have made impressive strides in solving math problems, with new models regularly topping leaderboards on benchmarks like MATH and AIME. But as these scores improve, people have started wondering whether these gains reflect genuine reasoning skills that transfer to other domains, or are models simply overfitting to narrow mathematical patterns?

This paper tackles that puzzle by evaluating over 20 reasoning-tuned models across diverse tasks, from scientific QA and coding to conversational dialogue and instruction following. Surprisingly, most models excelling at math failed to generalize their abilities elsewhere. 

Mechanism Behind Transferability Differences

To isolate the impact of fine-tuning methods, researchers conducted controlled experiments using Qwen3-14B models trained exclusively on math data. The researchers compared two approaches: supervised fine-tuning (SFT), where models learn from pre-written solutions, and reinforcement learning (RL), where models optimize for correct answers through trial and error. Both methods improved math performance, but they diverged dramatically elsewhere.

Transferability of mathematical reasoning to other reasoning and non-reasoning tasks.

SFT models showed significant "catastrophic forgetting", their general capabilities eroded. When the researchers tested them on non-math queries, their internal representations drifted substantially. Researchers measured this using principal component analysis on latent activations, which revealed distorted feature spaces that disrupted general task performance. Token distribution analysis further showed SFT models shifting probabilities erratically across many irrelevant tokens, like adding logical operators to simple emails.

On the other hand, the RL models maintained stable representations. Their latent spaces stayed aligned with the base model, preserving versatility. Token shifts were minimal and targeted: only math-relevant terms like "add" or "define" changed during reasoning tasks, while everyday language remained intact. This selective adaptation allowed RL models to extend math gains to coding puzzles or medical QA without compromising conversational ability.

Performance and Implications of Transferable Reasoning

The results of this study were quite surprising. On math tasks, RL models slightly outperformed SFT (53.8% vs. 49.8% average). But the real gap emerged elsewhere: RL surged ahead by 17.1% on coding benchmarks and 24% on non-reasoning tasks like email drafting, while SFT models regressed. The Transferability Index, which is a metric quantifying cross-domain generalization, confirmed RL’s edge and showed positive gains across all categories. SFT models scored negatively on non-reasoning work, losing up to 41% performance.

The results of this study challenge common practices. These results indicate that SFT is useful for specialized tasks, however it risks fragmenting a model’s core capabilities. RL’s on-policy learning, while computationally heavier, anchors improvements to the model’s existing knowledge, making reasoning gains portable.

This means that developers should prioritize RL when building general-purpose assistants. Future work could explore hybrid methods, but one lesson is clear: True reasoning isn’t just solving equations; it’s adapting those skills to the messy, multifaceted world beyond the chalkboard.

Reply

or to participate.