The Leaderboard Illusion

Plus more about Phi-4-reasoning Technical Report and Softpick: No Attention Sink, No Massive Activations with Rectified Softmax

Apr 28th ~ May 4th
#54 Latest AI Research Explained Simply

🗞️ Industry News in 1 Line

  1. ♥ 452 Allen Institute released OLMo 2 1B, a small yet powerful model trained on 4T tokens with advanced tuning methods, designed for efficient research iteration and outperforming peers like Gemma 3 1B and Llama 3.2 1B. Now available on Huggingface.

    OLMo 2 1B benchmark

  2. ♥ 683 Nous Research announced Atropos, a Language Model Reinforcement Learning Environments framework for collecting and evaluating LLM trajectories through diverse environments. Additionally, they are hosting a hackathon surrounding Atropos on May 18th with a prize pool of $50,000 in San Francisco. You can sign up here.

LTX Video Release: Fastest & GPU-friendly Video Gen Model

LTX Studio has announced a new version of LTX Video, their latest video generation model which offers faster rendering performance, with up to 30x faster than competitors while maintaining high visual quality.

The highlight of LTX Video is that they are entirely open-source under an OpenRAIL license (permitting free commercial use for businesses under $10M revenue) and are designed to run locally on consumer GPUs.

Key features include realistic multiscale rendering (this create videos hierarchically, starting with a low-resolution structure and progressively refining it with finer details at higher resolutions) and creative controls like setting start and end keyframes via the LTX Studio platform. This provides a new option for efficient, locally run AI video creation, compared to existing options.

The Leaderboard Illusion

Singh et al. [Cohere, Princeton University, Stanford University, University of Waterloo,  Massachusetts Institute of Technology, Allen Institute for Artificial Intelligence, University of Washington]

♥ 4.3k   LLM Benchmarks

The Hidden Biases and Problems in LLM Benchmarks 

New LLM research papers often use benchmarks to guide AI research, but what happens when the benchmarks itself starts pointing in misleading directions? You would have already heard of Chatbot Arena, a crowdsourced platform where users compare anonymized model responses. Unlike static benchmarks, it adapts to real-world use cases. 

This paper shows that the Chatbot Arena has systemic biases that risk distorting the field’s perception of progress. From undisclosed private testing to unequal data access, the findings highlight how current practices favor a handful of major players, raising urgent questions about fairness and transparency in AI evaluation.

Maximum observed sampling rate for models from different providers.

How the Chatbot Arena System Skews Results

Chatbot Arena’s ranking system uses the Bradley-Terry (BT) model, which is a statistical method for estimating skill levels from pairwise comparisons. The BT model assumes that it will receive unbiased sampling and every model will have an equal chance to prove itself. But private testing violates this principle. For instance, Meta tested 27 private variants of Llama-4 before launch, and selectively reported the best-performing version. Simulations show that testing just 10 variants inflates a model’s perceived skill by ~100 points. This is like a runner secretly entering multiple aliases in a race and claiming the fastest time.

Number of privately-tested models per provider based on random-sample battles (January – March 2025).

Additionally, the data access further exacerbates the problem because proprietary models are sampled more frequently and they appear in up to 34% of battles compared to 3–5% for open-source alternatives. This imbalance directly impacts the model’s performance on benchmark. Models trained on Arena-specific data achieve up to 112% higher win rates, as they adapt to the platform’s unique distribution of prompts and preferences. Over time, this creates a self-reinforcing cycle: dominant models get more data, which sharpens their edge, while others fall further behind.

How to Reform LLM Benchmarks for Fairer Evaluations

This paper clearly showed us that proprietary models consistently outperform open-source counterparts on the leaderboard, but this gap shrinks when controlling for data access and testing advantages. For example, when identical open-weight models were submitted under different aliases, their scores varied by up to 5%, purely due to sampling randomness. Similarly, silent deprecation of 205 models (mostly open-source) further destabilizes rankings, violating BT’s assumptions and eroding trust.

To address these issues, the authors of this paper propose the following reforms:

  1. Ban score retraction: All private tests must be publicly logged to prevent cherry-picking.

  2. Limit private variants: Cap submissions to 3 per provider to curb overtesting.

  3. Equalize data access: Allocate battles and deprecations evenly across model types.

  4. Transparent sampling: Publish deprecation lists and enforce fair sampling policies.

You can also check out my video on LLM benchmark/leaderboard cheating.

Phi-4-reasoning Technical Report

Abdin et al. [Microsoft]

♥ 1.4k   LLM Reasoning  

How Phi-4-Reasoning Competes with Giants

Large language models can do a lot but they also require a lot of computing power. But if we can build compact AI models that can solve complex problems without the computational requirements then it can be a game changer. Today’s frontier language models often rely on sheer size to tackle multi-step reasoning, but scaling parameters isn’t the only path forward. That’s why Microsoft built Phi-4-reasoning, a 14-billion-parameter model that prioritizes smarter training over brute-force scaling.

Inner Workings and Training of Phi-4-Reasoning

The Phi-4-reasoning model starts with its base model, Phi-4, which already excels at factual recall and basic reasoning. To specialize it, researchers created a dataset of 1.4 million prompts filtered for “teachability”. These were specialized problems which were just beyond the model's current capabilities. These prompts consisted of math, coding, and safety-critical scenarios, and their answers were generated by OpenAI’s o3-mini to ensure high-quality reasoning traces.

Rewriting seed data from the web (left) into verifiable synthetic questions for SFT and RL (right)

The supervised fine-tuning phase repurposes two tokens, <think> and </think>, to structure the model’s internal reasoning. This simple formatting encourages the model to generate detailed chains of thought before final answers, mimicking how humans break down problems. When using this approach, training runs revealed an unexpected benefit: even without explicit guidance, the model began producing concise, verifiable solutions, which is useful for real-world applications. Finally, the researchers created a short reinforcement learning phase with 6,000 math-focused problems that pushed the model to generate longer, more precise reasoning traces. 

Evaluating Phi-4-reasoning Model

In the AIME 2025 math competition (a gateway to the USA Math Olympiad), Phi-4-reasoning matches the performance of DeepSeek-R1, a model 48 times larger, and outperforms distilled versions like DeepSeek-R1-Distill-Llama-70B. In coding (LiveCodeBench), it beats its base model by 25 percentage points. 

Most researchers often use small datasets for testing their models, where minor variations skew results. To combat this, the team tested across multiple runs and reported standard deviations and using larger test sets. They also showed that “parallel test-time compute” approach is a viable solution for such models. The researchers showed that generating many candidate solutions and picking the best (via majority vote) could push accuracy near theoretical ceilings. For example, with 64 parallel generations, Phi-4-reasoning-plus approaches 95% accuracy on AIME 2025 which surpasses even its teacher model, o3-mini.

Average Pass@1 accuracy (%) of models on selected reasoning benchmarks.

Softpick: No Attention Sink, No Massive Activations with Rectified Softmax

Zuhri et al. [MBZUAI]

♥ 1.7k   LLM Attention   bycloud’s pick  

Introduction to the Softpick (Softmax’s Replacement)

Transformers have become the backbone of modern AI, but they mainly use softmax in attention layers that hides subtle quirks. One of the biggest issues is the attention sink phenomenon, where models allocate disproportionate focus to tokens like the beginning-of-sequence (BOS) marker. Although it is harmless for performance, these sinks create massive activations, which are extreme values in hidden states that complicate quantization and low-precision training. 

Figure 1: (left) Comparison between the attention maps when using softmax vs softpick and overall sink rate of the models. (right) Largest hidden state activation per layer of the models

The researchers of this paper introduce Softpick, a new normalization function designed to replace softmax. By relaxing softmax’s sum-to-one constraint and introducing sparsity, Softpick can eliminate these artifacts without sacrificing performance. 

Inner Workings of Softpick

Softpick re-creates the attention mechanism by decoupling normalization from strict probability constraints. Instead of exponentiating all inputs and normalizing to sum to one (as in softmax), Softpick applies a ReLU to shifted exponentials, then normalizes by the sum of their absolute values. This simple tweak allows attention scores to be zero for irrelevant tokens by creating sparse patterns. For example, if a token’s score is negative after shifting, it gets clipped to zero which effectively prunes its contribution to the output.

This design preserves critical properties of softmax, such as bounded gradients and training stability, while avoiding pitfalls. Because the denominator includes absolute values, even negative-scoring tokens contribute to normalization which ensures gradients flow through all inputs. Moreover, this asymmetry breaks the sum-to-one requirement and eliminates the need for models to “waste” attention on sink tokens. This results in a self-regulating mechanism where heads can dynamically shut off (outputting zeros) when unused which reduces noise in hidden states.

Evaluation of Softpick Normalization

The researchers tested the Softpick Normalization approach on standard benchmarks like ARC-E and Wikitext and found that it matches or slightly outperforms softmax in accuracy and perplexity. Moreover, its real advantage emerges during quantization. At 2-4 bit precision, Softpick models retain significantly more performance than their softmax counterparts. 

Comparison of softpick vs softmax performance for HQQ quantization methods. ( ↑= Higher is Better, ↓= Lower is Better, ∆= softpick - softmax)

When analysing the performance drops caused by quantization, the researchers found that Softpick has a 0% attention sink rate (vs. 33–63% for softmax) and hidden states with magnitudes reduced by an order of magnitude. These traits simplify low-precision training and sparsity optimizations. For example, dormant heads (those outputting zeros) could be pruned entirely which will save computation. The sparse attention maps also offer clearer interpretability, as zeroed scores highlight only relevant token interactions.

However, Softpick isn’t without limitations. In long-context tasks like passkey retrieval, its scores can become underscaled due to normalization over many tokens which weakens signal strength. If you want to experiment with quantization or sparse training, then you can use Softpick's drop-in solution today using the code in its GitHub repo.

Reply

or to participate.