- The AI Timeline
- Posts
- Defeating Nondeterminism in LLM Inference
Defeating Nondeterminism in LLM Inference
Plus more about Analog in-memory computing attention mechanism for fast and energy-efficient large language models, and the Majority is not always right: RL training for solution aggregation
Sep 8th ~ Sep 16th
#73 Latest AI Research Explained Simply
🗞️ Industry News in 1 Line
♥ 4.2k Alibaba's Qwen team has introduced Qwen3-Next-80B-A3B, its latest LLM with 80 billion parameters, which only activates 3 billion per token. This design enables it to train 10 times more cheaply and run inference 10 times faster than the Qwen3-32B model, particularly with context lengths over 32,000 tokens. If you are interested in exploring this new architecture, you can access the model and its technical details through the official blog post and platforms like Hugging Face and ModelScope.
Qwen3-Next Hybrid Architecture: Gated DeltaNet + Gated Attention
♥ 315 Researchers from MBZUAI and G42 have released K2 Think, a new 32B parameter open-source AI system designed for advanced reasoning. The model delivers frontier-level performance and outperforms systems more than 20 times its size by using long chain-of-thought supervised fine-tuning. You can try K2 Think right now in your browser.
♥ 349 The Interaction Company has launched Poke, which is gaining positive attention on social media for its text-based interface that lets users accomplish a variety of tasks. Many people like its conversational nature for everything from getting daily jacket recommendations to analyzing YouTube channel statistics. It can even handle follow-up messages and corrections like a natural conversation.
♥ 1.3k H Company has released the Holo1.5 series, a new family of open-source models designed to power "Computer Use" agents that can interact with web, desktop, and mobile applications on a user's behalf. These state-of-the-art models excel at localizing user interface elements, answering questions about on-screen content, and providing powerful productivity tools. View a replay of the Holo1.5 session in your browser to see what it can do for you.
♥ 413 Google has released VaultGemma, which is a new LLM trained from scratch with differential privacy to protect sensitive information in the training data. This new model is available in a 1B-parameter open version, and it allows researchers and developers to build privacy-preserving AI applications. You can download VaultGemma from Hugging Face today.
The structure of DP scaling laws.
Support My Newsletter
As I aim to keep this newsletter free forever, your support means a lot. If you like reading The AI Timeline, consider forwarding it to another research enthusiast, It helps us keep this up for free!
Defeating Nondeterminism in LLM Inference
He et al. [Thinking Machines]
♥ 7.3K Nondeterminism
Introduction to Defeating Nondeterminism in LLM Inference
Have you ever noticed that asking the same question to a language model multiple times gives different answers, even with settings that should make it deterministic? This inconsistency is a real obstacle for researchers and developers who rely on reproducible results. The common belief is that this nondeterminism stems from floating-point arithmetic and concurrency in GPUs.

This paper argues that batch sizes change the order of operations inside key computational kernels, leading to unpredictable outputs. Their research tackles this by making these kernels "batch-invariant," ensuring consistent results no matter how many requests are processed together.

From the perspective of an individual user, the other concurrent users are not an "input" to the system but rather a nondeterministic property of the system.
Inner Workings of Batch-Invariant Kernels
The team focused on three core operations in transformer models: RMSNorm, matrix multiplication, and attention. Each of these involves reductions (summing values across dimensions), which are sensitive to batch size changes. Normally, kernels optimize performance by adjusting their reduction strategy based on batch size, but this variability breaks consistency.
For RMSNorm, the solution is straightforward: use a data-parallel approach where each batch element is processed independently within a single core, avoiding inter-core communication that introduces order changes.

Data Parallel RMSNorm
Matrix multiplication poses a bigger challenge due to the use of tensor cores and tile-based processing for efficiency. Here, the researchers enforce a fixed kernel configuration across all batch sizes, sacrificing some performance but ensuring that reduction orders remain unchanged.
Attention mechanisms add another layer of complexity, as they handle sequences that can be split or cached differently during inference. By standardizing how key-value caches are updated and using fixed split sizes for reductions, they maintain identical numerics regardless of how tokens are processed, making attention batch-invariant as well.

Data Parallel Matmul
Evaluation and Implications for AI Research
When tested with the Qwen model, standard inference produced 80 different completions for the same prompt, while the batch-invariant approach produced identical outputs every time. Performance benchmarks show a slowdown (from 26 seconds to 42-55 seconds for processing 1000 sequences) but this is a manageable trade-off for determinism.

More importantly, this work enables true on-policy reinforcement learning, where sampling and training align perfectly with zero KL divergence. This breakthrough not only makes LLM inference reproducible but also opens doors for more reliable AI systems in research and production.
Analog in-memory computing attention mechanism for fast and energy-efficient large language models
Leroux et al. [Forschungszentrum Jülich, RWTH Aachen]
♥ 2.2k LLM Memory Computing
Introduction to Aggregation in Large Language Models
Scaling up test-time compute by generating multiple solutions and selecting among them has become a common strategy for improving large language models on difficult reasoning tasks. However, standard aggregation methods like majority voting or reward model ranking often don’t perform well, especially when correct answers appear only in the minority.
The paper introduces AggLM, a method that uses reinforcement learning from verifiable rewards to train a model to review, reconcile, and synthesize answers from multiple candidate solutions. By carefully balancing easy and hard examples during training, AggLM learns to recover correct minority answers while still effectively handling straightforward cases.

Building blocks of the analog hardware attention mechanism.
Inner Workings of AggLM
AggLM works by first sampling multiple independent solutions from a base language model for a given problem. These candidate solutions are then passed to an aggregator model, which is trained to produce a final answer by analyzing and combining the inputs. The aggregator reasons over the solutions, corrects mistakes, and fills in gaps where needed instead of just picking the most frequent answer.

Analog hardware attention pipeline.
Training happens through reinforcement learning with verifiable rewards. For each problem, the model receives a reward of 1 if its aggregated answer matches the ground truth and zero otherwise. The model receives easy examples, which are those where the majority answer is correct, as well as hard ones, where the majority is wrong. This balance helps the model learn both to trust correct majorities and to recover correct minority answers.

Multi-tile design and layout for multi-head attention.
The aggregator can be the same model as the solution generator or a separate one. In practice, the paper shows that both setups work well. The training uses Group-Relative Policy Optimization (GRPO), a reinforcement learning method that helps the model improve its aggregation policy over time based on group-level rewards.

Hardware model adaptation and training.
Evaluation and Performance of AggLM
AggLM was tested on several challenging math competition datasets, including AIME and HMMT problems. When aggregating solutions from a 1.7B parameter model, AggLM outperformed majority voting and reward-based selection methods. It improved accuracy from 35% to 50% on one dataset. It also generalized well to solutions from stronger models, like an 8B parameter model, even though it was only trained on data from the smaller model.

Analog hardware attention mechanism accuracy and performance.
The method proved more token-efficient than generating a larger number of solutions for majority voting. It performed especially well when candidate solutions were diverse and the correct answer was not in the majority.
The Majority is not always right: RL training for solution aggregation
Zhao et al. [FAIR at Meta, CMU]
♥ 714 LLM RL bycloud’s pick
Introduction to In-Memory Computing for Transformer Efficiency
Transformer models have become the backbone of modern AI, but they come with a significant drawback: high energy consumption and latency, especially during inference. This is largely due to the need to repeatedly load key-value (KV) cache projections from GPU memory into static RAM at each generation step.
This paper introduces a hardware solution using in-memory computing with gain cells to store token projections and compute attention operations directly in analog. This approach avoids the costly data transfers that slow down traditional GPUs and opens the door to much faster and more energy-efficient generative transformers.
Inner Working of the Gain-Cell Attention Architecture
This paper suggests using gain-cell arrays to store keys and values while performing the dot products needed for self-attention in the analog domain. Gain cells act as both memory and multipliers: they store multi-level voltages representing token projections and generate output currents proportional to the product of stored values and input pulses. This allows the attention mechanism to compute without repeatedly moving data between memory and processing units.

Given a task and sampled LLM solutions as input, AggLM uses reasoning to review, reconcile, and synthesize a final aggregated solution, which is typically superior to the original solutions.
To handle the non-idealities of analog computation (like nonlinearities and value decay over time) the authors designed charge-to-pulse circuits that convert integrated currents into pulse-width modulated signals. These pulses are used for intermediate computation and activation, replacing power-hungry analog-to-digital converters. The architecture also uses sliding window attention to limit the number of tokens attended to at each step, making the hardware design scalable and practical.
The adaptation algorithm allows pre-trained models like GPT-2 to work on this non-ideal hardware without full retraining. By fine-tuning scaling parameters layer by layer, the system matches the statistical behavior of ideal digital models, ensuring that performance remains high even with analog imperfections and quantized operations.
Evaluation and Performance of the Hardware Design
The proposed architecture shows remarkable efficiency improvements. Compared to GPUs, it reduces attention latency by up to two orders of magnitude and energy consumption by up to four orders. For example, it achieves energy savings of 40,000x over an embedded GPU and 70,000x over a data-center GPU when performing attention computations. These gains come from performing dot products fully in analog, minimizing data movement, and using efficient pulse-based signaling.
In terms of accuracy, the adapted model performs comparably to GPT-2 on standard language tasks like LAMBADA, HellaSwag, and WikiText-2, even with hardware constraints like HardSigmoid activation instead of softmax and low-precision quantization.

Comparison of training the solution model versus training the aggregator model on the same data, in either separate or multitask settings.
Reply