- The AI Timeline
- Posts
- Hogwild! Inference: Parallel LLM Generation via Concurrent Attention
Hogwild! Inference: Parallel LLM Generation via Concurrent Attention
Plus more about One-Minute Video Generation with Test-Time Training and Gaussian Mixture Flow Matching Models
Apr 7th ~ Apr 13th
#51 Latest AI Research Explained Simply
🗞️ Industry News in 1 Line
♥ 3.9k OpenAI has announced GPT-4.1, a next-gen non-reasoning multimodal model. GPT-4.1 is the first ever OpenAI model series to offer up to 1 million tokens. These models, are only accessible via the OpenAI API, while GPT-4.5 Preview will be deprecated.
♥ 1k Moonshot AI has announced Kimi-VL and Kimi-VL-Thinking, new open-source Vision-Language models with reasoning capabilities. The highlight of this release includes its multimodal reasoning and support for long context windows up to 128K tokens. It is now available via Hugging Face, with their research paper available on GitHub.
Kimi-VL benchmarks
♥ 1.1k Google has shared DolphinGemma, an AI model designed to analyze dolphin communication patterns. DolphinGemma utilizes the Open Gemma models and has been trained on The Dolphin Project’s acoustic database of wild Atlantic spotted dolphins. The model is capable of processing complex sequences of dolphin sounds, identifying patterns, and predicting likely subsequent sounds.
♥ 1.1k xAI has made Grok-3 series available via the xAI API, featuring text generation, image understanding, and image generation. These models support context windows up to 131,072 tokens.
Grok-3 API pricing
Thunder Compute: The Cheapest cloud GPU
Thunder Compute is the cheapest way to get GPU cloud instances for AI, machine learning, or data science. You can get an A100 hosted in Google Cloud, in a US data center, with best-in-class reliability and networking for $0.57/hr, compared with $3.50/hr directly from Google.
To make this possible, Thunder Compute invented virtualization software to network-attach GPUs. This increases the utilization of GPUs on the platform by 5x. Less downtime means lower prices for you.
Thunder Compute uses a simple CLI to create and connect to instances. Just run tnr create
and tnr connect [instance_id]
to start.
You can use Thunder Compute through a simple CLI to create and connect to instances, or through their VSCode extension to develop directly on cloud GPUs in one click!
Create a Thunder Compute instance for free with $20 per month of credit.
Hogwild! Inference: Parallel LLM Generation via Concurrent Attention
Rodionov et al. [Yandex, HSEUniversity, ITMO University, IST Austria]
♥ 317 LLM Attention
When humans tackle complex problems, we rarely work in isolation. We brainstorm, divide tasks, and adjust strategies dynamically. What if LLMs could do the same? Modern LLMs excel at tasks requiring long reasoning chains, but their sequential token-by-token generation limits efficiency. Previous attempts to parallelize LLM inference relied on predefined strategies: voting on answers, splitting problems into subtasks, or assigning specialized roles (e.g., "debugger" or "judge"). While these methods work for specific problems, they struggle when tasks don’t fit their structure. For instance, splitting a problem into subtasks fails if the initial plan is flawed, leaving workers stuck on irrelevant steps.

This paper introduces Hogwild! Inference, a method that allows multiple LLM instances to collaborate in parallel by sharing their "thoughts" in real time. This approach sidesteps rigid coordination frameworks, letting the models themselves decide how to work together.
How Hogwild! Inference Works
The Hogwild technique uses a shared Key-Value (KV) cache that lets multiple LLM workers access each other’s intermediate outputs instantly. Instead of running isolated threads, these workers dynamically stitch their attention contexts together. To understand this better, let’s Imagine two assistants, Alice and Bob, solving a math problem: Alice might start by suggesting a task division, while Bob immediately notices an error in her approach and pivots. Their shared cache allows them to see and react to each other’s progress token-by-token.

To make this practical, the method takes advantage of Rotary Position Embeddings (RoPE), which encode token positions as rotational angles in the attention mechanism. By rotating cached tokens to their new positions for each worker, Hogwild! avoids recomputing representations. The researchers tested three cache layouts for this:
Contiguous: Workers append tokens to private blocks, akin to collaborative document editing.
Interleaved: Workers share completed reasoning steps in a chat-like history.
Combined: A hybrid where workers see both real-time progress and shared history.
The system prompt encourages collaboration by periodically asking workers to check for redundant work (e.g., “Wait, am I doing redundant work?”). Surprisingly, models like QwQ-32B and DeepSeek-R1 adapt naturally to this setup and they often redistribute tasks or revise plans without explicit training.
Performance and Trade-offs
The researchers tested Hogwild technique on a number of synthetic and complex reasoning tasks (e.g., GSM8k and LIMO datasets):
With a token budget of 4096, Hogwild!’s combined layout solved 68.2% of LIMO tasks, outperforming independent workers (48.4%) and single-threaded baselines (52.3%).
Smaller budgets (1024 tokens) favor the contiguous layout, where immediate synchronization helps workers coordinate faster.
At higher budgets (8192 tokens), the interleaved layout catches up, as step-wise synchronization reduces noise from overlapping token streams.

However, there are caveats. Coordination overhead can hurt performance with too many workers: four workers underperformed a single worker on synthetic tasks due to excessive time spent negotiating roles.
One-Minute Video Generation with Test-Time Training
Dalal et al. [NVIDIA, Stanford University, UCSD, UC Berkeley, UT Austin]
♥ 5.7k Video Generation
Video generation models are getting better day by day, but telling coherent, multi-scene stories longer than 30 seconds is still pretty challenging. This is because in Transformers’, self-attention scales poorly with context length, while modern RNN alternatives like Mamba lack the expressive power to handle dynamic motion and complex scene transitions.

A new approach with Test-Time Training (TTT) layers uses a hybrid approach that treats hidden states as trainable neural networks. By updating these states through gradient descent during inference, the model dynamically adapts to retain critical story elements across scenes.
TTT layers reimagine the hidden state as a two-layer MLP that evolves with each frame. Unlike static matrices in Mamba or DeltaNet, this MLP trains on-the-fly using a self-supervised task: reconstructing corrupted versions of input frames. For each token in the sequence, the layer:
Corrupts the input (e.g., masking parts of a frame),
Updates the hidden MLP by minimizing reconstruction loss,
Predicts the original input using the refined MLP.
This process embeds a learning loop directly into the forward pass. The hidden state creates a model that actively adapts to fill in gaps, preserving details like character movements and scene layouts.

To integrate TTT into existing architectures, researchers added gated connections to a pre-trained 5B-parameter Diffusion Transformer. Self-attention layers handle local 3-second segments, while TTT layers stitch these segments globally. A bidirectional processing trick allows the model to learn from both past and future context without violating causality, crucial for maintaining continuity across scene cuts.

Evaluating Test-Time Training Approach
In human evaluations, TTT-based videos outperformed Mamba 2 and sliding-window attention by 34 Elo points, which is comparable to GPT-4 over GPT-3.5. During testing, the researchers observed that:
Temporal consistency: Characters maintained appearance across scene changes.
Motion naturalness: Dynamic actions (e.g., chases) flowed smoothly between segments.
Story adherence: Multi-step plots (like Jerry stealing a pie through coordinated tricks) stayed on track.
However, there are still a few artifacts in the videos. Additionally, TTT layers are not as efficient and they add 2.5× inference latency compared to Mamba, though they’re far cheaper than full self-attention.

Human evaluation results for one-minute videos.
Gaussian Mixture Flow Matching Models
Chen et al. [Stanford University, Adobe Research, Hillbot]
♥ 301 Image Generation bycloud’s pick
Introduction to Gaussian Mixture Flow Matching (GMFlow)
Generative models are getting better but even state-of-the-art methods face stubborn challenges: generating high-quality samples in just a few steps often leads to artifacts, and popular guidance techniques like classifier-free guidance (CFG) tend to oversaturate colors.
Traditional diffusion and flow matching models simplify the denoising process by assuming the distribution of "flow velocity" (the direction and speed at which noise transitions to data) follows a single Gaussian. This works well when taking tiny, incremental steps during sampling. But when you try to generate images in just a handful of steps, that approximation breaks down. Large step sizes introduce errors, leading to blurry or distorted outputs.

Gaussian Mixture Flow Matching (GMFlow) addresses these limitations by replacing the single-Gaussian assumption with a Gaussian mixture model (GMM). Instead of predicting a single flow velocity, GMFlow estimates a multi-modal distribution of velocities. This allows the model to capture complex, overlapping pathways from noise to data which enables higher-quality generation with fewer steps while avoiding guidance-induced artifacts.
How GMFlow Works: Mixtures, Guidance, and Analytic Solvers
There are three parts in GMFlow framework:
Modeling Multi-Modal Velocity Distributions
GMFlow predicts a mixture of Gaussians to represent the possible flow velocities at each denoising step. For every noisy input, the model outputs:
Means (directions) for each Gaussian component.
Weights (probabilities) for selecting among components.
A shared variance (spread) across all components.
Training uses a KL divergence loss to align the predicted mixture with the true velocity distribution. This generalizes traditional flow matching, which uses an L2 loss to regress a single mean. By capturing multi-modality, GMFlow better approximates the true denoising dynamics, even when large steps are taken.

Probabilistic Guidance
Classifier-free guidance (CFG) amplifies conditional signals by extrapolating between conditional and unconditional predictions. However, this extrapolation often pushes samples outside the training data distribution, causing oversaturation. GMFlow’s probabilistic guidance avoids this by reweighting the Gaussian mixture components instead of extrapolating.
The model estimates both conditional (e.g., "a cat") and unconditional ("an image") velocity distributions as GMMs.
Guidance strengthens the conditional signal by reweighting the mixture probabilities toward components that align with the condition.
Unlike CFG, this keeps samples within the data distribution, preventing oversaturation while improving alignment.

Sampling with Analytic Precision
GMFlow introduces specialized solvers (GM-SDE and GM-ODE) that leverage the analytic properties of Gaussian mixtures. When predicting the next denoising step, these solvers:
Compute the exact transition distribution by combining the predicted mixture components.
Use closed-form solutions to integrate velocity fields, reducing discretization errors.
This allows GMFlow to take larger steps without sacrificing accuracy. For example, in a 2D toy experiment, GMFlow reconstructs a checkerboard pattern in just four steps, while traditional methods require 16+ steps to avoid severe artifacts.

Comparison among vanilla flow models with different solvers and GMFlow.
Benchmark Results for GMFlow
GMFlow challenges the long-held assumption that single-Gaussian dynamics are sufficient for diffusion and flow models. By embracing multi-modality, it opens the door to faster, and better sampling. GMFlow was evaluated on two benchmarks: a synthetic 2D dataset and ImageNet 256Ă—256 generation.
2D Checkerboard Analysis
With just four steps, GMFlow nearly perfectly reconstructs the checkerboard, while baseline methods (DDPM, DPM++) show blurred or fragmented patterns.
Increasing the number of Gaussians (K) improves sample fidelity. K=64 achieves near-perfect results, while K=1 (equivalent to standard flow matching) struggles.
ImageNet 256Ă—256
GMFlow achieves a Precision score of 0.942 with only six sampling steps, outperforming flow matching baselines by a wide margin. At 32 steps, it reaches a state-of-the-art Precision of 0.950.
Probabilistic guidance cuts saturation levels by 30% compared to CFG, aligning outputs closer to natural image statistics.
Despite its complexity, GMFlow adds minimal computational overhead, just 0.005 seconds per step on an A100 GPU.

ImageNet evaluation results at best Precision
Limitations and Trade-offs
Pixel-Wise Factorization: GMFlow models each pixel independently, which simplifies training but ignores spatial correlations. The authors propose spectral sampling to inject spatial coherence, though this remains an area for improvement.
Component Sensitivity: Performance plateaus at around K=8 components for images, suggesting diminishing returns beyond a certain complexity.
🚨This week's top AI/ML research papers:
- Scaling Laws for Native Multimodal Models
- Quantization Hurts Reasoning?
- The AI Scientist -v2
- Parallel LLM Generation via Concurrent Attention
- Gaussian Mixture Flow Matching Models
- VAPO
- Are Reasoning Models losing Critical— The AI Timeline (@TheAITimeline)
4:10 AM • Apr 14, 2025
Reply