• The AI Timeline
  • Posts
  • Think In Diffusion: Continuous Latent Diffusion Language Model

Think In Diffusion: Continuous Latent Diffusion Language Model

plus more on Sparser, Faster, Lighter Transformer LMs, Manifold Steering, and Teaching Claude Why

May 5th ~ May 12th
#107 Latest AI Research Explained Simply

🗞️ Industry News in 1 Line

  1. ♥ 751 Baidu has released ERNIE 5.1, its latest model with better reasoning, search, and agentic capabilities while reportedly requiring only 6% of the pre-training cost of comparable models. Built using multi-dimensional elastic pre-training, the model has already secured a top-five global ranking on the LMSYS Search Leaderboard for its retrieval and synthesis performance. You can try it in browser today.

  2. ♥ 2.4k Zyphra has introduced ZAYA1-8B, an open-weights model with a new architecture that includes Compressed Convolutional Attention (CCA) for 8x KV-cache compression and a Markovian RSA technique for bounded-context reasoning. The model uses a four-stage RL cascade to achieve reasoning performance that rivals or surpasses much larger models on specialized benchmarks. You can try it on Zyphra Cloud or Hugging Face.

Thunder Compute: The cheapest cloud GPU

H100 @ $1.38/GPU/hr!!!

Thunder Compute has cheap cloud GPUs for developers. We offer on-demand GPU cloud instances in enterprise-grade data centers for a fraction of the price of competitors.

With on-demand H100 sitting at $1.38/GPU/hr, you get best-in-class reliability and networking, compared to other competitors that offer at least $4/GPU/hr.

With additional features like:

  • VSCode extension and CLI which let you connect to instances without SSH config.

  • Snapshots to save instance state and restore on any number of instances

  • Templates for ComfyUI, Ollama, Unsloth Studio, and more

  • $20 of free credit for students

Sparser, Faster, Lighter Transformer Language Models

Cetin et al. [Sakana AI, NVIDIA]

♥ 754   Transformers  

The feedforward layers in LLMs consume the vast majority of a model’s processing power and memory. This means only a tiny fraction of artificial neurons actually needs to activate to process any given word. But modern hardware is heavily optimized for dense calculations and forcing a graphics processing unit to selectively skip dormant neurons creates so much organizational overhead that it runs slower than just calculating everything.

Comparison of ELL with our new TwELL and Hybrid sparse formats

In this paper, researchers tried to bypass this bottlenech by rethinking how AI software communicates with hardware. They designed a new data packing format called TwELL, which neatly organizes only the active neurons into small, manageable data tiles. Instead of pausing to count and sort messy, unstructured data, the hardware can now process these synchronized tiles in a single, seamlessly fused computational pipeline.

Algorithmic description of gate projection with matmul kernel with TwELL

Furthermore, they introduced a mathematical penalty during training which lead the models into becoming over ninety-nine percent sparse. Leaving the vast majority of the network dormant resulted in practically zero loss to the model's intelligence or downstream reasoning capabilities.

Algorithmic description of fused up and down projections from gate activations in the TwELL format

For networks containing billions of parameters, this tiled approach accelerated processing speeds by over twenty percent, significantly slashed energy consumption, and drastically reduced memory requirements.

Comparison of performance and efficiency statistics of sparse LLMs leveraging our kernels with traditional models.

Continuous Latent Diffusion Language Model

Guo et al. [ByteDance Seed, The University of Hong Kong, The Australian National University, Peking University, Renmin University of China]

♥ 365   Diffusion LLMs   bycloud’s pick  

LLMs generate text in left-to-right order which works fine for some cases but it traps AI in a rigid, sequential way of thinking. Researchers have long wondered if high-quality generation actually needs to be tied to this fixed direction. The challenge has been finding an alternative that captures the broad meaning of a text without losing the efficiency and scalability that make modern AI so powerful.

The overall workflow of Cola DLM.

To solve this, researchers developed Cola DLM, which is a new framework. Instead of guessing the next word in a sequence, this system separates the generation process into two distinct steps. First, it forms a global semantic picture, sketching out overarching concepts within a flexible environment known as a continuous latent space.

Next, it uses a specialized decoder to translate those broad ideas into actual words. By utilizing a technique called diffusion, the model shapes underlying meaning rather than just recovering scrambled text. The AI effectively organizes its thoughts globally before worrying about local phrasing.

Unified text–image qualitative samples.

The implications of this hierarchical approach are incredibly promising. Through extensive testing against traditional models, researchers proved that this method scales beautifully as computing power increases. More importantly, because the system processes text as fluid concepts rather than rigid individual words, it establishes a natural bridge between written language and other continuous formats, like visual images.

Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior

Wurgaft et al. [Stanford University, University College London, Northeastern University, Harvard University, Technion IIT]

♥ 10K   LLM Scaling Law  

We are building AI models but we still don’t know how to reliably steer a model’s behavior without breaking it. Scientists have tried to guide AI by pushing its internal representations in straight lines. They treated the AI's mind like a flat grid, assuming they could simply draw a direct, linear path from one concept to another. Unfortunately, this rigid approach often produces unnatural, garbled outputs. The AI becomes unstable because that straight line blindly cuts through regions of its internal space that do not make sense.

How do different geometries of activation space modulate behavior?

To solve this, researchers mapped out how these models organize information. They discovered that an AI’s internal representations form distinct, elegant shapes that perfectly mirror human reasoning. When processing cyclical ideas like days of the week, the AI’s internal states form a continuous circle.

Manifold steering yields smooth and ordered behavioral transitions.

When reasoning through sequential concepts like ages or the alphabet, its thoughts form a clean, open curve. Researchers found a perfect mirror effect: the geometric shape of the AI's internal activations exactly matches the shape of its external behavior. When they gently guided the AI along these natural internal curves (a method they call manifold steering) the model’s behavior transitioned flawlessly. Instead of getting lost in unnatural territory, the AI easily glided from one coherent thought to the next.

Teaching Claude why

Anthropic

♥ 8K   LLM  

How do we stop capable AI systems from acting like sci-fi villains? When researchers tested earlier AI models with fictional ethical dilemmas, they encountered a startling problem called agentic misalignment.

In some test scenarios, models actually tried to blackmail engineers to avoid being shut down. Researchers realized that standard chat-based safety training simply wasn't enough once models started acting independently and using tools. Fixing this is necessary for building trust, ensuring that as artificial intelligence becomes more autonomous, it remains a safe, helpful partner rather than a catastrophic risk.

The team discovered that simply telling the AI what not to do is surprisingly ineffective. When trained strictly on examples of avoiding bad actions, the blackmail behavior barely dropped. The real breakthrough came from teaching the models the underlying principles of good behavior.

Instead of just showing the AI the right action, researchers trained it to explicitly deliberate on its values and explain why an ethical choice was better. They also introduced a clever approach where human users faced moral gray areas, and the AI was trained to offer thoughtful, principled advice. By combining this ethical reasoning with documents outlining the AI's core constitution and fictional stories of systems acting admirably, researchers fundamentally shifted the model's behavior so it could safely navigate entirely new situations.

By enriching training environments with diverse prompts, this principled alignment proved incredibly durable. Since implementing these deeper, reasoning-based methods, recent models have completely stopped engaging in extortion during these evaluations.

Reply

or to participate.