Compress Context... Into a LoRA!?

plus more on Learning Without Training and The Geometry of Noise

Feb 24th ~ Mar 4th
#97 Latest AI Research Explained Simply

🗞️ Industry News in 1 Line

  1. ♥ 10k Google has begun rolling out Nano Banana 2, its latest image generation model. The updated model uses real-time web search data to improve real-world accuracy and introduces the ability to render clear, multilingual text for designs like posters and logos. Additionally, Nano Banana 2 brings faster generation speeds alongside enhancements to lighting, textures, and overall image detail. Try it today via the Gemini app and web interface.

  2. ♥ 3.7k OpenAI has signed a classified deployment contract with the Department of War, insisting that a "cloud-only" architecture, an internal safety stack, and cleared engineers will somehow strictly prevent the military from using their models for autonomous lethal weapons or mass NSA surveillance. Entrusting a tech corporation to independently self-police lethal and intelligence applications sounds straight out of a black mirror episode.

  3. ♥ 20k Alibaba's Qwen team has launched the Qwen 3.5 Small Model Series, a family of native multimodal models ranging from a highly compact 0.8B to a remarkably capable 9B designed for edge devices and lightweight agents. Both the base and instruct models are now available on Hugging Face and ModelScope. Additionally, the entire suite is already optimized for local deployment via Ollama.

  4. ♥ 8k After small series, Alibaba has also introduced the Qwen 3.5 Medium Model Series, which includes the 27B, 35B-A3B, and 122B-A10B models designed to bridge the gap between mid-sized and frontier AI capabilities. Highlighting a shift toward architectural efficiency over parameter size, the new 35B-A3B model notably outperforms the previous generation's massive 235B model thanks to improved data quality and reinforcement learning. Try it in browser.

  5. ♥ 8.2k Google has announced Gemini 3.1 Flash-Lite, its fastest and most cost-effective Gemini 3 model to date. It has dynamic "thinking levels" and the model instantly processes high-volume queries while scaling its reasoning for complex edge cases, delivering a 2.5X faster time-to-first-token than its 2.5 Flash predecessor. Try it in browser.

Intuitive AI Academy - NEW MoE Chapter!

My latest project: Intuitive AI Academy has the perfect starting point for you! We focus on building your intuition to understand LLMs, from transformer components, to post-training logic. All in one place.

We just added a new chapter on MoE, that goes through the history, the key techniques, and the current state of MoE that frontier model uses. With over 10,000 words written.

We currently have a early bird offer, where you would get 40% off yearly plan for our early users.

Use code: TIMELINE

Learning Without Training

Ryan O’Dowd [Claremont Graduate University]

♥ 720   LLM Training  

Engineers usually build machine learning models by guessing a structure and running exhaustive optimization processes to train them. Researchers wanted to know if there was a more elegant way to overcome the hurdles of high-dimensional, noisy data without relying on these brute-force training methods.

The current approach assumes a model will eventually learn the underlying patterns, but it lacks constructive mathematical guarantees. By rooting their approach in classical approximation theory, we can save immense computational power while tackling complex problems like tracking brain diseases or analyzing hyperspectral images.

Normalized histogram of the density of interest (left), paired with our density estimation by σ128 based on 3900 samples (right).

This paper discovered how to mathematically construct highly accurate models directly on unknown, complex data surfaces, known as manifolds. This method bypasses the need to map out the entire geometry of the dataset first; it only requires knowing the dimension of the data.

The team unlocked a breakthrough in transfer learning by figuring out how to successfully lift learned information from just a localized portion of one data space and apply it to a completely different domain. As a result, adapting a massive model to a new problem no longer requires processing the entire original dataset.

Finally, the researchers reimagined data classification by treating it like a signal separation problem. By mathematically estimating the underlying sources of these signals, their new algorithm quickly zeroes in on the most informative data points.

Doc-to-LoRA: Learning to Instantly Internalize Contexts

Charakorn et al. [Sakana AI, Minerva University]

♥ 1.4k   LoRA   bycloud’s pick  

Let’s assume you asked an AI to analyze a massive technical manual. Currently, every time you ask a follow-up question, the system has to re-read the entire document. This repetitive reading eats up massive amounts of computing power, memory, and time.

While researchers can technically train the AI to memorize the document permanently, that traditional training process is painfully slow, expensive, and completely impractical for quick updates.

To solve this, researchers developed a brilliant workaround called Doc-to-LoRA, or D2L. Instead of forcing the main AI to constantly re-read text or undergo grueling training, they built a specialized, lightweight helper system called a hypernetwork.

This helper reads the document exactly once and instantly generates a tiny, customized plug-in. Think of it like instantly downloading a new skill directly into the AI's brain. Once this plug-in is attached, the main AI can answer subsequent queries fluidly without ever needing the original text in its prompt. It performs this complex mental compression in just a single step, completely bypassing the steep costs of standard training.

QA performance on SQuAD compared to the used context length ratio (left), update latency (middle), and additional memory needed for model updates (right).

The results are incredibly promising. In testing, D2L successfully hunted down specific facts hidden inside massive walls of text, achieving near-perfect accuracy on documents over four times larger than the AI’s normal limits.

It works drastically faster and uses far less memory than previous memorization methods. It can even translate visual information from image-based models into these text plug-ins, allowing a text-only AI to classify images.

Long document QA performance. LLMLingua-2 compresses the input with [20%, 40%, 60%, 80%, 90%] compression rates from right to left (gray dots).

The Geometry of Noise: Why Diffusion Models Don't Need Noise Conditioning

Sahraee-Ardakan et al. [Google]

♥ 382   Diffusion Models  

Can you restore a severely damaged painting without knowing how much damage was originally done? Standard diffusion models avoid this problem by relying on a strict timer that tells them exactly how much "noise" or corruption they are dealing with at any given step.

Recently, researchers have been incredibly hopeful about "autonomous" models that strip away this timer, learning a single rule to handle everything from pure static to nearly perfect data. However, this creates a profound mathematical paradox.

As these blind models approach the clean data, the underlying mathematical landscape forms an infinitely deep pit. The directional signals diverge completely, creating a severe geometric singularity. By all conventional logic, these models should become hopelessly unstable and crash.

The Singular Geometry of the Marginal Energy Landscape.

Yet, researchers have beautifully resolved this mystery. They discovered that these autonomous systems are actually charting a course across a unified map called "Marginal Energy."

More importantly, the scientists proved that the models naturally develop a hidden geometric shock absorber. As the AI approaches that infinitely deep mathematical pit, this built-in feature perfectly counteracts the extreme steepness. It transforms a catastrophic plunge into a smooth, stable descent known as a Riemannian gradient flow.

Generative performance on Fashion MNIST.

Models attempting to directly predict the noise act like faulty amplifiers, magnifying tiny errors until the system catastrophically breaks down. Conversely, models based on predicting "velocity" inherently absorb that uncertainty into a smooth, stable drift.

dLLM: Simple Diffusion Language Modeling

Zhou et al. [UC Berkeley, UIUC]

♥ 120   Diffusion LLM  

Language models traditionally generate text strictly left to right. But recently, researchers have found a promising alternative: diffusion language models. These systems can generate words in any order and iteratively refine their answers, unlocking highly flexible AI.

However, the underlying code was scattered across complex, isolated research repositories, making it incredibly difficult for developers to reproduce results or build upon each other’s work. To solve this, researchers created dLLM, a unified open-source framework that elegantly standardizes the development pipeline so the community can innovate together.

Inference pipeline: sampler swap from vanilla to FastdLLM MDLM sampler.

The framework seamlessly connects the core pillars of AI development: training, generation, and testing. Using a highly modular design, dLLM allows developers to snap different components together effortlessly. A researcher can easily swap out a training method or plug in a high-speed generation algorithm without rewriting the model's core architecture.

By standardizing how these models are evaluated, researchers also uncovered a hidden quirk: diffusion models are intensely sensitive to tiny adjustments in generation settings. A single tweaked parameter can drastically alter a model's performance, highlighting exactly why a transparent, shared testing environment is vital for meaningful progress.

Terminal Visualizer showing transition from masked to decoded tokens.

Perhaps the most hopeful breakthrough is how this framework democratizes AI research. The team proved that building these dynamic models does not require massive supercomputers. Using their new recipes, they successfully transformed standard, off-the-shelf systems (including traditional discriminative architectures like BERT) into functional diffusion chatbots.

Sensitivity to decoding hyperparameters.

They achieved this with minimal computing power and simple fine-tuning, requiring no structural changes to the original models.

Reply

or to participate.