- The AI Timeline
- Posts
- Kimi Moonshot: Prefill-as-a-Service!?
Kimi Moonshot: Prefill-as-a-Service!?
plus more about Looped Transformers, Nexus, RNN with Memory, and more
Apr 14th ~ Apr 20th
#104 Latest AI Research Explained Simply
🗞️ Industry News in 1 Line
♥ 16K Moonshot AI has released Kimi K2.6, their open-source coding model that achieves SoTA results across major benchmarks like SWE-bench Pro and Math Vision. It has long-horizon coding capabilities and it supports over 4,000 continuous tool calls, and has improved "Agent Swarms" that allow for massively parallel execution across complex, multi-file projects. You can explore the technical details or try the model on Moonshot or Hugging Face.

♥ 2.2K PrismML has launched Ternary Bonsai, a new family of models using 1.58-bit ternary weights to achieve a 9x reduction in size compared to standard 16-bit models. Available in 1.7B, 4B, and 8B parameter sizes under the Apache 2.0 license, these models offer a high intelligence-to-memory ratio, making them highly efficient for resource-constrained deployments. You can try it on GitHub or Hugging Face.

♥ 81K Anthropic has released Claude Opus 4.7, with improved instruction following, and a new self-verification capability. The update also brings significantly higher resolution vision processing and new API tools, including a flexible "xhigh" effort level and task budget management for long-running workflows. You can try it on the Claude website or through the API.

♥ 11K Alibaba has released Qwen3.6-35B-A3B, a new sparse Mixture-of-Experts (MoE) model that delivers high-level agentic coding and multimodal reasoning with only 3 billion active parameters. You can try it on Qwen Studio or Hugging Face.

Thunder Compute: The cheapest cloud GPU
Thunder Compute has cheap cloud GPUs for developers. We offer on-demand GPU cloud instances in enterprise-grade data centers for a fraction of the price of competitors.
With on-demand H100 sitting at $1.38/GPU/hr, you get best-in-class reliability and networking, compared to other competitors that offer at least $4/GPU/hr.
With additional features like:
VSCode extension and CLI which let you connect to instances without SSH config.
Snapshots to save instance state and restore on any number of instances
Templates for ComfyUI, Ollama, Unsloth Studio, and more
$20 of free credit for students
Parcae: Scaling Laws For Stable Looped Language Models
Prairie et al. [University of California, San Diego, Together AI]
♥ 1.2k LLM Scaling
Researchers have been exploring a clever AI architecture called "looped architectures." Instead of adding new layers, these models route information through the same layers in a continuous loop. This brilliantly keeps the model's footprint small while boosting processing power.

Parcae and the Scaling Laws of Looping.
Unfortunately, this recycling process has historically been incredibly unstable. The math inside the loop tends to spiral out of control, causing the model’s learning to randomly spike or entirely collapse. This leaves the promising approach too unpredictable to scale.
To solve this, researchers analyzed the looping process through the lens of classical control theory, treating it as a continuous feedback system. They pinpointed the exact point of failure: the parameters responsible for injecting new information into the loop were growing unrestrained, essentially blowing out the system's memory.

Optimal µrec and Tokens Follows Predictable Power Laws
Using this information, the team created a new model architecture called Parcae. Parcae acts like a smart governor on an engine. By mathematically constraining these injection parameters, it ensures they stay safely balanced, preventing the system from overloading. Alongside a stabilizing step for incoming data, Parcae keeps internal signals beautifully calm and safely controlled.
Parcae completely eliminated the chaotic learning spikes of past models, proving that looping is a genuinely viable way to build smarter AI without simply making it larger. In fact, a Parcae model successfully matched the quality and performance of a traditional model twice its size.
Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter
Qin et al. [New York University]
♥ 2.8K KV cache bycloud’s pick
When large language models generate text, they perform two distinctly different jobs: reading your prompt, known as the prefill phase, which demands massive computational power, and writing the response, or the decode phase, which relies heavily on memory speed.
Because these tasks need entirely different hardware to run efficiently, engineers have long wanted to split them up. The problem is that reading a long prompt creates a massive temporary memory file called a KVCache.

Comparison of two deployment paradigms for PD-disaggregated LLM serving.
Until now, this data was so enormous that moving it between different servers would completely jam the network, forcing companies to cram all their AI operations into single, ultra-expensive computing facilities.
Researchers wanted to break this physical wall, hoping to decouple the prefill and decode hardware across separate locations to make AI infrastructure drastically more flexible and cost-effective.

Deployment topology of the PrfaaS-PD architecture
The researchers found a remarkably elegant solution by developing a new architecture called Prefill-as-a-Service. Rather than trying to force every single user request across a network, their system acts as an intelligent traffic cop. It identifies exceptionally long, complex prompts and selectively offloads only those heavy prefill tasks to dedicated, high-performance computing clusters.
To make the transfer possible, they paired this routing with newer hybrid AI models that naturally condense the KVCache footprint. By shrinking the data size and only moving the most demanding tasks, the system can smoothly stream the resulting memory files over standard, everyday Ethernet cables to separate decode clusters without causing a network traffic jam. Short or simple requests just stay local, completely avoiding unnecessary travel.
By allowing different hardware to handle what it does best across separate physical locations, the researchers achieved a fifty-four percent increase in overall processing throughput and slashed the wait times for long requests by over sixty percent.
Language models transmit behavioural traits through hidden signals in data
Cloud et al. [Anthropic, Truthful AI, Warsaw University of Technology, Oxford Martin AI Governance Initiative, Alignment Research Center, University of California, University of Cambridge]
♥ 2.6K LLM hidden signals
As artificial intelligence systems grow more advanced, developers increasingly use large "teacher" models to generate data and train smaller, more efficient "student" models. To keep these new systems safe, creators carefully filter this training data to scrub away any biased or harmful content. But a profound question has lingered: can a student inherit a teacher’s hidden traits even if all obvious evidence is erased from the data?
Researchers recently explored this mystery to understand the invisible lineage of AI. Solving this puzzle represents a deeply hopeful step forward, giving developers the insight needed to ensure that as systems learn from one another, they pass down only safe and beneficial behaviors.

Schematic overview of the subliminal learning effect.
Researchers have discovered a phenomenon they call "subliminal learning". They gave a teacher model a specific hidden trait, like a disproportionate fondness for owls or a tendency to produce misaligned, unsafe responses.
They then asked this teacher to generate completely unrelated, harmless data, such as simple number sequences or basic math reasoning. Even after scientists rigorously filtered this data to guarantee absolutely no mention of owls or dangerous concepts remained, the student models trained on these neutral numbers still adopted the teacher’s hidden traits.

The structure of our main experiments to test subliminal learning.
The students effectively read between the lines, inheriting complex behaviors through subtle, invisible patterns woven deeply into the basic data.
Through mathematical proofs, the researchers revealed that this invisible transmission happens when the teacher and student share the same base model. Because their neural architectures match, the student instinctively aligns with the teacher's broader properties during training.

Students reliably express only when increased animal preference when trained on numbers generated by teachers with the same initialization.
It shows that evaluating AI safety requires looking beyond just the text in a dataset to examine the entire family tree of the models involved.
Memory Caching: RNNs with Growing Memory
Behrouz et al. [Google Research, Cornell University, USC]
♥ 987 LLM Memory
Current top-tier AI models remember every single word but doing so requires an enormous amount of computing power and memory. Conversely, more efficient models called Recurrent Neural Networks read with the equivalent of a tiny notepad.
They compress information into a fixed-size memory to save processing power, but they inevitably forget important details from earlier chapters. Researchers have been searching for a perfect middle ground, a way to give AI brilliant recall without the crushing computational cost.

The Overall Memory Caching Method.
To solve this, scientists developed a remarkably clever technique called Memory Caching. Instead of forcing the AI to memorize everything or squash it all into a single overflowing notepad, this method allows the model to periodically save "checkpoints" of its thoughts.
As the AI processes long text, it breaks the information into segments, summarizes the data into a memory state, and safely caches that summary. When the model needs to recall something, it doesn't just rely on its immediate memory. It actively looks back through its library of saved checkpoints. By combining its current context with these stored historical summaries, the model's memory capacity naturally grows as the text gets longer.

Sparse Selective Caching (SSC) of Memories.
The team designed different ways to use the checkpoints, including a highly efficient method where the model intelligently selects only the most relevant past memories to retrieve, rather than computing every single one. When tested on complex tasks like finding hidden information in long documents, this caching method dramatically boosted the performance of efficient models.

Needle-In-A-Haystack experiments with three levels of difficulty
Nexus: Same Pretraining Loss, Better Downstream Generalization via Common Minima
Khatri et al. [Meta, UT Austin, UCL, UC Berkeley, Harvard University, Periodic Labs]
♥ 270 LLM pretraining
When building Large Language Models, the vast majority of time and computing power is spent on pretraining. When an AI tries to master all these different subjects at once, does it find a compromise that simply looks good on average, or does it discover a geometric sweet spot close to the perfect solution for each individual subject?
It turns out that standard training methods typically settle for the compromise. They stop when the overall average error is low, even if the model's internal settings end up geometrically distant from the ideal setup for specific tasks.

Illustration of two types of minimizer
To solve this, researchers developed a new optimization approach called Nexus. Instead of just chasing a good average score, Nexus actively guides the model toward an intersection where the ideal solutions for all these different subjects naturally overlap. It achieves this by maximizing "gradient similarity," ensuring that the mathematical directions the model follows while learning remain aligned across different data sources.

By forcing the model to seek this geometric closeness, Nexus unlocked up to a 15 percent accuracy improvement on complex reasoning tasks. Remarkably, it achieved these massive gains while reaching the exact same overall training loss as traditional methods.

This discovery offers a hopeful glimpse into the future of AI development. It proves that we do not necessarily need infinitely more data or larger supercomputers to build better models; sometimes, we just need to help them navigate their learning landscape a little more elegantly.

Analysis of Gradient Similarity, Loss, and Benchmarks


Reply