There Will Be a Scientific Theory of Deep Learning

plus more about Hyperloop Transformer, Qwen-3.5 Omni, and Scaling Self-Play with Self-Guidance

Apr 23rd ~ Apr 29th
#105 Latest AI Research Explained Simply

🗞️ Industry News in 1 Line

  1. ♥ 52k OpenAI has introduced GPT-5.5, its latest model with multi-step task execution through enhanced tool use and self-correction. The update maintains the per token latency of GPT-5.4 while delivering superior performance in coding, computer use, and research-intensive applications with significantly higher token efficiency.

  2. ♥ 1.5k Tencent has released the open-source preview of Hy3, a 295B parameter model featuring a 21B active parameter architecture optimized for advanced reasoning and agentic tasks. The release shows high cost efficiency and competitive performance within its size class ahead of the full official launch. You can try it on GitHub or Hugging Face.

  3. ♥ 11k Alibaba's Qwen team has released Qwen3.6-27B, a dense open-source model under the Apache 2.0 license that delivers flagship-level coding and multimodal reasoning capabilities. Despite its smaller 27B parameter size, the model outperforms the much larger Qwen3.5-397B-A17B on major benchmarks and natively supports both thinking and non-thinking modes for text, image, and video tasks. You can try it on GitHub or Hugging Face.

  4. ♥ 15k OpenAI has launched ChatGPT Images 2.0, a major update to its integrated image generation engine with significantly enhanced visual fidelity and more precise prompt adherence. The new version focuses on improving multimodal reasoning within the chat interface to deliver more realistic and contextually accurate outputs.

    this image is generated

Intuitive AI Academy - NEW Advanced RL Chapter!

My latest project: Intuitive AI Academy has the perfect starting point for you! We focus on building your intuition to understand LLMs, from transformer components, to post-training logic. All in one place.

We have just added a new advanced RL chapter, that includes the basics of RL and the current state of RLHF!

We currently have an early bird offer, where you would get 40% off on the yearly plan for our early users.

Use code: TIMELINE

Hyperloop Transformers

Zeitoun et al. [MIT]

♥ 370   Transformers  

Bigger AI models consume more memory, but researchers are eager to bring powerful language models directly to smartphones, however mobile hardware simply lacks the memory to store giant AI models.

To solve this, researchers designed a simple architecture called the Hyperloop Transformer. Normally, information in an AI passes through a long sequence of unique computational layers. Instead, the team organized their model into a beginning, a middle, and an end, and programmed the middle section to repeatedly "loop" over itself.

(Left) A vanilla middle-cycle looped Transformer architecture with two loops. (Right) A Hyperloop Transformer, which uses parallel residual streams that are written to after each loop using hyper-connections

Reusing these middle layers drastically cuts down the memory required. However, simply looping layers usually makes a model less accurate, that’s why the team introduced a clever structural fix called "hyper-connections." At the very end of each loop, the model temporarily splits the flow of data into multiple parallel streams. This allows the model to process information flexibly and shift its internal perspective between loops, avoiding the rigid thinking of standard looping models while adding almost no extra computational cost.

Perplexity numbers as the number of loops is varied for the 135M (left) and 579M (right) parameter looped models.

Researchers found that their Hyperloop Transformer actually outperforms traditional models of the same depth while using fifty percent fewer parameters. This high performance holds strong even when the model undergoes additional memory-saving
compression techniques after training.

Scaling Self-Play with Self-Guidance

Bailey et al. [Stanford University]

♥ 264   Self guidance   bycloud’s pick  

Teaching AI to sovle complex mathematical problems is tricky and when problems are too hard, the AI hits a wall and stops learning. Researchers previously tried a clever workaround called self-play, where one part of the AI creates practice problems for another part to solve.

Unfortunately, this often breaks down. The problem-creator figures out how to game the system, generating artificially convoluted puzzles that technically score as "difficult" but do not actually help the solver improve. Researchers wanted to know how to stop this plateau and keep the AI learning continuously on its own.

Intuition behind SGS. Bottom, the space of problems solvable over course of SGS

To solve this, researchers developed an inspiring new framework called Self-Guided Self-Play, where the AI takes on three roles: a solver, a problem-creator, and, a guide. When the system faces a target problem it cannot crack, the problem-creator invents a simpler, stepping-stone version of it.

The guide then acts as an internal quality controller. It reviews the new practice problem to ensure it is genuinely relevant to the original goal, rather than just a messy collection of rules meant to cheat the scoring metric. If a synthetic problem is poorly constructed, the guide penalizes it.

By having the AI judge its own practice material, the researchers prevented the system from spiraling into useless problem generation. Instead, it sustained steady progress for far longer than previous methods.

Using this self-guided method, a relatively small AI model was able to solve more mathematical problems than a standard model nearly a hundred times its size.

Qwen3.5-Omni Technical Report

Qwen Team

♥ 242   Multi-modal  

Most AI models just have a text-in-text-out box. While human interaction is rich and multi-sensory, machines have historically struggled to fluidly combine sight, sound, and language in real time. Qwen3.5-Omni can be used to create digital assistants that move beyond merely answering text prompts to truly understanding the messy, overlapping reality of human communication.

The overview of Qwen3.5-Omni.

To achieve this, researchers designed a brilliant "Thinker-Talker" architecture. The Thinker absorbs massive amounts of data, up to ten hours of audio or hundreds of seconds of high-definition video, using precise timestamps to ensure every sight and sound remains perfectly synchronized.

The overview of AuT. Consuming 40 million hours of supervised data especially more multilingual data, AuT encoder in Qwen3.5-Omni obtain stronger general purpose audio representation in 6.25Hz.

The Talker then acts as the voice. Historically, streaming AI speech suffers from awkward pauses and robotic glitches because text and audio process at different speeds. To fix this, the team invented a dynamic alignment technique that seamlessly knits text and speech units together on the fly. This allows the system to converse with genuine, human-like emotional nuance across dozens of languages and even seamlessly adopt custom voices from a brief user sample.

What makes this a massive leap forward is how these synchronized senses unlock entirely new autonomous skills. Because the model perfectly aligns what it sees with what it hears, it can independently search the web or use software tools to solve complex problems.

Most remarkably, researchers witnessed the emergence of a brand-new capability they call Audio-Visual Vibe Coding. The AI can watch a video, listen to verbal instructions, and instantly write executable computer code based on that combined experience. By uniting our sensory world with immense computational power, this discovery brings us much closer to technology that naturally adapts to us.

There Will Be a Scientific Theory of Deep Learning

Simon et al. [UC Berkeley and Imbue, Harvard University, University of Pennsylvania, Flatiron Institute, New York University, Stanford University, Astera Institute]

♥ 1.4K   LLM Explainability  

For years, artificial intelligence has felt more like alchemy than rigorous science. Engineers know these neural networks can achieve incredible things, but the systems are opaque black boxes built mostly through trial and error. Researchers are trying to solve a fundamental problem: how can we predict, control, and fully understand what happens inside these models when they learn?

Large and small network output multipliers are sufficient to induce lazy and rich training dynamics.

Researchers propose that a unified mathematical theory, which they call "learning mechanics," is beginning to emerge. Just as physics explains how natural forces move objects through space, this new field explains the invisible forces driving a neural network’s journey through the training process.

To uncover this underlying mechanics, researchers synthesized major trends to reveal how seemingly chaotic systems actually follow universal rules. They found that by stripping away the complexity of modern networks, like imagining them stretching to infinite sizes or removing certain quirks, the math beautifully simplifies.

The loss of large neural networks decays according to predictable neural scaling laws.

In these idealized states, researchers can map out exactly how models acquire knowledge. Furthermore, the paper highlights how macroscopic behaviors, like a model's ultimate performance, consistently obey strict scaling laws based merely on the data and computing power used.

Even the endless numerical tuning knobs of model training can be mathematically disentangled to reveal a clear, underlying system of cause and effect. By focusing on the broad, aggregate dynamics of the learning process rather than tracking every individual artificial neuron, scientists are laying the groundwork for a robust theoretical foundation.

Reply

or to participate.