• The AI Timeline
  • Posts
  • Diffusion models as Game Engines, Generative Verifiers, and Distributed Training Over-The-Internet

Diffusion models as Game Engines, Generative Verifiers, and Distributed Training Over-The-Internet

#21 | Latest AI Research Explained Simply

In this issue: x3 industry news, x3 AI research papers

Aug 26th ~ Sep 1st

๐Ÿ—ž๏ธ Industry News in 1 Line

  1. โ™ฅ 2.6k Magic AI, another relatively new AI company, has launched LTM-2-Mini, a state-of-the-art LLM with a context length of 100 million tokens. This model can handle nearly 10 million lines of code or 75 novels in a single conversation which is pretty impressive.

  2. โ™ฅ 18k An AI startup called 1X is giving Boston Dynamics a run for their money. They have announced NEO โ€” a completely autonomous humanoid robot which contains soft limbs instead of rigid hydraulics. The robotโ€™s vision and movement can be taken over by a human if needed, which would raise many safety and ethical concerns if it ever gets hacked.

    Neo humanoid robot by 1X

  3. โ™ฅ 1.4k LLMs are pretty useful but they can be slow. A new startup called Cerebras has launched Cerebras Inference, a new inference platform which can run popular LLMs like Llama3.1-70B 20 times faster than other GPUs at merely 20% of the cost. It supports 16-bit precision for better accuracy and only costs $0.6 per million tokens.

Thinkbuddy: MacOS AI Lab of Power Users

๐ŸŒ€ LLM Remix | ๐Ÿค– Multi-Models | โšก๏ธ Shortcuts + ๐ŸŽ™๏ธ Whisper | ๐Ÿ”‘ All-Access Single Subscription

Hey, AI enthusiasts! We've created the ultimate macOS companion for power users. Thinkbuddy isn't just another ChatGPT wrapper; it's deeply integrated with macOS, leveraging shortcuts, prompt templates, and advanced features like AI model mixing for ultimate responses.

You don't need to pay $25 for Claude, $40 for ChatGPT + Gemini, and be stuck with different UIs from various inference companies to test Meta's new LLaMa 3.1 405B. Our solution offers all these in one place!

Features:

๐Ÿค– LLM Remix: Combine GPT-4o + Claude Sonnet 3.5 + Gemini 1.5 Pro

๐Ÿ”€ 10+ leading models, no extra subscriptions - free to download!

๐Ÿ’ป 50+ Local LLMs (coming soon)

๐ŸŽ Deep MacOS integration

๐ŸŒ Multilingual support (80+ languages)

๐ŸŽ™๏ธ Audio transcription & ๐Ÿ“ธ quick screenshot analysis

๐Ÿ“„ PDF, DOCX, XLSX, JSON, SQL file support

โšก Async LLM requests & remix under 10 seconds

๐Ÿ”’ Privacy-focused: Local chat storage

๐ŸŒ Web ๐Ÿ“ฑ Mobile ๐Ÿ’ป Windows (coming soon)

Curious?

โœจ Access all assistants with one subscription. Remix their best answers for the ultimate AI response. No need to pay $20 each. Download now for free!

First 20 sales only: Pay once, get all models for life - %30 OFF - 130$ (COUPON CODE: BYCLOUD) and watch Alex's review to see all features in 5 minutes

(Not satisfied? 30-day return policy - No questions asked!)

Diffusion Models Are Real-Time Game Engines

Valevski et al. [Google Research, Tel Aviv University, Google DeepMind]

โ™ฅ 1.9k   Diffusion Based World Sim

Gameplay Videos of GameNGen

Introduction to GameNGen

We've seen Doom running on everything from toasters to treadmills, but these are just emulations of the original game code. So far no one has successfully replaced the code with a neural network that learns to simulate the game on its own while being interactive. 

This paper presents GameNGen, a revolutionary game engine powered entirely by a neural network. This means that, unlike traditional game engines which rely on manually written code and rules, GameNGen can learn to simulate games from only data. Think of it as the ultimate answer to the age-old question, "Can it run Doom?"

This means it can simulate complex actions like attacking enemies, opening doors, and updating the game state, all in real time. While not a perfect replica of the original game, GameNGen's visual quality is impressive, achieving a quality comparable to lossy JPEG compression. 

How Does GameNGen Work?

Let's break down how GameNGen, the neural network game engine, actually works! Firstly, the game world is represented as a set of latent states (like the information in the game's memory) and observations (what the player sees on the screen). There's also a set of actions the player can take (like moving, shooting, etc.). Next, we define a goal to create a system

that can predict the next frame of the game based on the current state, previous frames, and the player's actions.

Architecture pipeline of GameNGen

GameNGen's Two-Stage Process:

  1. Training an Agent: First, they train an AI agent to play the game. This agent is trained to act in a way that's similar to a human player, creating a diverse dataset of gameplay actions and observations.

  2. Training the Diffusion Model: Once the agent has played a lot of the game, they train a diffusion model. This model learns to predict the next frame in the game based on the agent's actions and the previous frames.

Key Mechanics:

  • Diffusion Model: The diffusion model is a type of neural network that learns to gradually add noise to an image and then reverse the process to generate new images.

  • Conditioning: The diffusion model is trained to predict the next frame conditioned on the player's actions and the previous frames. This is like giving the model a set of clues about what should happen next.

  • Noise Augmentation: To avoid errors from accumulating over time, they add noise to the frames during training. This helps the model become more robust and less prone to drift in its predictions.

  • Auto-Regressive Generation: The model generates frames one by one, using the previous frames as input. It's like predicting the next chapter of a story based on the previous chapters.

  • Context Length: GameNGen only ultilizes the past 3 seconds/60 frames as the context window. Increase in historical context didnโ€™t yield a significant increase in performance. On the other hand, all games states are inferred via on screen content (eg. player locations, items, and health) to determine if an enemy has been defeated or not.

  • Agent Play: Training the diffusion model on data generated by an AI agent that learns to play the game improves performance compared to training on random actions. This shows the importance of having relevant training data that captures the dynamics of human gameplay.

Results and Real-World Implications of GameNGen

GameNGen is capable of simulating complex games like Doom in real-time at a quality comparable to the original game. It's a major breakthrough in using neural networks to create game engines, opening up a world of possibilities for game development.

  • Image Quality: In terms of PSNR (Peak Signal-to-Noise Ratio), it's comparable to lossy JPEG compression, meaning the simulated images are very close to the original game. Whatโ€™s even more impressive is that human raters had difficulty distinguishing between short clips of the simulated game and the actual game! The only shortcomings are the death animations of the enemies which turns into blobs most of the time.

  • Video Quality: While single frames look great, the quality does degrade slightly over longer videos. This is due to the accumulation of small errors in the prediction process. However, the simulated videos are still very convincing because they capture the overall content and visual style of the game.

Imagine a future where games are no longer limited by code but can be created automatically by AI! With GameNGen, we might be one step closer to that reality. GameNGen opens up the possibility of creating game engines entirely from data which shifts the focus from writing code to providing training data.

Generative Verifiers: Reward Modeling as Next-Token Prediction

Zhang et al. [Google DeepMind, University of Toronto, Mila, UCLA]

โ™ฅ 575   LLM Verification

GenRM workflow to verify output of LLMs

Introduction to Generative Verifiers

LLMs are getting pretty advanced but they still make mistakes. So, we need a way to check if the AI's answers are correct. Currently, the way that we usually check AI answers is kinda clunky - we judge the answers by giving the answer a score, but it doesn't really understand what the AI did to get the answer. Maybe the LLM produced the correct answer but the reasoning behind it was completely wrong.

This paper suggests training a special AI โ€œtutorโ€ to check the answers of the LLMs. This tutor AI is called a "generative verifier" (GenRM), and it actually tries to understand how the AI got its answer. This tutor asks the LLM to show its reasoning and produce step-by-step results. This way, the tutor AI can catch any mistakes the AI made along the way.

How do Generative Verifiers Work?

Traditional methods for verifying AI solutions often rely on assigning a numerical score to a given answer by effectively treating it as a "right" or "wrong" classification. However, this approach fails to leverage the full potential of large language models (LLMs), which are inherently designed for generating text. GenRM, on the other hand, utilizes the LLM's text generation capabilities in a novel way.

Instead of assigning a score, GenRM frames verification as a text prediction task. It prompts the LLM with a simple question like "Is the answer correct? Yes or No?". The LLM then outputs its prediction, and GenRM analyzes the probability of the LLM predicting "Yes" or "No" as its score. This method harnesses the LLM's natural language generation abilities, allowing it to reason through the answer's correctness in a more intuitive and nuanced way.

Prompts used for GenRM approach

Additionally, GenRM uses a unified training approach by simultaneously training the LLM on both generating correct solutions and verifying those solutions. This dual-training method allows for a positive transfer of knowledge between these two tasks, making the LLM more adept at both generating accurate solutions and identifying potential errors in its own reasoning.

GenRM also uses the concept of "chain-of-thought" reasoning by asking the LLM to explicitly verbalize its reasoning steps during the problem-solving process. This "thinking out loud" approach provides GenRM with a deeper understanding of the LLM's reasoning process which enables it to identify subtle errors that might otherwise go unnoticed.

Testing Generative Verifiers

GenRM consistently outperforms traditional discriminative verifiers (ORM), self-consistency, and LLM-as-a-Judge across various reasoning tasks, including Last Letter Concatenation, Word Sorting, and GSM8K (grade-school math). This shows that GenRM is an effective approach to train verifiers for reasoning tasks. Here's a breakdown of the key findings:

  • Chain-of-Thought (CoT) Advantage: GenRM-CoT, which integrates chain-of-thought reasoning with majority voting, further enhances performance, especially on algorithmic tasks where oracle CoTs are available. 

  • Data and Model Scaling: GenRM's performance scales favorably with both increased dataset size and model capacity. This means it can effectively learn from more data and benefit from larger, more powerful models.

  • Inference-Time Compute: GenRM-CoT effectively utilizes additional inference-time compute through majority voting, allowing for further performance gains.

  • Synthetic Rationale Quality: While even synthetic rationales generated by LLMs can improve performance, the quality of these rationales matters. Using reference-guided grading significantly improves results which shows the potential for LLMs to identify reasoning errors when they have a reference solution to compare against.

Becnhmark results of GenRM technique

DisTrO: Distributed Training Over-The-Internet

Peng et al. [Nous Research]

โ™ฅ 3.1k   Distributed LLM

Introduction to DisTrO

If you have trained your own neural networks using Pytorch or Tensorflow on your computer, then you would know that training these networks takes up lots of resources and it is a slow process. But as these neural networks get bigger, they require even more resources and it is quite common for AI researchers to spread the training process across a cluster of computers with multiple accelerators (e.g., GPUs) to speed up the process.

Traditional methods like Distributed Data Parallelism (DDP) synchronize gradients between all accelerators after each training step which leads to a significant communication bottleneck. Due to this, these computers need high-speed interconnects which are quite expensive. This limits the scalability of training to powerful, dedicated data centers with massive infrastructure costs. 

The paper introduces โ€œDistributed Training Over-The-Internetโ€ (DisTrO), a family of distributed optimizers designed to drastically reduce inter-GPU communication requirements during training. DisTrO operates by optimizing the gradient sharing process, allowing for efficient training over slower internet connections and heterogeneous networking hardware.

Although the idea to distribute the compute workload isnโ€™t new, there are many programs such as Folding@home, LHC@Home, MilkyWay@Home, World Community Grid, BOINC, etc. which allow normal people like us to contribute to cutting edge research by donating compute power from home. But unlike previous low-communication optimizers, DisTrO is architecture-agnostic, network-agnostic, and supports distributed data parallelism with minimal overhead.

Inner-Workings of DisTrO

DisTrO is a novel distributed optimizer designed to reduce inter-GPU communication requirements during large-scale neural network training. Letโ€™s see how it achieved significant bandwidth reduction when training a 1.2B LLM.

The Setup:

  • Model: A 1.2B parameter Llama 2 LLM architecture, chosen for its popularity and similarity to other widely used LLMs.

  • Training Data: 105B tokens from the Dolma v1.7 dataset.

  • Hardware: 32 H100 GPUs, each with the full model loaded in VRAM.

  • Optimizer: AdamW with a cosine decay learning rate schedule, with DisTrO-AdamW replacing AdamW for the DisTrO training run.

  • Baseline: Standard AdamW+All-Reduce, which synchronizes gradients across all GPUs after each step.

DisTrO in Action:

  1. Replacing All-Reduce: DisTrO-AdamW eliminates the All-Reduce operation, which is responsible for the bulk of communication in traditional methods. This significantly reduces the amount of data exchanged between GPUs.

  2. No Optimizer State Synchronization: Unlike other distributed training methods, DisTrO does not synchronize the optimizer state across GPUs. This further contributes to reduced communication.

  3. Stateless Variant: DisTrO has a stateless variant, but the experiment focused on the stateful version. This shows its effectiveness and flexibility as it can produce good results even without relying on statelessness.

Benchmarking DisTrO

DisTrO-AdamW achieved comparable convergence rates to standard AdamW+All-Reduce which resulted in similar training loss after 25,000 steps. The paper reported an 857x reduction in inter-node bandwidth requirements with DisTrO-AdamW compared to AdamW+All-Reduce. 

This shows that DisTrO can effectively reduce communication requirements during large-scale neural network training without compromising convergence speed. This breakthrough can enable training AI models over slower internet connections which opens up new possibilities for democratizing large-scale model development.

Speculating the Real-World Implications of DisTrO

This can have very interesting implications, maybe in the future we can see a decentralized AI model which was not trained by any one company but by all the companies and general people. If we can figure out a way to make it decentralized (somewhat similar to blockchain) then maybe this would finally make the models trustworthy.

Moreover, many companies donโ€™t want to share their personal data with third parties to train models, if this federated learning concept takes off then companies can train on their personal data privately and still contribute to AI research without revealing sensitive information.

Furthermore, by enabling training on more accessible infrastructure, DisTrO could reduce the environmental impact of AI development, lessening the reliance on energy-intensive centralized data centers. This would also allow us to shift our workloads to regions with sustainable sources of power without sacrificing performance.

Reply

or to participate.