- The AI Timeline
- Posts
- The AI Timeline #1 - AI Plays Minecraft for the First Time
The AI Timeline #1 - AI Plays Minecraft for the First Time
Latest AI Research Explained Simply
Research Papers x 3⚡Instant3D Music ControlNet JARVIS-1 | Industry News x 2Google’s Lyria Meta’s Emu Video & Edit |
Research Papers
⚡ Instant3D : Instant Text-to-3D Generation
Ming Li, Pan Zhou, Jia-Wei Liu, Jussi Keppo, Min Lin, Shuicheng Yan, Xiangyu Xu
Instant3D official demo (they look kinda unhinged lol)
Overview: A new framework for converting text into 3D objects.
Previous challenges: Traditional NeRF based text-to-3D methods (eg. DreamFusion) require hours for a single 3D object creation, involving optimizing a neural field for each text prompt from random initialization.
What’s better: Instant3D speeds up this process to 25ms by using a text-conditioned NeRF, thus requiring less computations and has strong generalization ability.
SoTA text-to-3D model v.s. Instant3D model
Innovation: Utilizes an additional trainable decoder network to "bridge text and 3D", this approach allows 3D generation in under a second, unlike earlier methods which required "per-prompt fine-tuning after each network inference"
Instant3D Model Structure
Model Structure Breakdown: Instant3D involves the fusion of three key modules: cross-attention, style injection, and token-to-plane transformation.
Cross-Attention Module: Adapting from text-to-image models, it merges text embeddings with decoder feature maps. Initial tests showed limitations in handling text ambiguity, leading to their next novel approach.
Style Injection Module: Integrates Adaptive Instance Normalization (AdaIN) with text features and random noise, enhancing the model's ability to navigate text ambiguities and improve 3D generation control.
Token-to-Plane Transformation: A novel approach that dynamically predicts the base tensor from text embeddings, replacing the static base tensor used in conventional methods, thus better aligning with conditioned text.
Performance: Instant3D demonstrated strong generalization capabilities on new prompts, eliminating the need for retraining for specific outputs.
🎵 Music ControlNet: Advanced Music Generation Tool
Shih-Lun Wu, Chris Donahue, Shinji Watanabe, Nicholas J. Bryan
Overview: Music ControlNet is a diffusion-based model for music generation with precise, time-varying controls for music generation.
Previous challenges: Older models excel in manipulating global musical attributes such as genre, mood, and tempo but fall short in controlling time-varying aspects like beat positions and dynamic changes in music.
What’s better: surpasses recent models like MusicGen in fidelity to input melodies by 49%, more faithful, with a lot less parameters, requiring less training data, and offering time-varying controls
Music ControlNet Architecture
Structure: The model's output is a Mel-spectrogram, then converted to audio via a vocoder
Conditional Generative Model: Generates audio waveforms given global text controls (eg. genre & mood tags) and a set of time-varying controls
Mel-spectrogram: spectrograms to model the joint distribution of waveform and controls
Time-Varying Controls: melody, dynamics, and rhythm, each directly extractable from spectrograms, eliminating the need for manual annotations.
Innovation: Proposed a novel masking strategy during training, enabling partial specification of controls over time, allows creators to partially specify controls, guiding the model to fill in the gaps.
Authors’ Note: Despite sharing the name, it's unrelated to image generation's ControlNET, but is drawn inspiration from Uni-ControlNet, another image generation research
Limitation: Code not currently available. Music demo can be found on project page.
🤖 JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models
Zihao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, Xiaojian Ma, Yitao Liang
JARVIS-1 Navigating tasks in Minecraft
Overview: JARVIS-1 showcased an advanced ability to learn, adapt, and autonomously improve over time using both visions and text, within Minecraft universe.
Previous challenges: Bad at processing visual/multimodal data, can’t perform consistent and accurate long-term planning, don’t have the ability to learn and evolve in a life-long fashion
What’s better:
Incorporated Multimodal language models
Generates advanced plans as task complexity increases
Utilizes multimodal memory for planning and task success
JARVIS-1’s Architecture
Model Structure: Interactive planner + goal-conditioned controller + multimodal memory of multimodal experiences
Multimodal Memory: planning with both pre-trained knowledge and real-world experiences. Enables planning correctness and consistency without additional model updates.
Memory-Augmented MLM: Generates plans and guides low-level actions.
Self-Improvement Mechanism: autonomously proposes tasks to enhance its planning abilities, and improves planning on familiar tasks using accumulated experiences.
JARVIS-1’s Interactive Planning Process
Interactive Planning Process
Initial Plan Generation: Upon receiving task instructions and observations, JARVIS-1 formulates a preliminary plan.
Self-Check: Identifies and corrects potential errors in the plan, marked in red for clarity.
Adaptive Error Correction: In case of execution errors, JARVIS-1 reasons about the next move based on environmental feedback (self-explain), enhancing plan robustness.
JARVIS-1’s Query Generation Strategy
Query Generation Strategy
Backward Reasoning: Determines intermediate sub-goals based on the current task and observation.
Depth-Limited Reasoning: Limits the extent of backward reasoning for efficiency.
Memory Integration: Combines relevant memory entries with current observations for comprehensive query formulation.
Ranking and Retrieval: Matches entries to the text query, ranking them by relevance and retrieving the most pertinent ones for each sub-goal.
The success rate of different models in the ObtainDiamondPickaxe challenge over time
Evaluation:
Takes a skilled person around 20 minutes to obtain a diamond pickaxe; takes JARVIS-1 around 60 minutes to obtain it, with success rate of 12%.
In prolonged game time (60 min), JARVIS-1 re-plans when pickaxe break, but VPT RL (Video Pre-Training, old SoTA) fails to do so, thus fail to find diamond.
Breakthrough: previous Minecraft AI agents are imitation learning (eg. VPT RL). However, JARVIS-1 is designed to be multi-task and not finetuned through imitation learning or reinforcement learning on specific data
Authors’ Notes: The success of these AI agents in complex simulations can probably transfer to real-life in a few years too since it’s not task specific learning.
Industry News
📽️ Meta’s Emu Video and Emu Edit
Emu Edit: Precise Image Editing via Recognition and Generation Tasks
Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi ParikhIshan Misra
view on their website for better resolution
Emu Edit:
The new Emu Edit model focuses on instruction-based image editing. Basically Photoshop Generative Fill without the need of manual masking.
Capable of precise edits without affecting unrelated image parts. A multi-skill base model unlike other image editing models.
Offers features like masking, super-resolution, and even segmentation with few shots fine-tuning.
Although not on par with the original Emu model in image generation, it excels in editing capabilities.
The types of training data used for Emu Edits
Authors’ Notes: Emu Edit is capable of image segmentation and masking, and the fact that it is trained WITHIN a generative model is INSANE. Check out bycloud’s video on it
Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning
Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, Yaniv Taigman
view on their website for better resolution
Emu Video:
Emu Video is a text-to-video model that produces SoTA quality, with much cleaner 512² resolution video compared to others.
Model generates 4-second, 16 fps video generation.
Uses diffusion models to generate an initial image and subsequent frames based on prompts and the initial image.
Outperforms other models in quality and faithfulness, with only Imagen's video model by Google coming close in faithfulness.
Generated video separates foreground, subject, and background extremely well.
Emu Video’s rough architecture
Emu Video benchmark with other text-to-video generators
Authors’ Notes: You need to look at the official results on the webpage to really see the clarity in the videos.
🎧 Google’s Lyria: Transforming the future of music creation
Example user interface of their newest music AI tools.
Overview:
Lyria: Google DeepMind's most advanced AI music generation model (they didn’t benchmark it with SoTA though). Excels in creating high-quality music with vocals and instrumentals. Maintains musical continuity in complex compositions.
Dream Track: A YouTube Shorts experiment for music creation. Limited creators can produce soundtracks with AI-generated voices and styles of popular artists. Can generate 30-second soundtracks based on user-input topics and artist selection.
Music AI Tools: A tool developed with artists and producers to aid their creative processes.
SynthID: Watermarking technology for AI-generated audio, ensuring traceability and responsible use. Apparently inaudible to humans.
Authors’ Notes: SynthID is yet another interesting idea in maintaining content authenticity, but more specifically for audio related AI generation which is really interesting.
that’s a wrap for this issue!
THANK YOU
Want to promote your service, website or product? Reach out at [email protected]
Reply