🚨This week’s top AI/ML research papers - Sep 28

(Sep 22 ~ Sep 28, 2024)

🚨This week’s top AI/ML research papers:

  • Molmo and PixMo

  • MaskLLM

  • Are We Closer to an AI Doctor?

  • Programming Every Example

  • MIMO

  • Pixel-Space Post-Training of Latent Diffusion Models

  • Phantom of Latent for Large Language and Vision Models

  • Making Text Embedders Few-Shot Learners

  • Discovering the Gems in Early Layers

  • Imagine yourself

  • Improvements to SDXL in NovelAI Diffusion V3

  • MaskBit

  • MonoFormer

  • Instruction Following without Instruction Tuning

  • HelloBench

  • YesBut

  • EMOVA

  • LLaVA-3D

  • Boosting Healthcare LLMs Through Retrieved Context

  • RACER

  • Present and Future Generalization of Synthetic Image Detectors

  • Time-MoE

  • Reflecting Reality

overview for each + authors' explanations ⬇️ 

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

Overview:

Molmo introduces a family of VLMs that achieve state-of-the-art performance using open weights and data.

The innovation centers on a novel, detailed image caption dataset collected via speech-based descriptions from human annotators.

To support diverse user interactions, the authors also introduce a varied dataset for fine-tuning, including in-the-wild Q&A and 2D pointing data.

The 72B model within the Molmo family outperforms other open models and compares favorably with proprietary systems on various benchmarks and human evaluations, attributing success to the quality of newly collected datasets and optimized model architecture.

Paper:

Author's Explanation:

MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models

Overview:

MaskLLM introduces a learnable pruning method that establishes semi-structured (N:M) sparsity in LLMs to reduce computational overhead during inference.

It utilizes Gumbel Softmax sampling to model N:M patterns as a learnable distribution, facilitating end-to-end training on large datasets.

This method offers high-quality mask learning and transferability of sparsity across domains or tasks, demonstrating significant improvements over state-of-the-art methods with notable reductions in perplexity on benchmarks like Wikitext.

MaskLLM effectively scales to LLMs ranging from 843M to 15B parameters, achieving superior performance with reduced computational resources.

Paper:

Author's Explanation:

Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale

Overview:

This paper introduces Programming Every Example (ProX), a framework enabling small language models to refine pre-training corpora by generating and executing fine-grained operations for each example.

ProX-curated data leads to over 2% performance improvement across various benchmarks and demonstrates significant effectiveness in domain-specific continual pre-training.

Notably, models trained on OpenWebMath refined by ProX improve average accuracy by up to 20.3%, significantly reducing training FLOPs compared to traditional methods.

Paper:

Author's Explanation:

A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?

Overview:

This paper explores the capabilities of OpenAI's o1, the first LLM with an internalized chain-of-thought technique via reinforcement learning, within medical contexts.

The analysis spans 37 medical datasets and evaluates o1 in understanding, reasoning, and multilinguality.

O1 improves in accuracy by around 6.4% on average across multiple datasets compared to GPT-4, suggesting potential for clinical utility.

However, notable issues such as hallucination and inconsistent multilingual performance still persists.

Paper:

Author's Explanation:

MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling

Overview:

MIMO introduces a novel framework for character video synthesis, enabling controllable attributes such as character, motion, and scene through user inputs.

It achieves scalability to arbitrary characters, generality to novel 3D motions, and applicability to interactive real-world scenes.

The method encodes 2D video frames into compact spatial codes by lifting them into 3D using monocular depth estimators, decomposing video clips into hierarchical layers based on depth.

This spatial decomposed modeling allows for flexible user control and complex motion expression, with experimental results demonstrating the method's effectiveness and robustness.

Demo:

Paper:

Pixel-Space Post-Training of Latent Diffusion Models

Overview:

This paper proposes incorporating pixel-space supervision in the post-training process of Latent Diffusion Models (LDMs) to address issues with generating high-frequency details and complex compositions.

By adding a pixel-space objective, the approach significantly improves supervised quality fine-tuning and preference-based post-training.

Empirical results demonstrate substantial enhancements in visual quality and flaw metrics for DiT transformer and U-Net diffusion models, while maintaining text alignment quality.

Paper:

Phantom of Latent for Large Language and Vision Models

Overview:

The paper introduces Phantom, a new efficient LLVM family with model sizes of 0.5B, 1.8B, 3.8B, and 7B parameters.

To enhance learning capabilities, Phantom temporarily increases the latent hidden dimension during multi-head self-attention, allowing for greater vision-language knowledge without significantly increasing model size.

The authors also introduce Phantom Optimization (PO), which uses autoregressive supervised fine-tuning and a DPO-like concept to improve performance.

Phantom outperforms many larger open- and closed-source LLVMs on its benchmarks

Paper:

Author's Explanation:

Making Text Embedders Few-Shot Learners

Overview:

Leveraging the in-context learning (ICL) capabilities of LLMs, this paper introduces a model called bge-en-icl that generates high-quality text embeddings using few-shot examples.

By integrating task-related examples directly into the query, the approach significantly improves performance across various tasks.

Subscribe to keep reading

This content is free, but you must be subscribed to The AI Timeline to continue reading.

Already a subscriber?Sign In.Not now

Reply

or to participate.