The AI Timeline
Posts
🚨This week’s top AI/ML research papers - Sep 28

🚨This week’s top AI/ML research papers - Sep 28

(Sep 22 ~ Sep 28, 2024)

by cloud
September 28, 2024

🚨This week’s top AI/ML research papers:

Molmo and PixMo
MaskLLM
Are We Closer to an AI Doctor?
Programming Every Example
MIMO
Pixel-Space Post-Training of Latent Diffusion Models
Phantom of Latent for Large Language and Vision Models
Making Text Embedders Few-Shot Learners
Discovering the Gems in Early Layers
Imagine yourself
Improvements to SDXL in NovelAI Diffusion V3
MaskBit
MonoFormer
Instruction Following without Instruction Tuning
HelloBench
YesBut
EMOVA
LLaVA-3D
Boosting Healthcare LLMs Through Retrieved Context
RACER
Present and Future Generalization of Synthetic Image Detectors
Time-MoE
Reflecting Reality

overview for each + authors' explanations ⬇️

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

Overview:

Molmo introduces a family of VLMs that achieve state-of-the-art performance using open weights and data.

The innovation centers on a novel, detailed image caption dataset collected via speech-based descriptions from human annotators.

To support diverse user interactions, the authors also introduce a varied dataset for fine-tuning, including in-the-wild Q&A and 2D pointing data.

The 72B model within the Molmo family outperforms other open models and compares favorably with proprietary systems on various benchmarks and human evaluations, attributing success to the quality of newly collected datasets and optimized model architecture.

Paper:

https://arxiv.org/abs/2409.17146

Author's Explanation:

Meet Molmo: a family of open, state-of-the-art multimodal AI models.
Our best model outperforms proprietary systems, using 1000x less data.
Molmo doesn't just understand multimodal data—it acts on it, enabling rich interactions in both the physical and virtual worlds.
Try it… x.com/i/web/status/1…
— Ai2 (@allen_ai)
2:58 PM • Sep 25, 2024

MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models

Overview:

MaskLLM introduces a learnable pruning method that establishes semi-structured (N:M) sparsity in LLMs to reduce computational overhead during inference.

It utilizes Gumbel Softmax sampling to model N:M patterns as a learnable distribution, facilitating end-to-end training on large datasets.

This method offers high-quality mask learning and transferability of sparsity across domains or tasks, demonstrating significant improvements over state-of-the-art methods with notable reductions in perplexity on benchmarks like Wikitext.

MaskLLM effectively scales to LLMs ranging from 843M to 15B parameters, achieving superior performance with reduced computational resources.

Paper:

https://arxiv.org/abs/2409.17481

Author's Explanation:

🚀 @NeurIPSConf Spotlight! 🥳 Imagine fine-tuning an LLM with just a sparsity mask! In our latest work, we freeze the LLM and use 2:4 structured sparsity to learn binary masks for each linear layer. Thanks to NVIDIA Ampere’s 2:4 sparsity, we can achieve up to 2x compute… x.com/i/web/status/1…
— Pavlo Molchanov (@PavloMolchanov)
3:06 AM • Sep 27, 2024

Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale

Overview:

This paper introduces Programming Every Example (ProX), a framework enabling small language models to refine pre-training corpora by generating and executing fine-grained operations for each example.

ProX-curated data leads to over 2% performance improvement across various benchmarks and demonstrates significant effectiveness in domain-specific continual pre-training.

Notably, models trained on OpenWebMath refined by ProX improve average accuracy by up to 20.3%, significantly reducing training FLOPs compared to traditional methods.

Paper:

https://arxiv.org/abs/2409.17115

Author's Explanation:

🚀 Still relying on human-crafted rules to improve pretraining data? Time to try Programming Every Example(ProX)! Our latest efforts use LMs to refine data with unprecedented accuracy, and brings up to 20x faster training in general and math domain!
👇 Curious about the details?
— Fan Zhou (@FaZhou_998)
4:07 AM • Sep 26, 2024

A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?

Overview:

This paper explores the capabilities of OpenAI's o1, the first LLM with an internalized chain-of-thought technique via reinforcement learning, within medical contexts.

The analysis spans 37 medical datasets and evaluates o1 in understanding, reasoning, and multilinguality.

O1 improves in accuracy by around 6.4% on average across multiple datasets compared to GPT-4, suggesting potential for clinical utility.

However, notable issues such as hallucination and inconsistent multilingual performance still persists.

Paper:

https://arxiv.org/abs/2409.15277

Author's Explanation:

OpenAI’s new o1(-preview) model has shown impressive reasoning capabilities across various general NLP tasks, but how does it hold up in the medical domain? A big thank you to @OpenlifesciAI for sharing our latest research, *"A Preliminary Study of o1 in Medicine: Are We Getting… x.com/i/web/status/1…
— yuyin zhou (@yuyinzhou_cs)
1:59 AM • Sep 25, 2024

MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling

Overview:

MIMO introduces a novel framework for character video synthesis, enabling controllable attributes such as character, motion, and scene through user inputs.

It achieves scalability to arbitrary characters, generality to novel 3D motions, and applicability to interactive real-world scenes.

The method encodes 2D video frames into compact spatial codes by lifting them into 3D using monocular depth estimators, decomposing video clips into hierarchical layers based on depth.

This spatial decomposed modeling allows for flexible user control and complex motion expression, with experimental results demonstrating the method's effectiveness and robustness.

Demo:

https://menyifang.github.io/projects/MIMO/index.html

Paper:

https://arxiv.org/abs/2409.16160

Alibaba presents MIMO
Controllable Character Video Synthesis with Spatial Decomposed Modeling
Character video synthesis aims to produce realistic videos of animatable characters within lifelike scenes. As a fundamental problem in the computer vision and graphics community, 3D… x.com/i/web/status/1…
— AK (@_akhaliq)
4:14 AM • Sep 25, 2024

Pixel-Space Post-Training of Latent Diffusion Models

Overview:

This paper proposes incorporating pixel-space supervision in the post-training process of Latent Diffusion Models (LDMs) to address issues with generating high-frequency details and complex compositions.

By adding a pixel-space objective, the approach significantly improves supervised quality fine-tuning and preference-based post-training.

Empirical results demonstrate substantial enhancements in visual quality and flaw metrics for DiT transformer and U-Net diffusion models, while maintaining text alignment quality.

Paper:

https://arxiv.org/abs/2409.17565

Phantom of Latent for Large Language and Vision Models

Overview:

The paper introduces Phantom, a new efficient LLVM family with model sizes of 0.5B, 1.8B, 3.8B, and 7B parameters.

To enhance learning capabilities, Phantom temporarily increases the latent hidden dimension during multi-head self-attention, allowing for greater vision-language knowledge without significantly increasing model size.

The authors also introduce Phantom Optimization (PO), which uses autoregressive supervised fine-tuning and a DPO-like concept to improve performance.

Phantom outperforms many larger open- and closed-source LLVMs on its benchmarks

Paper:

https://arxiv.org/abs/2409.14713

Author's Explanation:

Excited to share our super efficient and frontier vision language models (0.5B, 1.8B, 3.8B, and 7B), Phantom has been released in github.com/ByungKwanLee/P…. You can learn about it on Arixiv arxiv.org/abs/2409.14713, either.
— Byung-Kwan Lee (@BKLEE_NANO)
5:41 PM • Sep 24, 2024

Making Text Embedders Few-Shot Learners

Overview:

Leveraging the in-context learning (ICL) capabilities of LLMs, this paper introduces a model called bge-en-icl that generates high-quality text embeddings using few-shot examples.

By integrating task-related examples directly into the query, the approach significantly improves performance across various tasks.

Subscribe to keep reading

This content is free, but you must be subscribed to The AI Timeline to continue reading.

Already a subscriber?Sign in.Not now

Reply

or to participate.