🚨This week’s top AI/ML research papers - Oct 20th

(Oct 13 ~ Oct 20th, 2024)

🚨This week’s top AI/ML research papers:

  • REPA: Representation Alignment for Generation

  • Sabotage evaluations for frontier models

  • Janus

  • What Matters in Transformers? Not All Attention is Needed

  • The Curse of Multi-Modalities

  • When Attention Sink Emerges in Language Models

  • Your Mixture-of-Experts LLM Is Secretly an Embedding Model For Free

  • Sample what you can't compress

  • Mix Data or Merge Models? 

  • Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models

  • SeedLM

  • LOKI

  • Semantic Image Inversion and Editing using Rectified Stochastic Differential Equations

  • Baichuan-Omni Technical Report

  • KV Prediction for Improved Time to First Token

  • Thinking LLMs

  • MoH

  • WorldCuisines

  • Fluid

  • FlatQuant

  • A Comparative Study on Reasoning Patterns of OpenAI's o1 Model

  • Revealing the Barriers of Language Agents in Planning

  • HumanEval-V

  • EvolveDirector

  • Self-Data Distillation for Recovering Quality in Pruned Large Language Models

  • CoTracker3

  • SANA

  • LLM X MapReduce

  • MLLM can see?

  • Animate-X

  • Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models

  • Model Swarms

  • Fundamental Limitations on Subquadratic Alternatives to Transformers

  • Inference Scaling for Long-Context Retrieval Augmented Generation

  • Refined LLC

  • TorchTitan

  • You Know What I'm Saying: Jailbreak Attack via Implicit Reference

  • Strong Model Collapse

overview for each + authors' explanations ⬇️ 

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Overview:

This paper introduces REPresentation Alignment (REPA), a regularization technique that enhances diffusion models by aligning noisy input projections with pretrained image representations.

By incorporating high-quality external visual representations, REPA significantly improves training efficiency and generation quality in diffusion and flow-based transformers such as DiTs and SiTs.

This approach accelerates SiT training by over 17.5 times while achieving state-of-the-art generation results, such as an FID score of 1.42 with classifier-free guidance.

Paper:

Author's Explanation:

Sabotage Evaluations for Frontier Models 

Overview: 

This paper examines the potential for AI models to engage in sabotage, such as undermining oversight, evading behavior monitoring, or interfering with deployment decisions. 

The authors develop threat models and evaluations to assess whether a model can successfully disrupt activities of major organizations. Tests on Anthropic’s Claude 3 Opus and Claude 3.5 Sonnet suggest that minimal mitigations currently manage these risks, though stronger measures may be needed as AI capabilities advance. 

The study also highlights the benefits of mitigation-aware evaluations and simulating large-scale deployments through smaller-scale tests.

Paper:

Author's Explanation:

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Overview:

Janus is an autoregressive framework designed to unify multimodal understanding and generation by decoupling visual encoding into separate pathways, while employing a single transformer architecture.

This approach addresses the issue of suboptimal performance arising from using a single visual encoder for both tasks due to different information granularity requirements.

By allowing independent selection of encoding methods for understanding and generation, Janus enhances flexibility and effectively mitigates encoder role conflicts.

Experiments indicate that Janus not only outperforms previous unified models but also rivals or surpasses the performance of task-specific models, showcasing its potential as a leading multimodal framework.

Paper:

Code & Model:

Author's Explanation:

---

What Matters in Transformers? Not All Attention is Needed

Overview:

This paper investigates redundancy in Transformer-based LLMs, focusing on MLP and Attention layers, using a similarity-based metric.

It reveals that many attention layers can be pruned without significant performance loss, exemplified by Llama-2-70B, which realized a 48.4% speedup with only a 2.4% performance reduction by pruning half of its attention layers.

Additionally, a joint method for dropping both Attention and MLP layers is proposed, enabling more aggressive pruning, as demonstrated by Llama-2-13B retaining 90% performance on the MMLU task after dropping 31 layers.

Paper:

GitHub:

---

The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio

Subscribe to keep reading

This content is free, but you must be subscribed to The AI Timeline to continue reading.

Already a subscriber?Sign In.Not now

Reply

or to participate.