LLaVA-o1 Vision Model Reason Step-by-Step

Reasoning via Marco-o1, and Rapid Response Jailbreak

Nov 18th ~ Nov 24th
#33 Latest AI Research Explained Simply

🗞️ Industry News in 1 Line

  1. ♥ 3.9k Anthropic has chosen AWS as its primary partner for training and deploying AI models. Amazon will invest an additional $4 billion to support the collaboration and enhance their work together on advanced AI technologies.

  2. ♥ 4k DeepSeek has launched its R1-Lite-Preview, an open-source AI model which has high-performance test-time compute reasoning capabilities just like OpenAI-o1 across benchmarks like AIME and MATH. It follows a transparent thought process and can be accessed via API in an upcoming release. Try DeepSeek right now in your browser.

    deepseek-r1
  3. ♥ 1.4k Black Forest Labs has launched FLUX.1 Tools which is a suite of open-access models that improve image manipulation by adding unprecedented control and steerability to their base text-to-image model. The suite includes four innovative features - FLUX.1 Fill, Depth, Canny, and Redux - which enable advanced inpainting, outpainting, structural guidance, and image recreation through text prompts and various input maps.

LLaVA-o1: Let Vision Language Models Reason Step-by-Step

Xu et al. [Peking University, Tsinghua University, Peng Cheng Laboratory, AI for Science (AI4S)-Preferred Program, Alibaba DAMO Academy, Lehigh University]

♥ 1.4k   VLM   bycloud’s pick  

Introduction to LLaVA-o1

Language models have made great improvements in their reasoning capabilities, but Vision-Language Models (VLMs) still struggle with complex visual reasoning tasks. Most existing VLMs use a direct prediction approach where they immediately generate answers without systematic analysis. This leads to poor performance on tasks requiring multi-step logical reasoning.

This paper introduces LLaVA-o1, an innovative VLM that introduces autonomous multi-stage reasoning. Instead of using traditional chain-of-thought prompting, LLaVA-o1 breaks down visual reasoning into distinct stages: summarization, visual interpretation, logical reasoning, and conclusion generation. 

Performance of LLaVA-o1 and other models across six multimodal reasoning benchmarks.

How does LLaVA-o1 Work

LLaVA-o1 uses a structured, four-stage reasoning process that breaks down complex visual-language tasks into manageable components. The model begins with a Summary Stage, where it creates a high-level interpretation of the question. This is followed by a Caption Stage specifically for image-related queries, where it provides a focused description of relevant visual elements. The model then moves to a Reasoning Stage for systematic logical analysis, and finally arrives at a Conclusion Stage where it synthesizes all previous information into a final answer.

What makes this architecture unique is its autonomous stage management system. The model uses special tags (<SUMMARY>, <CAPTION>, <REASONING>, <CONCLUSION>) to organize its thinking process, but unlike traditional chain-of-thought approaches, it activates these stages independently based on its judgment. All stages are completed in a single inference pass, making the process efficient while maintaining structure. The model is trained on the carefully curated LLaVA-o1-100k dataset, which includes both general-purpose and science-targeted visual question-answering samples.

It also uses a stage-level beam search method during inference where it produces N candidates for each reasoning stage and selectively chooses the best option before moving to the next stage. This is much faster than generating multiple complete responses or working at the sentence level. This granular approach to inference scaling allows the model to maintain coherent reasoning while improving accuracy. 

Process flow for generating the LLaVA-o1-100k dataset.

Evaluating and Benchmarking LLaVA-o1 

Using the stage-level beam search strategy within the LLaVA-o1 architecture yielded substantial gains in inference accuracy on complex reasoning tasks. They performed comparative evaluations across multiple benchmarks (MMStar, MMBench V1.1, MMVet, MathVista, AI2D, and HallusionBench) and found that this approach significantly outperformed baseline methods. Specifically, the best-of-N method achieved only a modest 0.6% improvement, while sentence-level beam search resulted in a 1.9% performance degradation.

Experimental results of LLaVA-o1 and state-of-the-art models on reasoning benchmarks.

Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions

Zhao et al. [MarcoPolo Team, Alibaba International Digital Commerce]

♥ 644   LLM Reasoning

Introduction to Marco-o1

OpenAI's o1 and other similar LLMs have demonstrated remarkable reasoning capabilities in structured domains such as mathematics, physics, and coding. However, these models struggle with open-ended problems where clear standards are absent and traditional reward mechanisms are challenging to define. The primary limitation is that existing models excel in domains with well-defined metrics but falter in complex, real-world scenarios which require nuanced, flexible reasoning that cannot be easily quantified through standard reinforcement learning techniques.

A classic question reasoned by Marco-o1 model: “How many ‘r’s are in ‘strawberry’.”

Marco-o1 is a new model which uses three key innovative techniques: Chain-of-Thought (CoT) fine-tuning, Monte Carlo Tree Search (MCTS), and a novel Reasoning Action Strategy. This approach allows the model to explore multiple reasoning paths and the MCTS component allows the model to dynamically explore different solution strategies using confidence scores, while the Reasoning Action Strategy introduces variable granularity in reasoning steps to optimize search efficiency.

How Does Marco-o1's Reasoning Architecture Work

Marco-o1 uses an advanced Monte Carlo Tree Search (MCTS) framework with a probabilistic reasoning strategy. It has transforment traditional MCTS by representing reasoning as a dynamic computational graph where:

  • Nodes represent discrete reasoning states

  • Actions are probabilistic language model outputs

  • Search paths are evaluated through a sophisticated confidence scoring mechanism

Confidence Scoring Mechanism: The Marco-o1 model calculates token-level confidence through a normalized softmax-based approach:

  1. For each generated token, compute relative confidence by comparing its log probability against top-5 alternative token probabilities

  2. Normalize confidence scores between 0-1 using exponential scaling

  3. Aggregate token confidences to derive an overall path reward score (v)

Reasoning Action Strategies: The Marco-o1 model introduces multi-granular reasoning exploration:

  • Step-level Actions: Generate complete reasoning steps

  • Mini-step Actions: Subdivide reasoning into 32-64 token segments

  • Enables more nuanced path exploration compared to traditional coarse-grained approaches

Self-Reflection Mechanism: A meta-cognitive layer is integrated through an explicit self-critique prompt: "Wait! Maybe I made some mistakes! I need to rethink from scratch." This triggers:

  • Internal error detection

  • Voluntary reasoning path reconstruction

  • Approximately 50% improvement on complex problem-solving tasks

Evaluation of Marco-o1

Marco-o1 shows significant improvements in reasoning tasks as it achieves +6.17% accuracy on English MGSM and +5.60% on Chinese MGSM datasets. The also shows exceptional translation capabilities, particularly in handling complex colloquial expressions by providing nuanced, contextually accurate translations that outperformed standard tools like Google Translate

Moreover, they showcased the model's nuanced reasoning through a compelling translation example, where it successfully translated a complex Chinese colloquial expression capturing contextual and cultural subtleties. By expanding the solution space and implementing a reflective reasoning mechanism, Marco-o1 represents a promising step towards more adaptable and context-aware reasoning models that can handle open-ended, real-world challenges more effectively.

Rapid Response: Mitigating LLM Jailbreaks with a Few Examples

Peng et al. [Anthropic, New York University, MATS]

♥ 1.1k   LLM Jailbreak

Introduction to RapidResponseBench

LLMs are becoming more powerful and ensuring their safety against misuse is important. Existing defense methods are not robust against adversarial attacks. This paper proposes a novel "Jailbreak Rapid Response" approach that focuses on quickly detecting and blocking jailbreak strategies after observing only a few attack examples, rather than seeking perfect static defense.

They also developed RapidResponseBench, a benchmark that measures defense effectiveness. It demonstrates a method that can reduce attack success rates by a factor of 240 on in-distribution jailbreaks and 15 on out-of-distribution jailbreaks.

How Does RapidResponseBench Work

The researchers developed a "Jailbreak Rapid Response" approach that quickly adapts to novel jailbreak strategies. They use jailbreak proliferation to generate multiple similar attack examples from a few observed instances. They tested five baseline rapid response techniques across different language models. They generated 1000 proliferation attempts per jailbreak strategy using Llama-3.1-70B-Instruct using five Rapid Response Techniques:

  1. Regex

  2. Guard Fine-tuning (most promising method)

  3. Guard Few-shot

  4. Defense Prompt

  5. Embedding

Evaluating RapidResponseBench Work

The tests showed that by carefully training the AI's "defense mechanism," the researchers could dramatically reduce the chances of harmful interactions. Specifically, they found that with just one example of a potential attack, they could decrease successful manipulation attempts by more than 15 times.

Technical Methodology and Performance Metrics:

  • Attack Vector Reduction: Guard Fine-tuning achieved a >15x decrease in vulnerability exposure

  • Sample Efficiency: Enhanced defense effectiveness with progressive jailbreak example proliferation

  • Generalization Capability: Demonstrated robust performance across in-distribution and out-of-distribution scenarios

  • Adaptive Learning: Dynamic response mechanisms that maintain low false-positive rates for legitimate queries

This is an intelligent system that doesn't just block traffic but learns to recognize and neutralize increasingly sophisticated intrusion attempts. The system continuously adapts, learning from each interaction to build more resilient defense protocols. While promising, the approach requires continuous refinement, acknowledging the cat-and-mouse dynamics inherent in cybersecurity and AI safety. The core objective remains developing an anticipatory, self-evolving defense mechanism that can proactively identify and mitigate potential misuse risks with minimal human intervention.

Reply

or to participate.