The AI Timeline
Posts
Fully Autonomous AI Agents Should Not be Developed

Fully Autonomous AI Agents Should Not be Developed

Plus more about OmniHuman-1, and Simple test-time scaling

by cloud
February 11, 2025

Feb 3rd ~ Feb 9th
#42 Latest AI Research Explained Simply

🗞️ Industry News in 1 Line

♥ 12k Mistral AI has launched le Chat, a comprehensive AI assistant that is available on both iOS and Android platforms. It offers features like Flash Answers, advanced document processing, and code interpretation. The service includes free tier access to core capabilities, and the Pro subscriptions start at $14.99 monthly.
Le Chat Interface
♥ 1.5k Pika Labs has introduced two major updates: Pikadditions, a new feature that allows users to integrate people or objects into any video content seamlessly, and Pika 2.1, which brings enhanced 1080p resolution and improved human character rendering. New users can create fifteen free Pikadditions generations for free on pika.art website.
Pikadditions by @ring_hyacinth on X
♥ 3.6k Google has released Gemini 2.0 Flash Thinking Experimental, and this model is ranked as the world's best AI model by lmarena.ai. It includes integration capabilities with popular Google services like YouTube, Search, and Maps, though users should note this experimental version may have limitations and won't include real-time information access.
current rankings on LM Arena

Thunder Compute: The Cheapest cloud GPU

A100 @ $0.57/hr!

Thunder Compute is the cheapest way to get GPU cloud instances for AI, machine learning, or data science. You can get an A100 hosted in Google Cloud, in a US data center, with best-in-class reliability and networking for $0.57/hr, compared with $3.50/hr directly from Google.

To make this possible, Thunder Compute invented virtualization software to network-attach GPUs. This increases the utilization of GPUs on the platform by 5x. Less downtime means lower prices for you.

Thunder Compute uses a simple CLI to create and connect to instances. Just run tnr create [--gpu a100 --vcpus 8] and tnr connect [instance_id] to start.

All instances have native notebook and VSCode support with convenient templates to launch instances with Ollama, ComfyUI, Flux, and more.

Create an instance for free with $20 per month of credit.

Advertise with The AI Timeline!

OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models

Lin et al. [ByteDance]

♥ 1.4k Face Animation

Introduction to OmniHuman-1

Current AI animation models are somewhat capable of producing realistic results, but are severely limited in their practical applications due to their reliance on heavily filtered datasets. For instance, audio-driven models typically only work for facial animations, while pose-driven models are restricted to full-body animations with static backgrounds. This limitation is caused by the need to filter out data that contains "irrelevant" factors - for audio-driven models, this means removing data with significant body poses or background changes, while pose-driven models require specific camera angles and clean backgrounds.

This paper introduces OmniHuman, a mixed-condition training strategy that combines multiple types of inputs (text, audio, and pose) during the training phase. This approach allows the model to utilize data that would normally be discarded in single-condition models, as data unsuitable for one condition type can still be valuable for another. For example, a video with complex body movements that would be filtered out for audio-driven training can still be used for text or pose-driven generation.

OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models

Understanding OmniHuman-1

OmniHuman can produce realistic human animations by combining multiple types of input conditions - text, audio, pose, and reference images - into a single unified framework. The model starts by taking a reference image that defines the subject's appearance and identity. It then processes various driving signals (like audio for speech, pose for movement, or text descriptions) through specialized encoders. For audio, it uses wav2vec to extract acoustic features; for pose, it uses a pose guider to encode movement information; and for text, it maintains the original MMDiT text processing workflow.

The system is built on a DiT (Diffusion Transformer) architecture that processes all these inputs simultaneously. Rather than using separate networks for different features, OmniHuman cleverly combines them through cross-attention and self-attention mechanisms. For example, audio features interact with video generation through cross-attention blocks, while pose information is directly concatenated with the latent representations. The reference image features are processed alongside the video tokens, which allows the model to maintain consistent appearance throughout the generation process.

The researchers used a three-stage training strategy which starts with basic text and image conditioning, then gradually incorporates audio, and finally adds pose information. When we combine this staged approach with carefully balanced training ratios (giving more weight to weaker conditions), it allows the model to learn effectively from a much larger dataset than traditional single-condition models. During inference, the model can use any combination of these conditions to generate videos. It also uses a special annealing strategy for classifier-free guidance that helps maintain both expressiveness and visual quality.

Results and Evaluations

OmniHuman shows remarkable performance improvements over existing specialized models in both portrait and body animation tasks. It achieves superior scores across key metrics, with notably higher IQA (3.875), ASE (2.656), and Sync-C (5.199) for portrait animation, and significantly better gesture generation metrics (HKV: 47.561) for body animation. The model consistently outperforms established methods like SadTalker, Hallo, and CyberHost while maintaining lower FID and FVD scores.

OmniHuman is even more impressive due to its versatility and generalization capabilities. Unlike existing methods that require separate models for different scenarios, OmniHuman handles various input sizes, ratios, and body proportions with a single model.

Fully Autonomous AI Agents Should Not be Developed

Mitchell et al. [Hugging Face]

♥ 499 AI Agents

Will AI Doom Us?

Many AI companies are rushing to integrate LLMs into autonomous systems capable of executing multiple tasks without human intervention. However, no one is paying attention to fundamental risks that it may pose.

The researchers of this paper found a direct correlation between increased AI autonomy and heightened risks to human safety, privacy, and security. This problem is particularly pressing because current development trajectories are headed towards fully autonomous systems that could potentially override human control.

In this study, the researchers aim to demonstrate why maintaining human control elements in AI systems offers a better risk-benefit profile. This balanced approach would allow for technological advancement while mitigating the most severe potential harms identified in their analysis.

Will AI Agents Cause Robot Uprising?

Before we answer that question, let’s first quickly understand what we mean by AI agents. AI agents are computer software systems capable of creating context-specific plans in non-deterministic environments. This paper analyzes various AI agent levels and their associated ethical trade-offs. The researchers present a graduated scale of AI agent autonomy, ranging from simple processors to fully autonomous agents. At the lowest level, models have no impact on program flow, while at the highest level, models can create and execute new code independently.

The researchers found that inherent risks are present at all autonomy levels, but there are some critical metrics which are affected by AI agent autonomy, this includes accuracy, where errors compound with increased complexity; safety and security, where autonomy increases unpredictable actions and expands attack surfaces; and privacy, where data exposure risks escalate with higher autonomy levels. The researchers found that as AI agents become more autonomous, human control diminishes, which can potentially lead to unintended consequences.

Furthermore, researchers also found that increased autonomy can enhance efficiency and assertiveness while simultaneously introducing new risks of error complexity and control loss. Similarly, the value of human likeness can provide better human-computer interaction but may lead to inappropriate trust and psychological dependencies.

The paper concludes that fully autonomous AI agents present significant risks that outweigh potential benefits. While increased autonomy can improve efficiency, assistance, and relevance, it also increases risks related to accuracy, safety, security, and truthfulness. Furthermore, inherent risks and biases from base models can propagate through all autonomy levels, which can directly affect critical values like consistency, equity, and trust.

Arguments For Developing Automated AI Agents

This paper highlights a significant gap in how we distinguish between different levels of autonomy. The authors argue that treating all AI agents as a single category has led to widespread confusion and potential risks. However, there are notable counterarguments to the paper's position. Some researchers argue that AI agents could advance our understanding of human intelligence and potentially help address global challenges like climate change and hunger.

The AGI development community sees full autonomy as a pathway to significant technological and economic advancement. However, the authors of this study advocate for maintaining human oversight in all AI systems, including AGI. They support this position with historical evidence, notably citing the 1980 incident where automated systems falsely detected over 2,000 Soviet missiles heading toward North America. This near-catastrophic event was only averted through human verification. This historical lesson shows that even as AI capabilities advance, human judgment is essential for preventing potentially catastrophic automated decisions.

s1: Simple test-time scaling

Muennighoff et al. [Stanford University, University of Washington, Allen Institute for AI, Contextual AI]

♥ 1.3k LLM Reasoning bycloud’s pick

Introduction To Test-Time Scaling

OpenAI's o1 model has impressive test-time scaling capabilities (using extra computational resources during testing to improve performance), but since it is a proprietary mode, their testing methodology is private. Many researchers have tried to replicate it, but none have successfully reproduced the clear test-time scaling behavior while maintaining open-source accessibility.

This paper proposes a simple solution called "s1-32B" that achieves both strong reasoning performance and test-time scaling using just 1,000 carefully selected training samples and a technique called "budget forcing." This study fine-tunes the Qwen2.5-32B-Instruct model and implements budget forcing to control the model's thinking duration. This efficient approach challenges the assumption that large-scale reinforcement learning and massive datasets are necessary for achieving strong reasoning capabilities in language models.

GitHub - simplescaling/s1: s1: Simple test-time scaling

s1: Simple test-time scaling. Contribute to simplescaling/s1 development by creating an account on GitHub.

How Does Test-Time Scaling Work?

The researchers created a highly efficient AI training dataset to test Test-Time Scaling capabilities. The dataset started with 59,029 questions from 16 diverse sources, and they systematically refined their collection through three stages. The initial data came from established sources like NuminaMATH and OlympicArena, plus original content from Stanford University's Statistics PhD exams and quantitative trading interview questions.

The filtering process focused on three key principles: quality, difficulty, and diversity. For quality, they removed examples with formatting issues and API errors. To ensure difficulty, they tested questions against two AI models, keeping only the challenging ones that both models struggled with. Finally, they achieved diversity by categorizing questions across 50 different domains using the Mathematics Subject Classification system, carefully selecting representatives from each field.

s1K dataset of 1,000 high-quality and diverse questions with reasoning traces.

This ultimately produced a refined dataset of just 1,000 high-quality samples, proving that carefully curated smaller datasets can be more effective than larger, less refined ones for AI training purposes. After that, the researchers introduce "budget forcing", which is a simple yet effective approach to test-time scaling in language models. This method controls the model's thinking duration by either forcing it to end early when it exceeds a maximum token limit or encouraging additional thinking by appending "Wait" when more computation is desired.

During inference, the researchers evaluate test-time scaling methods using three key metrics: Control (ability to maintain specified computational limits), Scaling (average improvement rate with increased compute), and Performance (maximum achievement on benchmarks). Furthermore, they also developed frameworks to distinguish between sequential methods, where computations build upon previous results, and parallel methods like majority voting.

Benchmark Results of Test-Time Scaling

The experimental results for Test-Time Scaling show effective training can be achieved with just 1,000 samples, which makes it the most sample-efficient open data reasoning model available. The study compared two scaling approaches: sequential scaling through budget forcing and parallel scaling via majority voting. Sequential scaling proved more effective but showed diminishing returns after 6x scaling, primarily due to repetitive loops when end-of-thinking tokens were overly suppressed.

In performance comparisons, s1-32B significantly outperformed its base model (Qwen2.5-32B-Instruct) and achieved comparable results to Gemini 2.0 Thinking on AIME24 tests. While r1-32B showed superior performance, it required 800 times more training samples, which raises questions about the optimal balance between sample size and model performance.

🚨This week's top AI/ML research papers:
- Demystifying Long Chain-of-Thought Reasoning in LLMs
- OmniHuman-1
- LIMO
- s1: Simple test-time scaling
- Process Reinforcement through Implicit Rewards
- Iterate to Accelerate
- Efficient Reasoning with Hidden Thinking
- Fully… x.com/i/web/status/1…
— The AI Timeline (@TheAITimeline)
10:42 PM • Feb 9, 2025

Reply

or to participate.

Fully Autonomous AI Agents Should Not be Developed

Plus more about OmniHuman-1, and Simple test-time scaling

Feb 3rd ~ Feb 9th#42 Latest AI Research Explained Simply

🗞️ Industry News in 1 Line

Thunder Compute: The Cheapest cloud GPU

OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models

Introduction to OmniHuman-1

Understanding OmniHuman-1

Results and Evaluations

Fully Autonomous AI Agents Should Not be Developed

Will AI Doom Us?

Will AI Agents Cause Robot Uprising?

Arguments For Developing Automated AI Agents

s1: Simple test-time scaling

Introduction To Test-Time Scaling

How Does Test-Time Scaling Work?

Benchmark Results of Test-Time Scaling

Reply

Feb 3rd ~ Feb 9th
#42 Latest AI Research Explained Simply