• The AI Timeline
  • Posts
  • Can LLMs Produce Ideas, Windows Agent Arena, AdEMAMix Optimizer

Can LLMs Produce Ideas, Windows Agent Arena, AdEMAMix Optimizer

#23 | Latest AI Research Explained Simply

In this issue: x2 industry news, x3 AI research papers

Sep 9th ~ Sep 15th

🗞️ Industry News in 1 Line

  1. ♥ 17k OpenAI has released a new series of AI models called o1 which are designed to think and evaluate their responses using a chain of thought process before responding. This allows them to solve harder math problems, and it can beat 89% of students who participate in qualifier for the USA Math Olympiad.

  2. ♥ 5.3k Mistral, the French AI company that is popular for dropping SOTA model weights via torrent links has released its first multimodal model - Pixtral 12B. It can process text as well as images and can perform tasks like captioning images.

Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers

Si et al. [Stanford]

♥ 1.3k   LLM

Are LLMs Creative?

LLMs can generate text but it is usually just a mashup of its training data. Right now, it's unclear if they can actually generate novel, expert-level research ideas. Creating ideas is a crucial first step in the scientific process, and no one has definitively proven whether AI can do this as well as human experts.

In this paper, they set up a controlled experiment to compare research ideas generated by AI against those created by human experts in the field of Natural Language Processing (NLP). By doing this, we will get the first statistically significant evidence on whether current LLMs can generate novel research ideas comparable to those of human experts in the field of NLP.

Testing LLMs for Novelty of Their Ideas

In this paper, the researchers set up a experiment to compare research ideas generated by AI against those created by human experts in NLP.

The experiment focused on prompting-based NLP research and defined seven specific research topics extracted from recent NLP conference calls for papers. Both human participants and the AI agent were given the same set of instructions, including topic descriptions, idea templates, and demonstration examples.

To generate ideas, the researchers developed an AI agent using a LLM with RAG. The agent retrieved relevant papers using the Semantic Scholar API, generated a large pool of candidate ideas, and then ranked them.

For human participants, they recruited 49 highly qualified NLP researchers from various institutions to write ideas. The human experts were allowed to select their preferred topics, and the AI agent generated ideas for the same topics to ensure an equal distribution. If you think you are a good fit for this, you can sign up for this experiment too!

For evaluation, the researchers recruited 79 expert reviewers to conduct blind reviews of the ideas from three conditions: human-written ideas, AI-generated ideas, and AI-generated ideas reranked by a human expert. They developed a standardized review form with clear criteria and numerical scales to ensure consistent evaluations.

To eliminate potential biases from writing style, they used an LLM to normalize the style of all ideas before review. This comprehensive setup allowed the researchers to make statistically rigorous comparisons between human experts and state-of-the-art LLMs in generating novel research ideas.

Are LLMs More Creative than Humans?

This paper used three different statistical tests to answer this question:

  1. Test 1: Treated each review as an independent datapoint

  2. Test 2: Treated each idea as an independent datapoint

  3. Test 3: Treated each reviewer as an independent datapoint

Across all three tests, the results consistently showed that AI-generated ideas were rated significantly higher in novelty compared to human-generated ideas (p < 0.05). The mean novelty scores for AI ideas (both with and without human reranking) were approximately 5.6-5.8 out of 10, while human ideas scored around 4.8-4.9.

Statistical analysis of test results.

Interestingly, AI ideas were comparable to human ideas on other metrics such as excitement, feasibility, expected effectiveness, and overall score. In some tests, AI ideas even showed slight advantages in excitement and overall scores, though these were less consistent across all tests.

Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale

Bonatti et al. [Microsoft, Carnegie Mellon University, Columbia University]

♥ 744   LLM Agent

Introduction to WINDOWS AGENT ARENA

LLMs show great potential as computer agents to help with various tasks, but it's challenging to measure how well these agents perform in realistic environments. Current benchmarks are often limited to specific areas like text-only tasks or web navigation, and they can take days to run a full evaluation because of the step-by-step nature of the tasks.

To address this, the researchers have created WINDOWS AGENT ARENA, a new benchmark environment that focuses on the Windows operating system. This environment allows AI agents to use the same wide range of applications and tools that human users can access when solving tasks.

The benchmark includes over 150 diverse Windows tasks that test the agent's abilities in planning, understanding what's on the screen, and using different tools.

How Does WINDOWS AGENT ARENA Work?

WINDOWS AGENT ARENA is a benchmark environment for testing AI agents in a Windows operating system. The arena provides a virtual Windows environment where AI agents can perform various tasks, just like a human user would on a real computer. These tasks range from simple actions like editing documents to more complex operations like customizing system settings.

  1. Task structure: Each task in the arena has a specific instruction, an initial setup (like opening a particular program), and a way to evaluate if the task was completed successfully. The agent tries to complete the task by taking actions in the virtual environment.

  2. Agent interaction: The agent receives information about the current state of the system (like what's on the screen) and can perform actions such as moving the mouse, typing, or clicking. It keeps taking actions until it completes the task, gives up, or runs out of time.

  3. Evaluation: After the agent finishes, the system checks if the task was completed correctly. This might involve examining files, checking system settings, or running scripts to verify the outcome.

  4. Technical implementation: The arena uses a Docker container to host a Windows 11 virtual machine. A special server inside the virtual machine communicates with the agent, allowing it to send commands and receive information about the system state.

  5. Scalability: One of the key features of WINDOWSAGENTARENA is its ability to run many tasks in parallel using cloud computing (specifically, Microsoft Azure). This allows for much faster evaluation of agents compared to running tasks one at a time on a single computer.

Real World Implications of WINDOWS AGENT ARENA

WINDOWS AGENT ARENA shows that the best-performing agent, using UIA with Omniparser and GPT-4V-1106, only achieved a 19.5% success rate, much lower than the 74.5% human performance. This highlights the difficulty in creating AI agents that can match human abilities.

The study also found that precise visual prompts (Set-of-Marks) are crucial for better performance, and they improve results by up to 57% for some models. However, many failures were due to the agent's struggle to align text output with visual understanding.

This paper showed that there is a need for better AI agents that can assist with everyday computer tasks. Better accuracy and reliability in these agents can make them more useful for people.

Future research could focus on combining human input with AI to improve performance and developing specialized sub-systems for specific tasks.

The AdEMAMix Optimizer: Better, Faster, Older

Pagliardini et al. [EPFL, Apple]

♥ 2.3k   LLM Optimizer

Introduction to AdEMAMix Optimizer

When training AI models, we use past information to guide future steps. Traditional optimization algorithms like Stochastic Gradient Descent (SGD) with momentum and Adam use Exponential Moving Averages (EMAs) of past gradients. Traditionally, we mostly look at recent information, like the last 6 steps. This is because we think older information might not be useful anymore.

However, the researchers found that older information can actually be very helpful. The problem is, if we try to use too much old information in the traditional way, we become less sensitive to new information. This can slow down learning.

To address this issue, the researchers propose a novel optimizer called AdEMAMix (Adaptive EMA Mixture). This method aims to better utilize past gradients by combining a "fast-changing" EMA (with a smaller β value) and a "slow-changing" EMA (with a larger β value).

This combination will allow the optimizer to benefit from the speed-up provided by larger momentum while still remaining responsive to small changes in the loss landscape.

Comparing AdamW and AdEMAMix on language modeling

Comparing AdamW and AdEMAMix on language modeling

How Does AdEMAMix Optimizer Work?

AdEMAMix is a novel optimization algorithm designed to improve upon traditional methods like Adam. Here's how it works:

  1. Moment Calculation: The algorithm maintains two types of "memories" of past gradients:

    1. A fast-changing memory (similar to Adam's first moment). It also keeps track of the squared gradients (similar to Adam's second moment)

    2. A slow-changing memory (which retains information from much older gradients).

  2. Gradient Mixing: Instead of using just the fast-changing memory, AdEMAMix combines it with the slow-changing memory. This combination allows the algorithm to benefit from both recent and older gradient information.

  3. Adaptive Step Size: The algorithm uses the squared gradients to adapt the step size for each parameter. This helps in adjusting the learning rate based on the historical gradient information.

  4. Parameter Update: The final update to the model parameters is calculated using:

    1. The mixed gradient information from step 2

    2. The adaptive step size from step 3

    3. A weight decay term to help prevent overfitting

  5. Memory Update: After each update, the algorithm refreshes its memories (fast, slow, and squared gradients). This ensures that the next iteration has access to the most recent information.

AdEMAMix Optimizer algorithm

AdEMAMix Optimizer algorithm

The key innovation of AdEMAMix is its ability to leverage both recent and very old gradient information effectively. This helps in navigating complex data patterns more efficiently which potentially leads to faster convergence and better final results compared to traditional methods like Adam.

Comparing Adam and AdEMAMix on the Rosenbrock function.

Comparing Adam and AdEMAMix on the Rosenbrock function.

Evaluating AdEMAMix Optimizer

Switching to AdEMAMix during training improved performance. The earlier the switch, the better the final loss. This indicates that AdEMAMix’s benefits are not just due to early training dynamics but also late training improvements.

  1. Transformer LLM Training:

    • Setup: Used transformer models with 110M, 335M, and 1.3B parameters. Training involved sequences of 1,024 tokens and a cosine decay learning rate.

    • Results: AdEMAMix outperformed AdamW in all model sizes. For example, a 110M parameter model trained for 256k iterations with AdEMAMix matched the performance of an AdamW model trained for 500k iterations.

  2. Vision Transformer (ViT) Training:

    • Setup: Trained ViT models with 24M and 86M parameters on ImageNet datasets.

    • Results: AdEMAMix consistently reduced training loss more efficiently than AdamW, especially in scenarios with large data volumes relative to model size. For example, a 24M parameter model trained on 11M images showed significant improvements with AdEMAMix.

Training time comparison & starting AdEMAMix from AdamW.

Training time comparison & starting AdEMAMix from AdamW.

AdEMAMix was slightly slower than AdamW due to the additional EMA. However, the performance gains outweighed the extra time. For instance, training a 1.3B parameter model with AdEMAMix for 770k steps was as effective as training an AdamW model for 1.5M iterations.

Reply

or to participate.