• The AI Timeline
  • Posts
  • Agent Laboratory: Using LLM Agents as Research Assistants

Agent Laboratory: Using LLM Agents as Research Assistants

Plus more about Towards System 2 Reasoning in LLMs and Memory Layers at Scale

Jan 6th ~ Jan 12th
#38 Latest AI Research Explained Simply

🗞️ Industry News in 1 Line

  1. ♥ 1.2k Mistral has announced Codestral 25.01, a 256k context window coding model. It is available on Continue.dev or JetBrain for testing, however, at the time of writing, its parameter count is unknown.

    Copilot Arena on Chatbot Arena LLM Leaderboard

  2. ♥ 710 Goodfire has announced the first ever Sparse Autoencoders (SAEs) for Llama 3.3 70B and Llama 3.1 8B at this scale and capability level. SAEs decompose neural activations into interpretable features, which helps us understand LLMs mechanistically.

  3. ♥ 3.6k Phi-4 14B weights are now available on Huggingface. Phi-4 is primarily trained on data from filtered public domain websites, acquired academic books and Q&A datasets, which is known to be a small & strong STEM model family. Read the Phi-4 Technical Report.

    Phi-4 Benchmark from Phi-4 Technical Report

Ball Tracking + Camera Calibration = Football AI!

Learn how to combine the latest open source computer vision and machine learning techniques in action completely for free

From ball tracking in football…

To camera calibration for position tracking…

Roboflow’s Football AI Tutorial teaches you how to effectively develop a sophisticated open source vision pipeline from the ground up completely for free!

1.5 hours of pure learning on the the latest CV techniques

In this video, you will learn about YOLOv8 Fine-Tuning, Multi-Object Tracking, Homography Application, Spatial Analysis, and more.

Agent Laboratory: Using LLM Agents as Research Assistants

Schmidgall et al. [AMD, Johns Hopkins University]

♥ 1k   LLM Agents

Introduction to Agent Laboratory

Writing scientific papers requires a lot of time, cost, and resource allocation. Researchers must carefully prioritize their ideas based on predicted impact which leaves many potentially valuable research concepts unexplored. In our previous issue, we explained how LLMs can generate novel research ideas, but these systems operate independently of human input and are limited in many ways.

Agent Laboratory addresses these challenges by introducing a human-centric autonomous research framework. Unlike previous approaches like ResearchAgent or The AI Scientist that generate their own research ideas, Agent Laboratory takes human research concepts as input and supports their development through a three-stage pipeline: literature review, experimentation, and report writing. The system produces comprehensive research outputs which includes code repositories and research reports and allows researchers to provide feedback at each stage. This approach maintains human creativity and direction while automating the time-consuming aspects of research implementation.

How Does Agent Laboratory Work

The Agent Laboratory framework uses a sophisticated three-phase workflow designed to assist researchers with machine learning projects. The system takes advantage of multiple specialized AI agents that work collaboratively through different stages of the research process.

The first phase, Literature Review, is handled by a PhD agent that interfaces with the arXiv API. This agent performs iterative searches to gather relevant papers, creates summaries and extracts full texts as needed. The process continues until it reaches a predetermined maximum number of relevant papers (N=max), building a comprehensive foundation for the research.

The Experimentation phase is slightly more complex and it consists of three key components. First, PhD and Postdoc agents collaborate on plan formulation by defining specific experimental parameters and methodologies. Then, during the data preparation stage, an ML Engineer agent writes and tests code using Python. The most sophisticated part is the experiment execution, which uses a specialized mle-solver module. This solver uses an iterative refinement process with five key mechanisms:

  1. Command Execution: Iteratively refines programs through REPLACE and EDIT operations

  2. Code Execution: Tests code compilation with automatic repair attempts (up to 3 tries)

  3. Program Scoring: Uses an LLM reward model to evaluate code effectiveness (0-1 scale)

  4. Self Reflection: Generates insights from successes and failures to improve future iterations

  5. Performance Stabilization: Maintains code quality through top program sampling and batch-parallelization

The final Report Writing phase uses a paper-solver module operated by PhD and Professor agents. This module follows a structured approach with four main components: scaffold generation, ArXiv research integration, iterative report editing with LaTeX verification, and automated paper review based on NeurIPS conference guidelines. The system maintains academic standards while ensuring the research is presented in a clear, methodical format.

This entire pipeline is designed to be interactive, allowing human researchers to provide feedback at any stage while automating the time-consuming aspects of research implementation. 

Evaluating The Performance of Agent Laboratory 

The Agent Laboratory was tested with three different models (gpt-4o, o1-mini, and o1-preview) across five research topics. The o1-preview model was most useful in backend for research assistance. It achieved the highest usefulness score of 4.4/5 and report quality of 3.4/5. Interestingly, o1-mini demonstrated superior experimental quality with a score of 3.2/5, while gpt-4o consistently scored lower across all metrics.

The system's performance varied significantly by research topic, with the word order topic achieving the highest report quality (3.8/5) and usefulness (4.5/5), while the cognitive bias topic scored best in experimental quality (3.2/5). This variability suggests that Agent Laboratory's effectiveness may depend on both the chosen backend and the specific research domain.

The generated papers were evaluated against NeurIPS-style criteria, but the system showed mixed results. Human reviewers rated the papers below the typical NeurIPS acceptance threshold (5.9), with average overall ratings ranging from 3.5/10 (gpt-4o) to 4.0/10 (o1-preview).

Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought

Xiang et al. [SynthLabs.ai, Stanford University, UC Berkeley]

♥ 665   LLM Chain of Thought   bycloud’s pick  

Introduction to Meta Chain-of-Thought (Meta-CoT)

Although LLMs can handle straightforward problems, they often fail at more complex mathematical and logical reasoning tasks, even when these tasks have deterministic solutions. Traditional Chain-of-Thought (CoT) prompting is helpful in improving performance but it doesn't fully capture the non-linear and iterative nature of complex human-like reasoning. 

The paper introduces Meta Chain-of-Thought (Meta-CoT), a framework that extends traditional CoT by explicitly modeling the underlying reasoning process itself. Rather than just generating a linear sequence of thoughts, Meta-CoT attempts to capture the latent "thinking" process that includes exploration, verification, and iterative refinement. This is somewhat similar to Chain of Continuous Thought (Coconut), a recent paper we discussed in our newsletter. However, the Meta-CoT framework implements this differently:

  1. Process supervision and synthetic data generation using search algorithms like MCTS and A*

  2. Instruction tuning with linearized search traces

  3. Reinforcement learning post-training to internalize effective reasoning strategies

The authors propose integrating these components into a single end-to-end system that can perform more sophisticated reasoning tasks by better emulating the actual cognitive processes involved in complex problem-solving.

Scaling trends for verifier models

Inner workings Self-Taught Meta Chain-of-Thought Training

There were two main reasons why researchers created Meta Chain-of-Thought (Meta-CoT) reasoning: they expected that either it would create a more efficient LLM or give birth to a potential superintelligence. They suspected that by incorporating search within the context window, models can leverage previously explored paths and handle semantically similar content more efficiently than traditional search approaches. Furthermore, this internalization potentially enables models to discover novel reasoning algorithms through reinforcement learning.

The Meta-CoT uses the Self-Taught Reasoner (STaR) approach and it involves the following steps:

  1. Generating search traces and solutions using a base policy combined with search procedures

  2. Verifying correct solutions

  3. Creating training datasets from verified traces

  4. Training the model to internalize these search patterns using supervised learning

Reasoning via Planning (RAP) search procedure

The authors use a Meta-RL approach to solve the reasoning problem by treating it as a Partially Observable Markov Decision Process (POMDP), where the reward function for each prompt remains unknown until testing. This creates epistemic uncertainty - the model doesn't know beforehand which solutions will be accepted or rejected for new tasks. Unlike traditional reinforcement learning, which can perform poorly on novel reasoning domains, Meta-RL focuses on training agents to quickly explore and adapt to new environments. This teaches models how to learn across a distribution of tasks rather than optimizing for immediate rewards on individual problems.

The RL2 approach implements this by having the agent interact with tasks over multiple episodes while maintaining persistent memory. This allows the model to accumulate experience and refine its strategy. In language model applications, this framework extends beyond simple episodic learning to include capabilities like early termination and state resetting within context. 

E-RL2 search objective

Evaluations and Results

This paper shows that the Tool-Integrated Reasoning (TIR) approach is comparable to traditional Chain-of-Thought (CoT) approaches. Despite using only 25% of the training data, TIR shows superior scaling properties across all sample sizes. Models trained with uniform random numbers of in-context solutions (0-7 during training) demonstrate adaptive behavior and automatically allocate more computational resources to more difficult problems. For instance, Level 1 problems received an average of 2.45 solution attempts, while Level 5 problems received 5.84 attempts, which indicates successful internalization of complexity-based resource allocation.

The benefits of reinforcement learning for language model reasoning.

Furthermore, the most striking result is in the low-sample regime (20-23 samples), where TIR achieves approximately double the accuracy of CoT methods. This suggests that offloading computations to external tools significantly enhances problem-solving efficiency both during training and inference.

However, training on RL post-trained models showed degraded performance with standard supervised fine-tuning. The authors observed a 2% improvement over the base LLaMA 3.1 8B Instruct model, but indicate that mismatches between instruction-tuning, off-policy supervised fine-tuning. These findings collectively suggest that while Meta-CoT internalization is achievable, optimal training procedures require careful consideration of the interaction between pre-training, instruction-tuning, and fine-tuning phases.

If you like reading The AI Timeline, consider forwarding it to another research enthusiast, It helps us keep this up for free!

- bycloud

Memory Layers at Scale

Facebook AI Research at Meta

♥ 1k   LLM Architecture

What are Memory Layers

LLMs need massive computational resources to store and access information, since they primarily use dense feed-forward networks. This is inefficient, especially for storing simple associative information like facts, dates, and relationships.

This paper introduces an improved memory layer system that uses a trainable key-value lookup mechanism that adds parameters without increasing computational costs (FLOPs). The key innovation is scaling these memory layers far beyond previous attempts, reaching up to 128 billion parameters. The researchers replaced some feed-forward networks in transformer layers with these memory layers while keeping other components unchanged.

Parallel EmbeddingBag implementation for a “Memory Group” of two GPUs.

How Do Memory Layers Work?

The memory layers mechanism is similar to an attention mechanism, but with two crucial differences:

  1. It uses trainable parameters for keys and values instead of activations

  2. It handles millions of key-value pairs, requiring sparse lookup

Here's how the core mechanism works:

# Basic memory layer operation
I = SelectTopkIndices(K @ q)    # Find most relevant keys
s = Softmax(K[I] @ q)           # Get attention scores
y = s @ V[I]                    # Compute weighted sum of values

On the left, the regular memory layer. On the right, the Memory+ block, with the added projection, gating and silu non-linearity.

To make this system memory-efficient and fast, they implemented several key optimizations:

  1. Product-Key Lookup: Instead of storing N×n keys, they split keys into two smaller sets (K1, K2) of size √N×n/2. This reduces both memory usage and computation.

# Split query and find top-k matches in each smaller key set
q1, q2 = split_query(q)
I1, s1 = top_k(K1 @ q1)
I2, s2 = top_k(K2 @ q2)
# Combine scores to find best overall matches
final_indices = argmax(s1[I1] + s2[I2])
  1. Parallel Memory Implementation:

    1. Values are sharded across GPUs along the embedding dimension

    2. Each GPU handles lookup for all indices but only on its portion of embeddings

    3. Results are aggregated across GPUs efficiently

    4. They achieved near-theoretical memory bandwidth (3TB/s on H100)

  2. Custom CUDA Kernels: They optimized the EmbeddingBag operation with three strategies:

    1. Atomic additions

    2. Row-level atomic locks (efficient for embedding dim > 128)

    3. Atomic-free using reverse indices These optimizations made the operation 6x faster than PyTorch's implementation.

  3. Stability Improvements: They added input-dependent gating as shown below:

# Enhanced output computation
gated_output = (y * silu(x @ W1)) @ W2
# where silu(x) = x * sigmoid(x)

The system shares memory parameters across multiple layers (up to 3) to maximize parameter efficiency while maintaining performance. This architecture allows them to scale to 128B memory parameters while keeping computation costs low, making it particularly efficient for storing and retrieving factual information.

Evaluating Performance Gains using Memory Layers

The researchers conducted three main sets of experiments to evaluate their memory-augmented architecture:

  1. Fixed Memory Size Comparison:

    1. Memory+ models with 1 million embeddings consistently outperformed dense baselines and MOE models

    2. They achieved performance comparable to dense models using 2-4x more compute

    3. The improved Memory+ variant (using 3 memory layers) performed better than both the vanilla Memory model and PEER architecture

    4. Performance gains were particularly strong on QA tasks, with the 1.3B Memory+ model achieving significant improvements on NaturalQuestions and TriviaQA

  2. Memory Scaling Analysis:

    1. With a fixed 1.3B base model, increasing memory size led to predictable improvements in factual QA performance

    2. At 128B memory parameters (64M keys), the model approached the performance of Llama2 7B, despite using 10x less compute

    3. This demonstrated efficient scaling of knowledge capacity without proportional compute increases

  3. Large-Scale 8B Results:

    1. The 8B Memory+ model with 64B memory parameters showed strong performance across diverse tasks

    2. At 200B tokens, it demonstrated faster learning of factual information compared to dense baselines

    3. After 1T tokens, it approached the performance of Llama3.1 8B (trained on 15T tokens) on multiple benchmarks

What makes this work significant is its practicality - the researchers have created a fully parallelizable implementation that works effectively at modern scale. This addresses previous challenges where memory layers were difficult to optimize for hardware accelerators. Their approach even outperforms mixture-of-experts models when matched for compute and parameters.

Benchmark results of 8B base model

Reply

or to participate.