- The AI Timeline
- Posts
- Smarter AI Agents, Realistic Virtual Try-Ons, and Better Memory
Smarter AI Agents, Realistic Virtual Try-Ons, and Better Memory
How AI is Learning to Reason, Dress, and Remember: A Look at GLM-4.5, Voost, and Memp
Aug 4th ~ Aug 10th
#68 Latest AI Research Explained Simply
🗞️ Industry News in 1 Line
♥ 1.4k Pika Labs has announced a new audio-driven performance model designed to generate hyper-real expressions in video in near real-time. The new model can process video of any length or style in under six seconds and is rolling out now in the Pika Labs iOS app.
♥ 1.5k Microsoft has introduced Copilot 3D, which is a new experimental feature within Copilot Labs that enables users to generate a 3D model from a single 2D image. It lowers the barrier to entry for 3D creation for use in projects like gaming, animation, and 3D printing. The feature, which outputs models in the GLB format, is currently rolling out to a subset of users globally and requires a personal Microsoft Account for access on copilot.microsoft.com.
♥ 11k MiniMax has launched Speech 2.5, which is a text-to-speech and voice cloning model that now supports 40 languages. It can produce highly realistic voice clones that preserve the nuanced details of the original speaker, including their accent, age, and emotional tone. It can be used for anything from film dubbing by retaining an actor's original performance across languages to powering hyper-personalized digital assistants and enabling creators to generate audio content in their own voice. You can try it yourself at minimax.io.
AI leaders only: Get $100 to explore high-performance AI training data.
Train smarter AI with Shutterstock’s rights-cleared, enterprise-grade data across images, video, 3D, audio, and more—enriched by 20+ years of metadata. 600M+ assets and scalable licensing, We help AI teams improve performance and simplify data procurement. If you’re an AI decision maker, book a 30-minute call—qualified leads may receive a $100 Amazon gift card.
For complete terms and conditions, see the offer page.
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
Zhipu AI & Tsinghua University
♥ 485 LLM Reasoning
Introduction to GLM-4.5
Open-source LLMs are getting better day-by-day and they are becoming capable of solving versatile problems. However, it is still pretty hard to build a single model with agentic, reasoning, and coding capabilities in one system. Proprietary models like Claude and GPT-4 excel in specific areas, but no open-source solution has matched their performance.
GLM-4.5, a new Mixture-of-Experts (MoE) model, bridges this gap. Developed with 355 billion total parameters (32 billion activated per query), it introduces hybrid reasoning that switches between reflective "thinking" for complex tasks and direct responses for simpler queries. Trained on 23 trillion tokens, it targets real-world productivity by integrating tool interaction, logical deduction, and code generation into a single open framework.

Overview of the Slime RL infrastructure.
Inner Workings of GLM-4.5
GLM-4.5 uses a MoE architecture to balance computational efficiency and performance. Unlike dense models, it activates only 32 billion parameters per query via specialized "experts," reducing inference costs.
The design pays more attention to depth over width, more layers improve reasoning, and incorporates innovations like Grouped-Query Attention for faster processing and QK-Norm to stabilize training. During pre-training, data is carefully filtered: web content is ranked by quality, code repositories undergo rule-based and model-based screening, and math/science materials are up-sampled to boost reasoning.

Pre-training and mid-training stages for GLM-4.5. We adapt a multi-stage training recipe and extend the sequence length from 4K to 128K.
The researchers used repo-level code training links files within GitHub projects to teach cross-file dependencies and synthetic reasoning data to enhance problem-solving. The context window expands from 4K to 128K tokens, accommodating long-form tasks like browsing agent trajectories.
For fine-tuning, they used a hybrid approach that combines supervised learning with reinforcement learning. They also used an XML-like function-calling template to minimize character escaping in code parameters, streamlining tool use.
After training, they combined specialized "experts" (reasoning, agent, chat) through self-distillation. They used reinforcement learning to optimize each domain: curriculum learning escalates task difficulty dynamically, while single-stage training at full context length avoids capability loss.
Evaluation and Impact of GLM-4.5
GLM-4.5 was able to achieve top-tier results across 12 benchmarks. On agentic tasks, it scores 70.1% on TAU-Bench (e.g., retail/airline simulations) and 26.4% on BrowseComp, which outperforms Claude Opus 4 in web interactions.
For reasoning, it hits 91.0% on AIME 24 (math competitions) and 79.1% on GPQA (science questions), which is quite similar to Claude Sonnet 4. In coding, it reaches 64.2% on SWE-bench Verified (real GitHub fixes) and 37.5% on Terminal-Bench, which surpasses GPT-4.1. Despite having fewer parameters than rivals like Kimi K2, GLM-4.5 ranks 3rd overall among leading models and 2nd in agentic tasks.

The compact GLM-4.5-Air (106B parameters) also excels, matching larger models in efficiency. Both versions advance open-source AI by offering state-of-the-art reasoning and tool use.

Voost: A Unified and Scalable Diffusion Transformer for Bidirectional Virtual Try-On and Try-Off
Lee and Kwak [NXN Labs]
♥ 22k Image Generation
Introduction to Virtual Try-On and Try-Off with Voost
We buy pretty much everything online, from clothes to books, so it should be no surprise that virtual try-on technology can revolutionize online shopping. However, accurately placing digital garments onto diverse body shapes is a stubborn challenge. Existing methods often struggle with pose variations, fabric details, or occlusions, leading to unrealistic results.
This paper introduces Voost, which tackles this by introducing a streamlined framework that jointly learns two complementary tasks: virtual try-on (placing garments onto people) and virtual try-off (reconstructing garments from dressed images). This bidirectional approach strengthens garment-body reasoning without extra networks or labels.

Example of Virtual Try-On outputs produced by Voost
Inner Working of Voost
Voost uses a single diffusion transformer to handle both try-on and try-off. The model combines a garment image and a person image into a horizontally concatenated layout. A task token specifies the operation direction, try-on or try-off, and the garment category.
For try-on, the person’s clothing region is masked, prompting the model to synthesize the dressed result. For the try-off, the garment area is masked instead, asking the model to reconstruct the original garment. This shared setup exposes the model to diverse spatial relationships, improving its grasp of how fabrics interact with bodies.

During training, Voost adopts a flow matching strategy that simplifies optimization by predicting displacement vectors between noisy and clean latents. Only the transformer’s attention layers are fine-tuned, preserving the model’s generative capabilities while adapting to garment-body dynamics.
The researchers use two techniques to improve inference accuracy: attention temperature scaling adjusts for resolution mismatches by modulating attention sharpness based on token counts, while self-corrective sampling alternates between try-on and try-off predictions to refine outputs using their mutual consistency.

Evaluation and Results of Voost
The researchers tested this model against existing state-of-the-art models, and Voost outperformed specialized baselines across VITON-HD and DressCode benchmarks. It achieved state-of-the-art scores in Fréchet Inception Distance (5.27 vs. 6.34 in prior work) and structural metrics like SSIM (0.898 vs. 0.881). User studies also favored Voost for photorealism and detail preservation.

Quantitative results on VITON-HD and DressCode for the try-on task.
Memp: Exploring Agent Procedural Memory
Fang et al. [Zhejiang University, Alibaba Group]
♥ 424 LLM Agents byclo’s pick
Introduction to Procedural Memory in LLM Agents
LLM-based agents are getting pretty good at complex tasks like data analysis and web research. Still, they have a big limitation: their procedural memory is either rigidly hand-crafted or frozen within static parameters.
When environments shift, say, a website layout changes or a tool fails, agents can't adapt quickly, which forces them to restart tasks from scratch. This brittleness wastes time and computational resources.
The Memp framework tackles this by giving agents a dynamic, learnable procedural memory that evolves with experience. By converting past successes into reusable knowledge, Memp helps agents handle new challenges faster and more reliably.

With procedural memory, agents can improve their success rate (accuracy ↑) and execution efficiency (steps ↓) when solving similar tasks.
How Memp Builds and Uses Adaptive Memory
Memp transforms raw task trajectories into procedural memory through three key phases: Build, Retrieve, and Update. During the Build phase, it distills successful task completions into two formats: fine-grained step-by-step instructions (e.g., "grab the egg, microwave it for 30 seconds") and higher-level script abstractions (e.g., "heat perishable items before disposal"). This dual approach captures both concrete actions and general principles.
For the Retrieve phase, Memp uses vector similarity matching to fetch relevant memories when a new task arrives. Instead of random recall, it compares the new task’s description or keywords to stored memories, prioritizing the most relevant ones. For example, asking "How do I reheat leftovers?" might retrieve memories tagged with "microwave" or "food preparation."

The procedural memory framework.
The Update phase ensures memory stays current. After each task, Memp refines its repository by adding new successes, correcting flawed memories (e.g., if a step caused failure), and discarding outdated entries. This continuous tuning prevents memory bloat and keeps knowledge aligned with real-world dynamics. Together, these phases let agents bypass repetitive trial-and-error, directly applying proven strategies to similar problems.
Performance Gains and Future Directions
On benchmarks like TravelPlanner and ALFWorld, Memp improved the task success rates by up to 38% while cutting execution steps by 30–40%. For instance, in ALFWorld household tasks, agents using Memp completed "heat an egg and discard it" in 14 steps instead of 23, saving 685 tokens.
The framework also showed strong transferability: procedural memories built by powerful models like GPT-4o lifted the performance of smaller models like Qwen, raising their accuracy by 5% despite lower base capabilities. While this research sounds promising, it has one big limitation. It relies on clear success/failure signals, which aren’t always available in messy real-world scenarios.

Comparing trajectories with and without procedural memory shortens the process by nine steps and saves 685 tokens.
Reply