ManimAgent: Self-Evolving Multimodal Agents for Visual Education

1University of Alberta    2Southeast University    3Virginia Tech    4Emotion Machine Inc.    5Vivavia Inc.
University of Alberta Southeast University Virginia Tech Emotion Machine Inc. Vivavia Inc.
TL;DR. Reflection lets an agent recover within one task, but every lesson is thrown away before the next task — cross-task forgetting. ManimAgent carries that experience forward in a dual-channel Episodic Memory Bank grown from its own task stream, with no weight updates and no human seeds.
ManimAgent closes the cross-task forgetting gap that reflection alone leaves open.

ManimAgent closes the cross-task forgetting gap that reflection alone leaves open. (Left) Each task is an isolated episode: on Task 1 the agent notices a wrong formula, revises locally, and reaches the correct scene; on Task 2 a structurally similar prompt triggers the same class of error and the full reflection cost is paid again. (Right) ManimAgent promotes each completed task into a dual-channel Episodic Memory Bank (EMB) that stores logical deductions that worked and blind guesses that failed, so on Task 2 it succeeds first-try instead of re-deriving the same lesson.

Abstract

Multi-round reflection lets agents built on large language models recover from failures within a single task, but each task remains an isolated episode: lessons learned across many reflection rounds on one task are discarded before the next begins. We study this gap on a code-generation task: from a scientific paper section, the agent writes Python in the open-source Manim library to render a mathematical animation. We present ManimAgent, a self-evolving multimodal agent that carries reflection experience across tasks through a dual-channel Episodic Memory Bank grown entirely from its own task stream, with no weight updates and no human seeds.

After each animation converges, a vision–language model scores the rendered keyframes; the resulting signals populate a positive channel M+ that stores success rationales as soft Reference Examples, and a negative channel M that stores validated failure patterns as hard Known Pitfalls. On a fixed-probe evaluation against no-memory, matched-budget retrieval-augmented generation, and shuffled-memory baselines, blind human Pass@1 rises and reflection rounds fall as memory size grows. We will release the code, frozen memory snapshots, and the task stream.

Key Contributions

Cross-task memory, no weight updates

A self-evolving multi-agent system narrows the inter-task gap with a VLM reward source and a dual-channel Episodic Memory Bank (EMB) — grown entirely from the agent's own task stream, with no human seeds and no fine-tuning.

Two complementary channels

Two LLM distillers turn each reflection trace into a free-form success rationale (M+, soft Reference Examples) and a structured failure lesson (M, hard Known Pitfalls), each written only under explicit causal-attribution gates.

Fixed-probe evaluation

We release a JSON-indexed paper-section animation dataset with an output-level human-scoring protocol, and evaluate on frozen EMB snapshots in read-only mode — isolating cross-task memory from task-order and test-time-learning effects.

Method

ManimAgent architecture: a VLM-guided repair loop with dual-channel cross-task memory.

ManimAgent closes a VLM-guided repair loop with dual-channel cross-task memory. A Storyboarder, Coder, Renderer, text Reviewer, VLM Reviewer, and Visual Reviser form a standard reflection pipeline. For each scene the loop has three nested layers: a memory retrieve call, a text-reflection loop that repairs render crashes, and a visual-reflection loop that fixes renderable-but-weak output. The visual reviewer's score u(t) drives auto-pass and revision control, while the selected score u* and validated improvement transitions govern separate success- and failure-memory writes. The system reads from the EMB before generating and writes to it after converging.

  M+ — success rationales (u* ≥ 85), injected as soft Reference Examples   M — validated failure patterns, injected as hard Known Pitfalls

Results

62.0 84.9
Human Pass@1 (%), EMB@0 → EMB@200
12.2 6.5
Reflection rounds (lower is better)
3.26 3.88
Human quality (1–5)
EMB@K scales monotonically and overtakes the strong static-RAG baseline by K=200.

EMB@K scales monotonically and overtakes the strong static-RAG baseline by K=200. The solid blue curve tracks the frozen EMB snapshot at K ∈ {0, 50, 100, 200} on each of the three headline metrics; horizontal references are Manim-code RAG (dashed red), Random EMB (dash-dot orange), and VLM-only (dotted grey). For Human Pass@1 and quality the EMB curve crosses Manim-code RAG between K=100 and K=200; for reflection rounds the gap to RAG widens sharply at K=200 (6.5 vs. 11.4).

More Analysis

BibTeX

@misc{jiang2026manimagentselfevolvingmultimodalagents,
  title         = {ManimAgent: Self-Evolving Multimodal Agents for Visual Education},
  author        = {Wenjia Jiang and Zongyuan Cai and Yuanhang Shao and Chenru Wang
                   and Boyan Han and Zhixue Song and Keyu Chen and Shengwei An
                   and Xu Yang and Zhou Yang},
  year          = {2026},
  eprint        = {2606.30296},
  archivePrefix = {arXiv},
  primaryClass  = {cs.AI},
  url           = {https://arxiv.org/abs/2606.30296}
}