EgoAgent: A Joint Predictive Agent Model in Egocentric Worlds

1State Key Lab of CAD&CG, Zhejiang University, 2The Chinese University of Hong Kong, 3Shanghai Jiao Tong University, 4Shanghai Artificial Intelligence Laboratory
ICCV 2025

EgoAgent jointly learns (i) visual representations of egocentric observations, (ii) 3D human action prediction through skeletal motion, and (iii) future world state prediction in semantic feature space. We demonstrate that these three capabilities can be acquired through imitation of real-world human interactions and experiences.

Abstract

Learning an agent model that behaves like humans—capable of jointly perceiving the environment, predicting the future, and taking actions from a first-person perspective—is a fundamental challenge in computer vision.

Existing methods typically train separate models for these abilities, which fail to capture their intrinsic relationships and prevent them from learning from each other. Inspired by how humans learn through the perception-action loop, we propose EgoAgent, a unified agent model that simul- taneously learns to represent, predict, and act within a single transformer.

EgoAgent explicitly models the causal and temporal dependencies among these abilities by formulating the task as an interleaved sequence of states and actions. It further introduces a joint embedding–action–prediction architecture with temporally asymmetric predictor and observer branches, enabling synergistic optimization across all three capabilities. Comprehensive evaluations of EgoAgent on representative tasks such as image classification, egocentric future state prediction, and 3D human motion prediction demonstrate the superiority of our method.

Video

Experiments and Findings

Synergy between the Three Learning Tasks

We demonstrate that removing any one of the three tasks (prediction, action, and representation) during training (lines b–d), or training on a single task alone (lines e–g), reduces baseline performance. Additionally, we find that learning in a high-level semantic feature space reduces the overall difficulty of the learning process compared to learning in a pixel-level latent space from a pretrained VQGAN (lines h-i).

Ablation Study Table

Feature Map and 3D Action Prediction Visualization

EgoAgent can extract semantic features from first-person RGB images and predicts the next 3D human action, as if anticipating what a human would do next based on what they have seen.

World Model Predictions in Learned Semantic Space

EgoAgent also predicts the global semantics of the next world state after the predicted action is taken. These predicted states can be interpreted and verified through image retrieval within the video.

World Model by Retrieval

Transfering to Egocentric Embodied Manipulation

We freeze the pretrained EgoAgent vision-action backbone and add a three-layer MLP as the policy network to train on the TriFinger benchmark. EgoAgent achieves the highest success rates compared to vision-only representation models, such as DINO and DoRA, trained on egocentric datasets.

Related Links

There are some excellent works that were introduced around the same time as ours.

HMA proposes a unified model that simultaneously learns to represent and generate action-conditioned video dynamics across diverse robotic domains and embodiments.

UVA introduces a joint video–action model that learns a shared latent representation to predict both video frames and corresponding actions efficiently.

BibTeX

@article{chen2025egoagent,
  title={EgoAgent: A Joint Predictive Agent Model in Egocentric Worlds},
  author={Chen, Lu and Wang, Yizhou and Tang, Shixiang and Ma, Qianhong and He, Tong and Ouyang, Wanli and Zhou, Xiaowei and Bao, Hujun and Peng, Sida},
  journal={arXiv preprint arXiv:2502.05857},
  year={2025}
}