Learning an agent model that behaves like humans—capable of jointly perceiving the environment, predicting the future, and taking actions from a first-person perspective—is a fundamental challenge in computer vision.
Existing methods typically train separate models for these abilities, which fail to capture their intrinsic relationships and prevent them from learning from each other. Inspired by how humans learn through the perception-action loop, we propose EgoAgent, a unified agent model that simul- taneously learns to represent, predict, and act within a single transformer.
EgoAgent explicitly models the causal and temporal dependencies among these abilities by formulating the task as an interleaved sequence of states and actions. It further introduces a joint embedding–action–prediction architecture with temporally asymmetric predictor and observer branches, enabling synergistic optimization across all three capabilities. Comprehensive evaluations of EgoAgent on representative tasks such as image classification, egocentric future state prediction, and 3D human motion prediction demonstrate the superiority of our method.
We demonstrate that removing any one of the three tasks (prediction, action, and representation) during training (lines b–d), or training on a single task alone (lines e–g), reduces baseline performance. Additionally, we find that learning in a high-level semantic feature space reduces the overall difficulty of the learning process compared to learning in a pixel-level latent space from a pretrained VQGAN (lines h-i).
EgoAgent can extract semantic features from first-person RGB images and predicts the next 3D human action, as if anticipating what a human would do next based on what they have seen.
EgoAgent also predicts the global semantics of the next world state after the predicted action is taken. These predicted states can be interpreted and verified through image retrieval within the video.
We freeze the pretrained EgoAgent vision-action backbone and add a three-layer MLP as the policy network to train on the TriFinger benchmark. EgoAgent achieves the highest success rates compared to vision-only representation models, such as DINO and DoRA, trained on egocentric datasets.
There are some excellent works that were introduced around the same time as ours.
HMA proposes a unified model that simultaneously learns to represent and generate action-conditioned video dynamics across diverse robotic domains and embodiments.
UVA introduces a joint video–action model that learns a shared latent representation to predict both video frames and corresponding actions efficiently.
@article{chen2025egoagent,
title={EgoAgent: A Joint Predictive Agent Model in Egocentric Worlds},
author={Chen, Lu and Wang, Yizhou and Tang, Shixiang and Ma, Qianhong and He, Tong and Ouyang, Wanli and Zhou, Xiaowei and Bao, Hujun and Peng, Sida},
journal={arXiv preprint arXiv:2502.05857},
year={2025}
}