World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks

Zuyao Lin; Jianhui Zhang; Peidong Jia; Xiaoguang Zhao; Shanghang Zhang; Xingyu Chen

World-Ego Modeling
for Long-Horizon Evolution in Hybrid Embodied Tasks

Zuyao Lin^1,2,3 Jianhui Zhang^3,4 Peidong Jia⁵ Xiaoguang Zhao¹ Shanghang Zhang⁵ Xingyu Chen^3,✉

¹Institute of Automation, Chinese Academy of Sciences ²University of Chinese Academy of Sciences ³Zhongguancun Academy ⁴Shanghai Jiaotong University ⁵Peking University
^✉Corresponding author

arXiv Code Model Dataset

WEM decomposes embodied future prediction into persistent world evolution and robot-centric ego dynamics for long-horizon hybrid navigation-manipulation tasks.

Abstract

World models are widely explored in embodied intelligence, yet they typically predict distinct evolutions of the world and the ego within a single stream, where the world captures persistent instruction-agnostic scene regularities and the ego captures robot-centric instruction-conditioned dynamics. This world-ego entanglement leads to a degradation in long-horizon embodied scenarios, particularly in hybrid tasks with interleaved navigation and manipulation behaviors. In this paper, we introduce World-Ego Modeling, a new conceptual paradigm that decomposes future evolution into world and ego components. We define the world-ego boundary from three perspectives, i.e., motion-, semantic-, and intention-based views, and analyze three disentanglement strategies with post-, pre-, and full decoupling. Further, we instantiate this paradigm as the World-Ego Model (WEM), a unified embodied world model that couples an implicit separate world-ego planner with a cascade-parallel mixture-of-experts (CP-MoE) diffusion generator. To enable rigorous evaluation, we further construct HTEWorld, the first benchmark for long-horizon world modeling with hybrid navigation-manipulation tasks, providing 125K training video clips (over 4.5M frames) with fine-grained action annotations and 300 multi-turn evaluation trajectories (over 2K instructions). Extensive experiments show that WEM achieves state-of-the-art performance on HTEWorld while remaining competitive on existing manipulation-only benchmarks.

Highlights

World-Ego Modeling. We propose World-Ego Modeling, a new conceptual paradigm for the embodied world model that decomposes future evolution into the world and the ego. We define the world-ego boundary from motion-, semantic-, and intention-based views and analyze the necessity of world-ego disengagement for embodied evolution.
World-Ego Model. We design WEM, a video-based embodied world model with an RCA-based planner and a CP-MoE generator that instantiates the concept of world-ego modeling, to address long-horizon video rollout for hybrid navigation-manipulation tasks.
HTEWorld Benchmark. We construct HTEWorld, the first training dataset, benchmark, and metric protocol for long-horizon world evolution with hybrid navigation-manipulation behaviors. Our WEM achieves state-of-the-art performance on HTEWorld and maintains compatibility with the previous manipulation-oriented task.

World-Ego Modeling: Concept, Model, and Benchmark

This section expands the three main contributions: the World-Ego Modeling paradigm, the World-Ego Model architecture, and the Hybrid-Task Embodied World Benchmark.

World-Ego Modeling

We treat the world and the ego as two predictive roles for embodied evolution. The world-ego boundary is defined from motion-, semantic-, and intention-based views, and the necessity of world-ego disentanglement is studied through post-, pre-, and full decoupling strategies.

WEM

We instantiate the paradigm as World-Ego Model (WEM) under the semantic-based world-ego view and full disentanglement. WEM uses a role-conditioned attention planner to infer separate world and ego states, and a cascade-parallel mixture-of-experts generator to produce long-horizon video rollouts.

HTEWorld

We construct Hybrid-Task Embodied World Benchmark (HTEWorld) for long-horizon world evolution with unified navigation-manipulation behaviors. It provides training clips, multi-turn evaluation trajectories, and metrics for evaluating continuous hybrid-task rollouts.

General framework of World-Ego Modeling. (a) A state predictor infers separate world-ego states from vision-language tokens. CP-MoE is designed to form different degrees of world-ego decoupling. (b) Pre-disentanglement. (c) Post-disentanglement. (d) Full disentanglement.

Overview of the World-Ego Model architecture

Overview of the World-Ego Model. The predictor takes multi-turn instructions for hybrid navigation-manipulation tasks and predicts long-horizon world and ego states. The generator separately evolves the world and ego with the generated semantic proxy.

Statistics of the proposed HTEWorld benchmark. HTEWorld provides large-scale training clips and multi-turn evaluation trajectories for hybrid embodied world modeling. Left. Hybrid-task vocabulary spanning manipulation, navigation, objects, and scenes. Middle. Training-set composition, including training/evaluation scale, action-oriented clip types, and annotation categories. Right. Evaluation-trajectory composition, including instruction-round distribution and the manipulation/navigation proportion at each length.

Quantitative Results

We report the two main HTEWorld comparisons from the paper. Higher is better for all metrics.

HTEWorld Results Under WorldArena Metrics

Comparison on HTEWorld using WorldArena's normalized metric suite. The table keeps the full metric breakdown, with expanded headers for readability on the project page.

Model	EWMScore	Visual Quality			Motion Quality			Content Consistency			Physics Adherence		3D Accuracy		Controllability
Model	EWMScore	Image Quality	Aesthetic Quality	JEPA Similarity	Dynamic Degree	Flow Score	Motion Smoothness	Subject Consistency	Background Consistency	Photometric Consistency	Interaction Quality	Trajectory Accuracy	Depth Accuracy	Perspectivity	Instruction Following	Semantic Alignment	Action Following
WoW-7B	53.44	64.72	49.74	1.30	22.76	25.49	67.74	63.06	66.86	38.08	80.90	28.16	82.41	95.14	78.42	86.75	3.47
Cosmos-Predict 2.5-2B	54.83	64.40	50.21	1.26	24.62	27.23	69.33	66.88	71.54	42.53	83.02	28.78	82.11	95.28	79.60	87.05	3.52
Cosmos-Predict 2.5-14B	55.41	62.14	50.02	1.38	29.37	32.34	71.63	68.65	73.85	35.48	84.70	28.81	82.60	94.40	80.20	87.46	3.55
PAN-style Baseline	58.40	65.48	49.00	1.70	38.14	47.43	79.47	74.79	80.24	33.15	86.33	28.75	82.39	95.08	80.40	87.67	4.42
WEM	61.48	66.82	50.30	2.49	41.52	49.21	82.70	82.07	87.92	35.95	90.80	34.51	84.55	97.60	82.00	90.74	4.50

HTEWorld Navigation-Manipulation Metrics

Comparison on HTEWorld with navigation-manipulation metrics in their original scale. These metrics evaluate continuous multi-turn generation and unified navigation-manipulation behavior.

Model	Rollout Chunk-Boundary Dynamics	Late-Prefix State Alignment	Chunk Instruction-Step Retrieval	Phase-Matched Motion Profile Alignment	Cross-Phase Discriminative Margin	Frontier Phase-Hop State Consistency
WoW-7B	0.23	0.83	0.49	0.45	0.47	0.85
Cosmos-Predict 2.5-2B	0.24	0.83	0.50	0.47	0.48	0.86
Cosmos-Predict 2.5-14B	0.26	0.83	0.51	0.48	0.49	0.85
PAN-style Baseline	0.27	0.86	0.49	0.50	0.46	0.88
WEM	0.31	0.87	0.57	0.54	0.52	0.89

Qualitative Results

Representative rollouts show long-horizon prediction quality, hybrid task consistency, and the effect of world-ego specialization.

Comparison with Baselines

More WEM Outputs

1Move toward the soda can with a trash can.
2Continue forward with the trash can.
3Stop in front of the soda can.
4Move closer to the soda can.
5Lower the right arm to the can.
6Pick up the soda can.
7Raise the soda can.
8Drop the soda can into the trash can.

1Reach for the hinged jar lid.
2Open the lid while holding the jar.
3Move left with the open jar.
4Move forward through the kitchen.
5Approach the wooden countertop.
6Continue toward the countertop.
7Stop beside the countertop with the jar.

1Move toward the grey trash can.
2Continue through the room toward the trash can.
3Lower the left arm to grasp the trash can.
4Lift the trash can.
5Turn right while holding the trash can.

BibTeX

@article{wem2026,
  title={World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks},
  author={Lin, Zuyao and Zhang, Jianhui and Jia, Peidong and Zhao, Xiaoguang and Zhang, Shanghang and Chen, Xingyu},
  journal={arXiv preprint arXiv:2605.19957},
  year={2026}
}