World-Ego Modeling
for Long-Horizon Evolution in Hybrid Embodied Tasks

Zuyao Lin1,2,3 Jianhui Zhang3,4 Peidong Jia5 Xiaoguang Zhao1 Shanghang Zhang5 Xingyu Chen3,✉
1Institute of Automation, Chinese Academy of Sciences 2University of Chinese Academy of Sciences 3Zhongguancun Academy 4Shanghai Jiaotong University 5Peking University
Corresponding author
World-Ego Modeling teaser

WEM decomposes embodied future prediction into persistent world evolution and robot-centric ego dynamics for long-horizon hybrid navigation-manipulation tasks.

Abstract

World models are widely explored in embodied intelligence, yet they typically predict distinct evolutions of the world and the ego within a single stream, where the world captures persistent instruction-agnostic scene regularities and the ego captures robot-centric instruction-conditioned dynamics. This world-ego entanglement leads to a degradation in long-horizon embodied scenarios, particularly in hybrid tasks with interleaved navigation and manipulation behaviors. In this paper, we introduce World-Ego Modeling, a new conceptual paradigm that decomposes future evolution into world and ego components. We define the world-ego boundary from three perspectives, i.e., motion-, semantic-, and intention-based views, and analyze three disentanglement strategies with post-, pre-, and full decoupling. Further, we instantiate this paradigm as the World-Ego Model (WEM), a unified embodied world model that couples an implicit separate world-ego planner with a cascade-parallel mixture-of-experts (CP-MoE) diffusion generator. To enable rigorous evaluation, we further construct HTEWorld, the first benchmark for long-horizon world modeling with hybrid navigation-manipulation tasks, providing 125K training video clips (over 4.5M frames) with fine-grained action annotations and 300 multi-turn evaluation trajectories (over 2K instructions). Extensive experiments show that WEM achieves state-of-the-art performance on HTEWorld while remaining competitive on existing manipulation-only benchmarks.

Highlights

  • World-Ego Modeling. We propose World-Ego Modeling, a new conceptual paradigm for the embodied world model that decomposes future evolution into the world and the ego. We define the world-ego boundary from motion-, semantic-, and intention-based views and analyze the necessity of world-ego disengagement for embodied evolution.
  • World-Ego Model. We design WEM, a video-based embodied world model with an RCA-based planner and a CP-MoE generator that instantiates the concept of world-ego modeling, to address long-horizon video rollout for hybrid navigation-manipulation tasks.
  • HTEWorld Benchmark. We construct HTEWorld, the first training dataset, benchmark, and metric protocol for long-horizon world evolution with hybrid navigation-manipulation behaviors. Our WEM achieves state-of-the-art performance on HTEWorld and maintains compatibility with the previous manipulation-oriented task.

World-Ego Modeling: Concept, Model, and Benchmark

This section expands the three main contributions: the World-Ego Modeling paradigm, the World-Ego Model architecture, and the Hybrid-Task Embodied World Benchmark.

World-Ego Modeling

We treat the world and the ego as two predictive roles for embodied evolution. The world-ego boundary is defined from motion-, semantic-, and intention-based views, and the necessity of world-ego disentanglement is studied through post-, pre-, and full decoupling strategies.

WEM

We instantiate the paradigm as World-Ego Model (WEM) under the semantic-based world-ego view and full disentanglement. WEM uses a role-conditioned attention planner to infer separate world and ego states, and a cascade-parallel mixture-of-experts generator to produce long-horizon video rollouts.

HTEWorld

We construct Hybrid-Task Embodied World Benchmark (HTEWorld) for long-horizon world evolution with unified navigation-manipulation behaviors. It provides training clips, multi-turn evaluation trajectories, and metrics for evaluating continuous hybrid-task rollouts.

General framework of world-ego modeling

General framework of World-Ego Modeling. (a) A state predictor infers separate world-ego states from vision-language tokens. CP-MoE is designed to form different degrees of world-ego decoupling. (b) Pre-disentanglement. (c) Post-disentanglement. (d) Full disentanglement.

Overview of the World-Ego Model architecture

Overview of the World-Ego Model. The predictor takes multi-turn instructions for hybrid navigation-manipulation tasks and predicts long-horizon world and ego states. The generator separately evolves the world and ego with the generated semantic proxy.

HTEWorld dataset and benchmark overview

Statistics of the proposed HTEWorld benchmark. HTEWorld provides large-scale training clips and multi-turn evaluation trajectories for hybrid embodied world modeling. Left. Hybrid-task vocabulary spanning manipulation, navigation, objects, and scenes. Middle. Training-set composition, including training/evaluation scale, action-oriented clip types, and annotation categories. Right. Evaluation-trajectory composition, including instruction-round distribution and the manipulation/navigation proportion at each length.

Quantitative Results

We report the two main HTEWorld comparisons from the paper. Higher is better for all metrics.

HTEWorld Results Under WorldArena Metrics

Comparison on HTEWorld using WorldArena's normalized metric suite. The table keeps the full metric breakdown, with expanded headers for readability on the project page.

Model EWMScore Visual Quality Motion Quality Content Consistency Physics Adherence 3D Accuracy Controllability
Image Quality Aesthetic Quality JEPA Similarity Dynamic Degree Flow Score Motion Smoothness Subject Consistency Background Consistency Photometric Consistency Interaction Quality Trajectory Accuracy Depth Accuracy Perspectivity Instruction Following Semantic Alignment Action Following
WoW-7B53.4464.7249.741.3022.7625.4967.7463.0666.8638.0880.9028.1682.4195.1478.4286.753.47
Cosmos-Predict 2.5-2B54.8364.4050.211.2624.6227.2369.3366.8871.5442.5383.0228.7882.1195.2879.6087.053.52
Cosmos-Predict 2.5-14B55.4162.1450.021.3829.3732.3471.6368.6573.8535.4884.7028.8182.6094.4080.2087.463.55
PAN-style Baseline58.4065.4849.001.7038.1447.4379.4774.7980.2433.1586.3328.7582.3995.0880.4087.674.42
WEM61.4866.8250.302.4941.5249.2182.7082.0787.9235.9590.8034.5184.5597.6082.0090.744.50

HTEWorld Navigation-Manipulation Metrics

Comparison on HTEWorld with navigation-manipulation metrics in their original scale. These metrics evaluate continuous multi-turn generation and unified navigation-manipulation behavior.

Model Rollout Chunk-Boundary Dynamics Late-Prefix State Alignment Chunk Instruction-Step Retrieval Phase-Matched Motion Profile Alignment Cross-Phase Discriminative Margin Frontier Phase-Hop State Consistency
WoW-7B0.230.830.490.450.470.85
Cosmos-Predict 2.5-2B0.240.830.500.470.480.86
Cosmos-Predict 2.5-14B0.260.830.510.480.490.85
PAN-style Baseline0.270.860.490.500.460.88
WEM0.310.870.570.540.520.89

Qualitative Results

Representative rollouts show long-horizon prediction quality, hybrid task consistency, and the effect of world-ego specialization.

Comparison with Baselines

More WEM Outputs

  1. 1Move toward the soda can with a trash can.
  2. 2Continue forward with the trash can.
  3. 3Stop in front of the soda can.
  4. 4Move closer to the soda can.
  5. 5Lower the right arm to the can.
  6. 6Pick up the soda can.
  7. 7Raise the soda can.
  8. 8Drop the soda can into the trash can.
  1. 1Reach for the hinged jar lid.
  2. 2Open the lid while holding the jar.
  3. 3Move left with the open jar.
  4. 4Move forward through the kitchen.
  5. 5Approach the wooden countertop.
  6. 6Continue toward the countertop.
  7. 7Stop beside the countertop with the jar.
  1. 1Move toward the grey trash can.
  2. 2Continue through the room toward the trash can.
  3. 3Lower the left arm to grasp the trash can.
  4. 4Lift the trash can.
  5. 5Turn right while holding the trash can.

BibTeX

@article{wem2026,
  title={World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks},
  author={Lin, Zuyao and Zhang, Jianhui and Jia, Peidong and Zhao, Xiaoguang and Zhang, Shanghang and Chen, Xingyu},
  journal={arXiv preprint arXiv:2605.19957},
  year={2026}
}