World Modelling in Embodied AI

We develop structured world models to enable more robust, adaptive, and data-efficient embodied intelligence, with autonomous driving as a core testbed. Our research spans four directions: (i) enhancing rule understanding via generative world models, (ii) unifying multimodal understanding and generation, (iii) bridging simulation and reality for embodied robots, and (iv) learning high-level concepts for data-efficient driving intelligence. Together, these efforts aim to build agents that are safer, more adaptive, and capable of robust generalization in complex real-world environments.

Part 1: Enhancing Rule Understanding in Autonomous Driving Systems Using Generative World Models

Our Objective:

The core of an autonomous driving system lies in its deep understanding of the surrounding environment and traffic rules. However, current autonomous driving technology still faces challenges when dealing with complex or rare high-level rules. For instance, "How should the system decide when a police officer's hand gesture conflicts with traffic signal indications?" Such situations highlight the current system's shortcomings. Therefore, integrating these complex rules effectively into autonomous driving models becomes a critical issue for improving system reliability. The cause of this problem is the "long tail effect" of real-world data: rare but crucial hazardous scenarios are extremely sparse in datasets. Relying on manual data collection for such scenarios is not only costly but also entails significant safety risks. To address this challenge, our research utilizes Generative AI to construct a data feedback loop framework. This framework uses a Generative World Model to create large-scale simulation scenes that include complex high-level rules and complete 3D ground truth. These high-quality synthetic data are then used to train the main driving model, compensating for its shortcomings in rule understanding, and ultimately improving its decision-making ability in real-world environments.

Figure 1: Enhancing Rule Understanding Using Generative World Models.

Our Achievements So Far:

Static World Construction

(Yang* et al., 2023)

Dynamic World Synthesis

(Ma* et al., 2024)

Model Self-Correction

(Ma* et al., 2024)

Multimodal Joint Generation

(Tang et al., 2025)

Trajectory Risk Prediction Enhancement

(Hou* et al., 2025)

Future Outlook

We will continue to optimize the self-correcting feedback loop framework, enhance the realism of the simulator, and improve high-level closed-loop capabilities. Our goal is to reduce data collection and training costs, further enhancing the robustness and generalization of autonomous driving models, ultimately driving autonomous driving technology toward a smarter and safer future.

Part 2: Unifying Muti-modal Understanding and Generation

Our Objective:

We believe that understanding and generation are two sides of the same coin in perceiving the world: deeper understanding enables more precise generation, while the ability to generate in turn reinforces the model’s grasp of the underlying patterns of the world. Toward this vision, we focus on (i) constructing more effective unified MLLM architectures, (ii) developing unified discrete/continuous multimodal representations, (iii) designing more effective visual tokenization methods, and (iv) exploring how generative capabilities can be leveraged to genuinely enhance understanding.

Figure 2: Understanding and generation are two sides of the same coin.

Our Achievements So Far:

Unified Visual Tokenizer

(Song et al., 2025)

Can visual generation supervision enhance VLMs' understanding?

(Wang* et al., 2025)

Future Outlook

First, we aim to design more efficient and scalable tokenization mechanisms that can flexibly adapt to diverse modalities beyond vision and language, such as audio and video. Second, we intend to investigate how generative objectives can be more tightly integrated with understanding tasks, enabling mutual reinforcement between generation and understanding. Third, we envision applying our framework to real-world applications, such as embodied AI, to validate its broader impact.

Part 3: From Simulation to Reality: Toward Autonomous Embodied Robot

Our Objective:

Human can quickly adapt to new environments and tasks by leveraging prior knowledge, physical intuition, and high-level reasoning. In contrast, robot often rely on massive-scale data collection and environment-specific training, yet still struggle to generalize across diverse scenarios and perform reliably in the real world. Our research focuses on bridging this gap by developing agents capable of transferring skills from simulation to the real world (Sim2Real), while building structured world models that enable generalizable reasoning, robust planning, and adaptive decision-making. We aim to move beyond brute-force imitation learning and toward data-efficient, autonomous intelligence.

Figure 3: Towards Data-Driven Autonomous Embodied Robot.

Our Achievements So Far:

Generative Simulation Module

Socially-Enhanced Navigation Policy

Future Outlook

We plan to extend our framework to more diverse tasks and heterogeneous robotic platforms, enabling agents to autonomously collect data, learn concepts, and adapt policies online with minimal human supervision. Our long-term goal is to establish a unified paradigm for autonomous embodied intelligence, where agents not only learn efficiently from limited data but also generalize robustly to unseen environments, novel tasks, and real-world uncertainties.

Part 4: From Data to Concepts: Toward Efficient Driving Intelligence

Our Objective:

Human drivers typically require only limited formal training—covering traffic rules, basic operations, simulated road practice, and driving norms—before gradually adapting to increasingly complex and unfamiliar situations with experience. In contrast, autonomous driving systems have consumed millions or even billions of data samples, yet achieving fully reliable performance still remains a challenge. Our research explores whether autonomous agents can move beyond brute-force data accumulation by developing higher-level conceptual understanding from their training, somewhat akin to the human ability to generalize through intuition. We believe such an approach could lead to more data-efficient learning and enable agents to handle rare and long-tail scenarios more effectively. (missing reference)

Figure 4: Comparison between data-driven and concept-guided approaches.

Our Achievements So Far:

Concept-Learning Module

Future Outlook

We plan to extend this framework using more diverse datasets to improve the generalization of the concept model, with the ultimate goal of enhancing performance in unseen and complex situations while advancing the broader pursuit of data-efficient learning in autonomous driving.