Unleashing Generalization of End-to-End Autonomous Driving with Controllable Long Video Generation

1Westlake University,
2Li Auto Inc.,
3Tianjin University,
4Shenzhen Campus, Sun Yat-sen University,
5Southeast University,
6Harbin Engineering University,
7Harbin Institute of Technology(Shenzhen)

(*Co-first authors. Corresponding Author)
MY ALT TEXT

Overview of our method. We show that (a) our Delphi can generate up to 40 frames consecutive videos while (b) existing best only generate 8 frames. (c) With the failure-cased driven framework equipped with Delphi, (d) we can significantly boost the end-to-end model performance with much smaller cost.

Abstract

Using generative models to synthesize new data has become a de-facto standard in autonomous driving to address the data scarcity issue. Though existing approaches are able to boost perception models, we discover that these approaches fail to improve the performance of planning of end-to-end autonomous driving models as the generated videos are usually less than 8 frames and the spatial and temporal inconsistencies are not negligible. To this end, we propose Delphi, a novel diffusion-based long video generation method with a shared noise modeling mechanism across the multi-views to increase spatial consistency, and a feature-aligned module to achieves both precise controllability and temporal consistency. Our method can generate up to 40 frames of video without loss of consistency which is about 5 times longer compared with state-of-the-art methods. Instead of randomly generating new data, we further design a sampling policy to let Delphi generate new data that are similar to those failure cases to improve the sample efficiency. This is achieved by building a failure-case driven framework with the help of pre-trained visual language models. Our extensive experiment demonstrates that our Delphi generates a higher quality of long videos surpassing previous state-of-the-art methods. Consequentially, with only generating 4% of the training dataset size, our framework is able to go beyond perception and prediction task, for the first time to the best of our knowledge, boost the planning performance of the end-to-end autonomous driving model by a margin of 25%.

Long video generation on nuScenes dataset

Long videos generated by Delphi (up to 40 frames) on the nuScenes dataset. For readability, we play the video at 5x speed.

Precise controllability

MY ALT TEXT

Visual comparison of local region generated by different generative models. Our method maintains consistent spatial and temporal appearance where the previous methods fail.

MY ALT TEXT

Visualization of (a) instance-level editing, including appearance attributes of all vehicles, and (b) scene-level editing, including weather and time.

Our failure-case driven framework boosts the end-to-end planning model

MY ALT TEXT

Performance comparison of the end-to-end models fine-tuned from the UniAD open source model by applying different data sampling strategies, numbers of data cases, data engines, and data sources in the failure-case driven framework.The baseline performance is presented in the first row of the table.

MY ALT TEXT

Visualization of four examples before and after. (a) Here, we show four hard examples from the validation set, ''large objects in the front'' and ''unprotected left turn at intersection''. (b) Our framework is able to fix these four examples without using these data during training.

Scaling up

By training on a private multi-view driving dataset © (the training data is about 50 times larger than nuScenes), Delphi demonstrates interesting capabilities to generate up to 120 frames of spatiotemporally consistent videos. This fully demonstrates the scalability of our Delphi.

Application: visual renderer for closed-loop evaulation

We note that Delphi can also be used as a data engine with photorealistic image generation capabilities and further supports closed-loop evaluation of end-to-end models such as UniAD. Here we show a demo of closed-loop evaluation on nuNcenes. The top row shows an open-loop evaluation scenario on nuNcenes: the ego car is driving at a constant speed. The bottom row shows a closed-loop evaluation scenario on nuNcenes using Delphi: the ego car is accelerating, resulting in a dangerous distance from the vehicle in front.