𝒳-Scene: Large-Scale Driving Scene Generation with High Fidelity and Flexible Controllability

1Zhejiang University    2National University of Singapore

Abstract

Diffusion models are advancing autonomous driving by enabling realistic data synthesis, predictive end-to-end planning, and closed-loop simulation, with a primary focus on temporally consistent generation. However, the generation of large-scale 3D scenes that require spatial coherence remains underexplored. In this paper, we propose 𝒳-Scene, a novel framework for large-scale driving scene generation that achieves both geometric intricacy and appearance fidelity, while offering flexible controllability. Specifically, 𝒳-Scene supports multi-granular control, including low-level conditions such as user-provided or text-driven layout for detailed scene composition and high-level semantic guidance such as user-intent and LLM-enriched text prompts for efficient customization. To enhance geometrical and visual fidelity, we introduce a unified pipeline that sequentially generates 3D semantic occupancy and the corresponding multiview images, while ensuring alignment between modalities. Additionally, we extend the generated local region into a large-scale scene through consistency-aware scene outpainting, which extrapolates new occupancy and images conditioned on the previously generated area, enhancing spatial continuity and preserving visual coherence. The resulting scenes are lifted into high-quality 3DGS representations, supporting diverse applications such as scene exploration. Comprehensive experiments demonstrate that 𝒳-Scene significantly advances controllability and fidelity for large-scale driving scene generation, empowering data generation and simulation for autonomous driving.

Method

Pipeline Image

Pipeline of 𝒳-Scene for scalable driving scene generation: (a) Multi-granular controllability supports both high-level text prompts and low-level geometric constraints for flexible specification; (b) Joint occupancy-image generation synthesizes aligned 3D voxels and multi-view images via conditional diffusion; (c) Large-scale extrapolation and reconstruction enables coherent scene expansion through consistency-aware outpainting.

Scene Generation Results

1. Layout Conditioned Generation

2. Text-to-Scene Generation

3. Large-Scale Scene Generation

BibTeX

@article{yang2025xscene,
  title={X-Scene: Large-Scale Driving Scene Generation with High Fidelity and Flexible Controllability},
  author={Yang, Yu and Liang, Alan and Mei, Jianbiao and Ma, Yukai and Liu, Yong and Lee, Gim Hee},
  journal={arXiv preprint arXiv:2506.13558},
  year={2025}
}