SIU3R: SImultaneous Scene Understanding and 3D Reconstruction
Beyond Feature Alignment

Arxiv 2025

Qi Xu1,2*, Dongxu Wei1,3*†, Lingzhe Zhao1, Wenpu Li1, Zhangchi Huang1,4, Shunping Ji2†, Peidong Liu1†,

1 Westlake University   2 Wuhan University   3 Westlake Institute for Advanced Study   4 Zhejiang University  

Abstract

Simultaneous understanding and 3D reconstruction plays an important role in developing end-to-end embodied intelligent systems. To achieve this, recent approaches resort to 2D-to-3D feature alignment paradigm, which leads to limited 3D understanding capability and potential semantic information loss. In light of this, we propose SIU3R, the first alignment-free framework for generalizable simultaneous understanding and 3D reconstruction from unposed images. Specifically, SIU3R bridges reconstruction and understanding tasks via pixel-aligned 3D representation, and unifies multiple understanding tasks into a set of unified learnable queries, enabling native 3D understanding without the need of alignment with 2D models. To encourage collaboration between the two tasks with shared representation, we further conduct in-depth analyses of their mutual benefits, and propose two lightweight modules to facilitate their interaction. Extensive experiments demonstrate that our method achieves state-of-the-art performance not only on the individual tasks of 3D reconstruction and understanding, but also on the task of simultaneous understanding and 3D reconstruction, highlighting the advantages of our alignment-free framework and the effectiveness of the mutual benefit designs.

Method Overview

SIU3R is a feed-forward method that can achieve simultaneous 3D scene understanding and reconstruction given unposed images. In particular, SIU3R does not require feature alignment with 2D VLMs (e.g., CLIP, LSeg) to enable understanding, which unleashes its potential as a unified model to achieve multiple 3D understanding tasks (i.e., semantic, instance, panoptic and text-referred segmentation). Moreover, tailored designs for mutual benefits can further boost SIU3R's performance by encouraging bi-directional promotion between reconstruction and understanding.

Visualization Results

image1

Image 1
Image 1
Image 1
Image 1

image2

Image 2
Image 2
Image 2
Image 2

rgb

sem

ins

Zero-shot to Real-world Captured Data Visualization Results

image1

Image 1
Image 1

image2

Image 2
Image 2

rgb

sem

ins

Extend to 4 Input Views Visualization Results

image1

Image 1

image2

Image 2

image3

Image 3

image4

Image 4

rgb

sem

ins

Extend to 8 Input Views Visualization Results

image1

Image 1

image2

Image 2

image3

Image 3

image4

Image 4

image5

Image 4

image6

Image 4

image7

Image 4

image8

Image 4

rgb

sem

ins

Zero-shot to Scannet++ and Replica Visualization Results

image1

Image 1
Image 1

image2

Image 2
Image 2

rgb

sem

ins

Text-referred Segmentation Visualization Results

image

Image 1

refrigerator right of another

Image 2

refrigerators

Image 3

image

Image 1

the chair on the right

Image 2

chairs

Image 3

Depth Visualization for Objects with Complex or Curved Geometries

Image 1
Depth 1
Image 2
Depth 2

Citation

@misc{xu2025siu3r,
    title={SIU3R: Simultaneous Scene Understanding and 3D Reconstruction Beyond Feature Alignment},
    author={Qi Xu and Dongxu Wei and Lingzhe Zhao and Wenpu Li and Zhangchi Huang and Shunping Ji and Peidong Liu},
    year={2025},
    eprint={2507.02705},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}