PhD Student in Artificial Intelligence

Qi Xu 徐淇

I am a PhD student at Westlake University, advised by Prof. Peidong Liu. My research focuses on embodied AI, multimodal large language models, and 3D Vision.

I am interested in building intelligent systems that perceive, reconstruct, reason about, and act within physical environments from visual and multimodal observations. I previously received my master's and bachelor's degrees from Wuhan University.

Education & Experience

Westlake University logo

Westlake University

PhD Student in Artificial Intelligence, supervised by Prof. Peidong Liu.

Research: Embodied AI, Multimodal Large Language Models.

ByteDance logo

ByteDance, TikTok Live

Research Intern, LLM Applications Team. Mentored by Xiangtai Li. Worked on multimodal model research and large-scale training systems.

Wuhan University logo

Wuhan University

Master in Pattern Recognition and Intelligent Systems, supervised by Prof. Shunping Ji.

Research: Computer Vision, 3D Reconstruction, Multimodal Learning.

Wuhan University logo

Wuhan University

B.S. in Spatial Information and Digital Technology.

News

HiCI was accepted to ICML 2026.

Towards One-to-Many Temporal Grounding was accepted to ICML 2026.

SIU3R received a NeurIPS 2025 Spotlight.

Research Interests

Embodied AI

Perception, memory, and spatial reasoning foundations for agents that interact with physical environments.

Multimodal LLMs

Spatial-temporal reasoning over images, videos, language, and 3D observations for grounded multimodal intelligence.

3D Vision

Scene reconstruction, generation, and geometric representations from sparse, unposed, or multimodal observations.

Selected Publications

Qi Xu is underlined. * denotes equal contribution.

SIU3R teaser

NeurIPS 2025 Spotlight

SIU3R: Simultaneous Scene Understanding and 3D Reconstruction Beyond Feature Alignment

Qi Xu*, Dongxu Wei*, Lingzhe Zhao, Wenpu Li, Zhangchi Huang, Shunping Ji, Peidong Liu

Alignment-free framework for state-of-the-art 3D reconstruction and scene understanding from unposed images via a shared pixel-aligned 3D representation.

arXiv 2026

Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale

Dongxu Wei*, Qi Xu*, Zhiqi Li, Hangning Zhou, Cong Qiu, Hailong Qin, Mu Yang, Zhaopeng Cui, Peidong Liu

Performs 3D scene generation directly within an implicit 3D latent space using 3D Diffusion Transformers.

HiCI teaser

ICML 2026

HiCI: Hierarchical Construction-Integration for Long-Context Attention

Xiangyu Zeng, Qi Xu, Yunke Wang, Chang Xu

Efficient long-context modeling with hierarchical attention, extending LLaMA-2 to 100K tokens with only 5.5% additional parameters.

E-MoFlow teaser

NeurIPS 2025

E-MoFlow: Learning Egomotion and Optical Flow from Event Data via Implicit Regularization

Wenpu Li, Bangyan Liao, Yi Zhou, Qi Xu, Pian Wan, Peidong Liu

Unsupervised framework for joint ego-motion and optical flow estimation using implicit neural representations and geometric constraints.

Contact

I am open to research conversations and collaborations on embodied AI, multimodal reasoning, 3D Vision, and vision-language systems that connect perception with action.