PhD Student in Artificial Intelligence

Qi Xu 徐淇

I am a PhD student at Westlake University, advised by Prof. Peidong Liu. My research focuses on multimodal large language models, especially video reasoning, spatio-temporal grounding, and post-training.

I am interested in building grounded multimodal systems that can find and verify visual evidence, reason over long videos, and understand the spatial structure of physical environments. My earlier work spans 3D reconstruction and scene understanding. I received my master's and bachelor's degrees from Wuhan University.

Email GitHub CV Google Scholar

Education & Experience

2026 - Present

Westlake University

PhD Student in Artificial Intelligence, supervised by Prof. Peidong Liu.

Research: Embodied AI, Multimodal Large Language Models.

2025 - 2026

ByteDance, TikTok Global Live

Research Intern, LLM Strategy Department. Mentored by Xiangtai Li.

Developed video MLLMs for temporal grounding with SFT and RL.
Built scalable data and annotation pipelines for multimodal post-training.
Worked on large-scale distributed training, evaluation, and stability analysis.
Designed LLM-agent workflows for real-time interaction analysis.

2023 - 2026

Wuhan University

Master in Pattern Recognition and Intelligent Systems, supervised by Prof. Shunping Ji.

Research: Computer Vision, 3D Reconstruction, Multimodal Learning.

2019 - 2023

Wuhan University

B.S. in Spatial Information and Digital Technology.

News

2026

VideoZeroBench was released with code and data for evidence-grounded long-video evaluation.

2026

Any 3D Scene is Worth 1K Tokens was released on arXiv.

2026

Two advised-student papers were accepted to ECCV 2026.

2026

Watch, Remember, Reason was released on arXiv.

2026

HiCI was accepted to ICML 2026.

2026

Towards One-to-Many Temporal Grounding was accepted to ICML 2026.

2025

SIU3R received a NeurIPS 2025 Spotlight.

Research Interests

Multimodal Post-Training

SFT and reinforcement learning for grounded multimodal reasoning, with verifiable task-specific rewards.

Video Reasoning & Grounding

Long-video understanding, temporal localization, evidence verification, benchmark construction, and video agents.

3D & Spatial Intelligence

Scene reconstruction, understanding, generation, and spatial representations grounded in physical environments.

Selected Publications

Qi Xu is underlined. * denotes equal contribution.

NeurIPS 2025 Spotlight

SIU3R: Simultaneous Scene Understanding and 3D Reconstruction Beyond Feature Alignment

Qi Xu*, Dongxu Wei*, Lingzhe Zhao, Wenpu Li, Zhangchi Huang, Shunping Ji, Peidong Liu

Alignment-free framework for state-of-the-art 3D reconstruction and scene understanding from unposed images via a shared pixel-aligned 3D representation.

My contributions: Project lead, core idea contributor, and primary code implementer; led benchmarking, ablations, and paper writing.

Project ArXiv Code

ICML 2026

Towards One-to-Many Temporal Grounding

Qi Xu*, Yue Tan*, Shihao Chen, Jiahao Meng, Anran Wang, Shunping Ji, Hao Fei, Xiangtai Li

Introduces the first systematic solution for one-to-many temporal grounding with a 56K-sample dataset and SFT/RL training using temporal and CoT-based caption rewards.

My contributions: Initiated the project and contributed the core idea; led data construction, SFT/RL pipelines, reward design, model training, benchmarking, and paper writing.

Project ArXiv

arXiv 2026

Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale

Dongxu Wei*, Qi Xu*, Zhiqi Li, Hangning Zhou, Cong Qiu, Hailong Qin, Mu Yang, Zhaopeng Cui, Peidong Liu

Performs 3D scene generation directly within an implicit 3D latent space using 3D Diffusion Transformers.

My contributions: Built large-scale datasets, conducted ablations and evaluation, and contributed to paper writing and revision.

Project ArXiv Code

ICML 2026

HiCI: Hierarchical Construction-Integration for Long-Context Attention

Xiangyu Zeng, Qi Xu, Yunke Wang, Chang Xu

Efficient long-context modeling with hierarchical attention, extending LLaMA-2 to 100K tokens with only 5.5% additional parameters.

My contributions: Contributed to method design and paper revision.

ArXiv Code

arXiv 2026

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

Jiahao Meng, Yue Tan, Qi Xu, Kuan Gao, Weisong Liu, Yanwei Li, Xiangtai Li, Lingdong Kong, Haochen Wang, Qianyu Zhou, Jiangning Zhang, Guangliang Cheng, Yunhai Tong, Lu Qi, Minghsuan Yang

A human-view survey organizing video MLLM research around watching, remembering, and reasoning for long, multimodal, and knowledge-intensive video understanding.

My contributions: Organized literature for the memory section, created figures and tables, and contributed to writing and revision.

ArXiv GitHub

arXiv 2026

VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification

Jiahao Meng, Yue Tan, Qi Xu, Haochen Wang, Zhongwei Ren, Weisong Liu, Yuhao Wang, Renrui Zhang, Yunhai Tong, Haodong Duan

A hierarchical long-video QA benchmark that verifies answers with temporal intervals and spatial boxes under a five-level evidence protocol.

My contributions: Co-designed the five-level evaluation protocol, implemented the annotation tool and guidelines, and constructed about 200 complete QA-evidence samples.

Project ArXiv Code

NeurIPS 2025

E-MoFlow: Learning Egomotion and Optical Flow from Event Data via Implicit Regularization

Wenpu Li, Bangyan Liao, Yi Zhou, Qi Xu, Pian Wan, Peidong Liu

Unsupervised framework for joint ego-motion and optical flow estimation using implicit neural representations and geometric constraints.

My contributions: Contributed to method design and paper revision.

Project ArXiv Code

Publications with Advised Students

ECCV 2026

SupIR-GS: Thermal Infrared Super-Resolution Novel View Synthesis with Imaging-Calibrated 3DGS

Jin Liu, Haodong Li, Jiagang Chen, Dabin leng, Jiguang Li, Zhao Huang, Xiaoshuai Zhang, Qi Xu^†, Zhiwen Zheng, Xingru Huang^†

Reconstructs high-resolution 3D thermal scenes from low-resolution inputs via physics-informed degradation modeling.

ECCV 2026

Physically Grounded Dual-Opacity Gaussian Splatting for Joint RGB-TIR Reconstruction

Jin Liu, Dabin leng, Jiagang Chen, Haodong Li, Jiguang Li, Zhao Huang, Xiaoshuai Zhang, Zhiwen Zheng, Xingru Huang^†, Qi Xu^†

A unified RGB-TIR reconstruction framework resolving cross-spectral visibility conflicts through thermal field modeling.

Contact

I am open to research conversations and collaborations on embodied AI, multimodal reasoning, 3D Vision, and vision-language systems that connect perception with action.

insomniaaac@qq.com github.com/insomniaaac Google Scholar