About Me
I am a Research Scientist at Core AI, Meta Reality Labs , where I build multimodal spatial vision systems for scalable video and world understanding. My research focuses on developing spatial intelligence models that integrate vision, language, motion, and geometry to enable robust perception, reasoning, and generation across modalities. Representative work includes VideoAutoThink (CVPR 2026), VLM-3R (CVPR 2026 and Best Workshop Paper at ACM Multimedia 2025), and MV-DUSt3R+ (CVPR 2025 Oral). In addition to research, I work on the on-device perception stack powering Meta Quest experiences, including semantic segmentation and real-time MR/VR spatial understanding.
Before joining Meta Reality Labs, I was a technical lead and senior machine learning/computer vision engineer with the Video Engineering Group at Apple Inc. I have lead the algorithm development and delivered the shipments of multiple groundbreaking products, includes Room Tracking on VisionPro, RoomPlan Enhancement and RoomPlan. Additionally, I collaborated with Apple AIML on 3D Scene Style Generation, where we pioneered RoomDreamer, the first paper to enable text-driven 3D indoor scene synthesis with coherent geometry and texture.
I received my Ph.D. and M.S. degree from University of Maryland, College Park, where I was advised by Prof. Rama Chellappa. I completed my B.S. degree in Electrical Engineering and Information Science from University of Science and Technology of China. Additionally, I completed internships at Snap Research and the Palo Alto Research Center.
Highlights
- Feb, 2026. 🚀✨ Three papers — 1) VideoAutoThink, 2) VLM-3R, and 3) MoS (Mixture of States) — have been accepted to CVPR 2026! Huge thanks and congratulations to all co-authors and collaborators 🙌🎉
- 🧠🎬 VideoAutoThink: Video Auto Reasoning via Thinking Once, Answering Twice
An adaptive video reasoning framework that challenges unconditional chain-of-thought by adopting a thinking-once, answering-twice paradigm, enabling confidence-based reasoning activation for improved accuracy and efficiency.
📄 Paper [arXiv] · 📦 GitHub [Code] · 🤗 Model [Hugging Face] - 🧭📐 VLM-3R: Instruction-Aligned 3D Reconstruction and Reasoning
A spatial vision-language model aligning natural language instructions with 3D reasoning from monocular video.
💻 GitHub [Code] · 📦 Project Page [Link] - 🎨⚡ Mixture of States: Routing Token-Level Dynamics for Multimodal Generation
A dynamic token-wise routing mechanism for multimodal diffusion models, enabling adaptive layer selection and input-dependent text–vision alignment for scalable generation and editing.
📄 Paper [arXiv]
- 🧠🎬 VideoAutoThink: Video Auto Reasoning via Thinking Once, Answering Twice
Jan, 2026. VideoAuto-R1 is now online — try our 🤗 Demo [Hugging Face]. 📦 GitHub [Code] · Model [Hugging Face]
Dec, 2025. MoS (Mixture of States) is now online. Check out our [Paper]. Congrats to Haozhe Liu and Ding Liu for leading the work.
Nov, 2025. We’re grateful to receive the Best Paper Award from the ACM MM 2025 Multimodal Foundation Models for Spatial Intelligence Workshop. Congrats to Zhiwen for leading the work, and thanks to all collaborators. 📦 GitHub and 📄 Paper
Nov, 2025, DynamicVerse: Physically-Aware Multimodal Modeling for Dynamic 4D Worlds is accepted to NeurIPS 2025. 📦 GitHub
Jun, 2025. 🚀 VLM-3R is online! Check out our Project Website, read the arXiv Paper, and explore the Code.
Mar, 2025, MV-DUSt3R+ is accepted as an Oral at CVPR 2025. Check our Demo and Project. Congratulations to Zhenggang Tang, Yuchen Fan, Dilin Wang, Rakesh Ranjan, Alexander Schwing and Zhicheng Yan
Jan, 2025, MV-DUSt3R+ is Open Souced. Let’s further push the boundary!
Dec, 2024. MV-DUSt3R+ is online, a single-stage, multi-view, and multi-path model capable of reconstructing large-scale scenes from sparse, unconstrained views in just 2 seconds!
Jun, 2024. Room Tracking on VisionPro is unveiled at Apple WWDC 2024. This technology identifies room boundaries, supports precisely-aligned geometries, and recognizes transitions between rooms.
Oct, 2023. Our paper “RoomDreamer: Text-Driven 3D Indoor Scene Synthesis with Coherent Geometry and Texture” is accepted to ACM Multimedia 2023. [arXiv] [demo]. Congratulations to Liangchen Song, Liangliang Cao and all co-authors.
Jun, 2023. RoomPlan Enhancement is introduced at Apple WWDC 2023. It added numerous powerful features to RoomPlan, including multi-room scanning, multi-room layout, object attributes, polygon walls, improved furniture representation, room-type identification, and floor-shape recognition.
Oct, 2022. Our research article, “3D Parametric Room Representation with RoomPlan” is published at Apple Machine Learning Research. Read our research article to learn more!
- Jun, 2022. RoomPlan is first released at Apple WWDC 2022. Combining the power of Apple LiDAR, state-of-the-art 3D machine learning, and an intuitive scanning UI, RoomPlan empowers developers to create innovative solutions in interior design, architecture, real estate, and e-commerce.
