Ego View Synthesis (Follow the Dance!)

With 4 camera set-up, we can enable ego view synthesis at anywhere in the scene, boosting possible embodied applications! [Left: gt provided by EgoExo4D, Right: our synthesised view] (Different colour is led by different camera sensors, foreground in our reconstruction has been removed)

More Results!

HealthCare - CPR

Music - Piano

Cooking - Scramble Egg

Sports - Football

Panoptic - Baseball

Abstract

We address the problem of dynamic scene reconstruction from sparse-view videos. Prior work often requires dense multi-view captures using dozens of calibrated cameras (e.g. Panoptic Studio), or short monocular videos with limited information (e.g. DAVIS). In contrast, we aim to reconstruct diverse dynamic human behaviors, such as repairing a bike or dancing from sparse-view videos.

We repurpose state-of-the-art monocular reconstruction methods for sparse-view reconstruction and find that careful initialization from time- and view-consistent monocular depth estimators produces more accurate reconstructions. Specifically, our method predicts dense surface points across all training views, and uses confidence-aware pixel alignment to initialize scene geometry. We further distill per-point semantic features from 2D foundation models, and use feature clustering to encode a compact set of motion bases. Finally, we employ a gradient-based joint optimization framework to simultaneously learn scene geometry and motion.

Notably, our approach achieves state-of-the-art performance on challenging sequences from the Ego-Exo4D dataset.

Contribution

We repurpose Ego-Exo4D for sparse-view reconstruction and highlight the challenge of reconstructing skilled human behaviors in dynamic environments.

We demonstrate that monocular reconstruction methods can be extended to the sparse-view setting by carefully incorporating monocular depth and foundational priors.

We extensively ablate our design choices and show that we achieve state-of-the-art performance on challenging sequences from Ego-Exo4D.

Method Overview

As dynamic scene reconstruction from sparse views is extremely challenging, we present two key insights to initialize plausible geometry and motion:

Initializing consistent scene geometry via confidence-aware spatio-temporal alignment

Initializing motion trajectories by clustering per-point 3D semantic features distilled from 2D foundation models

Experiments

Concrete steps to validate our points:

Perfect training views (prove correct implementation)
Perfect near-novel (5°) view synthesis (illustrate that the method works, like other approaches)
Great extreme-novel (45°) view synthesis (demonstrate free-viewpoint rendering, no one did it before)
Held-out camera evaluation (90°) (use 3 out of 4 cameras for training and leave 1 out for qualitative results)

Selected Qualitative Results

45° Novel View Synthesis Comparison (1)

45° Novel View Synthesis Comparison (2)

45° Novel View Synthesis Comparison (3)

Existing monocular methods and their extension to multi-view produce poor results rendered from a drastically different novel view. MV-SOM improves upon SOM in 45° novel-view synthesis. Our method's careful point cloud initialization and feature-based motion bases further improve on MV-SOM.

BibTeX

@inproceedings{wang2025monofusion, title={MonoFusion: Sparse-View 4D Reconstruction via Monocular Fusion}, author={Wang, Zihan and Tan, Jeff and Khurana, Tarasha and Peri, Neehar and Ramanan, Deva}, booktitle={ICCV}, year={2025} }

Acknowledgements

Zihan Wang is currently supported by funding from Bosch Research. Zihan would like to thank all the coauthor for discussion and paper writing. Beyond that, Jeff for deep discussion about this project, Tarasha for debugging advice, Neehar for high-level insight, and Prof. Deva Ramanan for insightful guidance and advice. Outside of the author list, we would like to thank Nikhil Keetha and Jay Karhade for their great suggestions.

MonoFusion: Sparse-View 4D Reconstruction via Monocular Fusion

ICCV 2025

4D Scene Reconstruction that Supports Free-View Synthesis

We show comprehensive results that cover all categories of EgoExo4D! Even those complex, highly-occuluded scene. (We omit the training view for better visualization experience later)

Bike Repair