We address the problem of dynamic scene reconstruction from sparse-view videos. Prior work often requires dense multi-view captures using dozens of calibrated cameras (e.g. Panoptic Studio), or short monocular videos with limited information (e.g. DAVIS). In contrast, we aim to reconstruct diverse dynamic human behaviors, such as repairing a bike or dancing from sparse-view videos.
We repurpose state-of-the-art monocular reconstruction methods for sparse-view reconstruction and find that careful initialization from time- and view-consistent monocular depth estimators produces more accurate reconstructions.
Specifically, our method predicts dense surface points across all training views, and uses confidence-aware pixel alignment to initialize scene geometry.
We further distill per-point semantic features from 2D foundation models, and use feature clustering to encode a compact set of motion bases.
Finally, we employ a gradient-based joint optimization framework to simultaneously learn scene geometry and motion.
Notably, our approach achieves state-of-the-art performance on challenging sequences from the Ego-Exo4D dataset.