MonoFusion: Sparse-View 4D Reconstruction via Monocular Fusion

in submission

Carnegie Mellon University

Given sparse-view videos of dynamic scenes, our approach reconstructs 3D geometry and motion, enabling extreme novel view synthesis, 3D tracking, and feature distillation. Our sparse-view (4-camera) setup strikes a balance between ill-posed reconstructions from casual monocular captures and well-constrained reconstructions from dense multi-view studio captures.


4D Scene Reconstruction that Supports Free-View Synthesis


We show comprehensive results that cover all categories of egoexo4D! Even those complex, highly-occuluded scene. (We omit the training view for better visualization experience later)

Bike Repair


EgoView Synthesis (Follow the Dance!)


With 4 camera set-up, we can enable egoview synthesis at anywhere in the scene, boosting possible embodied applications! [Left: gt provided by EgoExo4D, Right: our synthesised view] (Different colour is led by different camera sensors, foreground in our reconstruction has been removed)


More Results!


HealthCare - CPR

Music - Piano

Cooking - Scrumble Egg

Sports - Football

Panoptic - Baseball


Abstract

We address the problem of dynamic scene reconstruction from sparse-view videos. Prior work often requires dense multi-view captures using dozens of calibrated cameras (e.g. Panoptic Studio), or short monocular videos with limited information (e.g. DAVIS). In contrast, we aim to reconstruct diverse dynamic human behaviors, such as repairing a bike or dancing from sparse-view videos.

We repurpose state-of-the-art monocular reconstruction methods for sparse-view reconstruction and find that careful initialization from time- and view-consistent monocular depth estimators produces more accurate reconstructions. Specifically, our method predicts dense surface points across all training views, and uses confidence-aware pixel alignment to initialize scene geometry. We further distill per-point semantic features from 2D foundation models, and use feature clustering to encode a compact set of motion bases. Finally, we employ a gradient-based joint optimization framework to simultaneously learn scene geometry and motion.

Notably, our approach achieves state-of-the-art performance on challenging sequences from the Ego-Exo4D dataset.

Contribution

  • We repurpose Ego-Exo4D for sparse-view reconstruction and highlight the challenge of reconstructing skilled human behaviors in dynamic environments.
  • We demonstrate that monocular reconstruction methods can be extended to the sparse-view setting by carefully incorporating monocular depth and foundational priors.
  • We extensively ablate our design choices and show that we achieve state-of-the-art performance on challenging sequences from Ego-Exo4D.

  • Method Overview


    Structure

    As dynamic scene reconstruction from sparse views is extremely challenging, we present two key insights to initialize plausible geometry and motion:

  • Initializing consistent scene geometry via confidence-aware spatio-temporal alignment
  • Initializing motion trajectories by clustering per-point 3D semantic features distilled from 2D foundation models

  • Experiments


    Concrete steps to validate our points:

    • Perfect training views (prove correct implementation)
    • Perfect near-novel (5°) view synthesis (illustrate that the method works, like other approaches)
    • Great extreme-novel (45°) view synthesis (demonstrate free-viewpoint rendering, no one did it before)
    • Held-out camera evaluation (90°) (use 3 out of 4 cameras for training and leave 1 out for qualitative results)

    Selected Qualitative Results

    45° Novel View Synthesis Comparison
    Structure

    Existing monocular methods and their extension to multi-view produce poor results rendered from a drastically different novel view. MV-SOM improves upon SOM in 45° novel-view synthesis. Our method's careful point cloud initialization and feature-based motion bases further improve on MV-SOM.

    BibTeX

    yet to come, I will send you email to ask for citation (just kidding).

    Acknowledgements

    Zihan would like to thank all the coauthor for discussion and paper writting. Beyond that, Jeff for deep discussion about this project, Tarasha for debugging advice, Neehar for high-level insight, and Prof. Deva Ramanan for insightful guidance and advice. Outside of the author list, we would like to thank Nikhil Keetha and Jay Karhade for their great suggestions. We thank ourselves for surviving. We also extend our gratitude to Zihan for his generous payment of tuition fees to Carnegie Mellon University, which made this possible.