MuM: Why Masking Multiple Views Beats Single-View Pretraining for 3D

We love MAE-like tricks for 2D images. But what if we want features that understand 3D — depth, pose, geometry — not just semantics? MuM shows that by simply masking multiple views at once and training a decoder to reconstruct pixels, you get drastically better 3D features. Here's why that matters for real-world vision pipelines.
Introduction
If you've spent time building 3D vision systems — for structure-from-motion (SfM), 3D reconstruction, dense matching, pose estimation — you know that standard pretraining (on semantic tasks) gives you nice features for recognizing "this is a dog," but sucks for "reconstruct the geometry of that scene."
Most self-supervised learning (SSL) approaches for images — like MAE or semantically driven models — target semantics. That leaves geometric reasoning to downstream fine-tuning, which often requires lots of labelled 3D data.
MuM flips the game: it shows that by extending masked image modeling to multi-view data (multiple images of the same scene, from different views), you can learn representations that carry real 3D information — without any explicit geometry supervision.
If you care about 3D pipelines: this is the closest thing to "pretraining for geometry" that's worked out of the box.
How it Works
At core, MuM extends the standard masked auto-encoder (MAE) idea to sets of images of the same scene but from different viewpoints.
- Input: a variable-length sequence of images (2 up to ~24 views) from the same scene. Each image is "patchified" (like in a ViT), and a subset of patches in each view is randomly masked — same uniform masking ratio for all views.
- Architecture: a Vision Transformer (ViT) encoder processes the visible patches separately per view. Then, in a ViT-B decoder, they run alternating attention: first attention within each view, then global attention across all views — no reference "anchor" frame is chosen.
- Loss / Target: for every masked patch, the model regresses normalized RGB values (i.e. raw pixel reconstruction, like in MAE). For single-view inputs, this reduces exactly to standard MAE.
Key design choices / "magic numbers":
- Masking ratio: same across all views, uniform.
- Random sequence length between 2 and 24 views per training sample.
- Pretraining dataset: a large mixture of 3D / multi-view / video datasets — collectively ~20 million frames.
- Training budget: 500k steps, AdamW, cosine LR schedule with 25k warmup. For a global batch size of 6144, peak learning rate is scaled accordingly.
The insight: by forcing the network to reconstruct masked pixels from multiple views, the model is nudged to learn a representation that "knows" how different views of the same scene correlate — implicitly encoding geometry, depth, and viewpoint differences — rather than just independent semantics per image.
Comparison
MuM against two baselines:
- DINOv3 — a top-tier self-supervised model optimized for semantic understanding.
- CroCo v2 — a previous masked-image approach designed for 2-view geometry.
Multi-view 3D reconstruction (n ≥ 2 views)
- On multi-view feed-forward reconstruction tasks (pose + depth + point cloud), a frozen MuM encoder + a simple ViT-based head outperforms DINOv3 and CroCo v2 on multiple datasets (indoor, outdoor, object-centric).
- With distillation finetuning (imitating a heavier 3D model), MuM achieves strong camera pose estimation (AUC@30°) and point-cloud reconstruction (median accuracy/completeness) outperforming baselines.
Two-view dense matching & pose estimation
- Linear-probe dense matching: MuM significantly reduces end-point-error (EPE) and increases robustness compared to DINOv3 and CroCo v2. For example, on MegaDepth dataset, MuM reaches EPE ≈ 10.2 px vs ~19 px for DINOv3.
- Using MuM features in a full matching pipeline (e.g. RoMa), matching accuracy improves over the baselines.
- Relative pose estimation (after finetuning) also sees gains: on several datasets, MuM delivers higher accuracy at angular thresholds (5°, 10°, 20°) than CroCo v2 and DINOv3.
Single-view tasks (depth, normals, semantics)
- On purely semantic tasks (e.g. classification / segmentation), MuM lags behind DINOv3 — no surprise, since its objective prioritizes geometry over semantics.
- But for geometry-related single-view tasks (depth, normals), MuM produces competitive results — showing that its learned features generalize beyond multi-view settings.
In short: MuM gives you geometry-aware features that outperform semantic pretrained models on geometry tasks — without expensive per-task supervision.
The Playground
Here are two pseudocode-like examples / recipes showing how one might use MuM in practice:
# Feature extraction for dense matching
images = load_views_of_scene(n_views=4) # e.g. 4 images of same scene
features = MuM_encoder(images) # returns list of feature maps (one per image)
matches = dense_matcher(features[0], features[1])
Expected: high-quality dense correspondences across views, better than DINOv3-based features — useful for SfM, depth fusion, mesh reconstruction.
# Multi-view depth + pose estimation (feed-forward)
images = load_views_of_scene(n_views=8)
features = MuM_encoder(images)
# pass features to a small decoder (ViT-based) that predicts:
# - per-image camera intrinsics & extrinsics
# - per-pixel depth for each image
# - optional point cloud (via depth + pose)
camera_poses, depth_maps, point_cloud = depth_pose_decoder(features)
Expected: you get a full 3D reconstruction from arbitrary number of views — usable in pipelines for 3D scanning or scene capture.
# Using MuM for transfer learning — two-view relative pose estimation
1. Extract features from MuM
f1, f2 = MuM_encoder(img1, img2)
2. Feed into a small regression network for rotation & translation
rel_pose = pose_regressor(f1, f2)
Expected: better pose prediction accuracy, particularly when viewpoint difference is large or illumination changes — compared to using vanilla DINOv3 features.
Practical Insight
- If you are building a 3D reconstruction, SLAM, or multi-view vision pipeline, MuM is potentially a drop-in pretrained backbone that can replace semantic-only backbones.
- Because the objective and architecture are relatively simple (ViT encoder + decoder; pixel reconstruction), it's easier to retrain or adapt than heavy student–teacher distillation pipelines.
- Pretraining cost is modest compared to some large SSL runs: according to the paper, MuM uses around 4,608 A100-hours vs 61,440 H100-hours for something like DINOv3-7B (when scaled up).
- That said — for purely semantic tasks (classification, segmentation) MuM will underperform. So it makes sense as a geometry backbone, not a "universal vision backbone."
In short: For any system where geometry matters more than semantics — e.g. 3D scanning, pose estimation, SfM, AR — MuM is production-ready and practically useful.
Conclusion
MuM delivers a deceptively simple yet powerful idea: extend masked image modeling to multi-view inputs. The result is a self-supervised model that learns geometry-aware features — depth, pose, matching — directly from unlabeled image collections. The empirical results show clear gains over both semantic pretrained models and previous two-view geometry approaches.
If you care about 3D vision and want a robust, generalizable backbone — start playing with MuM. It may not give you "vision that sees meaning," but it gives you "vision that understands space."


