MuM: Why Masking Multiple Views Beats Single-View Pretraining for 3D

We love MAE-like tricks for 2D images. But what if we want features that understand 3D — depth, pose, geometry — not just semantics? MuM shows that by simply masking multiple views at once and training a decoder to reconstruct pixels, you get drastically better 3D features. Here's why that matters for real-world vision pipelines.

Introduction

If you've spent time building 3D vision systems — for structure-from-motion (SfM), 3D reconstruction, dense matching, pose estimation — you know that standard pretraining (on semantic tasks) gives you nice features for recognizing "this is a dog," but sucks for "reconstruct the geometry of that scene."

Most self-supervised learning (SSL) approaches for images — like MAE or semantically driven models — target semantics. That leaves geometric reasoning to downstream fine-tuning, which often requires lots of labelled 3D data.

MuM flips the game: it shows that by extending masked image modeling to multi-view data (multiple images of the same scene, from different views), you can learn representations that carry real 3D information — without any explicit geometry supervision.

If you care about 3D pipelines: this is the closest thing to "pretraining for geometry" that's worked out of the box.

How it Works

At core, MuM extends the standard masked auto-encoder (MAE) idea to sets of images of the same scene but from different viewpoints.

Input: a variable-length sequence of images (2 up to ~24 views) from the same scene. Each image is "patchified" (like in a ViT), and a subset of patches in each view is randomly masked — same uniform masking ratio for all views.

Architecture: a Vision Transformer (ViT) encoder processes the visible patches separately per view. Then, in a ViT-B decoder, they run alternating attention: first attention within each view, then global attention across all views — no reference "anchor" frame is chosen.

Loss / Target: for every masked patch, the model regresses normalized RGB values (i.e. raw pixel reconstruction, like in MAE). For single-view inputs, this reduces exactly to standard MAE.

Key design choices / "magic numbers":

Masking ratio: same across all views, uniform.

Random sequence length between 2 and 24 views per training sample.

Pretraining dataset: a large mixture of 3D / multi-view / video datasets — collectively ~20 million frames.

Training budget: 500k steps, AdamW, cosine LR schedule with 25k warmup. For a global batch size of 6144, peak learning rate is scaled accordingly.

The insight: by forcing the network to reconstruct masked pixels from multiple views, the model is nudged to learn a representation that "knows" how different views of the same scene correlate — implicitly encoding geometry, depth, and viewpoint differences — rather than just independent semantics per image.

Comparison

MuM against two baselines:

DINOv3 — a top-tier self-supervised model optimized for semantic understanding.

CroCo v2 — a previous masked-image approach designed for 2-view geometry.

Multi-view 3D reconstruction (n ≥ 2 views)

On multi-view feed-forward reconstruction tasks (pose + depth + point cloud), a frozen MuM encoder + a simple ViT-based head outperforms DINOv3 and CroCo v2 on multiple datasets (indoor, outdoor, object-centric).

With distillation finetuning (imitating a heavier 3D model), MuM achieves strong camera pose estimation (AUC@30°) and point-cloud reconstruction (median accuracy/completeness) outperforming baselines.

Two-view dense matching & pose estimation

Linear-probe dense matching: MuM significantly reduces end-point-error (EPE) and increases robustness compared to DINOv3 and CroCo v2. For example, on MegaDepth dataset, MuM reaches EPE ≈ 10.2 px vs ~19 px for DINOv3.

Using MuM features in a full matching pipeline (e.g. RoMa), matching accuracy improves over the baselines.

Relative pose estimation (after finetuning) also sees gains: on several datasets, MuM delivers higher accuracy at angular thresholds (5°, 10°, 20°) than CroCo v2 and DINOv3.

Single-view tasks (depth, normals, semantics)

On purely semantic tasks (e.g. classification / segmentation), MuM lags behind DINOv3 — no surprise, since its objective prioritizes geometry over semantics.

But for geometry-related single-view tasks (depth, normals), MuM produces competitive results — showing that its learned features generalize beyond multi-view settings.

In short: MuM gives you geometry-aware features that outperform semantic pretrained models on geometry tasks — without expensive per-task supervision.

The Playground

Here are two pseudocode-like examples / recipes showing how one might use MuM in practice:

# Feature extraction for dense matching
images = load_views_of_scene(n_views=4)   # e.g. 4 images of same scene
features = MuM_encoder(images)            # returns list of feature maps (one per image)
matches = dense_matcher(features[0], features[1])

Expected: high-quality dense correspondences across views, better than DINOv3-based features — useful for SfM, depth fusion, mesh reconstruction.

# Multi-view depth + pose estimation (feed-forward)
images = load_views_of_scene(n_views=8)
features = MuM_encoder(images)
# pass features to a small decoder (ViT-based) that predicts:
#   - per-image camera intrinsics & extrinsics  
#   - per-pixel depth for each image  
#   - optional point cloud (via depth + pose)
camera_poses, depth_maps, point_cloud = depth_pose_decoder(features)

Expected: you get a full 3D reconstruction from arbitrary number of views — usable in pipelines for 3D scanning or scene capture.

# Using MuM for transfer learning — two-view relative pose estimation
1. Extract features from MuM
  f1, f2 = MuM_encoder(img1, img2)
2. Feed into a small regression network for rotation & translation
  rel_pose = pose_regressor(f1, f2)

Expected: better pose prediction accuracy, particularly when viewpoint difference is large or illumination changes — compared to using vanilla DINOv3 features.

Practical Insight

If you are building a 3D reconstruction, SLAM, or multi-view vision pipeline, MuM is potentially a drop-in pretrained backbone that can replace semantic-only backbones.

Because the objective and architecture are relatively simple (ViT encoder + decoder; pixel reconstruction), it's easier to retrain or adapt than heavy student–teacher distillation pipelines.

Pretraining cost is modest compared to some large SSL runs: according to the paper, MuM uses around 4,608 A100-hours vs 61,440 H100-hours for something like DINOv3-7B (when scaled up).

That said — for purely semantic tasks (classification, segmentation) MuM will underperform. So it makes sense as a geometry backbone, not a "universal vision backbone."

In short: For any system where geometry matters more than semantics — e.g. 3D scanning, pose estimation, SfM, AR — MuM is production-ready and practically useful.

Conclusion

MuM delivers a deceptively simple yet powerful idea: extend masked image modeling to multi-view inputs. The result is a self-supervised model that learns geometry-aware features — depth, pose, matching — directly from unlabeled image collections. The empirical results show clear gains over both semantic pretrained models and previous two-view geometry approaches.

If you care about 3D vision and want a robust, generalizable backbone — start playing with MuM. It may not give you "vision that sees meaning," but it gives you "vision that understands space."