Beyond pixels: Chorus lets AI understand the 3D world it moves through

12 June 2026

When a robot walks through a room, or an AR headset maps your living room, it sees the world as a cloud of 3D points and tiny coloured blobs—not as “chair”, “sofa” or “mug”. Bridging that gap between raw geometry and human-level understanding is one of the hardest problems in computer vision. A team from the Computer Vision group at IvI has now taken a major step forward. Their work, “Chorus: Multi-Teacher Pretraining for Holistic 3D Gaussian Scene Encoding”, led by Yue Li, has been selected as an Award Candidate and oral presentation at CVPR 2026—one of only 74 award candidates out of 16,092 submissions worldwide.

Letting multiple expert AIs teach 3D

Modern AI systems trained on 2D images—so‑called foundation models—are remarkably good at recognising objects, describing scenes in natural language, or picking out fine details. But 3D scene representations, such as 3D Gaussian Splatting (3DGS), still lack equally powerful, general-purpose “brains”. Chorus tackles this by letting several powerful 2D AI systems “coach” a single 3D model at the same time. Instead of learning on its own, this 3D model listens to multiple expert AIs that each look at the scene with different strengths: one is especially good at linking images to words, another has broad visual understanding, and a third is particularly sharp at spotting object shapes and boundaries. Each teacher looks at images rendered from the same 3D scene and gives its own guidance. In this way, Chorus learns a shared internal “language” for 3D scenes that covers everything from high-level meaning—such as “this is a sofa”—down to fine details of geometry and edges.

Instant 3D understanding, reused everywhere

Once Chorus has been trained, it can look at a new 3D scene just once and build a rich internal “mental map” of it. That same map can then be reused for many different purposes, for example to automatically name objects in the scene in everyday language or to quickly learn new tasks even when only a small amount of training data is available.

Beyond Gaussian splats and into new domains

Crucially, the approach also works beyond Gaussian splats. A variant of Chorus that only uses the centres, colours and normals of Gaussians transfers well to pure point cloud benchmarks and outperforms point-cloud baselines while using almost forty times fewer training scenes. The team further introduces a “render-and-distill” adaptation step, allowing the model to be fine-tuned to new domains without retraining from scratch, which is crucial when deploying vision systems across different sensors, environments or applications.

A strong CVPR showing for IvI Computer Vision

Chorus headlines a broader set of contributions from the Computer Vision team at CVPR 2026, including:

Unblur-SLAM: Dense Neural SLAM for Blurry Inputs – robust dense mapping and localisation even when the camera images are blurred.
Fast SceneScript: Fast and Accurate Language-Based 3D Scene Understanding via Multi-Token Prediction – using language to query and understand 3D scenes efficiently.
Gaussian Mapping for Evolving Scenes: handling 3D scenes that change over time, not just static environments.
RHINO: Reconstructing Human Interactions with Novel Objects from Monocular Videos – inferring how people interact with previously unseen objects from a single video stream.

With Chorus and these companion works, the IvI Computer Vision group shows how future machines might not just capture 3D worlds but truly understand them.

More information

Chorus project website