12 June 2026
Modern AI systems trained on 2D images—so‑called foundation models—are remarkably good at recognising objects, describing scenes in natural language, or picking out fine details. But 3D scene representations, such as 3D Gaussian Splatting (3DGS), still lack equally powerful, general-purpose “brains”. Chorus tackles this by letting several powerful 2D AI systems “coach” a single 3D model at the same time. Instead of learning on its own, this 3D model listens to multiple expert AIs that each look at the scene with different strengths: one is especially good at linking images to words, another has broad visual understanding, and a third is particularly sharp at spotting object shapes and boundaries. Each teacher looks at images rendered from the same 3D scene and gives its own guidance. In this way, Chorus learns a shared internal “language” for 3D scenes that covers everything from high-level meaning—such as “this is a sofa”—down to fine details of geometry and edges.
Once Chorus has been trained, it can look at a new 3D scene just once and build a rich internal “mental map” of it. That same map can then be reused for many different purposes, for example to automatically name objects in the scene in everyday language or to quickly learn new tasks even when only a small amount of training data is available.
Crucially, the approach also works beyond Gaussian splats. A variant of Chorus that only uses the centres, colours and normals of Gaussians transfers well to pure point cloud benchmarks and outperforms point-cloud baselines while using almost forty times fewer training scenes. The team further introduces a “render-and-distill” adaptation step, allowing the model to be fine-tuned to new domains without retraining from scratch, which is crucial when deploying vision systems across different sensors, environments or applications.
Chorus headlines a broader set of contributions from the Computer Vision team at CVPR 2026, including:
With Chorus and these companion works, the IvI Computer Vision group shows how future machines might not just capture 3D worlds but truly understand them.