Do Multimodal LLMs Understand Intraoral Dental Data? Dataset, Platform, and Baselines

Lumetti, Luca; Rizzo, Federico; Cremonini, Francesca; Candeloro, Ettore; Luca, Lombardo; Grana, Costantino; Bolelli, Federico

Progress in dental computer vision is limited by the absence of large-scale multimodal datasets that jointly capture 3D intraoral geometry and 2D appearance across diverse clinical settings. Existing resources are typically unimodal, which hinders robust cross-modal learning and generalization. We assemble and release a multi-center dataset of 1,000 patients comprising 2,000 registered upper/lower intraoral scans, 5,000 paired intraoral photographs, and 2,403 clinician-authored reports. This combination links detailed 3D dental geometry with complementary 2D evidence, supporting occlusal and orthodontic analysis. Moreover, to enable scalable and privacy-preserving acquisition and annotation across distributed centers, we introduce an open platform that supports multimodal ingestion and structured labeling. Experiments indicate that state-of-the-art multimodal models fail to generate clinically faithful reports, motivating geometry-aware adaptation. We therefore propose IOS-Qwen, which fuses a PointTransformer 3D encoder with Qwen3-VL to generate structured, point-cloud-conditioned reports. Together, the dataset, the platform, and the baselines establish a foundation for multimodal dental AI research. Code is publicly released (https://github.com/AImageLab-zip/IOS-Report)

Do Multimodal LLMs Understand Intraoral Dental Data? Dataset, Platform, and Baselines / Lumetti, L., Rizzo, F., Cremonini, F., Candeloro, E., Luca, L., Grana, C., Bolelli, F.. - (2026). (19th European Conference on Computer Vision -- ECCV 2026 Malmo, Sweden Sep 8-12).