Progress in dental computer vision is limited by the absence of large-scale multimodal datasets that jointly capture 3D intraoral geometry and 2D appearance across diverse clinical settings. Existing resources are typically unimodal, which hinders robust cross-modal learning and generalization. We assemble and release a multi-center dataset of 1,000 patients comprising 2,000 registered upper/lower intraoral scans, 5,000 paired intraoral photographs, and 2,403 clinician-authored reports. This combination links detailed 3D dental geometry with complementary 2D evidence, supporting occlusal and orthodontic analysis. Moreover, to enable scalable and privacy-preserving acquisition and annotation across distributed centers, we introduce an open platform that supports multimodal ingestion and structured labeling. Experiments indicate that state-of-the-art multimodal models fail to generate clinically faithful reports, motivating geometry-aware adaptation. We therefore propose IOS-Qwen, which fuses a PointTransformer 3D encoder with Qwen3-VL to generate structured, point-cloud-conditioned reports. Together, the dataset, the platform, and the baselines establish a foundation for multimodal dental AI research. Code is publicly released (https://github.com/AImageLab-zip/IOS-Report)
Do Multimodal LLMs Understand Intraoral Dental Data? Dataset, Platform, and Baselines / Lumetti, L., Rizzo, F., Cremonini, F., Candeloro, E., Luca, L., Grana, C., Bolelli, F.. - (2026). (19th European Conference on Computer Vision -- ECCV 2026 Malmo, Sweden Sep 8-12).
Do Multimodal LLMs Understand Intraoral Dental Data? Dataset, Platform, and Baselines
Lumetti, Luca;Candeloro, Ettore;Grana, Costantino;Bolelli, Federico
2026
Abstract
Progress in dental computer vision is limited by the absence of large-scale multimodal datasets that jointly capture 3D intraoral geometry and 2D appearance across diverse clinical settings. Existing resources are typically unimodal, which hinders robust cross-modal learning and generalization. We assemble and release a multi-center dataset of 1,000 patients comprising 2,000 registered upper/lower intraoral scans, 5,000 paired intraoral photographs, and 2,403 clinician-authored reports. This combination links detailed 3D dental geometry with complementary 2D evidence, supporting occlusal and orthodontic analysis. Moreover, to enable scalable and privacy-preserving acquisition and annotation across distributed centers, we introduce an open platform that supports multimodal ingestion and structured labeling. Experiments indicate that state-of-the-art multimodal models fail to generate clinically faithful reports, motivating geometry-aware adaptation. We therefore propose IOS-Qwen, which fuses a PointTransformer 3D encoder with Qwen3-VL to generate structured, point-cloud-conditioned reports. Together, the dataset, the platform, and the baselines establish a foundation for multimodal dental AI research. Code is publicly released (https://github.com/AImageLab-zip/IOS-Report)| File | Dimensione | Formato | |
|---|---|---|---|
|
ECCV2026_bite2text.pdf
Open access
Tipologia:
AAM - Versione dell'autore revisionata e accettata per la pubblicazione
Dimensione
38.69 MB
Formato
Adobe PDF
|
38.69 MB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris




