While several approaches to bring vision and language together are emerging, none of them has yet addressed the digital humanities domain, which, nevertheless, is a rich source of visual and textual data. To foster research in this direction, we investigate the learning of visual-semantic embeddings for historical document illustrations, devising both supervised and semi-supervised approaches. We exploit the joint visual-semantic embeddings to automatically align illustrations and textual elements, thus providing an automatic annotation of the visual content of a manuscript. Experiments are performed on the Borso d'Este Holy Bible, one of the most sophisticated illuminated manuscript from the Renaissance, which we manually annotate aligning every illustration with textual commentaries written by experts. Experimental results quantify the domain shift between ordinary visual-semantic datasets and the proposed one, validate the proposed strategies, and devise future works on the same line.
Aligning Text and Document Illustrations: towards Visually Explainable Digital Humanities / Baraldi, Lorenzo; Cornia, Marcella; Grana, Costantino; Cucchiara, Rita. - (2018), pp. 1097-1102. (Intervento presentato al convegno International Conference on Pattern Recognition tenutosi a Beijing, China nel August 20th-24th, 2018) [10.1109/ICPR.2018.8545064].