Image and video captioning are important tasks in visual data analytics, as they concern the capability of describing visual content in natural language. They are the pillars of query answering systems, improve indexing and search and allow a natural form of human-machine interaction. Even though promising deep learning strategies are becoming popular, the heterogeneity of large image archives makes this task still far from being solved. In this paper we explore how visual saliency prediction can support image captioning. Recently, some forms of unsupervised machine attention mechanisms have been spreading, but the role of human attention prediction has never been examined extensively for captioning. We propose a machine attention model driven by saliency prediction to provide captions in images, which can be exploited for many services on cloud and on multimedia data. Experimental evaluations are conducted on the SALICON dataset, which provides groundtruths for both saliency and captioning, and on the large Microsoft COCO dataset, the most widely used for image captioning.

Visual Saliency for Image Captioning in New Multimedia Services / Cornia, Marcella; Baraldi, Lorenzo; Serra, Giuseppe; Cucchiara, Rita. - (2017), pp. 309-314. (Intervento presentato al convegno 2017 IEEE International Conference on Multimedia and Expo Workshops tenutosi a Hong Kong nel July 10-14, 2017) [10.1109/ICMEW.2017.8026277].

Visual Saliency for Image Captioning in New Multimedia Services

CORNIA, MARCELLA;BARALDI, LORENZO;SERRA, GIUSEPPE;CUCCHIARA, Rita
2017

Abstract

Image and video captioning are important tasks in visual data analytics, as they concern the capability of describing visual content in natural language. They are the pillars of query answering systems, improve indexing and search and allow a natural form of human-machine interaction. Even though promising deep learning strategies are becoming popular, the heterogeneity of large image archives makes this task still far from being solved. In this paper we explore how visual saliency prediction can support image captioning. Recently, some forms of unsupervised machine attention mechanisms have been spreading, but the role of human attention prediction has never been examined extensively for captioning. We propose a machine attention model driven by saliency prediction to provide captions in images, which can be exploited for many services on cloud and on multimedia data. Experimental evaluations are conducted on the SALICON dataset, which provides groundtruths for both saliency and captioning, and on the large Microsoft COCO dataset, the most widely used for image captioning.
2017
2017 IEEE International Conference on Multimedia and Expo Workshops
Hong Kong
July 10-14, 2017
309
314
Cornia, Marcella; Baraldi, Lorenzo; Serra, Giuseppe; Cucchiara, Rita
Visual Saliency for Image Captioning in New Multimedia Services / Cornia, Marcella; Baraldi, Lorenzo; Serra, Giuseppe; Cucchiara, Rita. - (2017), pp. 309-314. (Intervento presentato al convegno 2017 IEEE International Conference on Multimedia and Expo Workshops tenutosi a Hong Kong nel July 10-14, 2017) [10.1109/ICMEW.2017.8026277].
File in questo prodotto:
File Dimensione Formato  
main.pdf

Open access

Tipologia: Versione dell'autore revisionata e accettata per la pubblicazione
Dimensione 2.22 MB
Formato Adobe PDF
2.22 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

Licenza Creative Commons
I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11380/1130904
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 20
  • ???jsp.display-item.citation.isi??? 8
social impact