FOIL it! Find One mismatch between Image and Language caption

Shekhar, Ravi; Pezzelle, Sandro; Klimovich, Yauhen; Herbelot, Aurelie; Nabi, Moin; Sangineto, Enver; Bernardi, Raffaella

doi:10.18653/v1/P17-1024

In this paper, we aim to understand whether current language and vision (LaVi) models truly grasp the interaction between the two modalities. To this end, we propose an extension of the MS-COCO dataset, FOIL-COCO, which associates images with both correct and ‘foil’ captions, that is, descriptions of the image that are highly similar to the original ones, but contain one single mistake (‘foil word’). We show that current LaVi models fall into the traps of this data and perform badly on three tasks: a) caption classification (correct vs. foil); b) foil word detection; c) foil word correction. Humans, in contrast, have near-perfect performance on those tasks. We demonstrate that merely utilising language cues is not enough to model FOIL-COCO and that it challenges the state-of-the-art by requiring a fine-grained understanding of the relation between text and image.

FOIL it! Find One mismatch between Image and Language caption / Shekhar, Ravi; Pezzelle, Sandro; Klimovich, Yauhen; Herbelot, Aurelie; Nabi, Moin; Sangineto, Enver; Bernardi, Raffaella. - (2017), pp. 255-265. ( 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017 Vancouver July 30th - August 4th, 2017) [10.18653/v1/P17-1024 ].

FOIL it! Find One mismatch between Image and Language caption

Shekhar, Ravi;Pezzelle, Sandro;Klimovich, Yauhen;Herbelot, Aurelie;Nabi, Moin;Sangineto, Enver;Bernardi, Raffaella

2017

Abstract

In this paper, we aim to understand whether current language and vision (LaVi) models truly grasp the interaction between the two modalities. To this end, we propose an extension of the MS-COCO dataset, FOIL-COCO, which associates images with both correct and ‘foil’ captions, that is, descriptions of the image that are highly similar to the original ones, but contain one single mistake (‘foil word’). We show that current LaVi models fall into the traps of this data and perform badly on three tasks: a) caption classification (correct vs. foil); b) foil word detection; c) foil word correction. Humans, in contrast, have near-perfect performance on those tasks. We demonstrate that merely utilising language cues is not enough to model FOIL-COCO and that it challenges the state-of-the-art by requiring a fine-grained understanding of the relation between text and image.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2017
			
	Titolo del Convegno
	
				55th Annual Meeting of the Association for Computational Linguistics, ACL 2017
			
	Luogo del Convegno
	
				Vancouver
			
	Data del Convegno
	
				July 30th - August 4th, 2017
			
	Codice DOI
	
				https://dx.doi.org/10.18653/v1/P17-1024 
			
	Codice WoS
	
				WOS:000493984800024
			
	Codice Scopus
	
				2-s2.0-85040908564
			
	Pagina iniziale
	
				255
			
	Pagina finale
	
				265
			
	Tutti gli autori
	
						Shekhar, Ravi; Pezzelle, Sandro; Klimovich, Yauhen; Herbelot, Aurelie; Nabi, Moin; Sangineto, Enver; Bernardi, Raffaella
					
	Citazione
	
				FOIL it! Find One mismatch between Image and Language caption / Shekhar, Ravi; Pezzelle, Sandro; Klimovich, Yauhen; Herbelot, Aurelie; Nabi, Moin; Sangineto, Enver; Bernardi, Raffaella. - (2017), pp. 255-265. ( 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017 Vancouver July 30th - August 4th, 2017) [10.18653/v1/P17-1024 ].
			
	Tipologia
	
				Relazione in Atti di Convegno

File in questo prodotto:

File	Dimensione	Formato
foil_acl17.pdf Open access Tipologia: VOR - Versione pubblicata dall'editore Licenza: [IR] creative-commons Dimensione 2.99 MB Formato Adobe PDF Visualizza/Apri	2.99 MB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris