The last years have seen an explosion of work on the integration of vision and language data. New tasks like Image Captioning and Visual Questions Answering have been proposed and impressive results have been achieved. There is now a shared desire to gain an in-depth understanding of the strengths and weaknesses of those models. To this end, several datasets have been proposed to try and challenge the state-of-the-art. Those datasets, however, mostly focus on the interpretation of objects (as denoted by nouns in the corresponding captions). In this paper, we reuse a previously proposed methodology to evaluate the ability of current systems to move beyond objects and deal with attributes (as denoted by adjectives), actions (verbs), manner (adverbs) and spatial relations (prepositions). We show that the coarse representations given by current approaches are not informative enough to interpret attributes or actions, whilst spatial relations somewhat fare better, but only in attention models.

Vision and language integration: Moving beyond objects / Shekhar, R.; Pezzelle, S.; Herbelot, A.; Nabi, M.; Sangineto, E.; Bernardi, R.. - (2017). (Intervento presentato al convegno 12th International Conference on Computational Semantics, IWCS 2017 tenutosi a Montpellier, fra nel 19 - 22 September 2017).

Vision and language integration: Moving beyond objects

Sangineto E.;
2017

Abstract

The last years have seen an explosion of work on the integration of vision and language data. New tasks like Image Captioning and Visual Questions Answering have been proposed and impressive results have been achieved. There is now a shared desire to gain an in-depth understanding of the strengths and weaknesses of those models. To this end, several datasets have been proposed to try and challenge the state-of-the-art. Those datasets, however, mostly focus on the interpretation of objects (as denoted by nouns in the corresponding captions). In this paper, we reuse a previously proposed methodology to evaluate the ability of current systems to move beyond objects and deal with attributes (as denoted by adjectives), actions (verbs), manner (adverbs) and spatial relations (prepositions). We show that the coarse representations given by current approaches are not informative enough to interpret attributes or actions, whilst spatial relations somewhat fare better, but only in attention models.
2017
12th International Conference on Computational Semantics, IWCS 2017
Montpellier, fra
19 - 22 September 2017
Shekhar, R.; Pezzelle, S.; Herbelot, A.; Nabi, M.; Sangineto, E.; Bernardi, R.
Vision and language integration: Moving beyond objects / Shekhar, R.; Pezzelle, S.; Herbelot, A.; Nabi, M.; Sangineto, E.; Bernardi, R.. - (2017). (Intervento presentato al convegno 12th International Conference on Computational Semantics, IWCS 2017 tenutosi a Montpellier, fra nel 19 - 22 September 2017).
File in questo prodotto:
File Dimensione Formato  
Vision.pdf

Accesso riservato

Tipologia: Versione dell'autore revisionata e accettata per la pubblicazione
Dimensione 591.98 kB
Formato Adobe PDF
591.98 kB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

Licenza Creative Commons
I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11380/1281620
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 10
  • ???jsp.display-item.citation.isi??? ND
social impact