Generating textual descriptions from visual inputs is a fundamental step towards machine intelligence, as it entails modeling the connections between the visual and textual modalities. For years, image captioning models have relied on pre-trained visual encoders and object detectors, trained on relatively small sets of data. Recently, it has been observed that large-scale multi-modal approaches like CLIP (Contrastive Language-Image Pre-training), trained on a massive amount of image-caption pairs, provide a strong zero-shot capability on various vision tasks. In this paper, we study the advantage brought by CLIP in image captioning, employing it as a visual encoder. Through extensive experiments, we show how CLIP can significantly outperform widely-used visual encoders and quantify its role under different architectures, variants, and evaluation protocols, ranging from classical captioning performance to zero-shot transfer.

The Unreasonable Effectiveness of CLIP features for Image Captioning: an Experimental Analysis / Barraco, Manuele; Cornia, Marcella; Cascianelli, Silvia; Baraldi, Lorenzo; Cucchiara, Rita. - 2022-:(2022), pp. 4661-4669. (Intervento presentato al convegno 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2022 tenutosi a New Orleans, Louisiana nel June 19-24, 2022) [10.1109/CVPRW56347.2022.00512].

The Unreasonable Effectiveness of CLIP features for Image Captioning: an Experimental Analysis

Manuele Barraco;Marcella Cornia;Silvia Cascianelli;Lorenzo Baraldi;Rita Cucchiara
2022

Abstract

Generating textual descriptions from visual inputs is a fundamental step towards machine intelligence, as it entails modeling the connections between the visual and textual modalities. For years, image captioning models have relied on pre-trained visual encoders and object detectors, trained on relatively small sets of data. Recently, it has been observed that large-scale multi-modal approaches like CLIP (Contrastive Language-Image Pre-training), trained on a massive amount of image-caption pairs, provide a strong zero-shot capability on various vision tasks. In this paper, we study the advantage brought by CLIP in image captioning, employing it as a visual encoder. Through extensive experiments, we show how CLIP can significantly outperform widely-used visual encoders and quantify its role under different architectures, variants, and evaluation protocols, ranging from classical captioning performance to zero-shot transfer.
2022
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2022
New Orleans, Louisiana
June 19-24, 2022
2022-
4661
4669
Barraco, Manuele; Cornia, Marcella; Cascianelli, Silvia; Baraldi, Lorenzo; Cucchiara, Rita
The Unreasonable Effectiveness of CLIP features for Image Captioning: an Experimental Analysis / Barraco, Manuele; Cornia, Marcella; Cascianelli, Silvia; Baraldi, Lorenzo; Cucchiara, Rita. - 2022-:(2022), pp. 4661-4669. (Intervento presentato al convegno 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2022 tenutosi a New Orleans, Louisiana nel June 19-24, 2022) [10.1109/CVPRW56347.2022.00512].
File in questo prodotto:
File Dimensione Formato  
2022_CVPRW_CLIP_Features.pdf

Open access

Tipologia: AAM - Versione dell'autore revisionata e accettata per la pubblicazione
Dimensione 396.33 kB
Formato Adobe PDF
396.33 kB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

Licenza Creative Commons
I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11380/1271225
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 56
  • ???jsp.display-item.citation.isi??? 48
social impact