Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation

Sarto, Sara; Barraco, Manuele; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

doi:10.1109/CVPR52729.2023.00668

The CLIP model has been recently proven to be very effective for a variety of cross-modal tasks, including the evaluation of captions generated from vision-and-language models. In this paper, we propose a new recipe for a contrastive-based evaluation metric for image captioning, namely Positive-Augmented Contrastive learning Score (PAC-S), that in a novel way unifies the learning of a contrastive visual-semantic space with the addition of generated images and text on curated data. Experiments spanning several datasets demonstrate that our new metric achieves the highest correlation with human judgments on both images and videos, outperforming existing reference-based metrics like CIDEr and SPICE and reference-free metrics like CLIP-Score. Finally, we test the system-level correlation of the proposed metric when considering popular image captioning approaches, and assess the impact of employing different cross-modal features. We publicly release our source code and trained models.

Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation / Sarto, Sara; Barraco, Manuele; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita. - 2023:June(2023), pp. 6914-6924. (Intervento presentato al convegno 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023 tenutosi a Vancouver, can nel Jun 18-22 2023) [10.1109/CVPR52729.2023.00668].

Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation

Sara Sarto;Manuele Barraco;Marcella Cornia;Lorenzo Baraldi;Rita Cucchiara

2023

Abstract

The CLIP model has been recently proven to be very effective for a variety of cross-modal tasks, including the evaluation of captions generated from vision-and-language models. In this paper, we propose a new recipe for a contrastive-based evaluation metric for image captioning, namely Positive-Augmented Contrastive learning Score (PAC-S), that in a novel way unifies the learning of a contrastive visual-semantic space with the addition of generated images and text on curated data. Experiments spanning several datasets demonstrate that our new metric achieves the highest correlation with human judgments on both images and videos, outperforming existing reference-based metrics like CIDEr and SPICE and reference-free metrics like CLIP-Score. Finally, we test the system-level correlation of the proposed metric when considering popular image captioning approaches, and assess the impact of employing different cross-modal features. We publicly release our source code and trained models.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2023
			
	Titolo del Convegno
	
				2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023
			
	Luogo del Convegno
	
				Vancouver, can
			
	Data del Convegno
	
				Jun 18-22 2023
			
	Codice DOI
	
				https://dx.doi.org/10.1109/CVPR52729.2023.00668
			
	Codice WoS
	
				WOS:001058542607026
			
	Codice Scopus
	
				2-s2.0-85171139447
			
	Serie
	
				PROCEEDINGS IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION
			
	N° del Volume
	
				2023
			
	Pagina iniziale
	
				6914
			
	Pagina finale
	
				6924
			
	Tutti gli autori
	
						Sarto, Sara; Barraco, Manuele; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
					
	Citazione
	
				Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation / Sarto, Sara; Barraco, Manuele; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita. - 2023:June(2023), pp. 6914-6924. (Intervento presentato al  convegno 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023 tenutosi a Vancouver, can nel Jun 18-22 2023) [10.1109/CVPR52729.2023.00668].
			
	Tipologia
	
				Relazione in Atti di Convegno

File in questo prodotto:

File	Dimensione	Formato
2023_CVPR_Captioning_Evaluation.pdf Open access Tipologia: Versione dell'autore revisionata e accettata per la pubblicazione Dimensione 2.08 MB Formato Adobe PDF Visualizza/Apri	2.08 MB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris