Instruction-based image editing models offer increased personalization opportunities in generative tasks. However, properly evaluating their results is challenging, and most of the existing metrics lag in terms of alignment with human judgment and explainability. To tackle these issues, we introduce DICE (DIfference Coherence Estimator), a model designed to detect localized differences between the original and the edited image and to assess their relevance to the given modification request. DICE consists of two key components: a difference detector and a coherence estimator, both built on an autoregressive Multimodal Large Language Model (MLLM) and trained using a strategy that leverages self-supervision, distillation from inpainting networks, and full supervision. Through extensive experiments, we evaluate each stage of our pipeline, comparing different MLLMs within the proposed framework. We demonstrate that DICE effectively identifies coherent edits, effectively evaluating images generated by different editing models with a strong correlation with human judgment. We publicly release our source code, models, and data.

What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models / Baraldi, Lorenzo; Bucciarelli, Davide; Betti, Federico; Cornia, Marcella; Baraldi, Lorenzo; Sebe, Nicu; Cucchiara, Rita. - (2025). ( IEEE/CVF International Conference on Computer Vision Honolulu, Hawaii Oct 19 – 23th, 2025).

What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models

Lorenzo Baraldi;Marcella Cornia;Lorenzo Baraldi;Nicu Sebe;Rita Cucchiara
2025

Abstract

Instruction-based image editing models offer increased personalization opportunities in generative tasks. However, properly evaluating their results is challenging, and most of the existing metrics lag in terms of alignment with human judgment and explainability. To tackle these issues, we introduce DICE (DIfference Coherence Estimator), a model designed to detect localized differences between the original and the edited image and to assess their relevance to the given modification request. DICE consists of two key components: a difference detector and a coherence estimator, both built on an autoregressive Multimodal Large Language Model (MLLM) and trained using a strategy that leverages self-supervision, distillation from inpainting networks, and full supervision. Through extensive experiments, we evaluate each stage of our pipeline, comparing different MLLMs within the proposed framework. We demonstrate that DICE effectively identifies coherent edits, effectively evaluating images generated by different editing models with a strong correlation with human judgment. We publicly release our source code, models, and data.
2025
IEEE/CVF International Conference on Computer Vision
Honolulu, Hawaii
Oct 19 – 23th, 2025
Baraldi, Lorenzo; Bucciarelli, Davide; Betti, Federico; Cornia, Marcella; Baraldi, Lorenzo; Sebe, Nicu; Cucchiara, Rita
What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models / Baraldi, Lorenzo; Bucciarelli, Davide; Betti, Federico; Cornia, Marcella; Baraldi, Lorenzo; Sebe, Nicu; Cucchiara, Rita. - (2025). ( IEEE/CVF International Conference on Computer Vision Honolulu, Hawaii Oct 19 – 23th, 2025).
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

Licenza Creative Commons
I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11380/1381189
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact