Inference-time scaling has recently gained attention as an effective strategy for improving the performance of generative models without requiring additional training. Although this paradigm has been successfully applied in text and image generation tasks, its extension to video diffusion models remains relatively underexplored. Indeed, video generation presents unique challenges due to its spatiotemporal complexity, particularly in evaluating intermediate generated samples, a procedure that is required by inference-time scaling algorithms. In this work, we systematically investigate the role of the verifier: the scoring mechanism used to guide sampling. We show that current verifiers, when applied at early diffusion steps, face significant reliability challenges due to noisy samples. We further demonstrate that fine-tuning verifiers on partially denoised samples significantly improves early-stage evaluation and leads to gains in generation quality across multiple inference-time scaling algorithms, including Greedy Search, Beam Search, and a novel Successive Halving baseline.

Verifier Matters: Enhancing Inference-Time Scaling for Video Diffusion Models / Baraldi, Lorenzo; Bucciarelli, Davide; Zeng, Zifan; Zhang, Chongzhe; Zhang, Qunli; Cornia, Marcella; Baraldi, Lorenzo; Liu, Feng; Hu, Zheng; Cucchiara, Rita. - (2025). ( British Machine Vision Conference Sheffield, UK 24th - 27th November 2025).

Verifier Matters: Enhancing Inference-Time Scaling for Video Diffusion Models

Lorenzo Baraldi;Marcella Cornia;Lorenzo Baraldi;Rita Cucchiara
2025

Abstract

Inference-time scaling has recently gained attention as an effective strategy for improving the performance of generative models without requiring additional training. Although this paradigm has been successfully applied in text and image generation tasks, its extension to video diffusion models remains relatively underexplored. Indeed, video generation presents unique challenges due to its spatiotemporal complexity, particularly in evaluating intermediate generated samples, a procedure that is required by inference-time scaling algorithms. In this work, we systematically investigate the role of the verifier: the scoring mechanism used to guide sampling. We show that current verifiers, when applied at early diffusion steps, face significant reliability challenges due to noisy samples. We further demonstrate that fine-tuning verifiers on partially denoised samples significantly improves early-stage evaluation and leads to gains in generation quality across multiple inference-time scaling algorithms, including Greedy Search, Beam Search, and a novel Successive Halving baseline.
2025
British Machine Vision Conference
Sheffield, UK
24th - 27th November 2025
Baraldi, Lorenzo; Bucciarelli, Davide; Zeng, Zifan; Zhang, Chongzhe; Zhang, Qunli; Cornia, Marcella; Baraldi, Lorenzo; Liu, Feng; Hu, Zheng; Cucchiara...espandi
Verifier Matters: Enhancing Inference-Time Scaling for Video Diffusion Models / Baraldi, Lorenzo; Bucciarelli, Davide; Zeng, Zifan; Zhang, Chongzhe; Zhang, Qunli; Cornia, Marcella; Baraldi, Lorenzo; Liu, Feng; Hu, Zheng; Cucchiara, Rita. - (2025). ( British Machine Vision Conference Sheffield, UK 24th - 27th November 2025).
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

Licenza Creative Commons
I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11380/1383749
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact