In this paper, we present a deep-learning-based framework for audio-visual speech inpainting, i.e., the task of restoring the missing parts of an acoustic speech signal from reliable audio context and uncorrupted visual information. Recent work focuses solely on audio-only methods and generally aims at inpainting music signals, which show highly different structure than speech. Instead, we inpaint speech signals with gaps ranging from 100 ms to 1600 ms to investigate the contribution that vision can provide for gaps of different duration. We also experiment with a multi-task learning approach where a phone recognition task is learned together with speech inpainting. Results show that the performance of audio-only speech inpainting approaches degrades rapidly when gaps get large, while the proposed audio-visual approach is able to plausibly restore missing information. In addition, we show that multi-task learning is effective, although the largest contribution to performance comes from vision.
Audio-Visual Speech Inpainting with Deep Learning / Morrone, Giovanni; Michelsanti, Daniel; Tan, Zheng-Hua; Jensen, Jesper. - 2021-:(2021), pp. 6653-6657. (Intervento presentato al convegno 2021 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2021 tenutosi a Toronto, Canada nel 6-11 June, 2021) [10.1109/ICASSP39728.2021.9413488].
Audio-Visual Speech Inpainting with Deep Learning
Giovanni Morrone
;
2021
Abstract
In this paper, we present a deep-learning-based framework for audio-visual speech inpainting, i.e., the task of restoring the missing parts of an acoustic speech signal from reliable audio context and uncorrupted visual information. Recent work focuses solely on audio-only methods and generally aims at inpainting music signals, which show highly different structure than speech. Instead, we inpaint speech signals with gaps ranging from 100 ms to 1600 ms to investigate the contribution that vision can provide for gaps of different duration. We also experiment with a multi-task learning approach where a phone recognition task is learned together with speech inpainting. Results show that the performance of audio-only speech inpainting approaches degrades rapidly when gaps get large, while the proposed audio-visual approach is able to plausibly restore missing information. In addition, we show that multi-task learning is effective, although the largest contribution to performance comes from vision.File | Dimensione | Formato | |
---|---|---|---|
Morrone_AVSpeechInpainting___ICASSP2021.pdf
Open access
Descrizione: Working paper
Tipologia:
Versione originale dell'autore proposta per la pubblicazione
Dimensione
363.18 kB
Formato
Adobe PDF
|
363.18 kB | Adobe PDF | Visualizza/Apri |
AV_Speech_Inpainting___ICASSP2021__camera_ready.pdf
Accesso riservato
Tipologia:
Versione pubblicata dall'editore
Dimensione
364.67 kB
Formato
Adobe PDF
|
364.67 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
Pubblicazioni consigliate
I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris