Audio-Visual Speech Inpainting with Deep Learning

In this paper, we present a deep-learning-based framework for audio-visual speech inpainting, i.e., the task of restoring the missing parts of an acoustic speech signal from reliable audio context and uncorrupted visual information. Recent work focuses solely on audio-only methods and generally aims at inpainting music signals, which show highly different structure than speech. Instead, we inpaint speech signals with gaps ranging from 100 ms to 1600 ms to investigate the contribution that vision can provide for gaps of different duration. We also experiment with a multi-task learning approach where a phone recognition task is learned together with speech inpainting. Results show that the performance of audio-only speech inpainting approaches degrades rapidly when gaps get large, while the proposed audio-visual approach is able to plausibly restore missing information. In addition, we show that multi-task learning is effective, although the largest contribution to performance comes from vision.

Audio-Visual Speech Inpainting with Deep Learning / Morrone, Giovanni; Michelsanti, Daniel; Tan, Zheng-Hua; Jensen, Jesper. - 2021-:(2021), pp. 6653-6657. (Intervento presentato al convegno 2021 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2021 tenutosi a Toronto, Canada nel 6-11 June, 2021) [10.1109/ICASSP39728.2021.9413488].

Audio-Visual Speech Inpainting with Deep Learning

Giovanni Morrone;Daniel Michelsanti;Zheng-Hua Tan;Jesper Jensen

2021

Abstract

In this paper, we present a deep-learning-based framework for audio-visual speech inpainting, i.e., the task of restoring the missing parts of an acoustic speech signal from reliable audio context and uncorrupted visual information. Recent work focuses solely on audio-only methods and generally aims at inpainting music signals, which show highly different structure than speech. Instead, we inpaint speech signals with gaps ranging from 100 ms to 1600 ms to investigate the contribution that vision can provide for gaps of different duration. We also experiment with a multi-task learning approach where a phone recognition task is learned together with speech inpainting. Results show that the performance of audio-only speech inpainting approaches degrades rapidly when gaps get large, while the proposed audio-visual approach is able to plausibly restore missing information. In addition, we show that multi-task learning is effective, although the largest contribution to performance comes from vision.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2021
			
	Titolo del Convegno
	
				2021 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2021
			
	Luogo del Convegno
	
				Toronto, Canada
			
	Data del Convegno
	
				6-11 June, 2021
			
	Codice DOI
	
				https://dx.doi.org/10.1109/ICASSP39728.2021.9413488
			
	Codice WoS
	
				WOS:000704288406185
			
	Codice Scopus
	
				2-s2.0-85109062184
			
	Serie
	
				PROCEEDINGS OF THE ... IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING
			
	N° del Volume
	
				2021-
			
	Pagina iniziale
	
				6653
			
	Pagina finale
	
				6657
			
	Tutti gli autori
	
						Morrone, Giovanni; Michelsanti, Daniel; Tan, Zheng-Hua; Jensen, Jesper
					
	Citazione
	
				Audio-Visual Speech Inpainting with Deep Learning / Morrone, Giovanni; Michelsanti, Daniel; Tan, Zheng-Hua; Jensen, Jesper. - 2021-:(2021), pp. 6653-6657. (Intervento presentato al  convegno 2021 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2021 tenutosi a Toronto, Canada nel 6-11 June, 2021) [10.1109/ICASSP39728.2021.9413488].
			
	Tipologia
	
				Relazione in Atti di Convegno

File in questo prodotto:

File	Dimensione	Formato
Morrone_AVSpeechInpainting___ICASSP2021.pdf Open access Descrizione: Working paper Tipologia: AO - Versione originale dell'autore proposta per la pubblicazione Dimensione 363.18 kB Formato Adobe PDF Visualizza/Apri	363.18 kB	Adobe PDF	Visualizza/Apri
AV_Speech_Inpainting___ICASSP2021__camera_ready.pdf Accesso riservato Tipologia: VOR - Versione pubblicata dall'editore Dimensione 364.67 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	364.67 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11380/1220454

Citazioni

ND

23

15

social impact