In this paper, we analyzed how audio-visual speech enhancement can help to perform the ASR task in a cocktail party scenario. Therefore we considered two simple end-to-end LSTM-based models that perform single-channel audiovisual speech enhancement and phone recognition respectively. Then, we studied how the two models interact, and how to train them jointly affects the final result. We analyzed different training strategies that reveal some interesting and unexpected behaviors. The experiments show that during optimization of the ASR task the speech enhancement capability of the model significantly decreases and vice-versa. Nevertheless the joint optimization of the two tasks shows a remarkable drop of the Phone Error Rate (PER) compared to the audio-visual baseline models trained only to perform phone recognition. We analyzed the behaviors of the proposed models by using two limited-size datasets, and in particular we used the mixed-speech versions of GRID and TCD-TIMIT.

An Analysis of Speech Enhancement and Recognition Losses in Limited Resources Multi-talker Single Channel Audio-Visual ASR / Luca, Pasa; Morrone, Giovanni; Badino, Leonardo. - (2020). (Intervento presentato al convegno 45th IEEE International Conference on Acoustics, Speech and Signal Processing tenutosi a Barcelona, Spain nel 4-8 May, 2020) [10.1109/ICASSP40776.2020.9054697].

An Analysis of Speech Enhancement and Recognition Losses in Limited Resources Multi-talker Single Channel Audio-Visual ASR

Giovanni Morrone;Badino, Leonardo
2020

Abstract

In this paper, we analyzed how audio-visual speech enhancement can help to perform the ASR task in a cocktail party scenario. Therefore we considered two simple end-to-end LSTM-based models that perform single-channel audiovisual speech enhancement and phone recognition respectively. Then, we studied how the two models interact, and how to train them jointly affects the final result. We analyzed different training strategies that reveal some interesting and unexpected behaviors. The experiments show that during optimization of the ASR task the speech enhancement capability of the model significantly decreases and vice-versa. Nevertheless the joint optimization of the two tasks shows a remarkable drop of the Phone Error Rate (PER) compared to the audio-visual baseline models trained only to perform phone recognition. We analyzed the behaviors of the proposed models by using two limited-size datasets, and in particular we used the mixed-speech versions of GRID and TCD-TIMIT.
2020
14-mag-2020
45th IEEE International Conference on Acoustics, Speech and Signal Processing
Barcelona, Spain
4-8 May, 2020
Luca, Pasa; Morrone, Giovanni; Badino, Leonardo
An Analysis of Speech Enhancement and Recognition Losses in Limited Resources Multi-talker Single Channel Audio-Visual ASR / Luca, Pasa; Morrone, Giovanni; Badino, Leonardo. - (2020). (Intervento presentato al convegno 45th IEEE International Conference on Acoustics, Speech and Signal Processing tenutosi a Barcelona, Spain nel 4-8 May, 2020) [10.1109/ICASSP40776.2020.9054697].
File in questo prodotto:
File Dimensione Formato  
analysis_jointavenhasr.pdf

Accesso riservato

Descrizione: Articolo principale
Tipologia: Versione pubblicata dall'editore
Dimensione 222.33 kB
Formato Adobe PDF
222.33 kB Adobe PDF   Visualizza/Apri   Richiedi una copia
slides_paper#3109.pdf

Accesso riservato

Descrizione: Presentation slides
Tipologia: Altro
Dimensione 1.21 MB
Formato Adobe PDF
1.21 MB Adobe PDF   Visualizza/Apri   Richiedi una copia
1904.08248.pdf

Open access

Tipologia: Versione dell'autore revisionata e accettata per la pubblicazione
Dimensione 143.07 kB
Formato Adobe PDF
143.07 kB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

Licenza Creative Commons
I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11380/1176481
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 2
  • ???jsp.display-item.citation.isi??? 2
social impact