In this work, we propose a new method to address audio-visual target speaker extraction in multi-talker environments using event-driven cameras. All audio-visual speech separation approaches use a frame-based video to extract visual features. However, these frame-based cameras usually work at 30 frames per second. This limitation makes it difficult to process an audio-visual signal with low latency. In order to overcome this limitation, we propose using event-driven cameras due to their high temporal resolution and low latency. Recent work showed that the use of landmark motion features is very important in order to get good results on audio-visual speech separation. Thus, we use event-driven vision sensors from which the extraction of motion is available at lower latency computational cost. A stacked Bidirectional LSTM is trained to predict an Ideal Amplitude Mask before post-processing to get a clean audio signal. The performance of our model is close to those yielded in frame-based fashion.

Arriandiaga, Ander, Giovanni, Morrone, Luca, Pasa, Leonardo, Badino e Chiara, Bartolozzi. "Audio-Visual Target Speaker Extraction on Multi-Talker Environment using Event-Driven Cameras" Working paper, 2019.

Audio-Visual Target Speaker Extraction on Multi-Talker Environment using Event-Driven Cameras

Giovanni Morrone
;
Leonardo Badino;
2019

Abstract

In this work, we propose a new method to address audio-visual target speaker extraction in multi-talker environments using event-driven cameras. All audio-visual speech separation approaches use a frame-based video to extract visual features. However, these frame-based cameras usually work at 30 frames per second. This limitation makes it difficult to process an audio-visual signal with low latency. In order to overcome this limitation, we propose using event-driven cameras due to their high temporal resolution and low latency. Recent work showed that the use of landmark motion features is very important in order to get good results on audio-visual speech separation. Thus, we use event-driven vision sensors from which the extraction of motion is available at lower latency computational cost. A stacked Bidirectional LSTM is trained to predict an Ideal Amplitude Mask before post-processing to get a clean audio signal. The performance of our model is close to those yielded in frame-based fashion.
2019
Dicembre
https://arxiv.org/abs/1912.02671
Arriandiaga, Ander; Morrone, Giovanni; Pasa, Luca; Badino, Leonardo; Bartolozzi, Chiara
Arriandiaga, Ander, Giovanni, Morrone, Luca, Pasa, Leonardo, Badino e Chiara, Bartolozzi. "Audio-Visual Target Speaker Extraction on Multi-Talker Environment using Event-Driven Cameras" Working paper, 2019.
File in questo prodotto:
File Dimensione Formato  
avse_edc.pdf

Open access

Descrizione: Articolo principale
Tipologia: Versione originale dell'autore proposta per la pubblicazione
Dimensione 166.28 kB
Formato Adobe PDF
166.28 kB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

Licenza Creative Commons
I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11380/1185303
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact