Audio-Visual Target Speaker Extraction on Multi-Talker Environment using Event-Driven Cameras

Arriandiaga, Ander; Morrone, Giovanni; Pasa, Luca; Badino, Leonardo; Bartolozzi, Chiara

In this work, we propose a new method to address audio-visual target speaker extraction in multi-talker environments using event-driven cameras. All audio-visual speech separation approaches use a frame-based video to extract visual features. However, these frame-based cameras usually work at 30 frames per second. This limitation makes it difficult to process an audio-visual signal with low latency. In order to overcome this limitation, we propose using event-driven cameras due to their high temporal resolution and low latency. Recent work showed that the use of landmark motion features is very important in order to get good results on audio-visual speech separation. Thus, we use event-driven vision sensors from which the extraction of motion is available at lower latency computational cost. A stacked Bidirectional LSTM is trained to predict an Ideal Amplitude Mask before post-processing to get a clean audio signal. The performance of our model is close to those yielded in frame-based fashion.

Arriandiaga, Ander, Giovanni, Morrone, Luca, Pasa, Leonardo, Badino e Chiara, Bartolozzi. "Audio-Visual Target Speaker Extraction on Multi-Talker Environment using Event-Driven Cameras" Working paper, 2019.

Audio-Visual Target Speaker Extraction on Multi-Talker Environment using Event-Driven Cameras

Ander Arriandiaga;Giovanni Morrone;Luca Pasa;Leonardo Badino;Chiara Bartolozzi

2019

Abstract

In this work, we propose a new method to address audio-visual target speaker extraction in multi-talker environments using event-driven cameras. All audio-visual speech separation approaches use a frame-based video to extract visual features. However, these frame-based cameras usually work at 30 frames per second. This limitation makes it difficult to process an audio-visual signal with low latency. In order to overcome this limitation, we propose using event-driven cameras due to their high temporal resolution and low latency. Recent work showed that the use of landmark motion features is very important in order to get good results on audio-visual speech separation. Thus, we use event-driven vision sensors from which the extraction of motion is available at lower latency computational cost. A stacked Bidirectional LSTM is trained to predict an Ideal Amplitude Mask before post-processing to get a clean audio signal. The performance of our model is close to those yielded in frame-based fashion.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2019
			
	Mese di pubblicazione
	
				Dicembre
			
	Indirizzo WEB
	
				https://arxiv.org/abs/1912.02671
			
	Tutti gli autori
	
						Arriandiaga, Ander; Morrone, Giovanni; Pasa, Luca; Badino, Leonardo; Bartolozzi, Chiara
					
	Citazione
	
				Arriandiaga, Ander,  Giovanni,  Morrone,  Luca,  Pasa,  Leonardo,  Badino e  Chiara,  Bartolozzi. "Audio-Visual Target Speaker Extraction on Multi-Talker Environment using Event-Driven Cameras"  Working paper, 2019.
			
	Tipologia
	
				Working paper

File in questo prodotto:

File	Dimensione	Formato
avse_edc.pdf Open access Descrizione: Articolo principale Tipologia: AO - Versione originale dell'autore proposta per la pubblicazione Dimensione 166.28 kB Formato Adobe PDF Visualizza/Apri	166.28 kB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris