Far-Field Location Guided Target Speech Extraction using
End-to-End Speech Recognition Objectives

Authors: Aswin Shanmugam Subramanian, Chao Weng, Meng Yu, Shi-Xiong Zhang, Yong Xu, Shinji Watanabe and Dong Yu
Abstract: Target speech extraction is a specific case of source separation where an auxiliary information like the location or some pre-saved anchor speech examples of the target speaker is used to resolve the permutation ambiguity. Traditionally such systems are optimized based on signal reconstruction objectives. Recently end-to-end automatic speech recognition (ASR) methods have enabled to optimize source separation systems with only the transcription based objective. This paper proposes a method to jointly optimize a location guided target speech extraction module along with a speech recognition module only with ASR error minimization criteria. Experimental comparisons with corresponding conventional pipeline systems verify that this task can be realized by end-to-end ASR training objectives without using parallel clean data. We show promising target speech recognition results in mixtures of two speakers and noise, and discuss interesting properties of the proposed system in terms of speech enhancement/separation objectives and word error rates. Finally, we design a system that can take both location and anchor speech as input at the same time and show that the performance can be further improved.

Audio Demos

Legend: TA - Target Angle, IA - Intereference Angle, NA - Noise Angle, PF - Postfiltering, AS - Anchor Speech, FT - Fine Tuning.
Method Sample 1 (SIR - 0dB, SNR - 0dB)
TA - 215, IA - 290, NA - 48
Sample 2 (SIR - -5dB, SNR - 0dB)
TA - 175, IA - 252, NA - 326
Sample 3 (SIR - 5dB, SNR - 20dB)
TA - 236, IA - 242, NA - 284
Unprocessed Mixture (Ch-1)
Ground Truth (Dry)
Pipeline
E2E MVDR-1
E2E MVDR-2
E2E GDR
E2E LCMV
E2E MVDR-2 + PF
E2E MVDR-2 + AS + FT
E2E MVDR-2 + AS + PF + FT