Abstract: Target speech extraction is a specific case of source separation where an auxiliary information like the location or some pre-saved anchor speech examples of the target speaker is used to resolve the permutation ambiguity. Traditionally such systems are optimized based on signal reconstruction objectives. Recently end-to-end automatic speech recognition (ASR) methods have enabled to optimize source separation systems with only the transcription based objective. This paper proposes a method to jointly optimize a location guided target speech extraction module along with a speech recognition module only with ASR error minimization criteria. Experimental comparisons with corresponding conventional pipeline systems verify that this task can be realized by end-to-end ASR training objectives without using parallel clean data. We show promising target speech recognition results in mixtures of two speakers and noise, and discuss interesting properties of the proposed system in terms of speech enhancement/separation objectives and word error rates. Finally, we design a system that can take both location and anchor speech as input at the same time and show that the performance can be further improved.
Audio Demos
Legend: TA - Target Angle, IA - Intereference Angle, NA - Noise Angle, PF - Postfiltering, AS - Anchor Speech, FT - Fine Tuning.
Method
Sample 1 (SIR - 0dB, SNR - 0dB) TA - 215, IA - 290, NA - 48
Sample 2 (SIR - -5dB, SNR - 0dB) TA - 175, IA - 252, NA - 326
Sample 3 (SIR - 5dB, SNR - 20dB) TA - 236, IA - 242, NA - 284