A Systematic Comparison of Phonetic Aware Techniques for Speech Enhancement

Or Tal1, Moshe Mandel1, Felix Kreuk2, Yossi Adi1,2,
The Hebrew University of Jerusalem, Israel1
Meta AI Research 2
Interspeech 2022

Abstract

Speech enhancement has seen great improvement in recent years using end-to-end neural networks. However, most models are agnostic to the spoken phonetic content. Recently, several studies suggested phonetic-aware speech enhancement, mostly using perceptual supervision. Yet, injecting phonetic features during model optimization can take additional forms (e.g., model conditioning). In this paper, we conduct a systematic comparison between different methods of incorporating phonetic information in a speech enhancement model. By conducting a series of controlled experiments, we observe the influence of different phonetic content models as well as various feature-injection techniques on enhancement performance, considering both causal and non-causal models. Specifically, we evaluate three settings for injecting phonetic information, namely: i) feature conditioning; ii) perceptual supervision; and iii) regularization. Phonetic features are obtained using an intermediate layer of either a supervised pre-trained Automatic Speech Recognition (ASR) model or by using a pre-trained Self-Supervised Learning (SSL) model. We further observe the effect of choosing different embedding layers on performance, considering both manual and learned configurations. Results suggest that using a SSL model as phonetic features outperforms the ASR one in most cases. Interestingly, the conditioning setting performs best among the evaluated configurations.

Interspeech 2022 Presentation

Audio Examples

Session Noisy Clean Baseline Conditioning Layer-6 Conditioning Learned Regularization Supervision
p232_001
p257_001
p232_002
p257_002
p232_006
p257_006
p232_009
p257_009
p232_010
p257_010
p232_014
p257_014
p232_017
p257_017
p232_022
p257_022
p232_097
p257_097
p232_002
p257_002

BibTeX

@article{tal2022systematic,
          title    =   {A Systematic Comparison of Phonetic Aware Techniques for Speech Enhancement},
          author   =   {Tal, Or and Mandel, Moshe and Kreuk, Felix and Adi, Yossi},
          journal  =   {arXiv preprint arXiv:2206.11000},
          year     =   {2022}
        }