Speaking Style Conversion With Discrete Self-Supervised Units

School of Computer Science and Engineering
The Hebrew University of Jerusalem, Israel

We formalise the setup of speaking style conversion (SSC) as opposed to traditional voice conversion, which considers voice texture only. While VC methods, in this case speech resynthesis, change the spectral features so the new utterance sounds like the new speaker, they do not change the rhythm and pitch. SSC on the other hand matches the faster speaking style, and the target speaker’s tendency to finish sentences with a pitch increase. White lines on the spectrograms show the pitch contour. This a real sample converted by our approach. As there is a variance within speaking style we do not expect the converted speech to match the target exactly.

Abstract

Voice conversion is the task of making a spoken utterance by one speaker sound as if uttered by a different speaker, while keeping other aspects like content the same. Existing methods focus primarily on spectral features like timbre, but ignore the unique speaking style of people which often impacts prosody. In this study we introduce a method for converting not only the timbre, but also the rhythm and pitch changes to those of the target speaker. In addition, we do so in the many-to-many setting with no paired data. We use pretrained, self-supervised, discrete units which make our approach extremely light-weight. We introduce a suite of quantitative and qualitative evaluation metrics for this setup, and show that our approach outperforms existing methods.

Audio Examples

VCTK

Sample Source Target Speech Resynthesis AutoPST DISSC_Rhythm DISSC_Both
p231_020
p245_019

Emotional Speech Dataset

Sample Source Target Speech Resynthesis AutoPST DISSC_Rhythm
0017Happy_020
0019Sad_028

Synthetic VCTK

Sample Source Target Speech Resynthesis DISSC_Pitch DISSC_Rhythm DISSC_Both
p270_001
p231_021
p245_014

Irregular Rhythm

Sample Abnormal Original AutoPST DISSC_Rhythm
p239_010

In the Wild Samples

This section contains samples from an unseen speaker and content, samples we recorded ourselves, converted to target speakers from Syn_VCTK. This demonstrates the any-to-many ability of the approach.
Source Target Speaker DISSC_Both

BibTeX

@misc{https://doi.org/10.48550/arxiv.2212.09730,
  doi = {10.48550/ARXIV.2212.09730},
  url = {https://arxiv.org/abs/2212.09730},
  author = {Maimon, Gallil and Adi, Yossi},
  keywords = {Sound (cs.SD), Computation and Language (cs.CL), Machine Learning (cs.LG), Audio and Speech Processing (eess.AS), FOS: Computer and information sciences, FOS: Computer and information sciences, FOS: Electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering},
  title = {Speaking Style Conversion With Discrete Self-Supervised Units},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}