Speaking Style Conversion With Discrete Self-Supervised Units

Gallil Maimon, Yossi Adi

School of Computer Science and Engineering
The Hebrew University of Jerusalem, Israel
EMNLP 2023

We formalise the setup of speaking style conversion as opposed to traditional voice conversion, which considers voice texture only. While VC methods, in this case speech resynthesis, change the spectral features so the new utterance sounds like the new speaker, they do not change the rhythm and pitch. SSC on the other hand matches the faster speaking style, and the target speaker’s tendency to finish sentences with a pitch increase. White lines on the spectrograms show the pitch contour. This is a real sample converted by our approach.

Abstract

We introduce DISSC, a novel, lightweight method that converts the rhythm, pitch contour and timbre of a recording to a target speaker in a textless manner. Unlike DISSC, most voice conversion (VC) methods focus primarily on timbre, and ignore people's unique speaking style (prosody). The proposed approach uses a pretrained, self-supervised model for encoding speech to discrete units, which makes it simple, effective, and fast to train. All conversion modules are only trained on reconstruction like tasks, thus suitable for any-to-many VC with no paired data. We introduce a suite of quantitative and qualitative evaluation metrics for this setup, and empirically demonstrate that DISSC significantly outperforms the evaluated baselines.

Audio Examples

VCTK

Sample	Source	Target	Speech Resynthesis	AutoPST	DISSC_Rhythm	DISSC_Both
p231_020
p245_019

Emotional Speech Dataset

Sample	Source	Target	Speech Resynthesis	AutoPST	DISSC_Rhythm
0017Happy_020
0019Sad_028

Synthetic VCTK

Sample	Source	Target	Speech Resynthesis	DISSC_Pitch	DISSC_Rhythm	DISSC_Both
p270_001
p231_021
p245_014

Irregular Rhythm

Sample	Abnormal	Original	AutoPST	DISSC_Rhythm
p239_010

In the Wild Samples

This section contains samples from an unseen speaker and content, samples we recorded ourselves, converted to target speakers from Syn_VCTK. This demonstrates the any-to-many ability of the approach.

Source	Target Speaker	DISSC_Both

Low Resource Languages

This section contains samples from an unseen language! (and unseen speaker and content). We demonstrate the ability to take a Hebrew utterance by an unseen speaker and convert to either fast or slow VCTK speakers (p231 and p270 respetively) who only speak English. All models, namely HuBERT encoder, Hifi GAN vocoder and both prosody convertion models have never seen Hebrew data during training. Training any of these in a self-supervised manner will definitely improve results. However, these results already demonstrate the potential where ASR-TTS methods struggle (lower resource languages).

Source	DISSC fast speaker	DISSC slow speaker

BibTeX

@inproceedings{maimon-adi-2023-speaking,
    title = "Speaking Style Conversion in the Waveform Domain Using Discrete Self-Supervised Units",
    author = "Maimon, Gallil  and Adi, Yossi",
    editor = "Bouamor, Houda  and Pino, Juan  and Bali, Kalika",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-emnlp.541",
    pages = "8048--8061"}