A Language Modeling Approach to Diacritic-Free Hebrew TTS

School of Computer Science and Engineering
The Hebrew University of Jerusalem, Israel
model

Abstract

We tackle the task of text-to-speech (TTS) in Hebrew. Traditional Hebrew contains Diacritics (`Niqqud'), which dictate the way individuals should pronounce given words, however, modern Hebrew rarely uses them. The lack of diacritics in modern Hebrew results in readers expected to conclude the correct pronunciation and understand which phonemes to use based on the context. This imposes a fundamental challenge on TTS systems to accurately map between text-to-speech. In this study, we propose to adopt a language modeling Diacritics-Free TTS approach, for the task of Hebrew TTS. The language model (LM) operates on discrete speech representations and is conditioned on a word-piece tokenizer. We optimize the proposed method using in-the-wild weakly supervised recordings and compare it to several diacritic based Hebrew TTS systems. Results suggest the proposed method is superior to the evaluated baselines considering both content preservation and naturalness of the generated speech.

Audio Samples

In the following section, we present audio samples generated by the publicly available model.



seen speakers [1] unseen speakers [2]
Speakers Ran Levi Reem Sherman Shaul Amsterdamski Danny Kushmaro Omer Adam
Diacritic-Free Hebrew Prompt / Audio Prompt

מה שיפה בסרטונים שלה הוא כמה הם ספציפיים.

ואם אתה מצליח לשפר את איכות החיזוי אתה יכול להציל חיי אדם

אני חושב שאחת הסיבות שאני לוקח את זה נורא נורא קשה



[1] Speakers which the model trained on their data.

[2] Speakers which the model has not trained on their data and never "seen" them while training. i.e. zero shot inference.

Comparison With Previous Methods

A comparison of MMS and Roboshaul to our method with 2 different tokenizers. Character and word-piece tokenizers as detailed in the paper. The first 4 recordings are generated with a recording of Ran Levi (Making History podcast) as an acoustic prompt. The fifth is a 'zero-shot' sample which generated using an acoustic prompt of Shaul Amsterdamski (Hayot Kiss podcast)



Diacritic-Free Hebrew Prompt MMS Roboshaul Ours (chars) Ours (word-piece)

תגידו, גנבו לכם פעם את האוטו ופשוט ידעתם שאין טעם להגיש תלונה במשטרה.

הדרבי תמיד היה המשחק הכי חשוב, אך בשנים האחרונות הוא נעשה כמעט הדבר היחיד שחשוב.
בראשית הייתה חללית מסוג נחתת.
אני חושב שאחת הסיבות שאני לוקח את זה נורא נורא קשה.
מה שבהגדרה משאיר את הכלכלה ההונגרית מאחור, אפילו ביחס למדינות כמו פולין.

BibTeX

@misc{roth2024languagemodelingapproachdiacriticfree,
  title={A Language Modeling Approach to Diacritic-Free Hebrew TTS},
  author={Amit Roth and Arnon Turetzky and Yossi Adi},
  year={2024},
  eprint={2407.12206},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2407.12206},
}