HebDB: a Weakly Supervised Dataset for Hebrew Speech Processing

Arnon Turetzky¹, Or Tal¹, Yael Segal², Yehoshua Dissen², Ella Zeldes¹, Amit Roth¹, Eyal Cohen², Yosi Shrem², Roni Chernyak², Olga Seleznova², Joseph Keshet², Yossi Adi¹

¹The Hebrew University of Jerusalem

²Technion–Israel Institute of Technology

Paper Code arXiv 🤗 Dataset

Abstract

We present HebDB, a weakly supervised dataset for spoken language processing in the Hebrew language. HebDB offers roughly 2500 hours of natural and spontaneous speech record- ings in the Hebrew language, consisting of a large variety of speakers and topics. We provide raw recordings together with a pre-processed, weakly supervised, and filtered version. The goal of HebDB is to further enhance research and development of spoken language processing tools for the Hebrew language. Hence, we additionally provide two baseline systems for Auto- matic Speech Recognition (ASR): (i) a self-supervised model; and (ii) a fully supervised model. We present the performance of these two methods optimized on HebDB and compare them to current multi-lingual ASR alternatives. Results suggest the proposed method reaches better results than the evaluated base- lines considering similar model sizes. Both HebDB and the baseline systems will be publicly available upon acceptance.

HEBDB

HEBDB contains natural dialogues of spontaneous speech. It is comprised of both testimonies from World War II survivors and five podcasts covering a wide range of subjects and speakers. While the testimonies provide firsthand accounts of historical events, the majority of our dataset consists of podcasts covering diverse topics such as economy, politics, sports, culture, science, history, and music, to name a few. We provide two versions of the dataset: raw and pre-processed.

The raw version of our dataset which contains in-the-wild audio to allow researchers and practitioners to explore different pre-processing alternatives and methods. Detailed description and statistics of this version is provided in the following table.

The raw recordings are constructed from full podcast episodes and testimonies and, hence, contain long audio sources and plenty of non-speech segments, e.g. music, environmental sounds, silence, etc. Such in-the-wild conditions make model optimization challenging and require a pre-processing step.

To handle that, we apply the following pre-processing pipeline to the raw version of HEBDB. (1) Resample to mono 16kHz (2) Apply VAD and filter empty and noisy parts (3) Transcribe using pre-trained ASR model (4) Employ a forced aligner using an alternative model to generate a confidence score. After the prepossessing step, we are left with ~1690 hours of speech partitioned into varied length segments with the vast majority of the segmented files having less than 10 seconds. We provide the following statistics for the pre-processed version of HEBDB.

Table shows the subdivision of processed audio with respect to each source separately. Box plots present the processed instances quartile distributions over audio duration in seconds and the number of transcribed words with respect to each source, discarding outliers. Histogram shows the force aligner score distribution.

Baseline system

We provide two baseline systems together with HEBDB. The first one is an SSL model, namely HuBERT. The second model is a fully supervised one, namely Conformer.Both models were optimized using HEBDB. Table presents the Word-Error-Rates (WER) results computed over the Fleurs benchmark. When considering comparison to Whisper models the provided baseline systems reach comparable or superior performance up until model size of 769M parameters.

BibTeX

@article{turetzky2024hebdb,
  title={HebDB: a Weakly Supervised Dataset for Hebrew Speech Processing},
  author={Turetzky, Arnon and Tal, Or and Segal-Feldman, Yael and Dissen, Yehoshua and Zeldes, Ella and Roth, Amit and Cohen, Eyal and Shrem, Yosi and Chernyak, Bronya R and Seleznova, Olga and others},
  journal={arXiv preprint arXiv:2407.07566},
  year={2024}
}