Gallil Maimon

A gif demonstraitng an overview of Stress Test.

Discrete Audio Tokens: More Than a Survey!

Arxiv preprint

Pooneh Mousavi*, Gallil Maimon*, Adel Moumen*, Darius Petermann*, Jiatong Shi*, Haibin Wu*, Haici Yang*, Anastasia Kuznetsova*, Artem Ploujnikov, Ricard Marxer, Bhuvana Ramabhadran, Benjamin Elizalde, Loren Lugosch, Jinyu Li, Cem Subakan, Phil Woodland, Minje Kim, Hung-yi Lee, Shinji Watanabe, Yossi Adi, Mirco Ravanelli

Discrete audio tokens are compact representations that aim to preserve perceptual quality, phonetic content, and speaker characteristics while enabling efficient storage and inference, as well as competitive performance across diverse downstream tasks. They provide a practical alternative to continuous features, enabling the integration of speech and audio into modern large language models (LLMs). As interest in token-based audio processing grows, various tokenization methods have emerged, and several surveys have reviewed the latest progress in the field. However, existing studies often focus on specific domains or tasks and lack a unified comparison across various benchmarks. This paper presents a systematic review and benchmark of discrete audio tokenizers, covering three domains: speech, music, and general audio. We propose a taxonomy of tokenization approaches based on encoder-decoder, quantization techniques, training paradigm, streamability, and application domains. We evaluate tokenizers on multiple benchmarks for reconstruction, downstream performance, and acoustic language modeling, and analyze trade-offs through controlled ablation studies. Our findings highlight key limitations, practical considerations, and open challenges, providing insight and guidance for future research in this rapidly evolving area.

Project Page arXiv

A gif demonstraitng an overview of Stress Test.

StressTest: Can YOUR Speech LM Handle the Stress?

Arxiv preprint

Iddo Yosha, Gallil Maimon, Yossi Adi

Sentence stress refers to emphasis, placed on specific words within a spoken utterance to highlight or contrast an idea, or to introduce new information. It is often used to imply an underlying intention that is not explicitly stated. Recent advances in speech-aware language models (SLMs) have enabled direct processing of audio, allowing models to bypass transcription and access the full richness of the speech signal and perform audio reasoning tasks such as spoken question answering. Despite the crucial role of sentence stress in shaping meaning and speaker intent, it remains largely overlooked in evaluation and development of such models. In this work, we address this gap by introducing StressTest, a benchmark specifically designed to evaluate a model’s ability to distinguish between interpretations of spoken sentences based on the stress pattern. We assess the performance of several leading SLMs and find that, despite their overall capabilities, they perform poorly on such tasks. To overcome this limitation, we propose a novel synthetic data generation pipeline, and create Stress-17k, a training set that simulates change of meaning implied by stress variation. Then, we empirically show that optimizing models with this synthetic dataset aligns well with real-world recordings and enables effective finetuning of SLMs. Results suggest, that our finetuned model, StresSLM, significantly outperforms existing models on both sentence stress reasoning and detection tasks.

Project Page arXiv Code <🤗 Models and Dataset

An image demonstraitng Slam's efficiency.

Slamming: Training a Speech Language Model on One GPU in a Day

ACL 2025 (Findings)

Gallil Maimon*, Avishai Elmakies*, Yossi Adi

We introduce slam, a recipe for training high-quality Speech Language Models (SLMs) on a single academic GPU in 24 hours. We do so through empirical analysis of model initialisation and architecture, synthetic training data, preference optimisation with synthetic data and tweaking all other components. We empirically demonstrate the obtained training recipe scales up to more compute getting results on par with leading SLMs in a fraction of the compute cost. We hope these insights will make SLM training and research more accessible. In the context of SLMs scaling laws, our results far outperform predicted compute optimal performance, giving an optimistic view to SLM feasibility. We open source code, data, models.

Project Page arXiv Code <🤗 Models and Dataset

An image comparing scaling to Speech only SLMs.

Scaling Analysis of Interleaved Speech-Text Language Models

Arxiv preprint

Gallil Maimon, Michael Hassid, Amit Roth, Yossi Adi

Existing Speech Language Model (SLM) scaling analysis paints a bleak picture. They predict that SLMs require much more compute and data compared to text, leading some to question the feasibility of training high-quality SLMs. However, modern SLMs are often initialised from pre-trained TextLMs using speech-text interleaving to allow knowledge transfer. This raises the question - Do interleaved SLMs scale more efficiently than textless-SLMs? In this paper we answer a resounding yes! We conduct scaling analysis of interleaved SLMs by training several dozen and analysing the scaling trends. We see that under this setup SLMs scale more efficiently with compute. Additionally, our results indicate that the scaling-dynamics are significantly different than textless-SLMs, suggesting one should allocate notably more of the compute budget for increasing model size over training tokens. We also study the role of synthetic data and TextLM model families in unlocking this potential. Results suggest, that our scaled up model achieves comparable performance with leading models on speech semantic metrics while using less compute and data than other approaches. We open source all models for further research into scaling SLMs.

Project Page arXiv Code <🤗 Models and Dataset

An image illustrating the SALMon benchmark.

A Suite for Acoustic Language Model Evaluation

ICASSP 2025 (Oral) & SALMA Workshop

Gallil Maimon*, Amit Roth*, Yossi Adi

Speech language models have recently demonstrated great potential as universal speech processing systems. Such models have the ability to model the rich acoustic information existing in audio signals, beyond spoken content, such as emotion, background noise, etc. Despite this, evaluation benchmarks which evaluate awareness to a wide range of acoustic aspects, are lacking. To help bridge this gap, we introduce SALMon🍣, a novel evaluation suite encompassing background noise, emotion, speaker identity and room impulse response. The proposed benchmarks both evaluate the consistency of the inspected element and how much it matches the spoken text. We follow a modelling based approach, measuring whether a model gives correct samples higher scores than incorrect ones. This approach makes the benchmark fast to compute even for large models. We evaluated several speech language models on SALMon🍣, thus highlighting the strengths and weaknesses of each evaluated method. Code and data are publicly available.

Project Page arXiv Code <🤗 Dataset

An image illustrating speaking style conversion vs. traditional VC.

Speaking Style Conversion With Discrete Self-Supervised Units

EMNLP 2023 (Findings)

Gallil Maimon, Yossi Adi

Voice Conversion (VC) is the task of making a spoken utterance by one speaker sound as if uttered by a different speaker, while keeping other aspects like content unchanged. Current VC methods, focus primarily on spectral features like timbre, while ignoring the unique speaking style of people which often impacts prosody. In this study, we introduce a method for converting not only the timbre, but also prosodic information (i.e., rhythm and pitch changes) to those of the target speaker. The proposed approach is based on a pretrained, self-supervised, model for encoding speech to discrete units, which make it simple, effective, and easy to optimise. We consider the many-to-many setting with no paired data. We introduce a suite of quantitative and qualitative evaluation metrics for this setup, and empirically demonstrate the proposed approach is significantly superior to the evaluated baselines. Code and samples can be found under the project page.

Project Page arXiv Code

An image illustrating Universal Adversarial Policies.

A Universal Adversarial Policy for Text Classifiers

Neural Networks 2022

Gallil Maimon, Lior Rokach

We introduce and formally define a new adversarial setup against text classifiers named universal adversarial policies. Under this setup one learns a single perturbation policy which given a text and a classifier selects the optimal pertubations (which words to replace) in order to reach an adversarial text. It is universal in the sense that one policy must generalise to many unseen texts. We introcde LUNATC which learns such a policy with reinforcement learning and succesfully generalises to unseen texts from as little as 500 texts.

arXiv Code

Hey, I'm Gallil Maimon

Select Publications

Discrete Audio Tokens: More Than a Survey!

StressTest: Can YOUR Speech LM Handle the Stress?

Slamming: Training a Speech Language Model on One GPU in a Day

Scaling Analysis of Interleaved Speech-Text Language Models

A Suite for Acoustic Language Model Evaluation

Speaking Style Conversion With Discrete Self-Supervised Units

A Universal Adversarial Policy for Text Classifiers