🍣SALMon: Suite for Acoustic Language Model evaluation🍣

ICASSP 2025 (Oral) & SALMA Workshop

🍣SALMon Leaderboard🍣

Method	Sentiment Consistency	Speaker Consistency	Gender Consistency	Background Consistency (In-Domain)	Background Consistency (Random)	Room Consistency	Sentiment Alignment	Background Alignment	sWUGGY
Human Baseline	97.2	91.2	98.6	83.1	88.7	94.4	93.3	95.7
LAST 1.3B	65.0	64.5	68.5	56.0	61.0	62.5	53.5	53.0	73.6
TWIST 7B	61.5	71.0	70.0	55.0	60.5	62.0	51.5	54.5	82.8
pGSLM	40.5	83.0	88.5	57.0	66.0	53.5	55.5	53.5	74.1
SPIRIT LM	54.5	69.5	67.0	53.5	55.5	54.5	48.0	51.5	75.5
SPIRIT LM Expr.	73.5	81.0	85.0	55.0	64.0	54.5	52.0	59.5	72.7
TASLM 1B (token)	59.0	68.0	70.5			61.0
TASLM 1B (embedding)	57.5	67.0	75.5			50.0
Flow-SLM-270M	61.5	75.5	78.0	69.0	67.0	73.5	60.0	55.5	77.6
Flow-SLM-1B-ext	65.0	76.5	80.0	70.0	64.5	73.5	57.0	53.0	73.2
LLaMa-Mimi-1.3B	79.0	85.0	83.5		73.5	92.0	48.5	53.5	68.7
LLaMa-Mimi-8B	76.5	86.5	85.5		73.0	92.0	46.5	52.5	68.8
CAST-0.7B	81.8	90.8	90.0	80.0	77.5	90.0	51.0	56.0	65.6
CAST-1B	81.8	90.0	90.0	78.0	68.5	91.0	48.5	51.5	67.0
CAST-1B (speech+text)	73.0	83.5	83.5	75.0	71.5	84.5	54.5	58.0	73.7

Abstract

Speech language models have recently demonstrated great potential as universal speech processing systems. Such models have the ability to model the rich acoustic information existing in audio signals, beyond spoken content, such as emotion, background noise, etc. Despite this, evaluation benchmarks which evaluate awareness to a wide range of acoustic aspects, are lacking. To help bridge this gap, we introduce SALMon🍣, a novel evaluation suite encompassing background noise, emotion, speaker identity and room impulse response. The proposed benchmarks both evaluate the consistency of the inspected element and how much it matches the spoken text. We follow a modelling based approach, measuring whether a model gives correct samples higher scores than incorrect ones. This approach makes the benchmark fast to compute even for large models. We evaluated several speech language models on SALMon🍣, thus highlighting the strengths and weaknesses of each evaluated method. Code and data are publicly available.

Acoustic Consistency

HuggingFace

Google Drive

Task

Sample

Positive Audio

Negative Audio

Sentiment Consistency

Speaker Consistency

Gender Consistency

Background Consistency (Random)

Background Consistency (In-Domain)

Room Impulse Response Consistency

Semantic-Acoustic Alignment

Task

Sample

Positive Audio

Negative Audio

Sentiment Alignment

Background Alignment

BibTeX

@INPROCEEDINGS{maimon2025salmon, author={Maimon, Gallil and Roth, Amit and Adi, Yossi}, booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, title={Salmon: A Suite for Acoustic Language Model Evaluation}, year={2025}, volume={}, number={}, pages={1-5}, keywords={Measurement;Codes;Publishing;Computational modeling;Pipelines;Benchmark testing;Signal processing;Acoustics;Background noise;Speech processing;Speech Language Models;Acoustic Modelling}, doi={10.1109/ICASSP49660.2025.10888561}}