šŸ£SALMon: Suite for Acoustic Language Model evaluationšŸ£

* indicates equal contribution

Abstract

Speech language models have recently demonstrated great potential as universal speech processing systems. Such models have the ability to model the rich acoustic information existing in audio signals, beyond spoken content, such as emotion, background noise, etc. Despite this, evaluation benchmarks which evaluate awareness to a wide range of acoustic aspects, are lacking. To help bridge this gap, we introduce SALMonšŸ£, a novel evaluation suite encompassing background noise, emotion, speaker identity and room impulse response. The proposed benchmarks both evaluate the consistency of the inspected element and how much it matches the spoken text. We follow a modelling based approach, measuring whether a model gives correct samples higher scores than incorrect ones. This approach makes the benchmark fast to compute even for large models. We evaluated several speech language models on SALMonšŸ£, thus highlighting the strengths and weaknesses of each evaluated method. Code and data are publicly available.

šŸ£SALMon LeaderboardšŸ£

Method Sentiment Consistency Speaker Consistency Gender Consistency Background Consistency (In-Domain) Background Consistency (Random) Room Consistency Sentiment Alignment Background Alignment sWUGGY
Human Baseline 97.2 91.2 98.6 83.1 88.7 94.4 93.3 95.7
LAST 1.3B 65.0 64.5 68.5 56.0 61.0 62.5 53.5 53.0 73.6
TWIST 7B 61.5 71.0 70.0 55.0 60.5 62.0 51.5 54.5 82.8
pGSLM 40.5 83.0 88.5 57.0 66.0 53.5 55.5 53.5 74.1
SPIRIT LM 54.5 69.5 67.0 53.5 55.5 54.5 48.0 51.5 75.5
SPIRIT LM Expr. 73.5 81.0 85.0 55.0 64.0 54.5 52.0 59.5 72.7

We encourage you to report your Speech Language Models' results, by sending an email to gallil.maimon@mail.huji.ac.il and we will update it here.

Acoustic Consistency

We provide some data samples from the benchmark, but encourage you to test out the entire benchmark on šŸ¤—HuggingFace or Google Drive.
Task Sample Positive Audio Negative Audio

Sentiment Consistency

1

2

3

Speaker Consistency

1

2

3

Gender Consistency

1

2

3

Background Consistency (Random)

1

2

3

Background Consistency (In-Domain)

1

2

3

Room Impulse Response Consistency

1

1

3





Semantic-Acoustic Alignment


Task Sample Positive Audio Negative Audio

Sentiment Alignment

1

2

3

Background Alignment

1

2

3

BibTeX

@article{maimon2024salmon,
          title={A Suite for Acoustic Language Model Evaluation},
          author={Maimon, Gallil and Roth, Amit and Adi, Yossi},
          journal={arXiv preprint arXiv:2409.07437},
          year={2024}
          }