Existing Speech Language Model (SLM) scaling analysis paints a bleak picture. They predict that SLMs require much more compute and data compared to text, leading some to question the feasibility of training high-quality SLMs. However, modern SLMs are often initialised from pre-trained TextLMs using speech-text interleaving to allow knowledge transfer. This raises the question - Do interleaved SLMs scale more efficiently than textless-SLMs? In this paper we answer a resounding yes! We conduct scaling analysis of interleaved SLMs by training several dozen and analysing the scaling trends. We see that under this setup SLMs scale more efficiently with compute. Additionally, our results indicate that the scaling-dynamics are significantly different than textless-SLMs, suggesting one should allocate notably more of the compute budget for increasing model size over training tokens. We also study the role of synthetic data and TextLM model families in unlocking this potential. Results suggest, that our scaled up model achieves comparable performance with leading models on speech semantic metrics while using less compute and data than other approaches. We open source all models for further research into scaling SLMs.
Prompt | SIMS (Qwen 7B) 6.9e20 |
SIMS (LLaMa 3B) 6e19 |
---|---|---|
one two three four five | ||
The capital of France is | ||
who liked very much to hear himself talk. One day, he was trying to say something, but only came out with a few of the same sound. Oh, he said. It's a bit of an accident, and I shall remember it some day. This occurred in the year 1944. It was the next day, however, that the computer said to Tommy, What was that strange sound? I have no idea. said tommy | the story of a computer, or a computer, that talked. It was a little computer that lived in a room. The computer was a friend of the boy who lived in the house. The boy's name was Tom. Tom and the little computer liked to play games together. They played all day long. One day, Tom and the little computer went to the park. They played on the swings and the slide. They had a lot of fun. But then, Tom got hungry. He wanted a snack. He looked in his lunch box. There were no snacks. Tom was sad. The little computer saw Tom was sad. It wanted to help Tom. The little computer had an idea. It said, "Tom, you can eat my | |
as Massachusetts, New York, Ohio, and Virginia, as well as many smaller states. But it does not matter very much for the purposes of describing the United States, if he tries to think of the whole. | as this. The United States is a federation of states. Each state is a separate political entity. Each state is also an integral part of the national government. | |
The names of the continents are North Africa, Asia, America, Europe, and Oceania. | the name of the continent is asia, the name of the continent is europe, the name of the continent is africa, the name of the continent is america, the name of the continent is oceania. | |
can travel fifteen and a half this afternoon and see if the mail will bring us to the house. We planned to be here a week ago we have only thirty-two miles left and have been averaging ten miles an hour this trip. Mr. gregory said if you will go with me to washington i'll make it. | can't go any farther, I am afraid, sir, said the guide. It is the first time we have seen the water. But if we could get a boat, we could make a raft, and it would be easier to row a raft than to walk. We have seen some of our men do it. We could get a boat if we could find one. I will send one for you, if you like. Yes, thank you. You see, I can't go on any farther. I am very tired. I am afraid I must stop here. I will send a boat to take you home. Thank you very much, sir. We will go now. The guide was about to leave, but the guide paused, and | |
D. E. F. G. H. I. J. K. L. M. N. O. P. R. D. S C. D. D. | D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z, A. B. C. D. E. F. G. H. I. J. K. L. M. N. O. P. Q. R. S. T. U. V. W. X. Y. Z, A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z, | |
@misc{maimon2025scaling,
title={Scaling Analysis of Interleaved Speech-Text Language Models},
author={Gallil Maimon and Michael Hassid and Amit Roth and Yossi Adi},
year={2025},
eprint={2504.02398},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.02398},
}