Marmoset monkeys (Callithrix jacchus) exhibit complex vocal communication, challenging the traditional view that primate vocalization is entirely innate. However, since \monks communicate solely through speech without a written form, applying standard LLM approaches is not straightforward. In this work, we introduce Generative Marmoset Spoken Language Modeling GmSLM, an optimized spoken language model pipeline for Marmoset vocalizations. We evaluate GmSLM using zero-shot metrics and weakly labeled conversational Marmoset data, demonstrating its superiority over a naive human-speech-based baseline. Additionally, we show that generated vocalizations closely match real resynthesized samples. Despite being fully unsupervised, GmSLM effectively differentiates authentic Marmoset conversations from artificial ones, establishing a foundation for future research in primate vocal communication, and advancing studies in vocal learning across neuroscience, bioacoustics, and evolutionary biology
We provide generation examples for different conditions. Click on the play buttons to listen to the samples.
id | Original | Re-synthesized | Generated |
---|---|---|---|
1 | |||
2 | |||
3 | |||
4 | |||
5 | |||
6 | |||
7 |