Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation

AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation

Guy Yariv^1,4, Itai Gat², Lior Wolf³, Yossi Adi^1,*, Idan Schwartz^3,4,*,

The Hebrew University of Jerusalem, Israel¹
Technion - Israel Institute of Technology²
Tel-Aviv University³
NetApp⁴

Abstract

In recent years, image generation has shown a great leap in performance, where diffusion models play a central role. Although generating high-quality images, such models are mainly conditioned on textual descriptions. This begs the question: how can we adopt such models to be conditioned on other modalities?. In this paper, we propose a novel method utilizing latent diffusion models trained for text-to-image-generation to generate images conditioned on audio recordings. Using a pre-trained audio encoding model, the proposed method encodes audio into a new token, which can be considered as an adaptation layer between the audio and text representations. Such a modeling paradigm requires a small number of trainable parameters, making the proposed approach appealing for lightweight optimization. Results suggest the proposed method is superior to the evaluated baseline methods, considering objective and subjective metrics.

@article{yariv2023audiotoken, title={AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation}, author={Yariv, Guy and Gat, Itai and Wolf, Lior and Adi, Yossi and Schwartz, Idan}, journal={arXiv preprint arXiv:2305.13050}, year={2023} }

AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation

Abstract

How does AudioToken work?

Comparison of audio-to-image generation on VGGSound

Fine-grained details

Multiple entities

BibTeX