The Hebrew University of Jerusalem, Israel1 Technion2 Tel-Aviv University3 NetApp4

Abstract:


We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of se- mantic classes. For this task, the videos are required to be aligned both globally and temporally with the input audio: globally, the input audio is semantically associated with the entire output video, and temporally, each segment of the in- put audio is associated with a corresponding segment of that video. We utilize an existing text-conditioned video genera- tion model and a pre-trained audio encoder model. The pro- posed method is based on a lightweight adaptor network, which learns to map the audio-based representation to the in- put representation expected by the text-to-video generation model. As such, it also enables video generation conditioned on text, audio, and, for the first time as far as we can ascertain, on both text and audio. We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples and further propose a novel evaluation metric (AV-Align) to assess the alignment of generated videos with input audio samples. AV-Align is based on the detection and comparison of energy peaks in both modalities. In com- parison to recent state-of-the-art approaches, our method gen- erates videos that are better aligned with the input sound, both with respect to content and temporal axis. We also show that videos produced by our method present higher visual quality and are more diverse.

VGGSound

Ours method zeroscope_v2_576w (Text-to-Video) *

prompt: fireworks banging

prompt: underwater bubbling

prompt: playing drum kit

prompt: chicken crowing

prompt: dog barking

Landscape

Ours method MM-Diffusion

Joint Audio-Text to Video Generation

A video of {TemporalAudioTokens}, with vibrant red and orange foliage A video of {TemporalAudioTokens} in abstract colors A video of {TemporalAudioTokens} on the moon
A painting of {TemporalAudioTokens} A video of {TemporalAudioTokens} in the Desert Isometric digital art of {TemporalAudioTokens}

AudioSet Drum

Ours method TATS