Aero: Audio Super Resolution in the Spectral Domain

School of Computer Science and Engineering
The Hebrew University of Jerusalem, Israel

Abstract

We present AERO, a audio super-resolution model that processes speech and music signals in the spectral domain. AERO is based on an encoder-decoder architecture with U-Net like skip connections. We optimize the model using both time and frequency domain loss functions. Specifically, we consider a set of reconstruction losses together with perceptual ones in the form of adversarial and feature discriminator loss functions. To better handle phase information the proposed method operates over the complex-valued spectrogram using two separate channels. Unlike prior work which mainly considers low and high frequency concatenation for audio super-resolution, the proposed method directly predicts the full frequency range. We demonstrate high performance across a wide range of sample rates considering both speech and music. AERO outperforms the evaluated baselines considering Log-Spectral Distance, ViSQOL, and the subjective MUSHRA test.

Section Ⅰ: Examples for samples upsampled from 4kHz to 16kHz.

The model is trained on the first 100 speakers of the VCTK dataset. The following samples are generated from the remaining 8 speakers.

Original low resolution
(4 kHz -> 16 kHz)
Original high resolution (16 kHz) Sinc (16 kHz) TFiLM (16 kHz) SEANet (16 kHz) Ours (16 kHz)
Audio
Linear
Spectrogram
Audio
Linear
Spectrogram

Section Ⅱ: Examples for samples upsampled from 8kHz to 16kHz.

The model is trained on the first 100 speakers of the VCTK dataset. The following samples are generated from the remaining 8 speakers.

Original low resolution
(4 kHz -> 16 kHz)
Original high resolution (16 kHz) Sinc (16 kHz) TFiLM (16 kHz) SEANet (16 kHz) Ours (16 kHz)
Audio
Linear
Spectrogram
Audio
Linear
Spectrogram

Section Ⅲ: Examples for samples upsampled from 8kHz to 24kHz.

The model is trained on the first 100 speakers of the VCTK dataset. The following samples are generated from the remaining 8 speakers.

Original low resolution
(8 kHz -> 24 kHz)
Original high resolution (24 kHz) Sinc (24 kHz) SEANet (24 kHz) Ours, hl=256 (24 kHz) Ours, hl=128 (24 kHz) Ours, hl=64 (24 kHz)
Audio
Linear
Spectrogram
Audio
Linear
Spectrogram
Audio
Linear
Spectrogram

Section Ⅳ: Examples for samples upsampled from 12kHz to 48kHz.

The model is trained on the first 100 speakers of the VCTK dataset. The following samples are generated from the remaining 8 speakers.

Original low resolution
(12 kHz -> 48 kHz)
Original high resolution (48 kHz) Sinc (48 kHz) SEANet (48 kHz) Nu-wave 2 (48 kHz) Ours (48 kHz)
Audio
Linear
Spectrogram
Audio
Linear
Spectrogram
Audio
Linear
Spectrogram

Section Ⅴ: Examples for samples upsampled from 11.025kHz to 44.1kHz.

The model is trained on the train set of the MusDB-HQ dataset. The following samples are generated from the test set.

Original low resolution
(11 kHz -> 44 kHz)
Original high resolution (44 kHz) Sinc (44 kHz) SEANet (44 kHz) BEHM-Gan(44 kHz) Ours, hl=256 (44 kHz) Ours, hl=128 (44 kHz) Ours, hl=64 (44 kHz)
Audio
Linear
Spectrogram
Audio
Linear
Spectrogram
Audio
Linear
Spectrogram
Audio
Linear
Spectrogram
Audio
Linear
Spectrogram

Section Ⅵ: Examples for adversarial ablation study.

The model is trained on the first 100 speakers of the VCTK dataset. The following samples are generated from the remaining 8 speakers.

Original low resolution
(4 kHz -> 16 kHz)
Original high resolution (16 kHz) Non adv. (16 kHz) 3 MSD: Adv. loss only (16 kHz) 3 MSD: Feature loss only (16 kHz) 1 MSD (16 kHz) 3 MSD (16 kHz)
Audio
Linear
Spectrogram
Audio
Linear
Spectrogram