Skip to the content.



We propose ConTuner, an efficient diffusion model for highfidelity Singing Voice Beautifying. The diffusion model is combined with modified conditions to generate Mel-spectrograms. We also reduce the number of steps of sampling t by using generator-based methods. For automatic pitch correction, we establish a mapping relationship from MIDI, spectrum envelope to pitch. To make amateur singing more expressive, we propose an expression enhancer in the latent space to convert the amateur vocal tone to be professional.

Model Architecture

Figure.1 The overall architecture of ConTuner.

Singing Audio Samples

There are four models in total: [1] GTMel, amateur (A) and [2] professional (P) version, where we first convert ground truth audio into mel-spectrograms, and then convert the mel-spectrograms back to audio according via the vocoder. [3] w/o Expressiveness Enhancer, remove expressive enhancer from ConTuner, which means that pitch predictor takes part in the beautifying. [4] ConTuner, the model proposed.

All four models have a slight electrical sound because of our vocoder Griffin-Lim. Please pay more attention to the pitch and expressiveness of songs.



I. 在我心中曾经有一个梦(zai wo xin zhong ceng jing you yi ge meng)

GT Amateur GT Profession w/o Expressiveness Enhancer ConTuner

II. 你总说毕业遥遥无期转眼就各奔东西(ni zong shuo bi ye yao yao wu qi zhuan yan jiu ge ben dong xi)

GT Amateur GT Profession w/o Expressiveness Enhancer ConTuner

III. 明天你是否还惦记曾经最爱哭的你(ming tian ni shi fou hai dian ji zhe ceng jing zui ai ku de ni)

GT Amateur GT Profession w/o Expressiveness Enhancer ConTuner



IV. Because when the sun shines, we’ll shine together. Told you I’ll be here forever

GT Amateur GT Profession w/o Expressiveness Enhancer ConTuner

V. I said, no one has to know what we do

GT Amateur GT Profession w/o Expressiveness Enhancer ConTuner

VI. Said I’ll always be a friend, took an oath. I’am stick it out till the end

GT Amateur GT Profession w/o Expressiveness Enhancer ConTuner