Adapting a Diffusion-based Music Synthesis Model to Human Voice Conversion

Ben Maman¹, Frank Zalkow², Hans-Ulrich Berendes¹, Paolo Sani², Christian Dittmar², Meinard Müller¹,

¹International Audio Laboratories Erlangen, Germany
²Fraunhofer IIS, Erlangen, Germany

Abstract

Recent generative models have shown promising results in audio generation across various domains, including human speech, singing voice, and multi-instrument music synthesis. Such acoustic models are typically specialized, with separate systems for speech, singing, and instrumental music. However, real-world audio often comprises multiple domains—for instance, musical recordings that combine a sung melody or spoken lyrics with instrumental accompaniment. This highlights the need for more general-purpose approaches to audio synthesis that can handle such integration. As an initial step towards universal synthesis, in this work we compare different acoustic models originating from distinct domains—instrumental music synthesis and speech synthesis—on the task of human voice conversion. Through an extensive evaluation across singing and speech, we demonstrate that a diffusion-based instrumental music synthesis model can be effectively adapted to human voice conversion, achieving performance comparable to or surpassing that of a dedicated speech synthesis model. To facilitate training on large-scale, minimally curated datasets, we demonstrate that off-the-shelf feature extractors for phonetics, pitch and acoustics provide effective conditioning signals for the synthesizer, enabling self-supervised training.

Contents

Singing Voice Conversion Samples—Singstyle111 Dataset
Singing Voice Conversion Samples—Popular Music & Schubert Winterreise
Speech Voice Conversion Samples
Speech MUSHRA-like Listening Test
Singing MUSHRA-like Listening Test
Singer Similarity Listening Test
Vocal–Instrumental Conversion Samples
Feature Extractors, Repositories, and Figures

For better visibilty, please use a Google Chrome or Microsoft Edge browser.

Singing Voice Conversion Samples—Singstyle111 Dataset [Go to top]

The samples in this section are generated using our proposed attention-based T5-Voc model, trained on both speech and singing data. The first column ("Source") presents real vocal excerpts that are converted across different singers. Each subsequent column corresponds to a target singer, and the first row ("Target") provides a real reference excerpt for each.
Originally developed for instrumental music synthesis, this T5-based diffusion model has been effectively adapted for singing voice conversion. The version conditioning mechanism—implemented using FiLM layers and initially designed for acoustic conditioning in instrumental music synthesis—can be repurposed to condition on singer identity.
Additionally, we use an off-the-shelf phonetic posteriorgram (PPG) extractor, trained only on speech, to condition the model on phonetic content. Despite being trained exclusively on speech, the PPG extractor successfully captures the phonetic content of singing, which is preserved during singer conversion.
For a comparison with the FlowMAC model (based on MatchaTTS and ForwardTacotron), refer to the singing MUSHRA-like listening test section.

Source 1 (M1):

Source 2:

Source 3:

Source 4:

Target (F1):

Generated (S1, F1):

Generated (S2, F1):

Generated (S3, F1):

Generated (S4, F1):

Target (F4):

Generated (S1, F4):

Generated (S2, F4):

Generated (S3, F4):

Generated (S4, F4):

Target (M1):

Generated (S1, M1):

Generated (S2, M1):

Generated (S3, M1):

Generated (S4, M1):

Target (M4):

Generated (S1, M4):

Generated (S2, M4):

Generated (S3, M4):

Generated (S4, M4):

Singing Voice Conversion Samples—Popular Music & Schubert Winterreise [Go to top]

The samples in this section are generated using the attention-based T5-Voc model, trained on vocal data and conditioned on well-known singers from popular music as well as the Schubert Winterreise dataset. Vocal tracks were obtained through source separation software (see the repositories section for details).
Following the format of the previous section, each column corresponds to a different singer. The first row ("Target") provides a real reference excerpt for each singer, while the subsequent rows ("Generated") contain singing excerpts that have been converted across the different singers, demonstrating the model’s ability to preserve phonetic and musical content while adapting timbre and style.

Source 1 (The Beatles, With a Little Help):

Source 2 (Creedence, Lookin' Out):

Source 3 (Madonna, Like a Virgin):

Source 4 (Matti Caspi, Hine Hine):

Source 5 (Yehudit Ravitz, Derech Hameshi):

Target (The Doors):

Generated (S1, The Doors):

Generated (S2, The Doors):

Generated (S3, The Doors):

Generated (S4, The Doors):

Generated (S5, The Doors):

Target (Dietrich Fischer-Dieskau):

Generated (S1, Dietrich Fischer-Dieskau):

Generated (S2, Dietrich Fischer-Dieskau):

Generated (S3, Dietrich Fischer-Dieskau):

Generated (S4, Dietrich Fischer-Dieskau):

Generated (S5, Dietrich Fischer-Dieskau):

Target (Madonna):

Generated (S1, Madonna):

Generated (S2, Madonna):

Generated (S3, Madonna):

Generated (S4, Madonna):

Generated (S5, Madonna):

Target (Matti Caspi):

Generated (S1, Matti Caspi):

Generated (S2, Matti Caspi):

Generated (S3, Matti Caspi):

Generated (S4, Matti Caspi):

Generated (S5, Matti Caspi):

Speech Voice Conversion Samples [Go to top]

This section presents speech voice conversion samples generated by our proposed attention-based T5-Voc model, trained on both speech and singing data. The first column ("Source") contains real speech excerpts, which are converted across target speakers shown in the subsequent columns. The first row ("Target") provides a real reference excerpt for each target speaker.
As in the previous sections, the model leverages the FiLM-based version conditioning mechanism, originally designed for acoustic conditioning in instrumental music synthesis, to effectively condition on speaker identity in the speech domain.
For a comparative evaluation with the FlowMAC model (based on MatchaTTS and ForwardTacotron), please refer to the speech MUSHRA-like listening test section.

Source 1:

Source 2 (Tim):

Source 3 (Chantal):

Source 4 (Ian):

Source 5 (Tim):

Target (Chantal):

Generated (S1, Chantal):

Generated (S2, Chantal):

Generated (S3, Chantal):

Generated (S4, Chantal):

Generated (S5, Chantal):

Target (Female 2):

Generated (S1, Female 2):

Generated (S2, Female 2):

Generated (S3, Female 2):

Generated (S4, Female 2):

Generated (S5, Female 2):

Target (Tim):

Generated (S1, Tim):

Generated (S2, Tim):

Generated (S3, Tim):

Generated (S4, Tim):

Generated (S5, Tim):

Target (Ian):

Generated (S1, Ian):

Generated (S2, Ian):

Generated (S3, Ian):

Generated (S4, Ian):

Generated (S5, Ian):

Speech MUSHRA-like Listening Test [Go to top]

The samples in this section are from the speech MUSHRA-like listening test. Several models are evaluated: PAD-Voc: A voice conversion model based on ForwardTacotron, trained with a reconstruction loss. MAC-Voc: The FlowMAC model, trained with a flow-matching objective, and conditioned on the output of PAD-Voc. T5-Voc: Our proposed T5 attention-based diffusion model adapted from instrumental music synthesis to human voice conversion. T5-All: The same model as T5-Voc, but trained on a broader dataset that includes instrumental music and vocal-instrumental mixtures, in addition to pure vocal data.

Reference:

Model: PAD-Voc

Model: T5-All

Model: MAC-Voc

Model: T5-Voc

Reference:

Model: PAD-Voc

Model: T5-All

Model: MAC-Voc

Model: T5-Voc

Reference:

Model: PAD-Voc

Model: T5-All

Model: MAC-Voc

Model: T5-Voc

Singing MUSHRA-like Listening Test [Go to top]

The samples in this section are from the singing MUSHRA-like listening test. Several models are evaluated: PAD-Voc: A voice conversion model based on ForwardTacotron, trained with a reconstruction loss. MAC-Voc: The FlowMAC model, trained with a flow-matching objective, and conditioned on the output of PAD-Voc. T5-Voc: Our proposed T5 attention-based diffusion model adapted from instrumental music synthesis to human voice conversion. T5-All: The same model as T5-Voc, but trained on a broader dataset that includes instrumental music and vocal-instrumental mixtures, in addition to pure vocal data.

Reference:

Model: PAD-Voc

Model: T5-All

Model: MAC-Voc

Model: T5-Voc

Reference:

Model: PAD-Voc

Model: T5-All

Model: MAC-Voc

Model: T5-Voc

Reference:

Model: PAD-Voc

Model: T5-All

Model: MAC-Voc

Model: T5-Voc

Singer Similarity Listening Test [Go to top]

The samples in this section are taken from the singer similarity listening test. In this test, the same source excerpt is rendered using two different voices: the target singer and a randomly sampled other singer. Listeners are asked to rate the similarity to the target singer.
We compare two models: T5-Voc and MAC-Voc, resulting in four converted samples per source excerpt. For each converted sample, we also provide a reference excerpt (of different content) from the target singer, to guide the similarity judgment.

Target:

Model / Condition: T5-Voc Other

Model / Condition: MAC-Voc Other

Model / Condition: MAC-Voc

Model / Condition: T5-Voc

Target:

Model / Condition: T5-Voc Other

Model / Condition: MAC-Voc Other

Model / Condition: MAC-Voc

Model / Condition: T5-Voc

Target:

Model / Condition: T5-Voc Other

Model / Condition: MAC-Voc Other

Model / Condition: MAC-Voc

Model / Condition: T5-Voc

Vocal–Instrumental Conversion Samples [Go to top]

The samples in this section are generated using the attention-based T5-All model, trained on a combined dataset of vocal and instrumental music. The instrumental content was derived from automatically transcribed piano rolls.
For vocal-only samples generated by this model, refer to the MUSHRA-like listening test sections for speech and singing.
While this paper focuses on the evaluation of vocal synthesis, we plan to explore the generation of vocal-instrumental mixtures more extensively in future work.

Source: Riders on The Storm (The Doors)
Target: Dietrich Fischer-Dieskau

Source: Martha My Dear (The Beatles)
Target: Dietrich Fischer-Dieskau

Source: Like a Virgin (Madonna)
Target: Dietrich Fischer-Dieskau

Source: Hine Hine (Matti Caspi)
Target: Dietrich Fischer-Dieskau

Source: Riders on The Storm (The Doors)
Target: The Doors

Source: Martha My Dear (The Beatles)
Target: The Doors

Source: Like a Virgin (Madonna)
Target: The Doors

Source: Hine Hine (Matti Caspi)
Target: The Doors

Source: Riders on The Storm (The Doors)
Target: Creedence Clearwater Revival

Source: Martha My Dear (The Beatles)
Target: Jimi Hendrix

Source: Holiday (Madonna)
Target: Metallica

Source: Hine Hine (Matti Caspi)
Target: Jimi Hendrix

Feature Extractors, Repositories, and Figures [Go to top]

We use the following feature extractors:
CREPE (vocal pitch)
wav2vec2.0 fine-tuned on the TIMIT dataset (PPG)
Multi-instrument Onsets and Frames (piano roll)

To convert generated mel-spectrograms to waveforms, we use a general-purpose BigVGAN vocoder.

For isolating the vocals from data mixed with instrumental accompaniment, we use the Demucs source separation model.

The compared acoustic models are derived from the following repositories:
Matcha-TTS
ForwardTacotron
T5 DDPM

Portrait in the overview figure by Joey McCoy, taken from Wikimedia Commons.