Recent generative models have shown promising results in audio generation across various domains, including human speech, singing voice, and multi-instrument music synthesis. Such acoustic models are typically specialized, with separate systems for speech, singing, and instrumental music. However, real-world audio often comprises multiple domains—for instance, musical recordings that combine a sung melody or spoken lyrics with instrumental accompaniment. This highlights the need for more general-purpose approaches to audio synthesis that can handle such integration. As an initial step towards universal synthesis, in this work we compare different acoustic models originating from distinct domains—instrumental music synthesis and speech synthesis—on the task of human voice conversion. Through an extensive evaluation across singing and speech, we demonstrate that a diffusion-based instrumental music synthesis model can be effectively adapted to human voice conversion, achieving performance comparable to or surpassing that of a dedicated speech synthesis model. To facilitate training on large-scale, minimally curated datasets, we demonstrate that off-the-shelf feature extractors for phonetics, pitch and acoustics provide effective conditioning signals for the synthesizer, enabling self-supervised training.
The samples in this section are generated using our proposed attention-based T5-Voc model, trained on both speech and singing data.
The first column ("Source") presents real vocal excerpts that are converted across different singers.
Each subsequent column corresponds to a target singer, and the first row ("Target") provides a real reference excerpt for each.
Originally developed for instrumental music synthesis, this T5-based diffusion model has been effectively adapted for singing voice conversion.
The version conditioning mechanism—implemented using FiLM layers and initially designed for acoustic conditioning in instrumental music synthesis—can be repurposed to condition on singer identity.
Additionally, we use an off-the-shelf phonetic posteriorgram (PPG) extractor, trained only on speech, to condition the model on phonetic content.
Despite being trained exclusively on speech, the PPG extractor successfully captures the phonetic content of singing, which is preserved during singer conversion.
For a comparison with the FlowMAC model (based on MatchaTTS and ForwardTacotron), refer to the singing MUSHRA-like listening test section.
The samples in this section are generated using the attention-based T5-Voc model, trained on vocal data and conditioned on well-known singers from popular music
as well as the Schubert Winterreise dataset. Vocal tracks were obtained through source separation software (see the repositories section for details).
Following the format of the previous section, each column corresponds to a different singer. The first row ("Target") provides a real reference excerpt for each singer,
while the subsequent rows ("Generated") contain singing excerpts that have been converted across the different singers, demonstrating the model’s ability to preserve phonetic and musical content while adapting timbre and style.
This section presents speech voice conversion samples generated by our proposed attention-based T5-Voc model, trained on both speech and singing data.
The first column ("Source") contains real speech excerpts, which are converted across target speakers shown in the subsequent columns.
The first row ("Target") provides a real reference excerpt for each target speaker.
As in the previous sections, the model leverages the FiLM-based version conditioning mechanism, originally designed for acoustic conditioning in instrumental music synthesis,
to effectively condition on speaker identity in the speech domain.
For a comparative evaluation with the FlowMAC model (based on MatchaTTS and ForwardTacotron),
please refer to the speech MUSHRA-like listening test section.
The samples in this section are from the speech MUSHRA-like listening test. Several models are evaluated: PAD-Voc: A voice conversion model based on ForwardTacotron, trained with a reconstruction loss. MAC-Voc: The FlowMAC model, trained with a flow-matching objective, and conditioned on the output of PAD-Voc. T5-Voc: Our proposed T5 attention-based diffusion model adapted from instrumental music synthesis to human voice conversion. T5-All: The same model as T5-Voc, but trained on a broader dataset that includes instrumental music and vocal-instrumental mixtures, in addition to pure vocal data.
The samples in this section are from the singing MUSHRA-like listening test. Several models are evaluated: PAD-Voc: A voice conversion model based on ForwardTacotron, trained with a reconstruction loss. MAC-Voc: The FlowMAC model, trained with a flow-matching objective, and conditioned on the output of PAD-Voc. T5-Voc: Our proposed T5 attention-based diffusion model adapted from instrumental music synthesis to human voice conversion. T5-All: The same model as T5-Voc, but trained on a broader dataset that includes instrumental music and vocal-instrumental mixtures, in addition to pure vocal data.
The samples in this section are taken from the singer similarity listening test. In this test, the same source excerpt is rendered using two different voices:
the target singer and a randomly sampled other singer. Listeners are asked to rate the similarity to the target singer.
We compare two models: T5-Voc and MAC-Voc, resulting in four converted samples per source excerpt. For each converted sample,
we also provide a reference excerpt (of different content) from the target singer, to guide the similarity judgment.
The samples in this section are generated using the attention-based T5-All model, trained on a combined dataset of vocal and instrumental music.
The instrumental content was derived from automatically transcribed piano rolls.
For vocal-only samples generated by this model, refer to the MUSHRA-like listening test sections for speech and singing.
While this paper focuses on the evaluation of vocal synthesis, we plan to explore the generation of vocal-instrumental mixtures more extensively in future work.