Multi-Aspect Conditioning for Diffusion-Based Music Synthesis: Enhancing Realism and Acoustic Control

1International Audio Laboratories Erlangen, Germany
2Tel Aviv University

Abstract

Music synthesis aims to generate audio from symbolic music representations, traditionally using techniques like concatenative synthesis and physical modeling. These methods offer good control but often lack expressiveness and realism in timbre. Recent advancements in diffusion-based models have enhanced the realism of synthesized audio, yet these models struggle with precise control over aspects like acoustics and timbre and are limited by the availability of high-quality annotated training data. In this paper, we introduce an advanced diffusion-based framework for music synthesis that further improves realism and introduces control through multi-aspect conditioning. This allows the synthesis from symbolic representations to accurately replicate specific performance and acoustic conditions. To address the need for precise multi-instrument target annotations, we propose using MIDI-aligned scores and automatic multi-instrument transcription based on neural networks. These methods effectively train our diffusion model with authentic audio, enhancing realism and capturing subtle nuances in performance and acoustics. As a second major contribution, we adopt conditioning techniques to gain control over multiple aspects, including score-related aspects like notes and instrumentation, as well as version-related aspects like performance and acoustics. This multi-aspect conditioning restores control over the music generation process, leading to greater fidelity in achieving the desired acoustic and stylistic outcomes. Finally, we validate our model's efficacy through systematic experiments, including qualitative listening tests and quantitative evaluation using Fréchet Audio Distance to assess version similarity, confirming the model's ability to generate realistic and expressive music, with acoustic control. Supporting evaluations and comparisons are detailed on our website (benadar293.github.io/multi-aspect-conditioning).

Contents


Improvements

This work complements and expands our ICASSP 2024 paper. There are 3 significant improvements in this work:

Synthesized Samples (Alignment-Conditioned Model)    [Go to top]

We provide generated samples of diverse instruments and ensembles, using different Version Conditions. Orchestral sound is obtained by simply using the 'Orchestra' instrument in the input MIDI, and does not require many violin channels.
For each sample, we provide the input MIDI (upper), and our generated sample (lower). Samples generated with the same Version Condition are marked with the same color.
See next sections for a demonstration of the Version Conditioning effect, and for a comparison with Hawthorne et al.
All generated samples in this page were generated with the same T5 model from the paper, and vocoded with Soundstream.

MIDI: Brahms Piano Concerto 1
Version Condition: Mitsuko Uchida & Kurt Sanderling playing Beethoven Piano Concertos 1-4
Input MIDI:

Ours:
MIDI: Bach 4th Orchestral Suite Overture
Version Condition: Bach 3rd Orchestral Suite (unknown performer)
Input MIDI:

Ours:
MIDI: Bach 2nd Harpsichord Concerto
Version Condition: Bach Sonatas for Violin & Harpsichord (unknown performer)
Input MIDI:

Ours:
MIDI: Beethoven 6th Symphony (Pastoral)
Version Condition: Ferenc Fricsay & Berlin Radio Orchestra playing Brahms' Haydn Variations
Input MIDI:

Ours:

MIDI: Beethoven Piano Concerto 5
Version Condition: Mitsuko Uchida & Kurt Sanderling playing Beethoven Piano Concertos 1-4
Input MIDI:

Ours:
MIDI: Bach 1st Orchestral Suite Overture
Version Condition: Bach 3rd Orchestral Suite (unknown performer)
Input MIDI:

Ours:
MIDI: Bach 1st Harpsichord Concerto
Version Condition: Bach Sonatas for Violin & Harpsichord (unknown performer)
Input MIDI:

Ours:
MIDI: Mozart 40th Symphony Part 4
Version Condition: Ferenc Fricsay & Berlin Radio Orchestra playing Brahms' Haydn Variations
Input MIDI:

Ours:



Synthesized Samples (Transcription-Conditioned Model)    [Go to top]

We provide generated samples of diverse instruments and ensembles, using different Version Conditions. Orchestral sound is obtained by simply using the 'Orchestra' instrument in the input MIDI, and does not require many violin channels. Samples in this section are from a model trained with score conditions obtained from transcriptions predicted by an automatic transcriber.
For each sample, we provide the input MIDI (upper), and our generated sample (lower). Samples generated with the same Version Condition are marked with the same color.
See next sections for a demonstration of the Version Conditioning effect, and for a comparison with Hawthorne et al.


MIDI: Mendelssohn Trio for Piano, Violin and Cello
Version Condition: Trio Élégiaque playing Beethoven's Piano Trios
Input MIDI:

Ours:
MIDI: Gershwin Rhapsody in Blue
Version Condition: Mitsuko Uchida & Kurt Sanderling playing Beethoven Piano Concertos 1-4
Input MIDI:

Ours:
MIDI: Beethoven 5th Symphony Part 1
Version Condition: Czech Symphony Orchestra playing Beethoven's Coriolan Overture
Input MIDI:

Ours:
MIDI: Bach Little Fugue
Version Condition: Kay Johannsen playing Bach's Organ Trio Sonatas
Input MIDI:

Ours:

MIDI: Mendelssohn Trio for Piano, Violin and Cello
Version Condition: Trio Élégiaque playing Beethoven's Piano Trios
Input MIDI:

Ours:
MIDI: Mozart 40th Symphony Part 1
Version Condition: Czech Symphony Orchestra playing Beethoven's 3rd Symphony (Eroica)
Input MIDI:

Ours:
MIDI: Beethoven Pastoral Symphony Part 4
Version Condition: Czech Symphony Orchestra playing Beethoven's Coriolan Overture
Input MIDI:

Ours:
MIDI: Bach Mass in B Minor Part 11 (left-out)
Version Condition: Bach Mass in B Minor (unknown performer)
Input MIDI:

Ours:



Version Conditioning with Pitch-Only Input    [Go to top]

In the following, we demonstrate the effect of Version Conditioning when providing pitch-only score input.
The exact same MIDI with pitch-only information (without instrumentation) is synthesized with different Version Conditions.
Control over the instrument in these samples is obtained only through the FiLM-based version condition, and not through the score encoder.
For each sample, we provide a segment from the conditioning version (upper), and our generated sample (lower).
For a fair comparison, all segments from the conditioning versions are vocoded with the Soundstream vocoder, also used for our generated samples.
Samples in this section are from the transcription-conditioned model.


Bach's 7th Harpsichord Concerto
Input MIDI:

Version Condition:
Ours:
Version Condition:

Ours:
Version Condition:
Ours:
Version Condition:
Ours:


Mozarts's 40th Symphony
Input MIDI:

Version Condition:
Ours:
Version Condition:

Ours:
Version Condition:
Ours:
Version Condition:
Ours:


Bach's Mass in B Minor
Input MIDI:

Version Condition:
Ours:
Version Condition:

Ours:
Version Condition:
Ours:
Version Condition:
Ours:





Version Conditioning for Different Instances of The Same Instrument    [Go to top]

In the following, we demonstrate the use of Version Conditioning to obtain different instances of the same instruments. This includes for example different types of harpsichords, different church organs, or different room acoustics or recording environment. In the following samples, the exact same MIDI (including instrumentation) is synthesized with different version conditions.
Samples in this section are from the alignment-conditioned model.


Bach's 8th Invention Synthesized on 8 Different Harpsichords
In the last 2 examples, the timbre of the harpsichord is extracted from a mixture with a violin.

Input MIDI:

Version Condition:
Ours:
Version Condition:
Ours:
Version Condition:
Ours:

Version Condition:

Ours:

Version Condition:

Ours:
Version Condition:

Ours:
Version Condition (mixture with violin):
Ours:
Version Condition (mixture with violin):
Ours:

Orchestra - Beethoven Symphony 6 (Pastoral)
In this example we provide the same MIDI of Beethoven's 6th symphony, synthesized with 4 different orchestras.

Input MIDI:

Version Condition:
Ours:
Version Condition:
Ours:
Version Condition:
Ours:
Version Condition:
Ours:

Church Organ
Bach's Toccata and Fugue synthesized on 2 different church organs:

Input MIDI:

Version Condition:
Ours:
Version Condition:
Ours:

Violin
Paganini's Capriccio on 4 different violins:

Input MIDI:

Version Condition (mixture with piano):
Ours:
Version Condition (mixture with harpsichord):
Ours:
Version Condition (mixture with harpsichord):
Ours:
Version Condition:
Ours:

Guitar
Kansas' Dust In The Wind synthesized on 3 different guitars:

Input MIDI:
Version Condition:
Ours:
Version Condition:
Ours:
Version Condition:
Ours:

Violin and Piano/Harpsichord
Bach's 2nd Concerto for Orchestra and Harpsichord with different violins, and the keyboard part played on a piano (left) or a harpsichord (right):

Input MIDI (Piano as Keyboard):
Version Condition (Cello, Violin & Piano):
Ours:
Input MIDI (Harpsichord as Keyboard):
Version Condition (Violin & Harpsichord):
Ours:



Comparison to Hawthorne et al.    [Go to top]

Our main comparison is with Hawthorne et al. (2022). Although we use the same T5 backbone, there are 2 main differences between the models:
(1) Data - We train on real data alone (~58 hours), including orchestral symphonies, while Hawthorne et al. train on mainly synthetic data (~1500 hours).
(2) We incorporate Version Conditioning.



Input MIDI:

Hawthorne et al. (flat timbre):

Ours Version Condition I:

Ours Version Condition II:

Input MIDI:

Hawthorne et al. (flat timbre):

Ours Version Condition I:

Ours Version Condition II:

Input MIDI:

Hawthorne et al. (flat timbre, timbre drift):

Ours Version Condition I:

Ours Version Condition II:

Input MIDI:

Hawthorne et al:

Ours Version Condition I:

Ours Version Condition II:



Pop, Rock, and Jazz Music with Zero-Shot Transcription-Conditioning    [Go to top]

In this section we provide samples of performances generated by a synthesizer trained on rock and pop music albums. For note conditions, we used predicted transcriptions of an existing transcriber. In the model used in this section, all training performances are unknown to the transcriber. The transcriber was trained mainly on rock albums, with unalinged MIDIs, following Maman and Bermano, 2022. However, instrumentation in MIDIs of pop and rock music is extremely prone to inaccuracies and incosistencies, and we therefore rely on the synthesizer's generative power, and on Version Conditioning, to produce likely instrumentation. Improving instrument disentanglement in such genres is an important direction for future work, and we believe improved transcription will enable more control in the generation process.

Note the ability of the model to generate drums, as the MIDI input contains only pitch and instrument (without drums).
The model can generate human voice, but it is not conditioned on lyrics - this is important future work.

MIDI: Beegees, More Than A Woman
Version Condition: Guns N' Roses, Appetite for Destruction
Input MIDI:

Ours:
MIDI: Beatles, Let It Be
Version Condition: Guns N' Roses, Appetite for Destruction
Input MIDI:

Ours:
MIDI: Beatles, Martha My Dear
Version Condition: Dave Brubeck, Time Out

Input MIDI:

Ours:
MIDI: Eric Clapton, I Shot The Sheriff
Version Condition: Dave Brubeck, Time Out

Input MIDI:

Ours:

MIDI: Doors, Hello I Love You
Version Condition: Metallica, Master of Puppets

Input MIDI:

Ours:
MIDI: Beatles, Let It Be
Version Condition: Queen, A Night at The Opera

Input MIDI:

Ours:
MIDI: Frank Zappa, Dancin' Fool
Version Condition: Daft Punk, Random Access Memories
Input MIDI:

Ours:
MIDI: Depeche Mode, People Are People
Version Condition: Yes, Close to The Edge

Input MIDI:

Ours:

Version Conditioning Effect in Pop, Rock, and Jazz Music

In this section, we demonstrate the effect of Version Conditioning in genres such as pop, rock, and jazz music. Each Version Condition corresponds to an album.


Britney Spears, One More Time
In this example, Britney Spears' One More Time is generated in two versions: ABBA's Super Trouper album, and Dick Schory's Music For Bang, Baaroom and Harp, which comprises mainly pitched percussion instruments.

Input MIDI:
Version Condition: ABBA, Super Trouper

Beatles, Let It Be
In this example, The Beatles' Let It Be is generated in the styles of two different rock albums. Notice the significant difference in the sound of the guitar.
Input MIDI:



Unseen Versions using TRILL Embeddings    [Go to top]

In this section we provide samples of performances generated with version conditions provided by audio samples, enabling conditioning on unseen versions. For this, we fine-tuned the model such that the version condition will be defined by the mean TRILL embedding over the version.
To isolate the interpolated version condition from the instrument condition, we use pitch-only input.

Bach Cantata BWV 29
Input MIDI:

Version Condition:
Ours:
Version Condition:

Ours:
Version Condition:
Ours:
Version Condition:
Ours:


Brahms Piano Concerto 1
Input MIDI:

Version Condition:

Ours:
Version Condition:
Ours:
Version Condition:
Ours:
Version Condition:
Ours:



Version Interpolation    [Go to top]

We provide samples for version interpolation. In each sample, we use interpolation weights varying linearly across time: The first version's weight decreases linearly from 1 to 0, while the other's increases linearly from 0 to 1, with both weights summing up to 1. We then apply the same process in the opposite direction. Interpolation weights are changed by 0.2 at each segment w.r.t. the previous one.
The result is the version shifting smoothly back and forth between the two versions.
We use two different interpolation techniques:

To isolate the interpolated version condition from the instrument condition, we use pitch-only input.


Version 1 (organ):
Version 2 (piano):
Input MIDI:


Ours, Interpolation in Version Embedding Space
(back and forth between organ & piano):

Ours, Interpolation in Epsilon Space
(back and forth between organ & piano):

Version 1 (guitar):

Version 2 (wind quintet):

Input MIDI:


Ours, Interpolation in Version Embedding Space
(back and forth between guitar & wind):

Ours, Interpolation in Epsilon Space
(back and forth between guitar & wind):

Version 1 (guitar):
Version 2 (orchestra):
Input MIDI:


Ours, Interpolation in Version Embedding Space
(back and forth between guitar & orchestra):

Ours, Interpolation in Epsilon Space
(back and forth between guitar & orchestra):

Version 1 (harpsichord):

Version 2 (violin):

Input MIDI:


Ours, Interpolation in Version Embedding Space
(back and forth between harpsichord & violin):

Ours, Interpolation in Epsilon Space
(back and forth between harpsichord & violin):



Vocoder Quality    [Go to top]

The Soundstream vocoder's quality is an upper bound on the quality of our generated performances.
Provided here are examples demonstrating the vocoder's quality.
We show for each example the original segment, and its vocoded version (i.e., reconstructed from the mel spectrogram).


Original:

Soundstream vocoder:
Original:

Soundstream vocoder:
Original:

Soundstream vocoder:
Original:

Soundstream vocoder: