Music synthesis aims to generate audio from symbolic music representations, traditionally using techniques like concatenative synthesis and physical modeling. These methods offer good control but often lack expressiveness and realism in timbre. Recent advancements in diffusion-based models have enhanced the realism of synthesized audio, yet these models struggle with precise control over aspects like acoustics and timbre and are limited by the availability of high-quality annotated training data. In this paper, we introduce an advanced diffusion-based framework for music synthesis that further improves realism and introduces control through multi-aspect conditioning. This allows the synthesis from symbolic representations to accurately replicate specific performance and acoustic conditions.
To address the need for precise multi-instrument target annotations, we propose using MIDI-aligned scores and automatic multi-instrument transcription based on neural networks. These methods effectively train our diffusion model with authentic audio, enhancing realism and capturing subtle nuances in performance and acoustics.
As a second major contribution, we adopt conditioning techniques to gain control over multiple aspects, including score-related aspects like notes and instrumentation, as well as version-related aspects like performance and acoustics. This multi-aspect conditioning restores control over the music generation process, leading to greater fidelity in achieving the desired acoustic and stylistic outcomes.
Finally, we validate our model's efficacy through systematic experiments, including qualitative listening tests and quantitative evaluation using Fréchet Audio Distance to assess version similarity, confirming the model's ability to generate realistic and expressive music, with acoustic control. Supporting evaluations and comparisons are detailed on our website (benadar293.github.io/multi-aspect-conditioning).
This work complements and expands our ICASSP 2024 paper. There are 3 significant improvements in this work:
We perform extensive listening tests, evaluating both realism, and similarity to the target version.
For score conditions we use predictions of a transcriber, rather than alignments, and obtain comparable results. Using predicted transcriptions is significantly more favourable from a practical point of view, since unaligned MIDIs do not always exist.
We provide initial results showing that the model can generate performances from pitch-only note conditions, while Version Conditioning implicitly controls instrumentation.
Synthesized Samples (Alignment-Conditioned Model) [Go to top]
We provide generated samples of diverse instruments and ensembles, using different Version Conditions. Orchestral sound is obtained by simply using the 'Orchestra' instrument in the input MIDI, and does not require many violin channels.
For each sample, we provide the input MIDI (upper), and our generated sample (lower). Samples generated with the same Version Condition are marked with the same color.
See next sections for a demonstration of the Version Conditioning effect, and for a comparison with Hawthorne et al.
All generated samples in this page were generated with the same T5 model from the paper, and vocoded with Soundstream.
MIDI: Brahms Piano Concerto 1 Version Condition: Mitsuko Uchida & Kurt Sanderling playing Beethoven Piano Concertos 1-4
Input MIDI:
Ours:
MIDI: Bach 4th Orchestral Suite Overture Version Condition: Bach 3rd Orchestral Suite (unknown performer)
Input MIDI:
Ours:
MIDI: Bach 2nd Harpsichord Concerto Version Condition: Bach Sonatas for Violin & Harpsichord (unknown performer)
Input MIDI:
Ours:
MIDI: Beethoven 6th Symphony (Pastoral) Version Condition: Ferenc Fricsay & Berlin Radio Orchestra playing Brahms' Haydn Variations
Input MIDI:
Ours:
MIDI: Beethoven Piano Concerto 5 Version Condition: Mitsuko Uchida & Kurt Sanderling playing Beethoven Piano Concertos 1-4
Input MIDI:
Ours:
MIDI: Bach 1st Orchestral Suite Overture Version Condition: Bach 3rd Orchestral Suite (unknown performer)
Input MIDI:
Ours:
MIDI: Bach 1st Harpsichord Concerto Version Condition: Bach Sonatas for Violin & Harpsichord (unknown performer)
Input MIDI:
Ours:
MIDI: Mozart 40th Symphony Part 4 Version Condition: Ferenc Fricsay & Berlin Radio Orchestra playing Brahms' Haydn Variations
Input MIDI:
Ours:
MIDI: Brahms 5th Hungarian Dance Version Condition: Trio Élégiaque playing Beethoven's Piano Trios
Input MIDI:
Ours:
MIDI: Bach 3rd English Suite Prelude Version Condition: Christophe Rousset playing Bach's Goldberg Variations
Input MIDI:
Ours:
MIDI: Bach's Toccata and Fugue in D Minor Version Condition: Kay Johannsen playing Bach's Organ Trio Sonatas
Input MIDI:
Ours:
MIDI: Grieg Peer Gynt Part 1 Version Condition: Richard Bonynge & National Phil. Orchestra playing Tchaikovsky's Swan Lake
Input MIDI:
Ours:
MIDI: Mozart String Quartet Version Condition: Trio Élégiaque playing Beethoven's Piano Trios
Input MIDI:
Ours:
MIDI: Gershwin's Rhapsody in Blue Version Condition: Trio Élégiaque playing Beethoven's Piano Trios
Input MIDI:
Ours:
MIDI: Jobim's Felicidad Version Condition: Fernando Sor's 24 Studies No.19 Op.31
Input MIDI:
Ours:
MIDI: Grieg Peer Gynt Part 3 Version Condition: Czech Symphony Orchestra playing Brahms 3rd Symphony
Input MIDI:
Ours:
Synthesized Samples (Transcription-Conditioned Model) [Go to top]
We provide generated samples of diverse instruments and ensembles, using different Version Conditions. Orchestral sound is obtained by simply using the 'Orchestra' instrument in the input MIDI, and does not require many violin channels.
Samples in this section are from a model trained with score conditions obtained from transcriptions predicted by an automatic transcriber.
For each sample, we provide the input MIDI (upper), and our generated sample (lower). Samples generated with the same Version Condition are marked with the same color.
See next sections for a demonstration of the Version Conditioning effect, and for a comparison with Hawthorne et al.
MIDI: Mendelssohn Trio for Piano, Violin and Cello Version Condition: Trio Élégiaque playing Beethoven's Piano Trios
Input MIDI:
Ours:
MIDI: Gershwin Rhapsody in Blue Version Condition: Mitsuko Uchida & Kurt Sanderling playing Beethoven Piano Concertos 1-4
Input MIDI:
Ours:
MIDI: Beethoven 5th Symphony Part 1 Version Condition: Czech Symphony Orchestra playing Beethoven's Coriolan Overture
Input MIDI:
Ours:
MIDI: Bach Little Fugue Version Condition: Kay Johannsen playing Bach's Organ Trio Sonatas
Input MIDI:
Ours:
MIDI: Mendelssohn Trio for Piano, Violin and Cello Version Condition: Trio Élégiaque playing Beethoven's Piano Trios
Input MIDI:
Ours:
MIDI: Mozart 40th Symphony Part 1 Version Condition: Czech Symphony Orchestra playing Beethoven's 3rd Symphony (Eroica)
Input MIDI:
Ours:
MIDI: Beethoven Pastoral Symphony Part 4 Version Condition: Czech Symphony Orchestra playing Beethoven's Coriolan Overture
Input MIDI:
Ours:
MIDI: Bach Mass in B Minor Part 11 (left-out) Version Condition: Bach Mass in B Minor (unknown performer)
Input MIDI:
Ours:
MIDI: Bach 2nd Haprsichord Concerto Part 1 Version Condition: Bach Sonatas for Violin & Harpsichord (unknown performer)
Input MIDI:
Ours:
MIDI: Bach 7th Haprsichord Concerto Part 3 Version Condition: Bach Sonatas for Violin & Harpsichord (unknown performer)
Input MIDI:
Ours:
MIDI: Beethoven Pastoral Symphony Part 5 Version Condition: Czech Symphony Orchestra playing Beethoven's Coriolan Overture
Input MIDI:
Ours:
MIDI: Ravel Bolero Version Condition: Richard Bonynge & National Phil. Orchestra playing Tchaikovsky's Swan Lake
Input MIDI:
Ours:
MIDI: Mozart 40th Symphony Part 3 Version Condition: playing Brahms 3rd Symphony
Input MIDI:
Ours:
MIDI: Bach 2nd Partita Version Condition: Bach WTC II (unknown performer)
Input MIDI:
Ours:
MIDI: Brahms 1st Piano Concerto Part 3 Version Condition: Mitsuko Uchida & Kurt Sanderling playing Beethoven Piano Concertos 1-4
Input MIDI:
Ours:
MIDI: Brahms 1st Piano Concerto Part 1 Version Condition: Mitsuko Uchida & Kurt Sanderling playing Beethoven Piano Concertos 1-4
Input MIDI:
Ours:
Version Conditioning with Pitch-Only Input [Go to top]
In the following, we demonstrate the effect of Version Conditioning when providing pitch-only score input.
The exact same MIDI with pitch-only information (without instrumentation) is synthesized with different Version Conditions.
Control over the instrument in these samples is obtained only through the FiLM-based version condition, and not through the score encoder.
For each sample, we provide a segment from the conditioning version (upper), and our generated sample (lower).
For a fair comparison, all segments from the conditioning versions are vocoded with the Soundstream vocoder, also used for our generated samples.
Samples in this section are from the transcription-conditioned model.
Bach's 7th Harpsichord Concerto
Input MIDI:
Version Condition:
Ours:
Version Condition:
Ours:
Version Condition:
Ours:
Version Condition:
Ours:
Mozarts's 40th Symphony
Input MIDI:
Version Condition:
Ours:
Version Condition:
Ours:
Version Condition:
Ours:
Version Condition:
Ours:
Bach's Mass in B Minor
Input MIDI:
Version Condition:
Ours:
Version Condition:
Ours:
Version Condition:
Ours:
Version Condition:
Ours:
Mozart's 15th String Quartet
Input MIDI:
Version Condition:
Ours:
Version Condition:
Ours:
Version Condition:
Ours:
Version Condition:
Ours:
Bach's 4th Orchestral Suite (Bouree)
Input MIDI:
Version Condition:
Ours:
Version Condition:
Ours:
Version Condition:
Ours:
Version Condition:
Ours:
Version Conditioning for Different Instances of The Same Instrument [Go to top]
In the following, we demonstrate the use of Version Conditioning to obtain different instances of the same instruments. This includes for example different types of harpsichords, different church organs, or different room acoustics or recording environment.
In the following samples, the exact same MIDI (including instrumentation) is synthesized with different version conditions.
Samples in this section are from the alignment-conditioned model.
Bach's 8th Invention Synthesized on 8 Different Harpsichords
In the last 2 examples, the timbre of the harpsichord is extracted from a mixture with a violin.
Input MIDI:
Version Condition:
Ours:
Version Condition:
Ours:
Version Condition:
Ours:
Version Condition:
Ours:
Version Condition:
Ours:
Version Condition:
Ours:
Version Condition (mixture with violin):
Ours:
Version Condition (mixture with violin):
Ours:
Orchestra - Beethoven Symphony 6 (Pastoral)
In this example we provide the same MIDI of Beethoven's 6th symphony, synthesized with 4 different orchestras.
Input MIDI:
Version Condition:
Ours:
Version Condition:
Ours:
Version Condition:
Ours:
Version Condition:
Ours:
Church Organ
Bach's Toccata and Fugue synthesized on 2 different church organs:
Input MIDI:
Version Condition:
Ours:
Version Condition:
Ours:
Violin
Paganini's Capriccio on 4 different violins:
Input MIDI:
Version Condition (mixture with piano):
Ours:
Version Condition (mixture with harpsichord):
Ours:
Version Condition (mixture with harpsichord):
Ours:
Version Condition:
Ours:
Guitar
Kansas' Dust In The Wind synthesized on 3 different guitars:
Input MIDI:
Version Condition:
Ours:
Version Condition:
Ours:
Version Condition:
Ours:
Violin and Piano/Harpsichord
Bach's 2nd Concerto for Orchestra and Harpsichord with different violins, and the keyboard part played on a piano (left) or a harpsichord (right):
Input MIDI (Piano as Keyboard):
Version Condition (Cello, Violin & Piano):
Ours:
Input MIDI (Harpsichord as Keyboard):
Version Condition (Violin & Harpsichord):
Ours:
Our main comparison is with Hawthorne et al. (2022). Although we use the same T5 backbone, there are 2 main differences between the models:
(1) Data - We train on real data alone (~58 hours), including orchestral symphonies, while Hawthorne et al. train on mainly synthetic data (~1500 hours).
(2) We incorporate Version Conditioning.
Input MIDI:
Hawthorne et al. (flat timbre):
Ours Version Condition I:
Ours Version Condition II:
Input MIDI:
Hawthorne et al. (flat timbre):
Ours Version Condition I:
Ours Version Condition II:
Input MIDI:
Hawthorne et al. (flat timbre, timbre drift):
Ours Version Condition I:
Ours Version Condition II:
Input MIDI:
Hawthorne et al:
Ours Version Condition I:
Ours Version Condition II:
Pop, Rock, and Jazz Music with Zero-Shot Transcription-Conditioning [Go to top]
In this section we provide samples of performances generated by a synthesizer trained on rock and pop music albums. For note conditions, we used predicted transcriptions of an existing transcriber. In the model used in this section, all training performances are unknown to the transcriber. The transcriber was trained mainly on rock albums, with unalinged MIDIs, following Maman and Bermano, 2022. However, instrumentation in MIDIs of pop and rock music is extremely prone to inaccuracies and incosistencies, and we therefore rely on the synthesizer's generative power, and on Version Conditioning, to produce likely instrumentation. Improving instrument disentanglement in such genres is an important direction for future work, and we believe improved transcription will enable more control in the generation process.
Note the ability of the model to generate drums, as the MIDI input contains only pitch and instrument (without drums).
The model can generate human voice, but it is not conditioned on lyrics - this is important future work.
MIDI: Beegees, More Than A Woman Version Condition: Guns N' Roses, Appetite for Destruction
Input MIDI:
Ours:
MIDI: Beatles, Let It Be Version Condition: Guns N' Roses, Appetite for Destruction
Input MIDI:
Ours:
MIDI: Beatles, Martha My Dear Version Condition: Dave Brubeck, Time Out
Input MIDI:
Ours:
MIDI: Eric Clapton, I Shot The Sheriff Version Condition: Dave Brubeck, Time Out
Input MIDI:
Ours:
MIDI: Doors, Hello I Love You Version Condition: Metallica, Master of Puppets
Input MIDI:
Ours:
MIDI: Beatles, Let It Be Version Condition: Queen, A Night at The Opera
Input MIDI:
Ours:
MIDI: Frank Zappa, Dancin' Fool Version Condition: Daft Punk, Random Access Memories
Input MIDI:
Ours:
MIDI: Depeche Mode, People Are People Version Condition: Yes, Close to The Edge
Input MIDI:
Ours:
Version Conditioning Effect in Pop, Rock, and Jazz Music
In this section, we demonstrate the effect of Version Conditioning in genres such as pop, rock, and jazz music. Each Version Condition corresponds to an album.
Britney Spears, One More Time
In this example, Britney Spears' One More Time is generated in two versions: ABBA's Super Trouper album, and Dick Schory's Music For Bang, Baaroom and Harp, which comprises mainly pitched percussion instruments.
Beatles, Let It Be
In this example, The Beatles' Let It Be is generated in the styles of two different rock albums. Notice the significant difference in the sound of the guitar.
Input MIDI:
Unseen Versions using TRILL Embeddings [Go to top]
In this section we provide samples of performances generated with version conditions provided by audio samples, enabling conditioning on unseen versions. For this, we fine-tuned the model such that the version condition will be defined by the mean TRILL embedding over the version.
To isolate the interpolated version condition from the instrument condition, we use pitch-only input.
We provide samples for version interpolation. In each sample, we use interpolation weights varying linearly across time: The first version's weight decreases linearly from 1 to 0, while the other's increases linearly from 0 to 1, with both weights summing up to 1. We then apply the same process in the opposite direction. Interpolation weights are changed by 0.2 at each segment w.r.t. the previous one.
The result is the version shifting smoothly back and forth between the two versions.
We use two different interpolation techniques:
Interpolating version embeddings ("Version Embedding Space").
Interpolating predictions ("Epsilon Space").
To isolate the interpolated version condition from the instrument condition, we use pitch-only input.
Version 1 (organ):
Version 2 (piano):
Input MIDI:
Ours, Interpolation in Version Embedding Space
(back and forth between organ & piano):
Ours, Interpolation in Epsilon Space
(back and forth between organ & piano):
Version 1 (guitar):
Version 2 (wind quintet):
Input MIDI:
Ours, Interpolation in Version Embedding Space
(back and forth between guitar & wind):
Ours, Interpolation in Epsilon Space
(back and forth between guitar & wind):
Version 1 (guitar):
Version 2 (orchestra):
Input MIDI:
Ours, Interpolation in Version Embedding Space
(back and forth between guitar & orchestra):
Ours, Interpolation in Epsilon Space
(back and forth between guitar & orchestra):
Version 1 (harpsichord):
Version 2 (violin):
Input MIDI:
Ours, Interpolation in Version Embedding Space
(back and forth between harpsichord & violin):
Ours, Interpolation in Epsilon Space
(back and forth between harpsichord & violin):
The Soundstream vocoder's quality is an upper bound on the quality of our generated performances.
Provided here are examples demonstrating the vocoder's quality.
We show for each example the original segment, and its vocoded version (i.e., reconstructed from the mel spectrogram).