Generating Multi-instrument music from symblic music representations is a central task in Music Information Retrieval (MIR). Current literature tackling this important task still struggles with performance quality and control. In this work we propose a multi-instrument music synthesis framework that significantly improves both aforementioned aspects. Building on state-of-the-art diffusion-based music generative models, we first demonstrate how contemporary off-the-shelf transcription is mature enough to guide music generation of unprecedented quality and realism. Then, we introduce performance-based conditioning, a simple tool enabling precise control, indicating the generative model to synthesize music with style and timbre of specific instruments, taken from specific performances. Our prototype is evaluated using uncurated performances with diverse instrumentation for training, and achieves state-of-the-art FAD realism scores, while allowing novel timbre and style control.
Synthesized Samples
We provide generated samples of diverse instruments and ensembles, using different performance conditions. Orchestral sound is obtained by simply using the 'Orchestra' instrument in the input MIDI, and does not require many violin channels etc.
For each sample, we provide the input MIDI (upper), and our generated sample (lower). Samples generated with the same performance condition are marked with the same color.
See next sections for a demonstration of the performance conditioning effect, and for a comparison with Hawthorne et al.
All generated samples in this page were generated with the same T5 model from the paper, and vocoded with Soundstream.
MIDI: Brahms Piano Concerto 1 Performance Condition: Mitsuko Uchida & Kurt Sanderling playing Beethoven Piano Concertos 1-4
Input MIDI:
Ours:
MIDI: Bach 4th Orchestral Suite Overture Performance Condition: Bach 3rd Orchestral Suite (unknown performer)
Input MIDI:
MIDI: Grieg Peer Gynt Part 3 Performance Condition: Czech Symphony Orchestra playing Brahms 3rd Symphony
Input MIDI:
Ours:
Performance Conditioning Effect
In the following, we demonstrate the effect of performance conditioning. The exact same MIDI (including instrumentation) is synthesized with different performance conditions.
For each sample, we provide a segment from the conditioning performance (upper), and our generated sample (lower).
For a fair comparison, all segments from the conditioning performances are vocoded with the Soundstream vocoder, also used for our generated samples.
Bach's 8th Invention Synthesized on 8 Different Harpsichords
In the last 2 examples, the timbre of the harpsichord is extracted from a mixture with a violin.
Input MIDI:
Performance Condition:
Ours:
Performance Condition:
Ours:
Performance Condition:
Ours:
Performance Condition:
Ours:
Performance Condition:
Ours:
Performance Condition:
Ours:
Performance Condition (mixture with violin):
Ours:
Performance Condition (mixture with violin):
Ours:
Orchestra - Beethoven Symphony 6 (Pastoral)
In this example we provide the same MIDI of Beethoven's 6th symphony, synthesized with 4 different orchestras.
Input MIDI:
Performance Condition:
Ours:
Performance Condition:
Ours:
Performance Condition:
Ours:
Performance Condition:
Ours:
Church Organ
Bach's Toccata and Fugue synthesized on 2 different church organs:
Input MIDI:
Performance Condition:
Ours:
Performance Condition:
Ours:
Violin
Paganini's Capriccio on 4 different violins:
Input MIDI:
Performance Condition (mixture with piano):
Ours:
Performance Condition (mixture with harpsichord):
Ours:
Performance Condition (mixture with harpsichord):
Ours:
Performance Condition:
Ours:
Guitar
Kansas' Dust In The Wind synthesized on 3 different guitars:
Input MIDI:
Performance Condition:
Ours:
Performance Condition:
Ours:
Performance Condition:
Ours:
Violin and Piano/Harpsichord
Bach's 2nd Concerto for Orchestra and Harpsichord with different violins, and the keyboard part played on a piano (left) or a harpsichord (right):
Our main comparison is with Hawthorne et al. (2022). Although we use the same T5 backbone, there are 2 main differences between the models:
(1) Data - We train on real data alone (~58 hours), including orchestral symphonies, while Hawthorne et al. train on mainly synthetic data (~1500 hours).
(2) We incorporate performance conditioning.
Input MIDI:
Hawthorne et al. (flat timbre):
Ours Performance Condition I:
Ours Performance Condition II:
Input MIDI:
Hawthorne et al. (flat timbre):
Ours Performance Condition I:
Ours Performance Condition II:
Input MIDI:
Hawthorne et al. (flat timbre, timbre drift):
Ours Performance Condition I:
Ours Performance Condition II:
Input MIDI:
Hawthorne et al:
Ours Performance Condition I:
Ours Performance Condition II:
Vocoder Quality
The Soundstream vocoder's quality is an upper bound on the quality of our generated performances.
Provided here are examples demonstrating the vocoder's quality.
We show for each example the original segment, and its vocoded version (i.e., reconstructed from the mel spectrogram).