Performance Conditioning for Diffusion-Based Multi-Instrument Music Synthesis

1Tel Aviv University
2International Audio Laboratories Erlangen, Germany

Abstract

Generating Multi-instrument music from symblic music representations is a central task in Music Information Retrieval (MIR). Current literature tackling this important task still struggles with performance quality and control. In this work we propose a multi-instrument music synthesis framework that significantly improves both aforementioned aspects. Building on state-of-the-art diffusion-based music generative models, we first demonstrate how contemporary off-the-shelf transcription is mature enough to guide music generation of unprecedented quality and realism. Then, we introduce performance-based conditioning, a simple tool enabling precise control, indicating the generative model to synthesize music with style and timbre of specific instruments, taken from specific performances. Our prototype is evaluated using uncurated performances with diverse instrumentation for training, and achieves state-of-the-art FAD realism scores, while allowing novel timbre and style control.


Synthesized Samples

We provide generated samples of diverse instruments and ensembles, using different performance conditions. Orchestral sound is obtained by simply using the 'Orchestra' instrument in the input MIDI, and does not require many violin channels etc.
For each sample, we provide the input MIDI (upper), and our generated sample (lower). Samples generated with the same performance condition are marked with the same color.
See next sections for a demonstration of the performance conditioning effect, and for a comparison with Hawthorne et al.
All generated samples in this page were generated with the same T5 model from the paper, and vocoded with Soundstream.

MIDI: Brahms Piano Concerto 1
Performance Condition: Mitsuko Uchida & Kurt Sanderling playing Beethoven Piano Concertos 1-4
Input MIDI:

Ours:
MIDI: Bach 4th Orchestral Suite Overture
Performance Condition: Bach 3rd Orchestral Suite (unknown performer)
Input MIDI:

Ours:
MIDI: Bach 2nd Harpsichord Concerto
Performance Condition: Bach Sonatas for Violin & Harpsichord (unknown performer)
Input MIDI:

Ours:
MIDI: Grieg Peer Gynt Part 1
Performance Condition: Richard Bonynge & National Phil. Orchestra playing Tchaikovsky's Swan Lake
Input MIDI:

Ours:

MIDI: Beethoven Piano Concerto 5
Performance Condition: Mitsuko Uchida & Kurt Sanderling playing Beethoven Piano Concertos 1-4
Input MIDI:

Ours:
MIDI: Bach 1st Orchestral Suite Overture
Performance Condition: Bach 3rd Orchestral Suite (unknown performer)
Input MIDI:

Ours:
MIDI: Bach 1st Harpsichord Concerto
Performance Condition: Bach Sonatas for Violin & Harpsichord (unknown performer)
Input MIDI:

Ours:
MIDI: Beethoven 6th Symphony (Pastoral)
Performance Condition: Ferenc Fricsay & Berlin Radio Orchestra playing Brahms' Haydn Variations
Input MIDI:

Ours:

MIDI: Brahms 5th Hungarian Dance
Performance Condition: Trio Élégiaque playing Beethoven's Piano Trios
Input MIDI:

Ours:
MIDI: Bach 3rd English Suite Prelude
Performance Condition: Christophe Rousset playing Bach's Goldberg Variations
Input MIDI:

Ours:
MIDI: Bach's Toccata and Fugue in D Minor
Performance Condition: Kay Johannsen playing Bach's Organ Trio Sonatas
Input MIDI:

Ours:
MIDI: Mozart 40th Symphony Part 4
Performance Condition: Ferenc Fricsay & Berlin Radio Orchestra playing Brahms' Haydn Variations
Input MIDI:

Ours:

Performance Conditioning Effect

In the following, we demonstrate the effect of performance conditioning. The exact same MIDI (including instrumentation) is synthesized with different performance conditions.
For each sample, we provide a segment from the conditioning performance (upper), and our generated sample (lower).
For a fair comparison, all segments from the conditioning performances are vocoded with the Soundstream vocoder, also used for our generated samples.


Bach's 8th Invention Synthesized on 8 Different Harpsichords
In the last 2 examples, the timbre of the harpsichord is extracted from a mixture with a violin.

Input MIDI:

Performance Condition:
Ours:
Performance Condition:
Ours:
Performance Condition:
Ours:

Performance Condition:

Ours:

Performance Condition:

Ours:
Performance Condition:

Ours:
Performance Condition (mixture with violin):
Ours:
Performance Condition (mixture with violin):
Ours:

Orchestra - Beethoven Symphony 6 (Pastoral)
In this example we provide the same MIDI of Beethoven's 6th symphony, synthesized with 4 different orchestras.

Input MIDI:

Performance Condition:
Ours:
Performance Condition:
Ours:
Performance Condition:
Ours:
Performance Condition:
Ours:

Church Organ
Bach's Toccata and Fugue synthesized on 2 different church organs:

Input MIDI:

Performance Condition:
Ours:
Performance Condition:
Ours:

Violin
Paganini's Capriccio on 4 different violins:

Input MIDI:

Performance Condition (mixture with piano):
Ours:
Performance Condition (mixture with harpsichord):
Ours:
Performance Condition (mixture with harpsichord):
Ours:
Performance Condition:
Ours:

Guitar
Kansas' Dust In The Wind synthesized on 3 different guitars:

Input MIDI:
Performance Condition:
Ours:
Performance Condition:
Ours:
Performance Condition:
Ours:

Violin and Piano/Harpsichord
Bach's 2nd Concerto for Orchestra and Harpsichord with different violins, and the keyboard part played on a piano (left) or a harpsichord (right):

Input MIDI (Piano as Keyboard):
Performance Condition (Cello, Violin & Piano):
Ours:
Input MIDI (Harpsichord as Keyboard):
Performance Condition (Violin & Harpsichord):
Ours:

Comparison

Our main comparison is with Hawthorne et al. (2022). Although we use the same T5 backbone, there are 2 main differences between the models:
(1) Data - We train on real data alone (~58 hours), including orchestral symphonies, while Hawthorne et al. train on mainly synthetic data (~1500 hours).
(2) We incorporate performance conditioning.



Input MIDI:

Hawthorne et al. (flat timbre):

Ours Performance Condition I:

Ours Performance Condition II:

Input MIDI:

Hawthorne et al. (flat timbre):

Ours Performance Condition I:

Ours Performance Condition II:

Input MIDI:

Hawthorne et al. (flat timbre, timbre drift):

Ours Performance Condition I:

Ours Performance Condition II:

Input MIDI:

Hawthorne et al:

Ours Performance Condition I:

Ours Performance Condition II:

Vocoder Quality

The Soundstream vocoder's quality is an upper bound on the quality of our generated performances.
Provided here are examples demonstrating the vocoder's quality.
We show for each example the original segment, and its vocoded version (i.e., reconstructed from the mel spectrogram).


Original:

Soundstream vocoder:
Original:

Soundstream vocoder:
Original:

Soundstream vocoder:
Original:

Soundstream vocoder: