Performance Conditioning for Diffusion-Based Multi-Instrument Music Synthesis

We provide here both of the listening tests described in the paper, one to asses realism, and one to asses version similarity.

Listening Test 1 - Realism

We provide the first listening test for realism, which was conducted according to the MUSHRA protocol, using the webMUSHRA implementation by Schoeffler et al. (see paper). We synthesize the same score excerpt using varius methods and ask the user to rate realism. We also provide a reference sample from a real musical performance of the same score excerpt.
The following were compared:

Reference sample from real performance of the same score (Real)
Vocoded version of the Reference (Vocoded)
Concatenative synthesis using the Windows GM soundfont (GM)
Concatenative synthesis using the Fluid R3 GM soundfont (Fluid)
The T5 model trained by Hawthorne et al., mainly on synthetic data (SLAKH; Note that this model does not implement version conditioning) (Hawth.)
Our T5 model without version Conditioning (Uncond.)
Our T5 model with version Conditioning (Cond.)

Question 1

Real

Vocoded

Fluid

Hawth.

Uncond.

Cond.

Question 2

Real

Vocoded

Fluid

Hawth.

Uncond.

Cond.

Question 3

Real

Vocoded

Fluid

Hawth.

Uncond.

Cond.

Question 4

Real

Vocoded

Fluid

Hawth.

Uncond.

Cond.

Question 5

Real

Vocoded

Fluid

Hawth.

Uncond.

Cond.

Question 6

Real

Vocoded

Fluid

Hawth.

Uncond.

Cond.

Question 7

Real

Vocoded

Fluid

Hawth.

Uncond.

Cond.

Question 8

Real

Vocoded

Fluid

Hawth.

Uncond.

Cond.

Question 9

Real

Vocoded

Fluid

Hawth.

Uncond.

Cond.

Question 10

Real

Vocoded

Fluid

Hawth.

Uncond.

Cond.

Listening Test 2 - Similarity

We provide the second listening test for version similarity. The goal of this test is to measure the effectiveness of version conditioning in obtaining perceptual characteristics of the reference version, including acoustics, timbre, and style.
We randomly choose a reference version, and provide the listener with an audio excerpt from the corresponding recording. We then use our model to synthesize each score excerpt with three different version conditions, of the same instrumentation, one of which is the reference version, and the other two randomly sampled.
We request the listener to rate the similarity of each synthesized score excerpt to the reference audio excerpt.

Question 1

Reference