We provide here both of the listening tests described in the paper, one to asses
realism,
and one to asses version similarity.
Listening Test 1 - Realism
We provide the first listening test for realism, which was conducted according to the MUSHRA protocol, using the webMUSHRA implementation by Schoeffler et al. (see paper).
We synthesize the same score excerpt using varius methods and ask the user to rate realism. We also provide a reference sample from a real musical performance of the same score excerpt.
The following were compared:
Reference sample from real performance of the same score (Real)
Vocoded version of the Reference (Vocoded)
Concatenative synthesis using the Windows GM soundfont (GM)
Concatenative synthesis using the Fluid R3 GM soundfont (Fluid)
The T5 model trained by Hawthorne et al., mainly on synthetic data (SLAKH; Note that this model does not implement version conditioning) (Hawth.)
Our T5 model without version Conditioning (Uncond.)
Our T5 model with version Conditioning (Cond.)
Question 1
Real
Vocoded
GM
Fluid
Hawth.
Uncond.
Cond.
Question 2
Real
Vocoded
GM
Fluid
Hawth.
Uncond.
Cond.
Question 3
Real
Vocoded
GM
Fluid
Hawth.
Uncond.
Cond.
Question 4
Real
Vocoded
GM
Fluid
Hawth.
Uncond.
Cond.
Question 5
Real
Vocoded
GM
Fluid
Hawth.
Uncond.
Cond.
Question 6
Real
Vocoded
GM
Fluid
Hawth.
Uncond.
Cond.
Question 7
Real
Vocoded
GM
Fluid
Hawth.
Uncond.
Cond.
Question 8
Real
Vocoded
GM
Fluid
Hawth.
Uncond.
Cond.
Question 9
Real
Vocoded
GM
Fluid
Hawth.
Uncond.
Cond.
Question 10
Real
Vocoded
GM
Fluid
Hawth.
Uncond.
Cond.
Listening Test 2 - Similarity
We provide the second listening test for version similarity.
The goal of this test is to measure the effectiveness of version conditioning in obtaining perceptual characteristics of the reference version, including acoustics, timbre, and style.
We randomly choose a reference version, and provide the listener with an audio excerpt from the corresponding recording. We then use our model to synthesize each score excerpt with three different version conditions, of the same instrumentation, one of which is the reference version, and the other two randomly sampled.
We request the listener to rate the similarity of each synthesized score excerpt to the reference audio excerpt.