Unaligned Supervision for Automatic Music Transcription in The Wild.

Abstract

Multi-instrument Automatic Music Transcription (AMT), or the decoding of a musical recording into semantic musical content, is one of the holy grails of Music Information Retrieval. Current AMT approaches are restricted to piano and (some) guitar recordings, due to difficult data collection. In order to overcome data collection barriers, previous AMT approaches attempt to employ musical scores in the form of a digitized version of the same song or piece. The scores are typically aligned using audio features and strenuous human intervention to generate training labels.

We introduce NoteEM, a method for simultaneously training a transcriber and aligning the scores to their corresponding performances, in a fully-automated process. Using this unaligned supervision scheme, complemented by pseudo-labels and pitch shift augmentation, our method enables training on in-the-wild recordings with unprecedented accuracy and instrumental variety. Using only synthetic data and unaligned supervision, we report SOTA note-level accuracy of the MAPS dataset, and large favorable margins on cross-dataset evaluations. We also demonstrate robustness and ease of use; we report comparable results when training on a small, easily obtainable, self-collected dataset, and we propose alternative labeling to the MusicNet dataset, which we show to be more accurate.

We provide here example transcriptions done by our system of famous pieces and songs, toegether with quantitative results on various benchmarks. We also provide here our improved labels for the MusicNet dataset (the original dataset can be found here). The labels were generated automatically by our method. We refer to MusicNet recordings with our labels as MusicNetEM. We provide a baseline for training from scratch on MusicNetEM, including cross-dataset evaluation. For more information, take a look at our ICML 2022 paper on arXiv.

Bach Concerto original.

Bach Concerto transcription

Source: https://www.youtube.com/watch?v=R66fz9yxzAk&ab_channel=SoliDeoGloria8550

Carmen original

Carmen transcription

Source: https://www.youtube.com/watch?v=jL-Csf1pNCI&ab_channel=FranceMusique

Eine Kleine Nachtmusik original

Eine Kleine Nachtmusik transcr.

Source: https://www.youtube.com/watch?v=oy2zDJPIgwc&ab_channel=AllClassicalMusic

Mozart Symphony 40 original

Mozart Symphony 40 transcription

Source: https://www.youtube.com/watch?v=wqkXqpQMk2k

Beethoven Wind Sextet original

Beethoven Wind Sextet transcription

Source: MusicNet 2416

Bach Invention original

Bach Invention transcription

Source: https://www.youtube.com/watch?v=whbFffxr2q4&ab_channel=NetherlandsBachSociety

Indiana Jones original

Indiana Jones transcription

Source: https://www.youtube.com/watch?v=-bTpp8PQSog&ab_channel=Vyrium

Beethoven Concerto original

Beethoven Concerto transcription

Source: https://www.youtube.com/watch?v=TahrEIVu4nQ&ab_channel=pianoconc2

Beethoven String Quartet original

Beethoven String Quartet transcr.

Source: MusicNet 2382

Comparisons

We show here qualitative comparisons with two models: MT3 (Gardner et al., 2021) and Omnizart (Wu et al., 2020):

ABBA Gimme original

ABBA Gimme MT3 transcription

ABBA Gimme our transcription

Source: https://www.youtube.com/watch?v=JWay7CDEyAI&ab_channel=CraigGagn%C3%A9

Pulp Fiction original

Pulp Fiction MT3

Pulp Fiction Ours

Source: https://www.youtube.com/watch?v=1hLIXrlpRe8

Stars and Stripes original

Stars and Stripes Omnizart

Stars and Stripes Ours

Source: https://www.youtube.com/watch?v=a-7XWhyvIpE&ab_channel=UnitedStatesMarineBand

Hungarian Dance original

Hungarian Dance MT3

Hungarian Dance ours

Source: https://www.youtube.com/watch?v=Nzo3atXtm54&ab_channel=MelosKonzerte

Barber of Seville original

Barber of Seville MT3

Barber of Seville ours

Source: https://www.youtube.com/watch?v=OloXRhesab0&t=2s&ab_channel=ClassicalMusicOnly

Brahms original

Brahms Omnizart

Brahms Ours

Source: https://www.youtube.com/watch?v=YzZy1is6ZRU&ab_channel=Levan

Pop Music & Singing

We show initial results for pop music, including human singing, fine-tuned on a small pop dataset:

Voyage original

Voyage transcription

Source: https://www.youtube.com/watch?v=NlgmH5q9uNk&ab_channel=Desireless

La Isla Bonita original

La Isla Bonita transcription

Source: https://www.youtube.com/watch?v=zpzdgmqIHOQ&ab_channel=Madonna

Toto Africa original

Toto Africa transcription

Source: https://www.youtube.com/watch?v=FTQbiNvZqaY&ab_channel=TotoVEVO

Quantitative Results

We show here quantitative results on MAESTRO, MAPS, GuitarSet, and MusicNetEM (MusicNet recordings with our generated labels). We do not use MAESTRO, MAPS, GuitarSet for training. The system is initially trained on synthetic data ("Synth", 4-th row from the bottom), and then further trained on real data with labels generated by our method. Therefore our method belongs to the Self-/Weakly-Supervised/Zero-Shot category.
We use two setings (two bottom rows): training on MusicNet recordings, with our generated labels (MusicNetEM), and training on self-collected data with our generated labels (Self-Collected). For comparison, we also further train the initial model on MusicNet with the original labels (third row from the bottom). The results clearly show that our MusicNetEM labels are significantly more accurate than the original labels, especially in onset timing.

	MAESTRO		MAPS		GuitarSet		MusicNetEM
	Note	Frame	Note	Frame	Note	Frame	Note	Frame
Supervised
Hawthorne et al., 2019	95.3	90.2	86.4	84.9	-	-	-	-
Gardner et al., 2021	96.0	88.0	-	-	90.0	89.0	-	-
Weakly/self-supervised/ZS
Gardner et al., 2021 ZS	28.0	60.0	-	-	32.0	58.0	-	-
Cheuk et al., 2021	-	-	75.2	79.5	-	-	-	-
Synth	83.8	74.7	79.1	76.6	68.4	72.9	72.0	59.8
MusicNet	57.5	57.9	53.4	74.3	10.0	57.2	41.5	66.7
MusicNetEM (ours)	89.7	76.0	87.3	79.6	82.9	81.6	88.8	82.8
Self-collected (ours)	89.6	76.8	86.6	80.9	82.2	79.3	-	-

MusicNetEM

We provide here our improved labels for the MusicNet dataset (the original dataset can be found here ). The labels are in the form of MIDI files aligned with the audio, and include instrument information. Onset timing accuracy of the labels is 32ms, which is sufficient to train a transcriber. Onset timings in the original MusicNet labels are not accurate enough for this. Our labels were generated automatically by an EM process similar to the one described in our paper Unaligned Supervision for Automatic Music Transcription in The Wild . We improved the alignment algorithm, and in order to get more accurate labels, we divided the datast into three groups, based on the ensembles: piano solo, strings, and wind. We performed the EM process on each group separately.

Baseline

You can train from scratch the architecture from the MAESTRO paper on MusicNet recordings with our labels. For note-with-instrument transcription, use N_KEYS * (N_INSTRUMENTS + 1) onset classes, one for each note/instrument combination, and additional N_KEYS for pitch indepenedent of instrument. We used 88 * 12 = 1056 classes. Training in this manner for 101K steps of batch size 8, without any augmentation, we've reached:

	note F1	note-with-inst. F1	frame F1	note-with-offset F1
MAPS	82.0	82.0	69.1	37.7
MAESTRO	85.0	85.0	65.2	31.9
GuitarSet	72.8	-	68.4	30.7
MusicNetEM	91.4	88.1	82.5	71.4
MusicNetEM wind	88.5	79.9	83.1	65.0
MusicNetEM strings	89.1	85.5	82.6	77.7
MusicNetEM strings*	85.9	81.1	79.0	75.1

The split we use for wind and string instruments is the same as in Cheuk et al., 2021 . We also show evaluation on pieces from the MusicNet test set that include string instruments only (last row, MusicNetEM strings*).
When evaluating by instrument, we've achieved in this setting:

test instrument	note-with-inst. F1
Violin	87.3
Viola	61.1
Cello	79.9
Bassoon	78.0
Clarinet	86.8
Horn	75.0

Credits

For MIDI visualization we used the MIDI visualizer tool created by Simon Rodriguez available here.
For the website we used the source code from the Nerfies website .