Unaligned Supervision for Automatic Music Transcription in The Wild

Ben Maman1, Amit H. Bermano1,
1Tel Aviv University,

Abstract

Multi-instrument Automatic Music Transcription (AMT), or the decoding of a musical recording into semantic musical content, is one of the holy grails of Music Information Retrieval. Current AMT approaches are restricted to piano and (some) guitar recordings, due to difficult data collection. In order to overcome data collection barriers, previous AMT approaches attempt to employ musical scores in the form of a digitized version of the same song or piece. The scores are typically aligned using audio features and strenuous human intervention to generate training labels.

We introduce NoteEM, a method for simultaneously training a transcriber and aligning the scores to their corresponding performances, in a fully-automated process. Using this unaligned supervision scheme, complemented by pseudo-labels and pitch shift augmentation, our method enables training on in-the-wild recordings with unprecedented accuracy and instrumental variety. Using only synthetic data and unaligned supervision, we report SOTA note-level accuracy of the MAPS dataset, and large favorable margins on cross-dataset evaluations. We also demonstrate robustness and ease of use; we report comparable results when training on a small, easily obtainable, self-collected dataset, and we propose alternative labeling to the MusicNet dataset, which we show to be more accurate.

We provide here example transcriptions done by our system of famous pieces and songs, toegether with quantitative results on various benchmarks. We also provide here our improved labels for the MusicNet dataset (the original dataset can be found here). The labels were generated automatically by our method. We refer to MusicNet recordings with our labels as MusicNetEM. We provide a baseline for training from scratch on MusicNetEM, including cross-dataset evaluation. For more information, take a look at our ICML 2022 paper on arXiv.

Bach Concerto original.

Bach Concerto transcription

Source: https://www.youtube.com/watch?v=R66fz9yxzAk&ab_channel=SoliDeoGloria8550

Carmen original

Carmen transcription

Source: https://www.youtube.com/watch?v=jL-Csf1pNCI&ab_channel=FranceMusique

Eine Kleine Nachtmusik original

Eine Kleine Nachtmusik transcr.

Source: https://www.youtube.com/watch?v=oy2zDJPIgwc&ab_channel=AllClassicalMusic



Mozart Symphony 40 original

Mozart Symphony 40 transcription

Source: https://www.youtube.com/watch?v=wqkXqpQMk2k

Beethoven Wind Sextet original

Beethoven Wind Sextet transcription

Source: MusicNet 2416

Bach Invention original

Bach Invention transcription

Source: https://www.youtube.com/watch?v=whbFffxr2q4&ab_channel=NetherlandsBachSociety



Indiana Jones original

Indiana Jones transcription

Source: https://www.youtube.com/watch?v=-bTpp8PQSog&ab_channel=Vyrium

Beethoven Concerto original

Beethoven Concerto transcription

Source: https://www.youtube.com/watch?v=TahrEIVu4nQ&ab_channel=pianoconc2

Beethoven String Quartet original

Beethoven String Quartet transcr.

Source: MusicNet 2382


Comparisons

We show here qualitative comparisons with two models: MT3 (Gardner et al., 2021) and Omnizart (Wu et al., 2020):


ABBA Gimme original

ABBA Gimme MT3 transcription

ABBA Gimme our transcription

Source: https://www.youtube.com/watch?v=JWay7CDEyAI&ab_channel=CraigGagn%C3%A9



Pulp Fiction original

Pulp Fiction MT3

Pulp Fiction Ours

Source: https://www.youtube.com/watch?v=1hLIXrlpRe8



Stars and Stripes original

Stars and Stripes Omnizart

Stars and Stripes Ours

Source: https://www.youtube.com/watch?v=a-7XWhyvIpE&ab_channel=UnitedStatesMarineBand

Hungarian Dance original

Hungarian Dance MT3

Hungarian Dance ours

Source: https://www.youtube.com/watch?v=Nzo3atXtm54&ab_channel=MelosKonzerte



Barber of Seville original

Barber of Seville MT3

Barber of Seville ours

Source: https://www.youtube.com/watch?v=OloXRhesab0&t=2s&ab_channel=ClassicalMusicOnly



Brahms original

Brahms Omnizart

Brahms Ours

Source: https://www.youtube.com/watch?v=YzZy1is6ZRU&ab_channel=Levan



Pop Music & Singing

We show initial results for pop music, including human singing, fine-tuned on a small pop dataset:

Voyage original

Voyage transcription

Source: https://www.youtube.com/watch?v=NlgmH5q9uNk&ab_channel=Desireless

La Isla Bonita original

La Isla Bonita transcription

Source: https://www.youtube.com/watch?v=zpzdgmqIHOQ&ab_channel=Madonna

Toto Africa original

Toto Africa transcription

Source: https://www.youtube.com/watch?v=FTQbiNvZqaY&ab_channel=TotoVEVO


Quantitative Results

We show here quantitative results on MAESTRO, MAPS, GuitarSet, and MusicNetEM (MusicNet recordings with our generated labels). We do not use MAESTRO, MAPS, GuitarSet for training. The system is initially trained on synthetic data ("Synth", 4-th row from the bottom), and then further trained on real data with labels generated by our method. Therefore our method belongs to the Self-/Weakly-Supervised/Zero-Shot category.
We use two setings (two bottom rows): training on MusicNet recordings, with our generated labels (MusicNetEM), and training on self-collected data with our generated labels (Self-Collected). For comparison, we also further train the initial model on MusicNet with the original labels (third row from the bottom). The results clearly show that our MusicNetEM labels are significantly more accurate than the original labels, especially in onset timing.



MAESTRO
MAPS
GuitarSet
MusicNetEM
Note Frame Note Frame Note Frame Note Frame
Supervised
Hawthorne et al., 2019 95.3 90.2 86.4 84.9 - - - -
Gardner et al., 2021 96.0 88.0 - - 90.0 89.0 - -
Weakly/self-supervised/ZS
Gardner et al., 2021 ZS 28.0 60.0 - - 32.0 58.0 - -
Cheuk et al., 2021 - - 75.2 79.5 - - - -
Synth 83.8 74.7 79.1 76.6 68.4 72.9 72.0 59.8
MusicNet 57.5 57.9 53.4 74.3 10.0 57.2 41.5 66.7
MusicNetEM (ours) 89.7 76.0 87.3 79.6 82.9 81.6 88.8 82.8
Self-collected (ours) 89.6 76.8 86.6 80.9 82.2 79.3 - -

MusicNetEM

We provide here our improved labels for the MusicNet dataset (the original dataset can be found here ). The labels are in the form of MIDI files aligned with the audio, and include instrument information. Onset timing accuracy of the labels is 32ms, which is sufficient to train a transcriber. Onset timings in the original MusicNet labels are not accurate enough for this. Our labels were generated automatically by an EM process similar to the one described in our paper Unaligned Supervision for Automatic Music Transcription in The Wild . We improved the alignment algorithm, and in order to get more accurate labels, we divided the datast into three groups, based on the ensembles: piano solo, strings, and wind. We performed the EM process on each group separately.

Baseline

You can train from scratch the architecture from the MAESTRO paper on MusicNet recordings with our labels. For note-with-instrument transcription, use N_KEYS * (N_INSTRUMENTS + 1) onset classes, one for each note/instrument combination, and additional N_KEYS for pitch indepenedent of instrument. We used 88 * 12 = 1056 classes. Training in this manner for 101K steps of batch size 8, without any augmentation, we've reached:

note F1 note-with-inst. F1 frame F1 note-with-offset F1
MAPS82.082.069.137.7
MAESTRO85.085.065.2 31.9
GuitarSet72.8-68.4 30.7
MusicNetEM91.488.182.571.4
MusicNetEM wind88.579.983.165.0
MusicNetEM strings89.185.582.677.7
MusicNetEM strings*85.981.179.075.1
The split we use for wind and string instruments is the same as in Cheuk et al., 2021 . We also show evaluation on pieces from the MusicNet test set that include string instruments only (last row, MusicNetEM strings*).
When evaluating by instrument, we've achieved in this setting:

test instrument note-with-inst. F1
Violin 87.3
Viola 61.1
Cello 79.9
Bassoon 78.0
Clarinet 86.8
Horn 75.0

Credits

For MIDI visualization we used the MIDI visualizer tool created by Simon Rodriguez available here.
For the website we used the source code from the Nerfies website .