Multi-instrument Automatic Music Transcription (AMT), or the decoding of a musical recording into semantic musical content, is one of the holy grails of Music Information Retrieval. Current AMT approaches are restricted to piano and (some) guitar recordings, due to difficult data collection. In order to overcome data collection barriers, previous AMT approaches attempt to employ musical scores in the form of a digitized version of the same song or piece. The scores are typically aligned using audio features and strenuous human intervention to generate training labels.
We introduce NoteEM, a method for simultaneously training a transcriber and aligning the scores to their corresponding performances, in a fully-automated process. Using this unaligned supervision scheme, complemented by pseudo-labels and pitch shift augmentation, our method enables training on in-the-wild recordings with unprecedented accuracy and instrumental variety. Using only synthetic data and unaligned supervision, we report SOTA note-level accuracy of the MAPS dataset, and large favorable margins on cross-dataset evaluations. We also demonstrate robustness and ease of use; we report comparable results when training on a small, easily obtainable, self-collected dataset, and we propose alternative labeling to the MusicNet dataset, which we show to be more accurate.
We provide here example transcriptions done by our system of famous pieces and songs, toegether with quantitative results on various benchmarks. We also provide here our improved labels for the MusicNet dataset (the original dataset can be found here). The labels were generated automatically by our method. We refer to MusicNet recordings with our labels as MusicNetEM. We provide a baseline for training from scratch on MusicNetEM, including cross-dataset evaluation. For more information, take a look at our ICML 2022 paper on arXiv.
We show here qualitative comparisons with two models: MT3 (Gardner et al., 2021) and Omnizart (Wu et al., 2020):
We show initial results for pop music, including human singing, fine-tuned on a small pop dataset:
We show here quantitative results on MAESTRO, MAPS, GuitarSet, and MusicNetEM
(MusicNet recordings with our generated labels).
We do not use MAESTRO, MAPS, GuitarSet for training. The system is initially
trained on synthetic data ("Synth", 4-th row from the bottom), and then further
trained on real data with labels generated by our method. Therefore our method
belongs to the Self-/Weakly-Supervised/Zero-Shot category.
We use two setings (two bottom rows): training on MusicNet recordings, with
our generated labels (MusicNetEM), and training on self-collected data with our generated
labels (Self-Collected). For comparison, we also further train the initial model
on MusicNet with the original labels (third row from the bottom). The results
clearly show that our MusicNetEM labels are significantly more accurate than the
original labels, especially in onset timing.
Note | Frame | Note | Frame | Note | Frame | Note | Frame | Supervised |
---|---|---|---|---|---|---|---|---|
Hawthorne et al., 2019 | 95.3 | 90.2 | 86.4 | 84.9 | - | - | - | - |
Gardner et al., 2021 | 96.0 | 88.0 | - | - | 90.0 | 89.0 | - | - |
Weakly/self-supervised/ZS | ||||||||
Gardner et al., 2021 ZS | 28.0 | 60.0 | - | - | 32.0 | 58.0 | - | - |
Cheuk et al., 2021 | - | - | 75.2 | 79.5 | - | - | - | - |
Synth | 83.8 | 74.7 | 79.1 | 76.6 | 68.4 | 72.9 | 72.0 | 59.8 |
MusicNet | 57.5 | 57.9 | 53.4 | 74.3 | 10.0 | 57.2 | 41.5 | 66.7 |
MusicNetEM (ours) | 89.7 | 76.0 | 87.3 | 79.6 | 82.9 | 81.6 | 88.8 | 82.8 |
Self-collected (ours) | 89.6 | 76.8 | 86.6 | 80.9 | 82.2 | 79.3 | - | - |
We provide here our improved labels for the MusicNet dataset (the original dataset can be found here ). The labels are in the form of MIDI files aligned with the audio, and include instrument information. Onset timing accuracy of the labels is 32ms, which is sufficient to train a transcriber. Onset timings in the original MusicNet labels are not accurate enough for this.
Our labels were generated automatically by an EM process similar to the one described in our paper Unaligned Supervision for Automatic Music Transcription in The Wild . We improved the alignment algorithm, and in order to get more accurate labels, we divided the datast into three groups, based on the ensembles: piano solo, strings, and wind. We performed the EM process on each group separately.
note F1 | note-with-inst. F1 | frame F1 | note-with-offset F1 | |
---|---|---|---|---|
MAPS | 82.0 | 82.0 | 69.1 | 37.7 |
MAESTRO | 85.0 | 85.0 | 65.2 | 31.9 |
GuitarSet | 72.8 | - | 68.4 | 30.7 |
MusicNetEM | 91.4 | 88.1 | 82.5 | 71.4 |
MusicNetEM wind | 88.5 | 79.9 | 83.1 | 65.0 |
MusicNetEM strings | 89.1 | 85.5 | 82.6 | 77.7 |
MusicNetEM strings* | 85.9 | 81.1 | 79.0 | 75.1 |
test instrument | note-with-inst. F1 |
Violin | 87.3 |
Viola | 61.1 |
Cello | 79.9 |
Bassoon | 78.0 |
Clarinet | 86.8 |
Horn | 75.0 |