AIM at the 3rd International Workshop on Reading Music Systems

The third edition of The International Workshop on Reading Music Systems (WoRMS) was held in a hybrid live/virtual setting 23rd of July. WoRMS is a novel workshop that tries to connect researchers who develop systems for reading music, such as in the field of Optical Music Recognition, with other researchers and practitioners that could benefit from such systems, like librarians or musicologists. This edition brought together many researchers working in Optical Music Recognition (OMR) and also from the industry. This edition 11 papers researching a broad list of topics in OMR were presented, and an outstanding keynote from Anthony Wilkes (Organum Ltd) was talking on The design of ReadScoreLib.

One of the presented papers was work done by me (Elona Shatri) and George Fazekas on the newly created DoReMi dataset.  We presented some of the challenges of OMR, specifically, the lack of a well-annotated, that supports more than one stage of OMR poses and how DoReMi moves closer to such dataset. Furthermore, statistics on the dataset and baseline experiments on object detection using Faster R-CNN models. DoReMi is a product of our collaborative work with Steinberg’s Dorico team. Using Dorico we generated 6 types of data that could possibly be used in different steps of OMR. These data are PNG images of scores (binary or colour), the musicXML file, XML files with metadata such as bounding boxes of each object together with musical information, Dorico original projects, MEI and MIDI files.  Here you can find the DoReMi documentation and here you can download the dataset.

Below you can find a summary of some of the papers presented at the workshop

Hybrid Annotation Systems for Music Transcription

Dwells on the idea of bringing human annotation and automated methods together for music transcription. In other words, how can a non-specialist carry out music transcription with careful task interaction using AI automated methods. Among 144 workers that executed tasks in MTurk, those with formal knowledge in music were rare. Audio extracts of target music scores were offered to increase performance, especially for short segments of one or two measures. For longer segments, audio extracts have shown better results against textual measures, but a combination of the two was used as more preferable.

Implementation and evaluation of a neural network for the recognition of handwritten melodies

This research came as a fruit of a current need for digital archiving and digitisation of music for the University Library of Regensburg. It evaluates if the existing SOTA deep learning architecture can recognise handwritten monophonic scores for digitisation. Based on existing work, the architecture includes two neural networks: a stave recognition network using autoencoders and an end-to-end note recognition using recurrent convolutional networks. One limitation mentioned is the amount of annotated data available for this research.

The Challenge of Reconstructing Digits in Music Scores

Pacha presented some focused research he is currently conducting at enote in recognising and reconstructing the digit elements in sheet music. He shows the main challenges posed by the ambiguity of the variations in their classes, their contextual nature and more computer vision issues. He then shows the results in using deep learning to recognise digits. The network was trained in synthetic samples and achieved a validation accuracy of 95%, which does not live up to the real-world scores. To address it was fine-tuned on 7000 manually annotated real scores, but yet again, accuracy does not reach 60%. In the end, this opened up a long discussion in the workshop on why does this happen and the ways to tackle it.

Detecting Staves and Measures in Music Scores with Deep Learning

This paper investigates strategies of detecting measures, staves and system measures using machine learning. That is to aid the detection of structural elements as a basis for an OMR system. A neural network is trained in handwritten music scores to generate annotations for typeset music. Detectron2 was used as a framework and Faster R-CNNs as a model to predict the bounding boxes in images. The datasets used for training were MUSCIMA++ and AudioLabs datasets. They applied the model in three settings: single class models (system measures, stave measures, staves), two class models (system measures & staves) and three class models (system measures & stave measures & staves). The first setting is performing best. However, considering that that model lacks diversity, it might not work well for every kind of sheet music.

Unsupervised Neural Document Analysis for Music Score Images

Given the lack of large training annotated set, this study suggests using Domain Adaptation (DA) based on adversarial training. They propose combining DA and Selectional Auto-Encoders for unsupervised document analysis. They utilise three corpora manually labelled for the layers: SALZINNES, EINSIEDELN and CAPITAN, and F-score as an evaluation metric. Results obtained show the proposed method slightly improves state-of-the-art, but such adaptation should not be carried out in every type of layer.

Multimodal Audio and Image Music Transcription

This paper draws attention to Optical Music Recognition (OMR) and Automatic Music Transcription (AMT) similarities and exploits them to assist each field. The paper presents a proof-of-concept that combines end-to-end AMT and OMR systems predictions over a set of monophonic scores. Using Symbol Error Rate (SER), they show that a fusion model of the two can slightly improve the error rate in OMR.

Sequential Next-Symbol Prediction for Optical Music Recognition

This study proposes to address the lack of large training sets with a sequential classification-based approach for music scores. That is done by predicting the symbol locations and their respective music-notation label using Convolutional Neural Networks (CNN).

Completing Optical Music Recognition with Agnostic Transcription and Machine

This work focuses on the last stage of OMR, encoding, where outputs from images are converted to a score encoding format. The paper investigates the implementations of recognition pipelines that use Machine Translation to do the encoding.


Original article:

Author: Elona Shatri


AIM students and staff to join the Alan Turing Institute

The Alan Turing Institute, the UK’s national institute in artificial intelligence and data science, is a project partner of the AIM CDT and three AIM PhD students and six AIM supervisors will join the Turing in Autumn 2021.

The following AIM PhD students will join the Turing as Enrichment students in 2021/22:

  • Lele Liu – Enrichment project: Cross-domain automatic music audio-to-score transcription
  • Ilaria Manco – Enrichment project: Multimodal deep learning for music information retrieval
  • Luca Marinelli – Enrichment project: Gender-coded sound: A multimodal data-driven analysis of gender encoding strategies in sound and music for advertising

The Turing’s Enrichment scheme offers students enrolled on a doctoral programme at a UK university an opportunity to boost their research project with a placement at the Turing for up to 12 months.

The following AIM supervisors have been appointed Turing Fellows in 2021/22:

Turing Fellows are scholars with proven research excellence in data science, artificial intelligence or a related field whose research would be significantly enhanced through active involvement with the Turing network of universities and partners.

AIM at IJCNN 2021

On 18-22 July 2021, AIM researchers will participate virtually at the IEEE International Joint Conference on Neural Networks (IJCNN 2021), the flagship conference of the IEEE Computational Intelligence Society and the International Neural Network Society.

The AIM CDT will have a strong presence at the conference, with the following papers authored/co-authored by AIM members to be presented at IJCNN 2021:

  • MusCaps: Generating Captions for Music Audio
    Ilaria Manco, Emmanouil Benetos, Elio Quinton and Gyorgy Fazekas
  • A Modulation Front-End for Music Audio Tagging
    Cyrus Vahidi, Charalampos Saitis and Gyorgy Fazekas
  • Revisiting the Onsets and Frames Model with Additive Attention
    Kin Wai Cheuk, Yin-Jyun Luo, Emmanouil Benetos and Dorien Herremans

Original news post here.