This page outlines PhD topics proposed by AIM academics and industry partners for September 2023 entry. You are welcome to apply for one or more of the topics, or to propose your own PhD topic, according to application guidelines specified here. In both above cases, we strongly encourage you to contact your chosen supervisor for an informal chat – this will also help you to put together your research proposal, which is an integral part of your application. Your research proposal should describe your one preferred PhD topic in detail, but we also strongly encourage you to list in your application 3 possible topics.

PhD TopicSupervisor / Industry Partner
Advancing music generation via accelerated deep learningAhmed Sayed / HPC-AI Technology
Artificial neuroscience for understanding, engineering and control of neural audio systemsMark Sandler
Auditory masking: advances and creative applicationsJosh Reiss
Automatic background music recommendation for micro-videosLin Wang
Beyond supervised deep learning for musical audioGeorge Fazekas / Universal Music Group
Deep learning for low-resource musicEmmanouil Benetos / ByteDance
Emotion-driven personalised music recommendation Mathieu Barthet / Deezer
Explainability of AI music generationNick Bryan-Kinns
Film score composer AI assistant: generating expressive mockups Mathieu Barthet / Spitfire Audio
Identification of interactive motion patterns in response to music Ekaterina Ivanova
Incorporating hierarchical structure into automatic music labellingJohan Pauwels
Machine learning of meta-composition heuristicsSimon Dixon / DAACI
Machine learning of physical modelsJosh Reiss / Nemisindo
Multi-sensor mapping for participatory music performances over a local networkStefan Poslad
Multimodal AI for musical collaboration in immersive environmentsMathieu Barthet / PatchXR
Multimodal learning for music understanding through 3D hand pose estimation from images and soundsShanxin Yuan
Multitask learning for annotation of jazz recordingsSimon Dixon
Navigable audio transformations for the digital audio workstationGeorge Fazekas
Neuro-symbolic automated music compositionSimon Colton
Perceptually-based mastering-referenced replay-system compensationMark Sandler / KEF
Probabilistic learning of sequential structure in music cognitionMarcus Pearce
Timbre tools for the digital instrument makerCharalampos Saitis / Bela

Advancing Music Generation via Accelerated Deep Learning

Supervisor: Dr. Ahmed Sayed
in collaboration with HPC-AI Technology

The capability of generating music in real-time from large-scale music data is becoming a key topic in generative art. Artificial Intelligence (AI) has been widely used to generate images [1] which recently is also widely used to generate music through deep learning techniques [2]. Deep music generation is to use computers utilizing Deep Neural Network (DNN) architectures to automatically generate music [3]. Unfortunately, to cope with the growth in music data, the DNN models also have grown in parameters from multi-millions (e.g., RNN and LSTMs) to multi-billions (e.g., GPT-3) [4]. Consequently, the time, computational costs and carbon footprint required to train and deploy DNN models for music generation have ever exploded.

The PhD project aims to investigate, propose and implement novel optimizations on the algorithmic and system levels to accelerate the training and inference of DDN models for music generation in HPC or Cloud environments. This involves solving a hard optimization problem with multi-objectives involving time, computation and energy efficiency. Based on these optimizations, we will develop software framework(s) that can accelerate Deep Music Generation tasks. This will help advance the field of music generation by making it more sustainable and affordable.

[1] Eva Cetinic and James She. 2022. Understanding and Creating Art with AI: Review and Outlook. ACM Trans. Multimedia Comput. Commun. Appl. 18, 2.
[2] Emma Frid, Celso Gomes, and Zeyu Jin. 2020. Music Creation by Example. In Proceedings of the Conference on Human Factors in Computing Systems.
[3] Ji, S., Luo, J. and Yang, X., 2020. A comprehensive survey on deep music generation: Multi-level representations, algorithms, evaluations, and future directions. arXiv preprint arXiv:2011.06801
[4] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A. and Agarwal, S., 2020. Language models are few-shot learners. NeurIPS.

Artificial Neuroscience for understanding, engineering and control of Neural Audio systems

Supervisor: Prof. Mark Sandler

The original Neural Networks were inspired by the (large) clusters of neurons found in biological nervous systems. Just as the study and manipulation of biological brains is called Neuroscience, we coin the term Artificial Neuroscience to cover a variety of ways that look at how artificial brains work, and then apply that understanding to control, simplify and otherwise manipulate Deep Learning (DL) systems. These studies will be in the context of Neural Audio systems.

The branch of Artificial Neuroscience that we focus on is the application of Linear Algebra and Signal Processing to the measurement and control of the dynamics of Neural Networks. Specifically, DL network layers, weights and loss functions are represented as matrices for which Singular Valued Decompositions (SVD – or Principal Component Analysis – PCA) can be computed and used to track the evolution of networks during training. SVD techniques can also be used to tune the learning and make network computations more tractable, particularly by resorting to so-called low rank approximations of these internal matrices. Only recently have papers relating to these concepts begun to appear, e.g. [1], and they cover such topics as initialising DLs, controlling gradients in loss functions, pruning layers and interpretability.

The PhD will explore these principles within a Neural Audio topic to be decided based on the student’s background and interests, which should be discussed in the Research Proposal. Topics include Audio Source Separation, Neural Audio Synthesis, multi-mic recording techniques, Harmonic Analysis, Automatic Transcription and Music Structure Decomposition. Note: the goal is not necessarily to chase State of the Art performance, but to understand, control and improve audio and music Deep Learning. However, it is anticipated that high performing systems will inevitably result from this research.

[1] B Bermeitinger, T Hrycej, and S Handschuh. Singular Value Decomposition and Neural Networks. 28th International Conference on Artificial Neural Networks, Germany, September, 2019.

Auditory masking: advances and creative applications

Supervisor: Prof. Josh Reiss

Auditory masking occurs whenever the presence of one sound (the masker) makes another sound (the signal) less perceptible. It is a well-studied phenomenon in auditory science, and the basis of many audio codecs. But the phenomenon is far less understood in many audio production contexts. For instance, sound engineering students are often taught that a goal of mixing is to reduce masking, but is this really the case? In many situations, musical instruments might be intended to blend together, to form a horn section for instance. And what are the best approaches to reduce masking? Should the same approaches be applied when there are many sources or just a few. Would an intelligent system take the same approach as a human engineer? Would new machine learning approaches outperform more established optimisation methods?

The aim of this PhD topic is to investigate masking in real world sound engineering and music production contexts. The work will propose and evaluate masking measures in different contexts, e.g, overall masking when there are multiple and diverse sources. It will try to establish and quantify preferences for masking, as well as best practices for masking reduction in music production. It will explore new approaches, both the creation of autonomous, intelligent systems that mimic the actions of a professional sound engineer, and novel, creative approaches. These creative approaches could include neural network architectures that are not constrained by the use of traditional audio effects, or reducing masking by placing different frequency bins of each source in different spatial locations.

For this topic, the researcher will have access to a large data set of multitrack content and mixes, as well as evaluation of those mixes. It also builds on a large body of prior research here. Knowledge of deep learning is preferred, but the research is exploratory and may be taken in different directions.

Automatic background music recommendation for micro-videos

Supervisors: Dr. Lin Wang

Micro-videos are becoming popular web media due to the prevalence of social media sharing platforms such as Youtube, Flickr, and Tiktok. Adding suitable background music to a video is an important process for making the video impressive by better conveying its content and emotion. However, manually selecting the background music is a painstaking task for normal users due to the requirement of professional knowledge and the ever-growing amount of candidate music. Traditional music-video matching methods are based on manually annotated emotion tags of music and video clips, and thus are time-consuming and imprecise due to the dynamically varied semantic structure of the media content.

The project aims to develop an automatic background music recommendation system for micro-videos to be shared on social media platforms. This is a challenging task as the system requires understanding the semantics of both music and video, and constructing a learning system that can automatically identify the best-matched music-video pair. A large-scale music and video dataset will be constructed to build such an artificial intelligence system. Deep learning techniques will be employed to solve potential scientific challenges, including
1) Music semantic analysis: to extract semantic features embedded in the music clips, such as timbre, rhythm, melody, and emotion;
2) Video semantic analysis: to extract semantic features embedded in the audio and video clips, such as content, scene, rhythm, and emotion;
3) Music-video association and synchronization: to identify the best-matched music and video clips efficiently and align them semantically.

Through this PhD project, the candidate will acquire a variety of skills including audio and video analysis, language and music understanding, information retrieval, and artificial intelligence.

[1] Yoshida, OtoPittan: A music recommendation system for making Impressive videos, ISM 2016.
[2] Liu, Background music recommendation based on latent factors and moods, Knowledge-Based Systems, 2018.
[3] Yi, Cross-modal variational auto-encoder for content-based micro-video background music recommendation.” IEEE Transactions on Multimedia, 2021.
[4] Gu, Video-music retrieval: A dual-path cross-modal network, arXiv 2022.

Beyond supervised deep learning for musical audio

Supervisor: Dr. George Fazekas
in collaboration with Universal Music Group

With support and co-supervision from Universal Music Group (UMG), a world leader in music-based entertainment, this PhD will investigate the use and development of advanced machine learning models with music industry applications, leveraging the large amount of data available in the modern digital music ecosystem. Depending on the candidate’s interests this PhD may focus on areas such as: self-supervised learning, semantically meaningful representation learning, multimodal deep learning, multitask models, small data training.

The adoption of data-driven and machine learning based methods, often using deep learning, is already significant in many areas of the music industry such as music identification, music discovery, personalization of fan experience, catalogue management etc. When first adopted in the audio music domain, deep learning methods tended to be supervised, unimodal (only considering audio input) and tackling a single task at once. Such models were therefore limited to a single task and modality, and required manually annotated datasets, which are expensive to acquire. In the end this means limitations on what practical use cases can be addressed with these models.

Recent research shows promising avenues to overcome these limitations. Self-supervised learning enables representation learning without the need for annotated data. Multi-modal deep learning, by connecting data sources of different nature, have the potential to help better understand, extract and analyse the structure and trends present in large, often unstructured, datasets. Multi-task models have also recently shown promising performance. Most modern methods rely on large training sets, but many practical scenarios are in the small data regime. Addressing such use cases is also of prime interest for industry applications. A few (non-exhaustive) research questions that would be of interest in the PhD: 1) How can self-supervised methods learn semantically meaningful (e.g. disentangled) representations? 2) How can self-supervision methods be more robust (e.g. better generalizability) and versatile (e.g. cover more tasks)? 3) How can multi-modal models help learn more complete, more relevant, and more effective representations? 4) How can modern methods be adapted to the small data regime?

Deep learning for low-resource music

Supervisors: Dr. Emmanouil Benetos
in collaboration with ByteDance

The field of music information retrieval (MIR) has been growing for more than 20 years, with recent advances in deep learning having revolutionised the way machines can make sense of music data. At the same time, the MIR community is constrained by the data available, and most methods are focused on extracting information from mainstream music styles (mostly pop, rock, and classical music), using predefined sets of commonly used musical instruments, and when relevant assuming high-resource languages for singing voice analysis. Inspired by recent developments in the field of speech technology for low-resource languages, this PhD project will investigate and develop deep learning methods for making sense of music data in low-resource conditions, whether these refer to under-represented music styles, new musical instruments, or low-resource singing corpora. Methods based on few-shot and zero-shot learning will be investigated, along with methods for open-set recognition, meta-learning or semi-supervised learning, applied to various MIR tasks including but not limited to music tagging, music transcription, lyrics recognition, or audio matching or cover song detection.

The successful candidate will investigate, propose and develop novel methods for analysing low-resource music corpora, resulting in models that can rapidly learn or adapt from small or unlabelled datasets of under-represented music styles, musical instruments, or sung languages.

Emotion-driven Personalised Music Recommendation

Supervisor: Dr. Mathieu Barthet
in collaboration with Deezer

Listeners often seek in music emotional qualities which match their context and activities (e.g. commuting, working, exercising, relaxing, partying, etc.). Music recommender systems can leverage mood tags associated to songs (e.g. chill, love, brutal, etc.) to generate mood-based playlists. Such mood tags can be crowd-sourced, curated, or inferred from audio using machine learning. However, music perception and cognition studies evidence that emotional associations with music tend to be listener-dependent being influenced by individual and hedonic factors (e.g. culture, experience, personality traits, musical taste). The integration of such listener factors into music emotion recognition models poses research challenges.

This PhD project will investigate how to establish emotion-driven music recommendation systems that provide personalised song suggestions based on user factors. The methodology will rely on graphs which provide a compact way to represent data coming from various modalities and can add context and depth to data-driven machine learning techniques. Several approaches can be considered, for example, graph neural networks (GNNs), which apply deep learning to graph-structured data, or knowledge graphs, which add semantic information to graphs (meaning underlying the data) in a flexible way and support reasoning.

The project will investigate how to combine various data sources for music emotion modelling such as audio, metadata, user-music item interaction, and symbolic music representation, when available. The models will be assessed using benchmark music emotion datasets and/or new datasets developed for the task. User studies will be conducted on prototypes to collect feedback and test the scalability of the approach.

The candidate will collaborate with Deezer including the possibility to conduct feature extraction and train models on a large and diverse music catalogue with user interaction data. The candidate should have experience in at least one of the following scientific areas or equivalent: recommender systems, machine learning, (knowledge) graphs, deep learning, music emotion recognition, signal processing, human-computer interaction. Software engineering skills (incl. programming, web development) and musical background are desirable.

Explainability of AI Music Generation

Supervisor: Prof. Nick Bryan-Kinns

There have been substantial advances in generative Artificial Intelligence (AI) applications from music generation to performance, but most end-users of AI for music have little understanding of how the AI actually works and why it makes the decisions it does. This PhD will research how to create and evaluate more explainable generative AI models for music. The candidate has the opportunity to explore many different approaches to generative music AI and centre their research on its explainability around their own interests. For example, how to build generative AI systems which make their latent space more interactive and understandable for performing musicians. The PhD will build on recent research in the Centre for Digital Music on how generative AI music systems can be made more understandable, explainable, and interactive. It could, for example, take existing RNN approaches to music generation as a start point and research the effectiveness of different training sets, latent space regularisation techniques, and user interfaces for real-time interaction. Further research may include examining different architectures for AI music generation and their usefulness for human-AI music generation depending on the interest and skills of the candidate. The candidate does not need to have existing expertise in AI, HCI, or music, though some would be advantageous.

Film score composer AI assistant: generating expressive mockups

Supervisor: Dr. Mathieu Barthet
in collaboration with Spitfire Audio

Contemporary film score composers often need to produce mockups of an orchestral arrangement before going into the studio. Mockups take an important role in the scoring process as they are used by film directors to assess the relevance and quality of a proposed soundtrack. The quality and size of orchestral sample libraries has greatly improved in the past decades, however making computer-generated score mockups sound realistic and expressive is still a challenging and time-consuming task.

This PhD will investigate how AI can be used to produce expressive renditions of orchestral arrangements using digital audio workstations. Interviews with film score composers will be conducted to identify relationships between musical structure and expressive features at the instrument- and orchestral-levels. This will inform the design of deep learning (DL) techniques aiming at predicting control parameter automations acting on musical attributes such as dynamics, timbre, and articulations. Transformer models handling sequential data will be considered for musical interpretation tasks consisting in translating nominal score information into expressive features. Symbolic MIDI and token-based music performance datasets will be used to train and evaluate the DL models. Content-based feature extraction techniques can also be considered to extract expressive information from audio and augment the symbolic music datasets.

The candidate will collaborate with Spitfire Audio ( and have access to software and data such as the Spitfire Symphony Orchestra. This state-of-the-art library includes woodwinds, brass & string instrument samples in a huge selection of playing styles, with numerous dynamic layers, release triggers, round robins, and true and performance legatos.

The candidate should have experience in at least one of the following scientific areas or equivalent: music information retrieval, machine learning, music signal processing, musical timbre modeling, new interfaces for musical expression, human-computer interaction. Programming skills (e.g. Python, C/C++) and music performance/composition/production backgrounds are desirable.

Identification of interactive motion patterns in response to music

Supervisor: Dr. Ekaterina Ivanova

The first time you practice ice skating, holding the hand of an experienced skater may give you confidence and help you improve performance. Indeed, recently we could observe how haptic human-human interaction improves sensorimotor performance and learning in both partners (Ivanova et al. 2022, Sci. Rep.). We also know that music induces neuroplasticity and can help boosting motor training (Ripollés et al., 2016, Brain Imaging Behav.). Even though in activities like dancing, the role of music is evident in the synchronisation of motions between partners, the links between music and haptic interaction have not been systematically investigated.

This PhD project will investigate how different aspects of music affect haptic human-human interaction, and how connected partners integrate auditory information with haptic feedback during interactive scenarios. The subjects’ motor behaviour will be studied using a dual robotic interface and electromyography (EMG) to measure muscle activity. Bayesian theory and machine learning methods will be applied to identify and model the music-mediated interactive motion patterns. The results may help design human-like multimodal robotic systems and provide optimal assistance during neurorehabilitation or physical training.

Incorporating hierarchical structure into automatic music labelling

Supervisor: Dr. Johan Pauwels

Hierarchies are omnipresent in music labels. For example, consider a 4-note chord, which can be reduced to its 3 most important notes or extended to a number of 5-note chords. Likewise, a time signature tells us both the number of beats per measure (2, 3, 4) and the way beats are subdivided (into 2 or 3 parts). Since many approaches to automatic music labelling consist of training a deep learning classifier, it makes intuitive sense to use a hierarchy of classes instead of a single flat level. Otherwise the separation between all classes will be considered equally important, which is clearly not the case for hierarchical labels. Nonetheless, flat classification has been the default approach up to now. However, new techniques have been proven promising in the context of musical instrument and everyday sound recognition. The aim of this PhD project is to explore how existing hierarchical classification techniques can be adapted to specific music labelling tasks and to develop new techniques. The proposed labelling tasks are chord recognition and time signature determination, but these can be adapted to suit the candidate’s interests.

The ideal candidate has an interest in the latest deep learning techniques, but is willing to fall back on classical machine learning to create strong baselines or if the situation otherwise requires so. An amateur-level understanding of music theory is useful, but should not hold you back from applying since it can be easily learnt if motivated.

Machine Learning of Meta-Composition Heuristics

Supervisor: Prof. Simon Dixon
in collaboration with DAACI

In collaboration with DAACI this research will focus on automating the process of analysing and generating meta composition rules (heuristics) from existing scores and real audio in a data-driven manner.

This research will fuel the ability for a compositional meta-sequencer to analyse its input and generate new models for output. In general, a sequencer allows you to enter specific notes that are fixed when played back. In contrast, a meta-sequencer allows you to express how these notes can be generated from heuristics that are capable of creating generative composition.

The project will involve collecting data using state of the art deep-learning-based MIR, including advanced chord transcriptions, structural segmentation into meaningful musical units, and the extraction and tagging of musical textures. This will enable a comprehensive approach to creating an artificial musical mind that can absorb musical culture and reproduce it in a meaningful way for different audiences with different requirements and understandings of different codes and conventions.

We would welcome a student who can appreciate the cross-disciplinary nature of this work and propose their own ideas for this project based on the application of DL/AI for extracting and generating compositional heuristics for generation of emotionally responsive music.

This project is a collaboration with DAACI, a London-based music generation start-up. DAACI will provide access to a range of proprietary software and high-quality datasets and their team of trained musicians and musicologists will be available for task-specific data annotation tasks.

Machine learning of physical models

Supervisor: Prof. Josh Reiss
in collaboration with Nemisindo

Physical models of sound generating phenomena are widely used in digital musical instruments, noise and vibration modelling, and sound effects. They can be incredibly high quality, but they also often have a large number of free parameters that may not be specified just from an understanding of the phenomenon.

Machine learning from sample libraries could be the key to improving the physical models and speeding up the design process. Not only can optimisation approaches be used to select parameter values such that the output of the model matches samples, the accuracy of such an approach will give us insight into the limitations of a model. It also provides the opportunity to explore the overall performance of different physical modelling approaches, and to find out whether a model can be generalised to cover a large number of sounds, with a relatively small number of exposed parameters.

This work will explore such approaches. It will build on recent high impact research from the team in relation to optimisation of sound effect synthesis models. Existing physical models will be used, with parameter optimisation based on gradient descent. Performance will be compared against recent neural synthesis approaches, that often provide high quality synthesis but lack a physical basis. It will also seek to measure the extent to which entire sample libraries could be replaced by a small number of physical models with parameters set to match the samples in the library.

The student will have the opportunity to work closely with research engineers from the start-up company Nemisindo, though will also have the freedom to take the work in promising new directions. Publishing research in high impact venues will be encouraged.

Multi-sensor mapping for participatory music performances over a local network

Supervisor: Dr. Stefan Poslad

In participatory music performances, the audience takes part in creating rather than just consuming music. While this concept is long established and can take many forms, this idea has recently been combined with networked music performance. In this setup, an audience is equipped with sensors (often mobile phones) connected through a local network, which allow each person to manipulate the sound. To let the creators keep a degree of artistic control, multiple mappings need to be defined between the sensors and musical controls.

In this PhD project, machine learning will be used to create new mappings, both on a technical and a creative level. First, a desirable control surface would be the absolute position of the sensors (i.e. audience members). GPS does not work on such a small scale, often indoors, so alternative localisation techniques need to be explored. Second, to avoid breaking down into random noise, all possible control combinations need to be mapped to a subset of creator-approved combinations.

Multimodal AI for musical collaboration in immersive environments

Supervisor: Dr. Mathieu Barthet
in collaboration with PatchXR

There is little research on the application of automatic music generation using deep learning to immersive environments such as virtual reality (VR). VR lends itself well to AI-based interfaces for music co-creation given that it supports embodied interaction, audio, and visual feedback through animated avatars. Music making using VR musical instruments provides a way to collect multimodal data related to musical control, body language, spatial position, and musical content. Such rich amounts of data can be harnessed to build intelligent systems for interactive musical collaboration between human and machines in VR.

This PhD will research interactive music generation techniques facilitating musical collaborations between human and machine-based avatar performers in VR. Deep learning models suitable for real-time human-computer interaction will be developed following two scenarios: call and response, where the human performer and the machine play in turn, and accompaniment generation, where the machine follows the human performer. Reinforcement learning will be considered by developing reward models leveraging multimodal VR and music data. Multiplayer VR environments supporting human-machine music making in genres such as EDM, techno, rock, jazz, and hip hop, will be conducted in collaboration with PatchXR.
The research will follow an iterative design approach by conducting user studies to assess the models taking into account factors such as creativity support, presence, and evaluation methods for computer-supported cooperative work.

The candidate will collaborate with PatchXR ( and have access to their VR creative environment technologies including interactive virtual worlds, 50+ VR instruments, modular DSP synths, and effects.

The candidate should have experience in at least one of the following scientific areas or equivalent: machine learning, computational creativity, music information retrieval, music signal processing, new interfaces for musical expression, human-computer interaction. Skills in the following areas are desirable: programming (e.g. Python, C/C++), VR (Unity), node-based language (e.g. Max/MSP, PureData) and/or modular synthesiser, music.

Multimodal Learning for Music Understanding through 3D Hand Pose Estimation from Images and Sounds

Supervisor: Dr. Shanxin Yuan

3D hand pose estimation (HPE) is the task of estimating 3D locations of handjoints. It has the potential to empower music understanding with the ability to quantify fingers’ movements. Quantitative hand poses enable us to conduct several studies: 1) AI-aided teaching and learning. Statistical analysis of hand poses can help to understand the difficulty levels of different music pieces. Real-time HPE can provide visual guidance for instrument learning. 2) To guide a transcription system, where hand poses complement with sounds to constrain the output space of AI transcription models. Existing acoustic transcription methods have difficulty in dealing with the polyphonic nature of music sounds. Existing HPE methods take image as input and achieve satisfactory performance on several public datasets, but they have difficulty in dealing with occlusions when fingers are occluded. This problem can be alleviated through using multi-view images, or multimodal images (depth and RGB) as input, but the improvement is incremental. For music understanding, there is an obvious fact that sounds and finger movements are synchronized, but it has been neglected in the HPE community. Moreover, information theory shows that additional information (sounds or images) can improve the performance of a task (HPE from images or music transcription).

This project will explore novel algorithms for music understanding through 3D hand pose estimation from both images and sounds. My existing publications on 3D hand pose estimation [1-7] explore HPE from depth and/or RGB images, and they are well recognized by the community with 1073 citations combined today. My paper [6] explores different modalities as input for HPE, where the depth image is used in training as privileged information. We will first collect a small dataset of images and sounds with annotated hand pose, and then leverage my existing work and use sounds as additional input or as privileged information to achieve a strong baseline. Then we will explore the latest developments in deep learning to design novel multimodal learning frameworks using sounds and images as input. We will explore 1) the recent work in vision transformer [8] as a backbone architecture for image feature extraction, 2) the recent work in recurrent neural networks for sound feature extraction, 3) a novel fusion approach to derive knowledge from the sounds and image features. The novel multimodal learning method will significantly improve the performance of 3D hand pose estimation for music understanding.

[1] Spatial Attention Deep Net with Partial PSO for Hierarchical Hybrid Hand Pose Estimation, Yuan et al., ECCV 2016.
[2] BigHand2.2M Benchmark: Hand Pose Data Set and State of the Art Analysis, Yuan et al., CVPR 2017.
[3] First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations, Yuan et al., CVPR 2018.
[4] Depth-Based 3D Hand Pose Estimation: From Current Achievements to Future Goals, Yuan et al., CVPR 2018.
[5] Opening the Black Box: Hierarchical Sampling Optimization for Estimating Human Hand Pose, Yuan et al., TPAMI 2018.
[6] 3D Hand Pose Estimation from RGB Using Privileged Learning with Depth Data, Yuan et al., ICCVW, 2019
[7] The 2017 Hands in the Million Challenge on 3D Hand Pose Estimation, Yuan et al., arXiv 2017.
[8] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, Liu et al.,ICCV 2021 (Marr Prize winning paper).

Multitask Learning for Annotation of Jazz Recordings

Supervisor: Prof. Simon Dixon

Many tools exist for automatic annotation of audio, for example transcription of specific [1,2] or multiple instruments [3], beat-tracking [4], chord estimation [5], structure analysis [6], main melody extraction [7] and alignment of audio with scores [8]. While it is possible to chain together such tools to solve practical data processing problems, errors tend to compound as potentially useful information is discarded along the processing chain. In machine learning, improved generalisation is reported for models trained on multiple tasks, where the tasks have access to a shared representation of the data. This project will employ multi-task learning to transcribe improvised solos in their rhythmic and harmonic context, possibly making use of auxiliary materials such as lead sheets. Building on the work of two recent projects, Dig That Lick ( and JazzDAP (, one objective of this project would be to facilitate analysis of jazz improvisation and provide insight into the development and spread of musical ideas such as patterns and licks.

[1] Kong et al. (2021), High-Resolution Piano Transcription With Pedals by Regressing Onset and Offset Times, IEEE Trans. Audio, Speech and Language Processing, 29:3707-3717.
[2] Hawthorne et al. (2018), Onsets and Frames: Dual-Objective Piano Transcription, 19th International Society for Music Information Retrieval Conference.
[3] Gardner et al. (2021), MT3: Multi-Task Multitrack Music Transcription, arXiv preprint arXiv:2111.03017.
[4] Böck et al. (2016), Joint Beat and Downbeat Tracking with Recurrent Neural Networks, 17th International Society for Music Information Retrieval Conference, 255-261.
[5] Korzeniowski & Widmer (2018), Improved Chord Recognition by Combining Duration and Harmonic Language Models, 19th International Society for Music Information Retrieval Conference.
[6] Nieto et al. (2020), Audio-Based Music Structure Analysis: Current Trends, Open Challenges, and Applications, Trans. International Society for Music Information Retrieval, 3(1):246-263.
[7] Basaran et al. (2018), Main Melody Estimation with Source-Filter NMF and CRNN, 19th International Society for Music Information Retrieval Conference, 82-89.
[8] Agrawal et al. (2022), A Convolutional-Attentional Neural Framework for Structure-Aware Performance-Score Synchronization, IEEE Signal Processing Letters, 29:344-348.

Navigable audio transformations for the Digital Audio Workstation

Supervisor: Dr George Fazekas
in collaboration with a well-known music technology company

While most users of audio equipment develop the capability to navigate the acoustics of sound intuitively, e.g. are able to differentiate a piano sound from a saxophone at the simplest level, or to differentiate between several piano shapes at a more subtle level, it is usually difficult for them to relate acoustic differences to audio transformation operators such as effect controls or synthesiser controls.

A fundamental question behind this is to encode the transformations of timbre rather than the timbre itself. Fundamental research work has been started in this sense, e.g., at the MIT Media Lab by Charles Holbrow [1, 2]. The goal of this PhD is therefore to research ways to encode and navigate the space of audio transformations, such that this encoding can support automations which will make audio content production more accessible to the general public. Rather than starting from a holistic and exhaustive view of the whole music production process [1], the proposed PhD project will rather adopt a stepwise approach through increasing levels of processing complexity:
• Is it possible to encode subtractive synthesis, which applies a series of transformations to a basic oscillating sound, in a way that makes it revertible, so that users could discover the synthesis parameters necessary to produce a sound that they have heard before?
• Which processes can be encoded and simplified in the domain of live console mixing, where series of transformations are applied to incoming live audio?
• Ultimately, can the full chain of transformation applied in a digital audio workstation be encoded and automated, and if not which proportion of it can be encoded, reverted and automated?

The proposed PhD is thus expected to address the fundamental question of the simplified encoding and automation of audio transformations, via investigations and proofs of concept applied to topics of interest for the audio industry.

[1] Holbrow, Charles. (2019). Turning the DAW Inside Out. Proc. Audio Engineering Society Conference: 146th Convention.
[2] Holbrow, Charles. (2021) Fluid Music – A New Model for Radically Collaborative Music Production. PhD, MIT.

Neuro-symbolic Automated Music Composition

Supervisor: Prof. Simon Colton

Generative deep learning is revolutionising the creative industries and is already producing amazing tools for music composition. The standard approach of training a neural model over a corpus of musical compositions, then using the trained model to produce music from a seed, has many use cases. However, it is difficult to get this approach to produce communicable insights into music composition which could potentially add to musical culture. In this project, we will explore neuro-symbolic approaches to automated music composition, where both generative deep learning (such as using RNNs, LSTMs, Transformers, Diffusion Models, etc.) and more traditional symbolic AI approaches (such as constraint solving, planning, rule-based systems and evolutionary computation) are combined. One such approach will be to write symbolic AI systems able to produce music according to rules of harmony, melody, counterpoint, rhythm, form, etc., but where certain decisions are informed by a pre-trained neural model. This will mimic how human composers write music within a musical system or genre, yet constantly play and listen to their compositions, so that they choose the best – according to their musical taste and/or an understanding or the genre – from certain options during the composition process.

This approach (and other neuro-symbolic methods that we will explore) has the potential not only to produce communicable musical knowledge, but also to produce musical compositions with purpose and direction. Moving to the meta-level, it should be possible to automatically produce rule-based generative systems, again informed by deep learning, from which we can learn new things about music. This project will contribute to the explainable AI and computational creativity subfields of AI, as well as to musical culture.

Perceptually-based mastering-referenced replay-system compensation

Supervisors: Prof. Mark Sandler
in collaboration with KEF

Audiophiles strive to reproduce recorded music in their homes as faithfully and accurately as possible. Focus on “signal integrity” is frequently the primary aim for audiophile music reproduction, but this emphasis seems misplaced. Toole’s “circle of confusion” concept [1] hypothesises that the inverse characteristics of the loudspeaker and room used to make or produce recorded music is inadvertently embedded into all audio masters.

This project seeks to explore the use of data collection, DSP and machine learning to break the “circle of confusion” problem. The general objective is to switch the replay target to be based not on the mastered audio signal, but instead on the audio that was perceived in the mastering control room. Practically, this will involve development and application of AI style transfer techniques that can automatically adapt a consumer replay system to sound more like the replay system on which the original music recording was mastered.

The research topic will therefore focus on investigating one or more of the following challenges:

1. Working with KEF’s R&D team, design an effective measurement method to identify key parameters that influence the overall audio characteristics of a replay system (loudspeakers, room and other playback hardware), based on easy and simple measurements.
2. Develop AI-driven, automatic DSP compensation algorithms that use data derived from this measurement method to make a first replay system sound more similar to a second replay system.
3. Working with KEF R&D and third party industry and academic partners, curate a measurement database of mastering facility characteristics, which can be accompanied by a suitably varied and extensive corpus of music recordings mastered in those studios.
4. Using appropriate machine learning techniques informed by this database and corpus, extend the correction method so that it can be applied, approximately, to music recordings mastered in facilities that do not feature in the database.

It is expected that some combination of Diffusion Networks and DDSP (Differentiable DSP) will be a suitable Deep Learning approach initially.

[1] F. E. Toole, Sound reproduction loudspeakers and rooms. Amsterdam; Boston: Elsevier, 2008.

Probabilistic learning of sequential structure in music cognition

Supervisors: Dr. Marcus Pearce

This project will combine computational modelling and empirical experiments with human participants to understand how listeners learn the syntactic structure of music and how this learning impacts on perception and aesthetic experience of music. Our existing research shows that listeners generate probabilistic predictions for the pitch, timing and harmony of forthcoming musical structures, which derive from implicit statistical learning over timescales ranging from long-term acquisition of the structure of musical styles to short-term learning of repeating structure within a piece of music. This represents a dynamic process of model construction, in which our brains attempt to extract as much structure as possible from the auditory environment to predict more accurately forthcoming auditory events. Expectations also have a special role to play in musical appreciation, since pleasure can arise both from predictable events, signalling a successful predictive model, and surprising events, which increase physiological arousal and drive learning. Aesthetic pleasure is maximal at intermediate degrees of unpredictability and uncertainty.

There are several promising avenues for further investigation. 1) research is required to establish the precise psychological mechanisms by which statistical learning and probabilistic prediction give rise to musical pleasure. 2) statistical learning implies that individuals with different listening histories will perceive music in different ways as a function of their experience; research is required to test this hypothesis by developing embodied artificial systems that simulate developmental trajectories in acquisition of culture-specific musical knowledge and predicting differences between musical cultures. 3) current models can perform better than humans, motivating research on memory constraints to better simulate human learning of structural regularities. 4) research is required to extend existing models to process sound and music at different hierarchical levels, at different time scales, ranging from high-level musical form (motifs, phrases, sections, parts), through symbolic notes, to acoustic processing of raw auditory input, including polyphonic music and multi-channel auditory scenes. The overall goal is to develop a complete computational model of music cognition. Within this general approach, there is scope to focus the project on different musical parameters (e.g., melody, rhythm, harmony), empirical methods (behavioural, EEG), computational approaches (e.g., structured probabilistic models, empirical Bayesian methods, neural networks) and musical styles, including non-musical auditory sequences.

Kaplan, T., Cannon, J., Jamone, L., & Pearce, M. (2022). Modeling enculturated bias in entrainment to rhythmic patterns. PLOS Computational Biology, 18(9), e1010579.
Harrison, P.M.C., Bianco, R., Chait, M., & Pearce, M. T. (2020). PPM-Decay: A computational model of auditory prediction with memory decay. PLOS Computational Biology, 16, e1008304.
Pearce, M. T. (2018). Statistical learning and probabilistic prediction in music cognition: mechanisms of stylistic enculturation. Annals of the New York Academy of Sciences, 1423, 378-395.

Timbre Tools for the Digital Instrument Maker

Supervisor: Dr. Charalampos Saitis
in collaboration with Bela

This PhD addresses the role of timbre in the design of sound synthesis and AI tools for digital instrument makers. Timbre is among the most evocative yet elusive attributes of music. It is through timbre that musicians can emote by manipulating the physical response of their acoustic instruments. Yet timbre is conspicuously absent from the digital luthier’s toolbox. Much synth design is still based on concepts from early analog and digital synthesis, or emulation of it using more recent techniques, while an audio engineer’s workbench is still based mainly on historical tools like oscilloscopes and signal generators.

Teaming up with Bela, this project will implement novel Bela IDE features at the intersection of creativity, education, and research, enabling makers to create digital interactions with an understanding of how we listen. This will involve developing flexible, open-ended tools for analysis and visualisation of timbre (and maybe synthesising sounds), real-time which could be plugged into any Bela project under development. Examples include tools enabling interaction with and interpretation of psychoacoustical and semantic timbre spaces, as well as leveraging neural audio synthesis techniques for real-time timbral mappings and sound morphing. The premise of this project is to promote a learn-by-making approach: through creating digital instruments using in-browser timbre tools amongst other tools, people would learn more about sound synthesis and make more interesting instruments.