This page outlines PhD topics proposed by AIM academics and industry partners for September 2022 entry. You are welcome to apply for one or more of the topics, or to propose your own PhD topic, according to application guidelines specified here. In both above cases, we strongly encourage you to contact your chosen supervisor for an informal chat – this will also help you to put together your research proposal, which is an integral part of your application. Your research proposal should describe your one preferred PhD topic in detail, but we also strongly encourage you to list in your application 3 possible topics.

PhD TopicSupervisor / Industry Partner
Active learning for interactive music transcriptionJohan Pauwels
Automatic construction of meta-composition heuristics from MIDI scores and real audioSimon Colton / DAACI
Combining deep learning and music domain knowledgeSimon Dixon / Apple
Computational orchestration models for creative audio and music generationCharalampos Saitis and George Fazekas
Creative deep learning approaches to automatic mixingJosh Reiss
Deep audio inpainting for musical signalsLin Wang
Film score composer AI assistant: generating expressive mockupsMathieu Barthet / Spitfire Audio
Generative models for music audio representation and understandingSimon Dixon / Spotify
High-performance embedded hardware for intelligent musical instrumentsAndrew McPherson
Intelligent audio and music editing with deep learningGeorge Fazekas
Latent dynamics for music-conditioned dance generationChangjae Oh
Latent spaces for human-AI music generationNick Bryan-Kinns
Machine learning of physical modelsJosh Reiss / Nemisindo
Modelling and synthesising articulation on acoustic and digital instrumentsAndrew McPherson / OHMI
Multimodal AI for musical collaboration in immersive environmentsMathieu Barthet / PatchXR
Multitask modelling for overlapping sound sourcesHuy Phan
Musification of physical world changes and interactionStefan Poslad
Performance rendering for music generation systemsSimon Dixon / DAACI
Probabilistic learning of sequential structure in music cognitionMarcus Pearce
Real-time timbral mapping for synthesized percussive performanceAndrew McPherson and Charalampos Saitis / Ableton
Resource-efficient models for music understandingEmmanouil Benetos and Phillip Stanley-Marbell
Self-supervision in machine listeningEmmanouil Benetos / Bytedance
Timbre tools for the digital instrument makerCharalampos Saitis / Bela
User-driven deep music generation in digital audio workstationsMathieu Barthet and Gaëtan Hadjeres / Sony CSL
Using Signal-informed Source Separation (SISS) principals to improve instrument separation from legacy recordingsMark Sandler and Emmanouil Benetos

Active Learning for Interactive Music Transcription

Supervisor: Dr. Johan Pauwels

Manually transcribing music is labour intensive, but still the dominant way of creating data to train an automatic transcription system (in the hope that one day it could replace manual transcription). One reason why the manual process is so time-consuming, is that annotations are typically made for entire music pieces, even for sections that in hindsight contribute little to the improvement of an automatic system because the previous version of the system already managed to successfully transcribe that section. The underlying cause is that the current state-of-the-art in music transcription, regardless of the exact characteristic that’s being transcribed (e.g. tempo, melody, instrumentation, harmony), is based on deep learning techniques, which are not very good at representing uncertainty. The field of active learning aims to include uncertainty as part of the training process, such that an iterative workflow can be established where the segment that is deemed most informative gets presented for transcription first. In this project, the leading active learning approaches will be adapted to one or more music transcription tasks. The result will be integrated into a browser-based transcription tool, which will subsequently be used to study the difference in personal preferences and subjectivity between transcribers.

Automatic construction of meta-composition heuristics from MIDI scores and real audio

Supervisor: Prof. Simon Colton
in collaboration with DAACI

This research will fuel the ability for a compositional meta-sequencer to analyse its input and generate new models for output. In general, a sequencer allows you to enter specific notes and recordings which are performed. In contrast, a meta-sequencer allows you to express how these notes can be generated from heuristics that are capable of producing compositions. This research will focus on automating the process of generating meta composition heuristics from existing scores and real audio in a data-driven manner. This includes collecting data using state of the art deep-learning-based MIR, including advanced chord transcriptions, structural segmentation into meaningful musical units (“form atoms”), and the extraction and tagging of musical textures. This will enable a comprehensive approach to creating an artificial musical mind that can absorb musical culture and reproduce it in a meaningful way for different audiences with different requirements and understandings of different codes and conventions. We would welcome a student who can appreciate the cross-disciplinary nature of this work and propose their own ideas to this project based on the application of DL/AI for extracting and generating compositional heuristics for generation of emotionally responsive music.

This project is a collaboration with DAACI, a London-based music generation start-up. DAACI will provide access to a range of proprietary software and high-quality datasets and their team of trained musicians and musicologists will be available for task-specific data annotation tasks.

Combining deep learning and music domain knowledge
in collaboration with Apple

Supervisor: Prof. Simon Dixon

The advent of deep learning has significantly boosted the accuracy of analytic and generative music models, providing an efficient framework to extract knowledge from the data. However, the paradigm shift from feature-engineering and logic to latent spaces and trainable operations comes at the cost of interpretability and expressive power. We invite you to submit a project proposal for a PhD exploring the combination of deep learning and the rich body of prior knowledge that we can derive from music theory. We imagine that prior knowledge can be used to complement or constrain the kind of patterns that are learnable, leading to more interpretable models and lower data requirements. Depending on your interests, you might choose to focus on analytic or generative models in the audio or symbolic music domain. We look forward to your proposal!

The successful applicant would be co-supervised by Prof. Simon Dixon (QMUL) and Dr. Bruno Di Giorgi (Apple). The goal of the collaboration is to advance fundamental methods in MIR. The expertise of the supervisory team relates to Western music.

Computational orchestration models for creative audio and music generation

Supervisors: Dr. Charalampos Saitis and Dr. George Fazekas

This PhD project addresses the problem of modelling perceptual effects of orchestration in generative systems aimed at enhancing musical creativity through artificial intelligence. Orchestration is known as the art of combining musical sounds to achieve a particular auditory result. While existing generative music systems have been successful in either producing a symbolic score or synthesising high quality audio, there is still little practical knowledge of how the two domains can be unified to create systems that are more complete and more creative. Progress is hindered by the lack of computational orchestration models capable of linking symbolic scores, audio signals, and perceptual analyses. The project will address this gap, building on recent efforts to develop a psychological foundation for a theory of orchestration practice based on perceptual principles associated with auditory scene analysis and musical timbre. Specifically, we aim to develop deep neural network architectures that can learn to analyse multi-track audio recordings on the basis of ontology and taxonomic structures for orchestral grouping effects and auditory grouping processes. Applications of such models include predicting the perceptual results of combining multiple musical sources, as well as the development of neural synthesis methods for complex, multi-timbre orchestral textures.

Creative deep learning approaches to automatic mixing

Supervisor: Prof. Josh Reiss

Automatic mixing aims to intelligently generate a high quality sound mix out of an unknown set of multichannel inputs, with minimal assumptions about the content and minimal intervention by a human engineer. Most approaches so far have focused on devising systems that aim to reproduce the mixing decisions of a skilled audio engineer. Hence, they seek to find settings on popular audio effects that a human might use, given that input content.

This is also the case with the deep learning approaches to automatic mixing that have been attempted. They have embedded audio effects into the network. This results in easily interpretable and adjustable mixes, but limits the types of mixes that can result.

The aim of this PhD topic is to use neural network architectures to explore such radical approaches to automatic mixing. What if the constraints (applying traditional effects, mimic human mixing approaches) were removed? What if the intelligent system is free to apply any processing it can, so long as the output mix is in some sense considered preferred? Perhaps new approaches to mixing could be found. For instance, reducing perceptual masking could be achieved by placing different frequency bins of each source in different spatial locations. This is an approach that does not mirror natural spatial positioning and would be far too time consuming to be performed manually, but it might create a far more immersive mix than other approaches.

For this topic, the researcher will have access to a large data set of multitrack content and mixes, as well as evaluation of those mixes. It also builds on a large body of automatic mixing research at QMUL. Knowledge of deep learning is preferred, but the research is exploratory. Hence, the researcher will be encouraged to experiment with neural network architectures, with variations on the training data, and with choices of target and how it is reached.

Deep Audio Inpainting for Musical Signals

Supervisors: Dr Lin Wang

Real-life audio signals often suffer from local degradation and lost information. Examples include short audio intervals corrupted by impulse noise and clicks, or a clip of audio wiped out due to damaged digital media or packet loss in audio transmission. Audio inpainting is a class of techniques that aim to restore the lost information with newly generated samples without introducing audible artifacts [1]. In addition to digital restoration, audio inpainting also finds wide applications in audio editing (e.g. noise removal in live music recording) and music enhancement (e.g. audio bandwidth extension and super-resolution).

Approaches to audio inpainting can be classified depending on the length of the lost information, i.e. the gap. For example, in declicking and declipping, corruption may be frequently but mostly confined to only a few milliseconds duration or less. On the other hand, gaps on a scale of hundreds of milliseconds or even seconds may happen due to digital media damage, transmission loss, and audio editing. While intensive work has been done on inpainting short gaps, long audio inpainting remains a challenging problem due to the high dimensional, complex and non-correlated audio features. Recently, intrigued by the tremendous success in image and video inpainting, deep learning based approaches started attracting attention in the research community, but still in an infant stage.

The PhD project intends to investigate the possibility of adapting deep learning frameworks from various domains inclusive of audio synthesis and image inpainting for audio inpainting. A particular focus will be given to recovering musical signals with long-gap information missing, and reconstructing super-resolution audio signals through bandwidth extension, which are both challenging tasks in the state of the art. The time-frequency sparsity, the structure and repetition of the musical signals, as well as auditory perception psychology and music semantics will be jointly exploited to achieve this goal. The research will be conducted by combining one or several methodologies as below.
1) Traditional musical signal processing approaches, e.g. exemplar-based method [2].
2) Deep learning approaches, e.g. convolutional and generative adversarial network [3, 4].
3) Audio-visual approaches exploiting visual context, e.g. video of instrument performance [5].

[1] Adler, Audio inpainting, IEEE-TASLP 2011.
[2] Perraudin, Inpainting of long audio segments with similarity graphs, IEEE-TASLP 2018.
[3] Marafioti, GACELA: A Generative adversarial context encoder for long audio inpainting of music, IEEE-JSTSP 2020.
[4] Mokry, Audio inpainting: Revisited and reweighted, IEEE-TASLP 2020.
[5] Zhou, Vision-infused deep audio inpainting, ICCV 2019.

Film score composer AI assistant: generating expressive mockups

Supervisor: Dr. Mathieu Barthet
in collaboration with Spitfire Audio

Contemporary film score composers often need to produce mockups of an orchestral arrangement before going into the studio. Mockups take an important role in the scoring process as they are used by film directors to assess the relevance and quality of a proposed soundtrack. The quality and size of orchestral sample libraries has greatly improved in the past decades, however making computer-generated score mockups sound realistic and expressive is still a challenging and time-consuming task.

This PhD will investigate how AI can be used to produce expressive renditions of orchestral arrangements using digital audio workstations. Interviews with film score composers will be conducted to identify relationships between musical structure and expressive features at the instrument- and orchestral-levels. This will inform the design of deep learning (DL) techniques aiming at predicting control parameter automations acting on musical attributes such as dynamics, timbre, and articulations. Transformer models handling sequential data will be considered for musical interpretation tasks consisting in translating nominal score information into expressive features. Symbolic MIDI and token-based music performance datasets will be used to train and evaluate the DL models. Content-based feature extraction techniques can also be considered to extract expressive information from audio and augment the symbolic music datasets.

The candidate will interact with Spitfire Audio and have access to software and data such as the Spitfire Symphony Orchestra. This state-of-the-art library includes woodwinds, brass & string instrument samples in a huge selection of playing styles, with numerous dynamic layers, release triggers, round robins, and true and performance legatos.

The candidate should have experience in at least one of the following scientific areas or equivalent: music information retrieval, machine learning, music signal processing, musical timbre modeling, new interfaces for musical expression, human-computer interaction. Programming skills (e.g. Python, C/C++) and music performance/composition/production backgrounds are desirable.

Generative models for music audio representation and understanding

Supervisor: Prof. Simon Dixon
in collaboration with Spotify

Generative models for musical audio are seeing rapid progress in their capacity to produce audio in a realistic and musical fashion. As a recent observation, the representations such models produce and employ to generate music can be exploited for downstream tasks, e.g. to produce state of the art results for melody extraction, genre classification, emotion recognition and auto tagging and thus to surpass models specifically trained for the task at hand using discriminative algorithms. These surprising results are not well understood and their potential has not been harnessed yet. Research questions include whether generative models yield a better approach for representation learning than existing methods. Related questions involve investigations into why that would be the case and how their more advantageous properties can be leveraged to obtain a step-change in audio understanding.

High-performance embedded hardware for intelligent musical instruments

Supervisor: Prof. Andrew McPherson

Interest is growing in moving audio machine learning applications from high-performance workstations onto mobile or embedded hardware. Such deployment is particularly useful for creating digital musical instruments. However, most current embedded systems are not optimised for the stringent requirements of performing musicians, including high I/O bandwidth and guaranteed low latency, and making effective use of the complex resources of a modern embedded system-on-chip can be a challenge.

This project will explore a new high-performance embedded hardware architecture for real-time music-AI applications, with close collaboration from the team behind Bela, the open-source platform for ultra-low-latency for embedded audio. The architecture will deploy the multi-core processing resources of the latest embedded systems-on-chip to deliver optimised audio and sensor processing meeting stringent real-time requirements. The project will also explore ways of deploying lightweight machine learning models for audio feature extraction and sensor fusion, creating demonstrator musical instruments featuring rich, nuanced interaction. The project also seeks generalisable principles that other researchers can use to deploy their algorithms on an embedded platform.

The ideal candidate for this position will have an interest in musical instruments and experience programming in C or C++. Familiarity with hardware design and real-time audio are beneficial but not required.

Intelligent Audio and Music Editing with Deep Learning

Supervisor: Dr. George Fazekas

This PhD topic aims to seek novel ways of interacting with audio in music production, particularly in recording or post-production editing of multitrack audio. Recent advancements in audio analysis using deep learning enables accurate recognition and labelling of sound events in audio recordings. Labels could mark for instance musical notes, chords or other artefacts in audio recordings, and include higher level semantic annotations corresponding to temporal aspects of music such as beats or structural boundaries. These annotations present an opportunity to enhance and ease the audio editing workflow, by viewing, navigating, editing or otherwise manipulating audio recordings in novel ways, and by going beyond waveform display and manual methods of splicing and editing audio in digital audio workstations. Challenges remain however in providing accurate annotations well suited for particular editing tasks, for instance in cases where sample level accuracy is required, or where incorrect annotations are counterproductive in audio engineering workflows. At the same time, an opportunity exists in the parallel analysis of multiple correlated audio tracks usually present in a digital audio workstation. Several applications are within scope of this PhD, but they come with different challenges. These include but are not limited to highly accurate labelling of audio events using multitrack audio analysis, novel visualisation and navigation of multitrack audio, and automatic or semi-automatic editing supported by multitrack audio analysis. Several ways of addressing these problems are possible including the use of multi-input multi-output neural networks, parallel networks or transfer learning approaches. Students engaged in this research would have the freedom to focus on the task and approach most inspiring to them or closest to their practical experience and interest in audio engineering.

Latent Dynamics for Music-conditioned Dance Generation

Supervisor: Dr. Changjae Oh

Humans can dance by listening to music and following the musical beats and melody. Artificial intelligence (AI) driven machines can also listen to music and generate dances once the machines are trained with a dataset. However, the trained machines are commonly deterministic while humans can express various choreographies from a piece of music.

This project aims to study machine/deep learning-based approaches to create choreographies from music. The project will investigate latent dynamics that can abstract the high-dimensional inputs, e.g. images and audios, to compact state spaces, enabling the imagination of motion trajectories such as dances. The main challenges in this project include audio-visual dance video dataset collection, cross-modal perception, one-to-many stochastic formulation of dance generation from music, and evaluation of the created choreographies. The project mainly requires good programming skills and knowledge in audio-visual signal processing and machine learning.

Latent Spaces for Human-AI Music Generation

Supervisor: Prof. Nick Bryan-Kinns

Latent space models have been successfully used for creative AI applications including music generation, music inpainting, and music interpolation. This PhD will research how to build generative AI systems which make their latent space more interactive and understandable for users. It will build on recent research (2021) on how generative AI music systems can be made more understandable, explainable, and interactive. The PhD will take existing RNN approaches to music generation as a start point and research the effectiveness of different training sets, latent space regularisation techniques, and user interfaces for real-time interaction. Further research may include examining different architectures for AI music generation and their usefulness for human-AI music generation. The PhD student would need to have or develop skills in AI and Machine Learning as well as some skills or interest in visualisation and Human Computer Interaction. Musical skills are not necessary but would be advantageous.

Bryan-Kinns, N., Banar, B., Ford, C., Reed, C. N., Zhang, Y., Colton, S., & Armitage, J. (2021). Exploring XAI for the Arts: Explaining Latent Space in Generative Music. eXplainable AI Approaches for Debugging and Diagnosis Workshop @ NeurIPS2021.

Machine learning of physical models

Supervisor: Prof. Josh Reiss
in collaboration with Nemisindo

Physical models of sound generating phenomena are widely used in digital musical instruments, noise and vibration modelling, and sound effects. They can be incredibly high quality, but they also often have a large number of free parameters that may not be specified just from an understanding of the phenomenon.

Machine learning from sample libraries could be the key to improving the physical models and speeding up the design process. Not only can optimisation approaches be used to select parameter values such that the output of the model matches samples, the accuracy of such an approach will give us insight into the limitations of a model. It also provides the opportunity to explore the overall performance of different physical modelling approaches, and to find out whether a model can be generalised to cover a large number of sounds, with a relatively small number of exposed parameters.

This work will explore such approaches. It will build on recent high impact research from the team in relation to optimisation of sound effect synthesis models. Existing physical models will be used, with parameter optimisation based on gradient descent. Performance will be compared against recent neural synthesis approaches, that often provide high quality synthesis but lack a physical basis. It will also seek to measure the extent to which entire sample libraries could be replaced by a small number of physical models with parameters set to match the samples in the library.

The student will have the opportunity to work closely with research engineers from the start-up company Nemisindo, though will also have the freedom to take the work in promising new directions. Publishing research in high impact venues will be encouraged.

Modelling and Synthesising Articulation on Acoustic and Digital Instruments

Supervisor: Prof. Andrew McPherson
in collaboration with the OHMI Trust

This project investigates musical articulation, the complex transient behaviour exhibited by most instruments at the beginning of a note. Many acoustic instruments, especially bowed strings and winds, are capable of an extremely diverse palette of articulations. However, most digital musical instruments still operate on a simplified model based on the MIDI standard, where notes are discrete events characterised by a velocity parameter. This simplification is one of the major factors which limit the richness of digital instruments in comparison to their acoustic counterparts.

This project will investigate methods of AI-based modelling and synthesising articulations, with a particular focus on real-time synthesis for use in digital musical instruments. Depending on the interest of the student, the project could include modelling of acoustic string or wind instrument transients, neural audio synthesis of articulations, and/or physical interfaces for performing with such models. The project is in partnership with the OHMI Trust, a charity of musicians with physical disabilities, and there will be opportunities to apply the work toward the creation of accessible digital musical instruments which emulate the richness and nuance of traditional acoustic instruments.

Multimodal AI for musical collaboration in immersive environments

Supervisor: Dr. Mathieu Barthet
in collaboration with PatchXR

There is little research on the application of automatic music generation using deep learning to immersive environments such as virtual reality (VR). VR lends itself well to AI-based interfaces for music co-creation given that it supports embodied interaction, audio, and visual feedback through animated avatars. Music making using VR musical instruments provides a way to collect multimodal data related to musical control, body language, spatial position, and musical content. Such rich amount of data can be harnessed to build intelligent systems for interactive musical collaboration between human and machines in VR.

This PhD will research interactive music generation techniques facilitating musical collaborations between human and machine-based avatar performers in VR. Deep learning models suitable for real-time human-computer interaction will be developed following two scenarios: call and response, where the human performer and the machine play in turn, and accompaniment generation, where the machine follows the human performer. Reinforcement learning will be considered by developing reward models leveraging multimodal VR and music data. Multiplayer VR environments supporting human-machine music making in genres such as EDM, techno, rock, jazz, and hip hop, will be conducted in collaboration with PatchXR.

The research will follow an iterative design approach by conducting user studies to assess the models taking into account factors such as creativity support, presence, and evaluation methods for computer-supported cooperative work.

The candidate will interact with PatchXR and have access to their VR creative environment technologies including interactive virtual worlds, 50+ VR instruments, modular DSP synths and effects.

The candidate should have experience in at least one of the following scientific areas or equivalent: machine learning, computational creativity, music information retrieval, music signal processing, new interfaces for musical expression, human-computer interaction. Skills in the following areas are desirable: programming (e.g. Python, C/C++), VR (Unity), node-based language (e.g. Max/MSP, PureData) and/or modular synthesiser, music.

Multitask modelling for overlapping sound sources

Supervisor: Dr. Huy Phan

Overlapping sound sources are the main error source in a modelling system. Examples are polyphonic audio events in audio event detection and polyphonic music in multi-instrument transcription. In deep-learning context, the most common approach to deal with event overlaps is to treat the modelling task as a multi-label classification problem. By doing this, we inherently consider multiple one-vs.-rest classification problems, which are jointly solved by a single (i.e. shared) network. This project investigates to frame the task as a multi-class classification problem by considering each possible label combination as one class. To circumvent the large number of arising classes due to combinatorial explosion, decomposition of the label space will be explored to form multiple groups of category labels and yield a multi-task problem in a divide-and-conquer fashion, where each of the tasks is a multi-class classification problem. Network architectures will be then devised for multi-task modelling. The approach will be validated on databases with high overlapping degree of sound sources for polyphonic audio event detection and polyphonic music transcription.

Musification of physical world changes and interaction

Supervisor: Dr. Stefan Poslad

Music can not only be used as entertainment and art in its own right, but as edutainment to promote, or to gamify physical world interaction, to draw more attention to societal issues via music in a novel way. Examples are: Piano stairs into a train station that encouraged people to use the stairs more to increase their physical exertion rather than to use the escalator; Turning air pollution into music to draw attention to air pollution and music-driven therapy. Such musification of the physical world state changes and physical world interaction raises interesting IoT, AI & music challenges. For example, for the piano stairs, the examples show only that one person or group plays the piano at the same time to generate a shared public sound to anyone in earshot. But this does not scale up to enable multiple people to exercise on the stairs at the same time. Music AI issues and research questions raised are as follows. How can we fuse the data from different independently actuator generated music into coherent individual or group or groups of group performance? How can we model and orchestrate such music as a multi-agent system whose agents may cooperate and compete at different times? etc.

The main objective of this PhD is to identify and investigate how IoT combined with AI can be used to draw more focused attention to physical world and societal issues via musification.

Performance Rendering for Music Generation Systems

Supervisor: Prof. Simon Dixon
in collaboration with DAACI

Expressive Performance Rendering refers to the task of automatically assigning performance parameters (e.g. velocity, micro-timing, pedal use or bowing types) to MIDI scores to make the resulting synthesized audio approach a natural sounding performance. Existing research has mainly focused on developing instrument-specific models in an isolated scenario, for example for piano [1], drums [2] or guitar [3]. In addition, the correct choice of musical expression parameters can be style and mood dependent, a factor which has not been addressed by prior research. Expanding existing work towards a generic style-aware model that analyses a multi-track score as a whole and predicts performance parameters in a single step is a crucial step towards a fully-automatic composition pipeline. This work ranges from an assessment of the generalisation capabilities of existing methods to advancing the state of the art using contemporary data-driven approaches, including sequence-aware deep neural networks.

This project is a collaboration with DAACI, a London-based music generation start-up. DAACI will provide access to a range of proprietary software and high-quality datasets and their team of trained musicians and musicologists will be available for task-specific data annotation tasks.

[1] Widmer, Gerhard, and Werner Goebl. “Computational models of expressive music performance: The state of the art.” Journal of new music research 33.3 (2004): 203-216.
[2] Burloiu, Grigore, and CINETic UNATC. “Adaptive Drum Machine Microtiming with Transfer Learning and RNNs.” Extended Abstracts for the Late-Breaking Demo Session of the International Society for Music Information Retrieval Conference (ISMIR). 2020.
[3] Giraldo, Sergio, et al. “A Machine Learning Approach to Study Expressive Performance Deviations in Classical Guitar.” Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, Cham, 2019.

Probabilistic learning of sequential structure in music cognition

Supervisor: Dr. Marcus Pearce

This project will combine computational modelling and empirical experiments with human participants to understand how listeners learn the syntactic structure of music and how this learning impacts on perception and aesthetic experience of music. Our existing research shows that listeners generate probabilistic predictions for the pitch, timing and harmony of forthcoming musical structures, which derive from implicit statistical learning over timescales ranging from long-term acquisition of the structure of musical styles to short-term learning of repeating structure within a piece of music. This represents a dynamic process of model construction, in which our brains attempt to extract as much structure as possible from the auditory environment to predict more accurately forthcoming auditory events. Expectations also have a special role to play in musical appreciation, since pleasure can arise both from predictable events, signalling a successful predictive model, and surprising events, which increase physiological arousal and drive learning. Aesthetic pleasure is maximal at intermediate degrees of unpredictability and uncertainty.

There are several promising avenues for further investigation. 1) research is required to establish the precise psychological mechanisms by which statistical learning and probabilistic prediction give rise to musical pleasure. 2) statistical learning implies that individuals with different listening histories will perceive music in different ways as a function of their experience; research is required to test this hypothesis by developing embodied artificial systems that simulate developmental trajectories in acquisition of culture-specific musical knowledge and predicting differences between musical cultures. 3) current models can perform better than humans, motivating research on memory constraints to better simulate human learning of structural regularities. 4) research is required to extend existing models to process sound and music at different hierarchical levels, at different time scales, ranging from high-level musical form (motifs, phrases, sections, parts), through symbolic notes, to acoustic processing of raw auditory input, including polyphonic music and multi-channel auditory scenes. The overall goal is to develop a complete computational model of music cognition. Within this general approach, there is scope to focus the project on different musical parameters (e.g., melody, rhythm, harmony), empirical methods (behavioural, EEG), computational approaches (e.g., structured probabilistic models, empirical Bayesian methods, neural networks) and musical styles, including non-musical auditory sequences.

Gold, B., Pearce, M. T., Mas-Herrero, E., Dagher, A., & Zatorre, R. J. (2019). Predictability and uncertainty in the pleasure of music: a reward for learning? Journal of Neuroscience, 39, 9397-9409.
Harrison, P.M.C., Bianco, R., Chait, M., & Pearce, M. T. (2020). PPM-Decay: A computational model of auditory prediction with memory decay. PLOS Computational Biology, 16, e1008304.
Pearce, M. T. (2018). Statistical learning and probabilistic prediction in music cognition: mechanisms of stylistic enculturation. Annals of the New York Academy of Sciences, 1423, 378-395.

Real-time timbral mapping for synthesized percussive performance

Supervisors: Prof. Andrew McPherson and Dr. Charalampos Saitis
in collaboration with Ableton

The project addresses the problem of generating a timbral map between an acoustic performance to a synthesized one in real-time. Given a played drum part, the challenge is to map the audio from a microphone to MIDI and parameter data for a known synthesizer in such a way that the acoustic qualities of the original performance are preserved within the synthesized one, including dynamics and spectral qualities. The project could build on known onset detection techniques to examine how a synthesizer might use MPE/MIDI 2.0 and its own parameters to most closely approximate a real-time source. Such a system could be used to augment a drum in real-time. For example, a snare drum could be augmented with a synthesized snare in such a way that the synthesized version models the timbral characteristics of the real one. Given the possibilities of striking the instrument at different points of the skin, or metal-ware, with different velocities, and that the acoustic drum sound’s evolution will differ from the original, this is a complex problem.

To accomplish this, a model might be constructed for the effect of parameter changes on the spectral properties of the synthesized sound. Previous work has also focused on analysing the spectral properties across the parameter space. It is then possible to modulate parameters via a UI that uses the timbre space. In practice, the dimensionality of the timbral space can exceed three dimensions and thus dimensionality reduction has been used to present a usable interface. The system is designed to work in real-time and this might impose additional constraints. Applications include the ability to record electronic percussion using a variety of real-world objects as well as augmentation of percussive parts in live performance.

Relevant Work:
Making music through real-time voice timbre analysis: machine learning and timbral control Dan Stowell
Automatic Programming of VST Sound Synthesizers using Deep Networks and Other Techniques. Matthew John Yee-King, Leon Fedden and Mark d’Inverno
TSAM: a tool for analyzing, modeling, and mapping the timbre of sound synthesizers. Stefano Fasciani
A Self-Organizing Gesture Map for a Voice-Controlled Instrument Interface. Stefano Fasciani and Lonce Wyse
Real-time Human Interaction with Supervised Learning Algorithms for Music Composition and Performance Rebecca Fiebrink

Resource-efficient models for music understanding

Supervisors: Dr. Emmanouil Benetos and Prof. Phillip Stanley-Marbell

State-of-the-art models for music understanding and music information research are often very hard to run on small and embedded devices such as mobile phones, single-board computers, and other microprocessors. At the same time, the computational cost, footprint, and environmental impact for building and deploying deep learning models for music understanding is constantly increasing. This PhD project will investigate methods for creating resource-efficient models for music understanding, applied to various tasks in music information research that involve music audio data, such as automatic music transcription, audio fingerprinting, or music tagging. Methods to be investigated can include but are not limited to sparse training, network pruning, binary neural networks, post-training inference, and knowledge distillation.

The successful candidate will investigate, propose and develop novel machine learning methods and software tools for resource-efficient music understanding, and will apply them to address tasks of their choice within the wider field of music information research. This will result in models that can be deployed on small or embedded devices, or on offline models where learning and inference times and computational resources are drastically reduced.

Self-supervision in machine listening

Supervisor: Dr. Emmanouil Benetos
in collaboration with Bytedance

Self-supervised learning methods aim to provide an alternative to supervised representation learning, eliminating the need for large annotated datasets. Self-supervision has advanced rapidly in recent years with applications across several modalities, and can be ideally used in machine listening and music understanding tasks which have been historically data-deprived compared to other domains. This PhD project will investigate methods for self-supervised learning applied to various tasks in applied to various tasks in music information research that involve music audio data, such as automatic music transcription, audio fingerprinting, or music tagging. Methods to be investigated can include but are not limited to contrastive self-supervised learning, formulation of appropriate pretext tasks, transferability to downstream tasks, and links between self-supervised and semi-supervised learning for music understanding.

The successful candidate will investigate, propose and develop novel self-supervised representation learning methods and software tools for music understanding, and will apply them to address tasks of their choice within the wider field of music information research. This will result in models that can learn from unlabelled data while performing comparably or surpassing supervised learning methods.

Timbre Tools for the Digital Instrument Maker

Supervisor: Dr. Charalampos Saitis
in collaboration with Bela

This PhD addresses the role of timbre in the design of sound synthesis and AI tools for digital instrument makers. Timbre is among the most evocative yet elusive attributes of music. It is through timbre that musicians can emote by manipulating the physical response of their acoustic instruments. Yet timbre is conspicuously absent from the digital luthier’s toolbox. Much synth design is still based on concepts from early analog and digital synthesis, or emulation of it using more recent techniques, while an audio engineer’s workbench is still based mainly on historical tools like oscilloscopes and signal generators. Teaming up with Bela, this project will implement novel Bela IDE features at the intersection of creativity, education, and research, enabling makers to create digital interactions with an understanding of how we listen. This will involve developing flexible, open-ended tools for analysis and visualisation of timbre (and maybe synthesising sounds) which could be plugged into any Bela project under development. Examples include tools enabling interaction with and interpretation of psychoacoustical and semantic timbre spaces, as well as real-time light-weight neural audio synthesis models for timbre transfer and sound morphing. The premise of this project is to promote a learn-by-making approach: through creating digital instruments using in-browser timbre tools amongst other tools, people would learn more about sound synthesis and make more interesting instruments.

User-driven deep music generation in digital audio workstations

Supervisors: Dr. Mathieu Barthet and Dr Gaëtan Hadjeres
in collaboration with Sony CSL

Although deep music generation has made great progress in the past decade, current approaches still offer little creative control to the user. Systems that highly or fully automate the music making process inherently limit the role of musicians. Besides responsible innovation and ethical considerations, this may reduce cognitive reward, user engagement, and hinder the adoption of such systems in the music industry. Combining creative agency with automation presents interesting challenges both for deep music generation and human-computer interaction.

The PhD will investigate AI-based music production assistive techniques that integrate seamlessly to user workflows in digital audio workstations and offer flexible editing control. Interviews with musicians will be conducted to identify creative affordances desired by users. The knowledge gained will inform the design of deep learning models that can be constrained based on high-level user controls and domain-specific knowledge. Several computational music creativity tasks will be considered including music inpainting (continuation of a musical composition in context) and harmonisation. Previous work on piano inpainting will be extended to small ensembles (e.g. jazz trio). Training will combine symbolic and audio datasets by developing models yielding suitable mid-level representations.

User evaluation will be conducted using HCI methods suitable for music making including creativity support assessment. The PhD will also advance computational methods to identify overfitting situations and to characterise the novelty of deep generated content given training corpus data.

The candidate will collaborate with the Sony CSL music team (, and have the opportunity to interact with experienced music AI researchers and software developers. Access to specialised datasets and artist network for user studies and public engagement will also be possible.

The candidate should have experience in at least one of the following scientific areas or equivalent: music informatics, machine learning, music signal processing, musical timbre modeling, new interfaces for musical expression, human-computer interaction. Programming skills (e.g. Python/PyTorch, C/C++, Javascript), audio plugin development (e.g. JUCE, VST), and music performance/composition/production backgrounds are desirable.

Using Signal-informed Source Separation (SISS) principals to improve instrument separation from legacy recordings

Supervisors: Prof. Mark Sandler and Dr. Emmanouil Benetos

The recently proposed Signal-Informed Source Separation (SISS) paradigm from c4dm belongs to the broader category of Informed Source Separation (ISS), with the unique and specific attribute of using one audio signal to inform the separation of another. The informing source is a close approximation of a coherent component in the mixture. A current AIM PhD is examining this paradigm for live ensemble recordings when the spot mic signal informs the separation of the main mix. An alternative viewpoint is that the informing signal as a caricature of its corresponding component in the main feed. This suggests that we should investigate other caricatures for musical instrument separation and modification, especially for re-mixing and up-mixing of legacy commercial recordings. For example, a session musician could play the guitar line in the Beatles’ “She Loves You” to separate Harrison’s part, or it could be rendered from a MIDI transcription. Preliminary, confirmatory evidence for this approach appears in [1 & 2] which explore crude implementations with good outcomes but do not develop the approach further.

This PhD will develop skills in deep learning, especially architectures employing conditioning, as well as novel cost functions, perhaps incorporating physical models of the instruments to be separated. Applicants would benefit from a background in Machine Learning and DSP, coupled with knowledge of modern music recording, processing and mixing techniques.

[1] P. Smaragdis & G. J. Mysore, ‘Separation by “humming”: User-guided sound extraction from monophonic mixtures’, in IEEE WASPAA, 2009.
[2] Y. Li et al, ‘Learning to Denoise Historical Music’, in ISMIR, 2020.