This page outlines PhD topics proposed by AIM academics and industry partners for September 2020 entry. You are welcome to apply for one or more of the topics, or to propose your own PhD topic, according to application guidelines specified here. In both above cases, we strongly encourage you to contact your chosen supervisor for an informal chat – this will also help you to put together your research proposal, which is an integral part of your application. We also strongly encourage you to list in your application more than one PhD topic, preferably 3 topics.
Supervisor: Dr Dan Stowell
Deep learning is now a widespread tool for audio analysis tasks. Most commonly, these use “convolutional” and/or “recurrent” neural networks. But what will be the next generation of deep learning methods? There are many exciting ideas to build on, such as: deep Gaussian processes, capsule networks, gauge equivalent networks, neural ordinary differential equations, normalising flows, and point process networks. In this project you will explore such possibilities and their usefulness for analysing/generating audio data, and then refine selected methods to create a new generation of audio deep learning that is not only powerful but also insightful. The applications of this can range from automatic music transcription, audio event detection, to understanding the fine details of vocal expression. For this project a strong mathematical background is particularly desirable (e.g. calculus, statistics).
Supervisor: Dr Lin Wang
Real-life audio signals often suffer from local degradation and lost information. Examples include short audio intervals corrupted by impulse noise and clicks, or a clip of audio wiped out due to damaged digital media or packet loss in audio transmission. Audio inpainting is a class of techniques that aim to restore the lost information with newly generated samples without introducing audible artifacts . In addition to digital restoration, audio inpainting also finds wide applications in audio editing (e.g. removing audience noise in live music recording) and music enhancement (e.g. audio bandwidth extension and super-resolution).
Approaches to audio inpainting can be classified depending on the length of the lost information, i.e. the gap. For example, in declicking and declipping, corruption may be frequently but mostly confined to only a few milliseconds duration or less. On the other hand, gaps on a scale of hundreds of milliseconds or even seconds may happen due to digital media damage, transmission loss, and audio editing. While intensive work has been done on inpainting short gaps, long audio inpainting still remains a challenging problem due to the high dimensional, complex and non-correlated audio features. Recently, intrigued by the tremendous success in image and video inpainting, deep learning based approaches started attracting attention in the research community, but still in an infant stage.
The PhD project intends to investigate the possibility of adapting deep learning frameworks from various domains inclusive of audio synthesis and image inpainting for audio inpainting. A particular focus will be given to recovering musical signals with long-gap information missing, and reconstructing super-resolution audio signals through bandwidth extension, which are both challenging tasks in the state of the art. The time-frequency sparsity, the structure and repetition of the musical signals, as well as auditory perception psychology and music semantics will be jointly exploited to help achieve this goal.
The research will be conducted by combining one or several methodologies as below.
1) Traditional musical signal processing approaches, e.g. exemplar-based method .
2) Deep learning based approaches, e.g. convolution neural networks and generative adversarial network [3-5].
3) Audio-visual based approaches exploiting additional visual context, e.g. video recording of instrument performance .
 Adler, Audio inpainting, IEEE-TASLP 2011.
 Perraudin, Inpainting of long audio segments with similarity graphs, IEEE-TASLP 2018.
 Chang, Deep-long audio inpainting, arXiv 2019.
 Marafioti, A context encoder for audio inpainting, IEEE-TASLP 2019.
 Lim, Time-frequency networks for audio super-resolution, ICASSP 2018.
 Zhou, Vision-infused deep audio inpainting, ICCV 2019.
Supervisor: Dr Andrew McPherson
Expert performers spend many years developing skills on their instruments, and few performers have time and motivation to learn new and unfamiliar digital instruments. This project explores an approach to creating new instruments which build on the existing skills of expert players in which a familiar instrument can be used as a controller for other types of musical sounds.
Starting with an electric violin, electric guitar or similar instrument, a real-time audio analysis algorithm will be developed based on deep learning techniques such as variational recurrent autoencoders (VRAEs) to construct a low-dimensional latent space representation of the timbre and articulation. The results will be evaluated for perceptual salience to the performer, and resynthesis techniques will be explored which use the latent space representation to control other sound synthesis algorithms. The desired end result is an instrument which remains familiar to the performer in its physical form, whose sound is creatively new but also responds in predictable ways to the performer’s actions. A priority is to look beyond typical high-level transcription features such as note onsets to focus on the nuanced, micro-level timbral control that expert performers expect from their instruments.
Supervisor: Dr Andrew McPherson
in collaboration with OHMI
People with physical disabilities face a number of barriers to full participation in musical life. Technology has played an important role in removing some of these barriers and continues to present opportunities for further progress in this area. However, few adapted instruments exist which have fully succeeded in providing a platform for undifferentiated participation, with respect to concepts such as virtuosity, access to existing repertoire, or acceptability in existing ensemble forms – orchestras, rock bands and folk groups, are examples. Augmenting instruments with AI is an approach which has been seen in existing ‘smart instruments’, but is as yet under-explored means by which disabled musicians may be given full access to musical performance. This studentship provides an opportunity to explore the ways in which AI can potentially be used to overcome the technical and aesthetic challenges of adapting instruments for disabled musicians, while meeting the highest musical challenges of virtuosity and expressiveness and allowing access to repertoire and cultural acceptability. Examples of topics might include using of AI technology to predict and respond to the intentions of a performer to alter aspects of an instrument’s performance (perhaps through embouchure control of a wind instrument, or hand and finger adjustments to control pitch, vibrato and bowing of a string instrument).
Every songwriter, composer and producer who has ever made a piece of music was influenced, in some way, by his or her predecessors. The history of those influences can be thought of as a genealogical network, rather like an evolutionary tree, reaching into the past. Such networks are usually just made up by experts who hear, or think they hear, certain similarities in products of various artists (e.g., AllMusic’s lists of influential artists). This project will develop objective, statistical, methods to infer influence networks.
The project will focus on a collection of around 10,000 traditional British and Irish folk songs hosted at the British Library. The first part of the project will characterise the songs by investigating and developing new music and audio descriptors using music information retrieval (MIR) methodologies. Given a set of MIR descriptors, we will then use methods inspired by evolutionary biology to infer, and represent, their relationships to each other as Temporally Directed Acyclic Graphs (tDAGs). The resulting influence network (or genealogical graph) will, in turn, be used to test hypotheses about the origin and spread of musical forms in the British Isles: a history that we believe reaches back for centuries. Candidates are encouraged to apply the developed methods to other music styles and cultures.
The ideal candidate would have skills in computational statistics or statistical machine learning, scientific computing (Python and R), as well as an interest in comparative musicology.
Music recommender systems (MRSs) have to date primarily focused on song or playlist recommendation for listening purposes, but much less work has been done on the recommendation of audio content for music production. Contemporary music production tools commonly include large digital audio libraries of loops (repeatable musical parts typically used in grid-based music), virtual instrument samples (based on synthesis or recordings) and sound packs (collections of sounds). Such richness of content can however impede creativity as it can be daunting for musicians to navigate tens of thousands of sounds to find items matching the style of their production and intent.
This PhD will research, develop and assess composition-aware music recommendation systems for music production enabling musicians get the best musical value out of their creative digital audio libraries.
Computer music makers often produce music by combining different instrumental parts together (e.g. drums, bass, lead, vocals, etc.). One of the challenges will be to provide new audio items that can be meaningfully mixed with other audio elements in the composition. The PhD will investigate methods to assess the musical compatibility between a set of audio items given constraints such as a composer’s musical taste and creative style.
Different music recommendation paradigms and AI techniques will be researched in order to minimise the impact of cold-start and sparsity issues, and maximise interpretability of the results. The project will compare content-based filtering (CBF) using audio or metadata (e.g. “analog dirty bass”, “crisp bright lead”, etc.), collaborative filtering (CF) based on user-item interactions, and hybrid models. Graph-based solutions for composition-aware recommendation will be investigated, e.g. by developing multi-layer structure taking into account user preferences and audio item musical compatibility. The project will also research deep learning techniques for (i) automatic feature learning from audio (e.g. to model perceptual distances between sounds to find sound-alikes), (ii) modelling audio item musical compatibility, and (iii) extracting latent factors from user-item interactions.
Evaluations of the models will be conducted taking into account how they support creative agency and provide interpretable recommendations to the user. The candidate will have access to data and software provided by Focusrite including a catalogue of over 10k WAV loops with metadata, a collection of about 30k synth patches (with about 200 parameters), and musical compatibility indicators. The recommendation models will be tested on Ampify iOS music apps such as Launchpad, to make and remix music, and Blocs Wave, to make and record music (https://ampifymusic.com/).
The candidate should have experience in at least one of the following scientific areas or equivalent: music recommendation, music signal processing, machine learning, graph theory, deep learning, musical timbre modeling, music emotion recognition, human computer interaction. Programming skills (e.g. Python, C/C++, Objective-C, Swift, Matlab) and musical background are desirable.
Supervisor: Dr Mathieu Barthet
in collaboration with Sensing Feeling Ltd and Ovomind
Live music poses a challenge for affective computing due to the complex and multidimensional nature of perceptual modalities involved and social human factors. The production and reception of live music result in a wide range of expressions which pertain to the emotional states of performers and audiences. These expressions can be explicit (e.g. the strumming gesture of a guitar chord, an audience dancing and cheering), or implicit (e.g. an increase in heart rate), and highly depend on context and performance social norms.
This PhD research aims to establish multimodal deep learning and sensor fusion methods to sense, recognise and model the emotional states of audiences and performers dynamically over the course of live music performances. The work will investigate how predictions can be improved by combining several modalities including visual (video), audio (e.g. crowd sounds, music), inertial (linked to audience/performer motion), and/or physiological (e.g. electrodermal activity, pulse) attributes. A deep learning-based sensor fusion system will be developed using data captured with Internet of Things (IoT) sensors, wearables and microphones. Different models will be devised to infer emotional states at individual or collective (crowd, band) levels. Data captured with the sensors will also be used to better understand the correspondence between visual valence and arousal of audiences in a number of musical styles.
Supervisor: Dr Marcus Pearce
This project will combine computational modelling and empirical experiments with human participants to understand how listeners learn the sequential syntactic structure of music and how this learning impacts on perception and aesthetic experience.
Our existing research shows listeners generate probabilistic predictions for the pitch, timing and harmony of forthcoming musical structures, which derive from implicit statistical learning over timescales ranging from long-term acquisition of the structure of musical styles to short-term learning of repeating structure within a piece of music. This represents a dynamic process of model construction, in which our brains attempt to extract as much structure as possible from the auditory environment to predict more accurately forthcoming auditory events. Surprising events constitute failures of prediction by the model and can therefore drive learning. Expectations have a special role to play in musical appreciation, since pleasure can arise both from predictable events, signalling a successful predictive model, and surprising events, which generate interest and even thrilling heightening of physiological arousal. Aesthetic pleasure is maximal at intermediate degrees of unpredictability and uncertainty.
There are several promising avenues for further investigation. 1.) research is required to establish the precise psychological mechanisms by which statistical learning and probabilistic prediction give rise to musical pleasure. 2.) statistical learning implies that individuals with different listening histories will perceive music in different ways as a function of their experience; research is required to test this hypothesis by developing embodied artificial systems that simulate developmental trajectories in acquisition of culture-specific musical knowledge and predicting differences between musical cultures. 3.) current models can perform better than humans, motivating research on memory constraints to better simulate human learning of structural regularities. 4.) research is required to extend our existing models to process sound and music at different hierarchical levels, at different time scales, ranging from high-level musical form (motifs, phrases, sections, parts), through symbolic notes, to acoustic processing of raw auditory input. This will also involve developing a model of auditory stream segregation and fusion in polyphonic music and multi-channel auditory scenes.
The overall goal is to develop a complete computational model of cognitive music. Within this general approach, there is scope to focus the project on different musical parameters (e.g., melody, rhythm, harmony), empirical methods (behavioural, EEG), computational approaches (e.g., structured probabilistic models, empirical Bayesian methods, neural networks) and musical styles, including non-musical auditory sequences.
Barascud, N., Pearce, M. T., Griffiths, T. D., Friston, K. J., & Chait, M. (2016). Brain responses in humans reveal ideal observer-like sensitivity to complex acoustic patterns. Proceedings of the National Academy of Sciences, 113, E616-E625.
Gold, B., Pearce, M. T., Mas-Herrero, E., Dagher, A., & Zatorre, R. J. (2019). Predictability and uncertainty in the pleasure of music: a reward for learning? Journal of Neuroscience, 39(47), 9397-9409.
Pearce, M. T. (2018). Statistical learning and probabilistic prediction in music cognition: mechanisms of stylistic enculturation. Annals of the New York Academy of Sciences, 1423, 378-395.
Supervisor: Dr Charalampos Saitis
in collaboration with the BBC and ISI Foundation
Digital platforms of music streaming, including radio and podcasts, offer access to behavioural signals that can be used to learn about the characteristics and preferences of individuals. The goal of the proposed research is to leverage the power of low-level digital trails of music listening behaviours (most frequently listened to genres/artists, how often one listens to new music/artist suggestions, playlists, etc.). Such knowledge can then be employed to build predictive models of complex high-level psychological constructs like moral and human values as well as on complex demographic attributes.
The doctoral student will collect large-scale demographic data from UK listeners, closely following the UK population with respect to gender, age, geographical distribution, education, and income. Along with the demographic data, they will collect self-reported assessments on validated psychometric questionnaires for moral traits and basic human values, and combine this information with passively (but entirely voluntarily) collected multimodal digital data from online music/radio/podcast streaming platforms, such as BBC Sounds and the BBC iPlayer, and from Facebook “likes” on music-related pages (e.g., artists, bands, communities).
Recent research between academics and the BBC investigates how global human values are shaped in an ever-changing world (shorturl.at/FQS28). The proposed project will extend this work by looking at the links between global human values and musical preferences. Previous studies typically sought to infer such connections through specially designed questionnaires. The novelty of the proposed project lies in using passively collected digital trails of music listening behaviours outside the laboratory and over a period of time that extends beyond filling in a questionnaire.
The data informed models developed in this project will help unlock the potential of personalised, uniquely tailored listener experiences, recommendation systems, and communication strategies in digital music—a strategic focus area of the UKRI-EPSRC portfolio—with applications in the creative industries and healthcare in particular (e.g., understanding well-being from musical choices). Furthermore, the cultural mosaic of the UK society offers a unique setting in which to explore cultural influences on how moral and human values are reflected in musical choices.
Supervisor: Prof Nick Bryan-Kinns
Music making is a rich domain in which to explore AI, human-robot interaction, creativity, engagement, and Human-Computer Interaction (HCI). Manipulating the ways robots interact with people in real-time provides opportunities for systematic evaluation of the effect of human-robot interaction features on people’s creativity and engagement with and through music making. However, such evaluation is a non-trivial challenge given the subjective nature of music and creativity and the inherently experiential nature of music making which reduces the value of conventional efficiency measures. There is the potential to use measures of people’s interaction with robots and to assess features of the creative products themselves to evaluate participants’ creativity, engagement, and user experience, in addition to self-reporting by people. This PhD would research the use of AI and Machine Learning techniques to create music in real-time for performance by robots with human counterparts. The PhD student would need skills in AI and Machine Learning as well as some skills or interest in HCI user study techniques and human-robot interaction design and build. Musical skills are not necessary but would be advantageous.
Supervisor: Dr George Fazekas
This PhD explores the synergies between computational analysis of audio and the abundance of big music data available today. The broader aim is to discover and analyse patterns in music data to enable the study of how musical style develops and evolves over time, identify leaders, innovators and influencers in artist or listener communities, or facilitate the use of music to address societal challenges. For instance, can music bring together social media users and reduce the adverse effects of isolated communities and information bubbles? There are several key AI challenges in answering these questions. This includes how to use the available information effectively and how to deal with sparsity and uncertainty in data (see https://arxiv.org/pdf/1706.02361.pdf for some relevant challenges). While the Semantic Web is an excellent source of Linked Open Data about music artists, albums, tracks, styles etc., the data is incomplete and heavily biased by popularity. Data about listener communities is also sparse because of the varying degrees users and services provide access to personal information. The combination of different data sources and inference mechanisms can therefore become key in addressing these challenges. The PhD research will investigate: 1) How to compete knowledge graphs obtained from Semantic Web resources such as Wikipedia, DBpedia, Wikidata, etc. using other information sources such as musical features extracted from audio, music similarity, playlists or listening data. 2) How to infer new knowledge from the analysis of relationship patterns in music data, both at the level of instances (e.g. artist networks) and music ontologies (http://musicontology.com). A proposed approach will use Graph Convolutional Networks (https://arxiv.org/pdf/1609.02907.pdf) to obtain an embedding space in which new relationships between tracks or artists may be uncovered. These may stem from common musical patterns or other relationships, e.g., common appearance in playlists or common social tags. The new relationships will enhance the knowledge graph and may also yield generic musical knowledge encoded as rules or web ontologies to support logical inference. There is scope to explore both of these aspects of the topic. The research has industry relevant applications (e.g. music recommendation) or could enable new musicological research in digital humanities.
This proposed PhD project focusses on development of tools and techniques to support the curation of collections of music tracks and production of music mixes by expert users in a professional environment. The project will investigate how music discovery and recommendation techniques can be applied in professional tools to assist these expert curators in their work; enhancing their creative role, rather than replacing them with an automated system.
A range of state-of-the-art techniques for music discovery should be considered and evaluated in this specific context of use. This includes content-based analysis, collaborative filtering, and auxiliary sources of music-related knowledge and data. Beyond simple recommendation of songs, other features should be considered such as human-readable explanations of why tracks are being recommended and suggestions for playlist ordering. This project will involve user research to understand the needs of the target professional users and the constraints in which they work, as well as evaluation of tools developed during the project in a professional context.
The project will be carried out in collaboration with BBC Research & Development. This gives the opportunity to work directly with teams at the BBC and to evaluate the research outcomes in a real-world application context both with professional users and large audiences.
We would like to invite you to submit a project proposal for a PhD on methods for artificial musicality. You can decide whether to focus on music understanding or creativity. A PhD is your opportunity to do research without the constraints of an industrial environment — we prefer fresh ideas, novel methods and careful experiments. We are a team of music information retrieval veterans and love to share our academic and industrial experience. We will work as supervisors with the successful candidate, providing technical guidance (neural networks, Bayesian networks, music signal processing) and advice on the applicability of the research to industry use cases.
Supervisor: Prof Simon Dixon
Recent years have witnessed an increase in approaches that combine different music analysis tasks (e.g. beat tracking, chord recognition, structure analysis, instrument recognition), making use of the relationships between the different aspects of musical organisation. The feasibility of multimodal approaches, extending the analysis of audio data with scores and/or video data, has also been demonstrated. In this project you will develop new multi-task learning approaches to analyse and annotate music collections, making use of shared representations across the different tasks. The project will make use of an archive of five decades of recordings from a jazz and popular music festival. The high-level musical features computed in this project will then be used to characterise and model musical styles at individual and collective levels (e.g. by genre, date, geographic origin).
Supervisor: Prof Ioannis Patras
In this project we will develop methodologies to learn from large collections of music video data to associate image regions with specific sounds. In this way, we will be able to design detectors that can localize music instruments in video, and will be able to adjust the volume, or other properties of the sound signal, or of specific instruments in a video that depicts more than one. The methods will be trained in an unsupervised manner, i.e., without assuming known correspondences between images and sounds (such as a video where a violin is played), or only few correspondences. The methodologies will be based on Deep Neural Networks that are at the crossroads of Computer Vision and Audio Processing.
This PhD project aims to intelligently automate the production of audio mixes of multiple input sources. The target application is in broadcast production of live events, such as pop and classical music concerts, and sports. Live broadcast environments operate under great time pressure with the need for staff to perform many complex tasks. The aim of the project is to ease the work of the skilled sound engineer, by automating production decisions with artificial intelligence. However, the challenge is not to replace the role of the engineer, but to improve quality and creativity in their work. Therefore, it should be considered how to allow human input to steer high-level aesthetic decisions, while automating many mundane tasks and producing a professional quality output by default.
The project will be carried out in collaboration with BBC Research & Development. This gives the opportunity to work directly with teams at the BBC and to evaluate the research outcomes in a real-world application context. There will be opportunities to inform the research with study of and data from live broadcast productions.
In professional multi-microphone recording scenarios, there is often cross-talk between sound sources and undesirable influence of room acoustics. This can cause quality issues in mixing. The combination of multiple correlated microphone signals leads to timbral colouration. It is desirable to reduce such interference, separating the mixtures of sound sources, to enable more precise control of the mix balance in a way that increases sound quality.
While source separation has been around for many years, the techniques can often introduce undesirable artefacts, which reduce quality. Blind source separation describes the scenario where there is little a priori information about the mixture of source. In a multi-microphone production scenario, such as in a music or broadcast recording studio, there is often information available to inform the source separation process. One approach could be to utilise information from multiple correlated microphone sources. For music recordings, often there is also a priori information from the score or previous multitrack recordings of the pieces and instruments. Similarly for speech signals, such as a panel show or drama recording, the voices of performers are generally known in advance.
The project will be carried out in collaboration with BBC Research & Development. This gives the opportunity to work directly with teams at the BBC and to evaluate the research outcomes in a real-world application context with professional users.
Supervisor: Prof Mark Sandler
in collaboration with L-Acoustics UK
In augmented and cross reality, room acoustics mapping (acoustic simulation at every possible listening position) has recently attracted scientific attention as it allows a virtual acoustic environment to react to the positioning of the listener in a room.
To date, room acoustic mapping is typically performed by modelling the propagation of sound via a ray-based method (image-source model, ray tracing…) or wave-based method (surface element method, volume element method…) and requires extensive computing time and very detailed maps of the room geometry and structures. This project explores a new way of performing room acoustic mapping via room impulse response (RIR) synthesis based on quickly measurable data (such as impulse responses and/or photo spheres) and the usage of AI algorithms.
The successful candidate will research the usage of supervised learning algorithms such as neural networks and deep neural networks to synthesise predicted RIRs for given listener and source positions based on a minimal set of measured data in a room. After creating a dataset using measurements or acoustics simulation technics, one of the challenges will be to identify and extract relevant information from this dataset to train the algorithm. This project will thus investigate different signal based methods (possibly paired with computer-vision methods) for feature extraction.
The candidate will have access to an extensive database of geometrical models (CAD files) of concert halls and L-Acoustics will facilitate the access to venues and concert halls for measurements.
The candidate should have experience in at least one of the following scientific areas: audio signal processing, machine learning, deep learning, room acoustics, virtual acoustics, audio information retrieval, programming skills (e.g. Python, C/C++, Matlab). A musical background and/or an acute critical listening skill is a plus.
Automatic methods for understanding the content of image, video and audio data have led to considerable technological advances in recent years. Smart speakers would be impossible without modern speech recognition system, while self-driving cars would be unthinkable without modern computer vision systems. Many of these success stories were supported by the availability of large datasets, which enabled the use of large capacity models, such a neural networks. In this PhD, you will develop automatic methods for understanding sound from music and audio recordings. By exploiting large scale datasets, your methods will unlock the development of intelligent, efficient and intuitive ways to search, re-use, explore or process audio in new ways.