Congratulations to AIM members Aditya Bhattacharjee and Christos Plachouras, and C4DM member Sungkyun Chang who secured first place at the Query-by-Vocal Imitation (QbVI) Challenge, held as part of the AES International Conference on Artificial Intelligence and Machine Learning for Audio conference (AES AIMLA 2025) taking place from September 8-10, 2025.
The winning entry addressed the task, which entails retrieving relevant audio clips from a database using only a vocal imitation as a query. This is a particularly complex problem due to the variability in how people vocalise sounds and the acoustic diversity across sound categories. Successful approaches must bridge the gap between vocal and non-vocal audio, while handling the unpredictability of human-generated imitations.
The team’s submission, titled “Effective Finetuning Methods for Query-by Vocal Imitation”, advances the state-of-the-art in QbVI by integrating a triplet-based regularisation objective with supervised contrastive learning. This method addresses the issue of limited data by sampling from an unused subset of the VocalSketch dataset, which comprises practice recordings and human-rejected vocal imitations. While this data may not be suitable for positive matches, the vocal imitation data is useful for creating confounding examples during training. More specifically, this increases the size of the pool of negative examples, which is utilised by the added regularisation method.
The proposed method surpassed state-of-the-art methods for both subjective and objective evaluation metrics, opening up scope for product-based innovations and software tools that can be used by artists to effectively search large repositories of sound effects.