This AI Paper Proposes A Self-Supervised Music Understanding Model Called MERT That Attains Overall SOTA Performance on 14 MIR Tasks


Self-supervised learning is being prominently used in Artificial Intelligence to develop intelligent systems. The transformer models like BERT and T5 have recently got popular due to their excellent properties and have utilized the idea of self-supervision in Natural Language Processing tasks. These models are first trained with massive amounts of unlabeled data, then fine-tuned with labeled data samples. Though Self-supervised learning has been successfully used in a number of fields, including speech processing, Computer vision, and Natural Language Processing, its application still needs to be explored in music audios. The reason for that is the limitations accompanying the field of music, which is modeling musical knowledge like the tonal and pitched characteristics of music.

To address this issue, a team of researchers has introduced MERT, which is an abbreviation for ‘Music undERstanding model with large-scale self-supervised Training.’ This acoustic model has been developed with the idea of using teacher models to generate pseudo labels in the manner of masked language modeling (MLM) for the pre-training phase. MERT helps the transformer encoder in the BERT approach, which is the student model, to comprehend and understand the model music audio in a better way by integrating the teacher models.

This generalizable and affordable pre-trained acoustic music model follows a speech Self Supervised Learning paradigm and employs teacher models to generate pseudo targets for sequential audio clips by incorporating a multi-task paradigm to balance acoustic and musical representation learning. To enhance the robustness of the learned representations, MERT has introduced an in-batch noise mixture augmentation technique. By combining audio recordings with random clips, this technique distorts the audio recordings, challenging the model to pick up relevant meanings even from obscure circumstances. The model’s capacity to generalize to situations where music may be mixed with irrelevant audio is enhanced by this addition.

The team has come up with a super effective combination of teacher models that shows better performance than all the conventional audio and speech methods. This group includes an acoustic teacher based on Residual Vector Quantization – Variational AutoEncoder (RVQ-VAE) and a music teacher based on the Constant-Q Transform (CQT). The acoustic teacher utilizes RVQ-VAE to provide a discretized acoustic-level summarization of the music signal, capturing the acoustic characteristics. Based on CQT, the musical teacher focuses on capturing the tonal and pitched aspects of the music. Together, these teachers guide the student model to learn meaningful representations of music audio.

The team has also explored settings to address acoustic language model pre-training instability. By optimizing these settings, they were able to scale up MERT from 95M to 330M parameters, resulting in a more powerful model capable of capturing intricate details of music audio. Upon evaluation, the experimental results demonstrated the effectiveness of MERT in generalizing to various music understanding tasks. The model achieved SOTA scores on 14 different tasks, showcasing its strong performance and generalization ability. 

In conclusion, the MERT model addresses the gap in applying Self Supervised Learning to music audios.

Check Out The Paper and Github link. Don’t forget to join our 23k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at

Check Out 100’s AI Tools in AI Tools Club

The post This AI Paper Proposes A Self-Supervised Music Understanding Model Called MERT That Attains Overall SOTA Performance on 14 MIR Tasks appeared first on MarkTechPost.

 Read More Artificial Intelligence Category – MarkTechPost 







Leave a Reply

Your email address will not be published. Required fields are marked *