Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representation

被引:0
|
作者
Niizumi, Daisuke [1 ]
Takeuchi, Daiki [1 ]
Ohishi, Yasunori [1 ]
Harada, Noboru [1 ]
Kashino, Kunio [1 ]
机构
[1] NTT Corp, NTT Commun Sci Labs, Atsugi, Kanagawa, Japan
关键词
Self-supervised learning; General-purpose Audio Representation; Masked Autoencoders; Masked Spectrogram Modeling;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent general-purpose audio representations show state-of-the-art performance on various audio tasks. These representations are pre-trained by self-supervised learning methods that create training signals from the input. For example, typical audio contrastive learning uses temporal relationships among input sounds to create training signals, whereas some methods use a difference among input views created by data augmentations. However, these training signals do not provide information derived from the intact input sound, which we think is suboptimal for learning representation that describes the input as it is. In this paper, we seek to learn audio representations from the input itself as supervision using a pretext task of auto-encoding of masked spectrogram patches, Masked Spectrogram Modeling (MSM, a variant of Masked Image Modeling applied to audio spectrogram). To implement MSM, we use Masked Autoencoders (MAE), an image self-supervised learning method. MAE learns to efficiently encode the small number of visible patches into latent representations to carry essential information for reconstructing a large number of masked patches. While training, MAE minimizes the reconstruction error, which uses the input as training signal, consequently achieving our goal. We conducted experiments on our MSM using MAE (MSM-MAE) models under the evaluation benchmark of the HEAR 2021 NeurIPS Challenge. Our MSM-MAE models outperformed the HEAR 2021 Challenge results on seven out of 15 tasks (e.g., accuracies of 73.4% on CREMA-D and 85.8% on LibriCount), while showing top performance on other tasks where specialized models perform better. We also investigate how the design choices of MSM-MAE impact the performance and conduct qualitative analysis of visualization outcomes to gain an understanding of learned representations. We have made our code available online for further improvements and applications of the MSM framework.(1)
引用
收藏
页码:1 / 24
页数:24
相关论文
共 50 条
  • [21] SemMAE: Semantic-Guided Masking for Learning Masked Autoencoders
    Li, Gang
    Zheng, Heliang
    Liu, Daqing
    Wang, Chaoyue
    Su, Bing
    Zheng, Changwen
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [22] Masked Autoencoders for Point Cloud Self-supervised Learning
    Pang, Yatian
    Wang, Wenxiao
    Tay, Francis E. H.
    Liu, Wei
    Tian, Yonghong
    Yuan, Li
    [J]. COMPUTER VISION - ECCV 2022, PT II, 2022, 13662 : 604 - 621
  • [23] A Self-Supervised Learning Approach to Road Anomaly Detection Using Masked Autoencoders
    Dutta, Proma
    Podder, Kanchon Kanti
    Zhang, Jian
    Hecht, Christian
    Swarna, Surya
    Bhavsar, Parth
    [J]. INTERNATIONAL CONFERENCE ON TRANSPORTATION AND DEVELOPMENT 2024: PAVEMENTS AND INFRASTRUCTURE SYSTEMS, ICTD 2024, 2024, : 536 - 547
  • [24] Masked Modeling-based Audio Representation for ACM Multimedia 2022 Computational Paralinguistics ChallengE
    You, Kang
    Xu, Kele
    Zhu, Boqing
    Feng, Ming
    Feng, Dawei
    Liu, Bo
    Gao, Tian
    Ding, Bo
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 7060 - 7064
  • [25] Contrastive Masked Image-Text Modeling for Medical Visual Representation Learning
    Chen, Cheng
    Zhong, Aoxiao
    Wu, Dufan
    Luo, Jie
    Li, Quanzheng
    [J]. MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT V, 2023, 14224 : 493 - 503
  • [26] Masked Structural Point Cloud Modeling to Learning 3D Representation
    Yamada, Ryosuke
    Tadokoro, Ryu
    Qiu, Yue
    Kataoka, Hirokatsu
    Satoh, Yutaka
    [J]. IEEE Access, 2024, 12 : 142291 - 142305
  • [27] Bioinspired framework for general-purpose learning
    de Toledo, SA
    Barreiro, JM
    [J]. FOUNDATIONS AND TOOLS FOR NEURAL MODELING, PROCEEDINGS, VOL I, 1999, 1606 : 507 - 516
  • [28] Modeling in the bioimpedance measurement techniques using general-purpose software
    Paavle, Toivo
    [J]. 2006 INTERNATIONAL BALTIC ELECTRONICS CONFERENCE, PROCEEDINGS, 2006, : 209 - 212
  • [29] Unsupervised Learning of Parsimonious General-Purpose Embeddings for User and Location Modeling
    Yang, Jing
    Eickhoff, Carsten
    [J]. ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2018, 36 (03)
  • [30] Prediction of Masked Hypertension and Masked Uncontrolled Hypertension Using Machine Learning
    Hung, Ming-Hui
    Shih, Ling-Chieh
    Wang, Yu-Ching
    Leu, Hsin-Bang
    Huang, Po-Hsun
    Wu, Tao-Cheng
    Lin, Shing-Jong
    Pan, Wen-Harn
    Chen, Jaw-Wen
    Huang, Chin-Chou
    [J]. FRONTIERS IN CARDIOVASCULAR MEDICINE, 2021, 8