Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representation

被引:0
|
作者
Niizumi, Daisuke [1 ]
Takeuchi, Daiki [1 ]
Ohishi, Yasunori [1 ]
Harada, Noboru [1 ]
Kashino, Kunio [1 ]
机构
[1] NTT Corp, NTT Commun Sci Labs, Atsugi, Kanagawa, Japan
关键词
Self-supervised learning; General-purpose Audio Representation; Masked Autoencoders; Masked Spectrogram Modeling;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent general-purpose audio representations show state-of-the-art performance on various audio tasks. These representations are pre-trained by self-supervised learning methods that create training signals from the input. For example, typical audio contrastive learning uses temporal relationships among input sounds to create training signals, whereas some methods use a difference among input views created by data augmentations. However, these training signals do not provide information derived from the intact input sound, which we think is suboptimal for learning representation that describes the input as it is. In this paper, we seek to learn audio representations from the input itself as supervision using a pretext task of auto-encoding of masked spectrogram patches, Masked Spectrogram Modeling (MSM, a variant of Masked Image Modeling applied to audio spectrogram). To implement MSM, we use Masked Autoencoders (MAE), an image self-supervised learning method. MAE learns to efficiently encode the small number of visible patches into latent representations to carry essential information for reconstructing a large number of masked patches. While training, MAE minimizes the reconstruction error, which uses the input as training signal, consequently achieving our goal. We conducted experiments on our MSM using MAE (MSM-MAE) models under the evaluation benchmark of the HEAR 2021 NeurIPS Challenge. Our MSM-MAE models outperformed the HEAR 2021 Challenge results on seven out of 15 tasks (e.g., accuracies of 73.4% on CREMA-D and 85.8% on LibriCount), while showing top performance on other tasks where specialized models perform better. We also investigate how the design choices of MSM-MAE impact the performance and conduct qualitative analysis of visualization outcomes to gain an understanding of learned representations. We have made our code available online for further improvements and applications of the MSM framework.(1)
引用
收藏
页码:1 / 24
页数:24
相关论文
共 50 条
  • [1] GROUP MASKED MODEL LEARNING FOR GENERAL AUDIO REPRESENTATION
    Atito, Sara
    Awais, Muhammed
    Alex, Tony
    Kittler, Josef
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 2600 - 2604
  • [2] Enhancing Representation Learning of EEG Data with Masked Autoencoders
    Zhou, Yifei
    Liu, Sitong
    [J]. AUGMENTED COGNITION, PT II, AC 2024, 2024, 14695 : 88 - 100
  • [3] BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation
    Niizumi, Daisuke
    Takeuchi, Daiki
    Ohishi, Yasunori
    Harada, Noboru
    Kashino, Kunio
    [J]. 2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [4] EXTENDING AUDIO MASKED AUTOENCODERS TOWARD AUDIO RESTORATION
    Zhong, Zhi
    Shi, Hao
    Hirano, Masato
    Shimada, Kazuki
    Tateishi, Kazuya
    Shibuya, Takashi
    Takahashi, Shusuke
    Mitsufuji, Yuki
    [J]. 2023 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS, WASPAA, 2023,
  • [5] CONTRASTIVE LEARNING OF GENERAL-PURPOSE AUDIO REPRESENTATIONS
    Saeed, Aaqib
    Grangier, David
    Zeghidour, Neil
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 3875 - 3879
  • [6] SELF-SUPERVISED LEARNING METHOD USING MULTIPLE SAMPLING STRATEGIES FOR GENERAL-PURPOSE AUDIO REPRESENTATION
    Kuroyanagi, Ibuki
    Komatsu, Tatsuya
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 3263 - 3267
  • [7] MAE-AST: Masked Autoencoding Audio Spectrogram Transformer
    Baade, Alan
    Peng, Puyuan
    Harwath, David
    [J]. INTERSPEECH 2022, 2022, : 2438 - 2442
  • [8] Learn from Incomplete Tactile Data: Tactile Representation Learning with Masked Autoencoders
    Cao, Guanqun
    Jiang, Jiaqi
    Bollegala, Danushka
    Luo, Shan
    [J]. 2023 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2023, : 10800 - 10805
  • [9] Contextual Representation Learning beyond Masked Language Modeling
    Fu, Zhiyi
    Zhou, Wangchunshu
    Xu, Jingjing
    Zhou, Hao
    Li, Lei
    [J]. PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 2701 - 2714
  • [10] Improving Masked Autoencoders by Learning Where to Mask
    Chen, Haijian
    Zhang, Wendong
    Wang, Yunbo
    Yang, Xiaokang
    [J]. PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT VIII, 2024, 14432 : 377 - 390