EXTENDING AUDIO MASKED AUTOENCODERS TOWARD AUDIO RESTORATION

被引:1
|
作者
Zhong, Zhi [1 ]
Shi, Hao [2 ]
Hirano, Masato [1 ]
Shimada, Kazuki [3 ]
Tateishi, Kazuya [1 ]
Shibuya, Takashi [3 ]
Takahashi, Shusuke [1 ]
Mitsufuji, Yuki [1 ,3 ]
机构
[1] Sony Grp Corp, Tokyo, Japan
[2] Kyoto Univ, Kyoto, Japan
[3] Sony Res, Kyoto, Japan
关键词
Audio classification; audio restoration; speech enhancement; masked autoencoder; vision transformer;
D O I
10.1109/WASPAA58266.2023.10248171
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Audio classification and restoration are among major downstream tasks in audio signal processing. However, restoration derives less of a benefit from pretrained models compared to the overwhelming success of pretrained models in classification tasks. Due to such unbalanced benefits, there has been rising interest in how to improve the performance of pretrained models for restoration tasks, e.g., speech enhancement (SE). Previous works have shown that the features extracted by pretrained audio encoders are effective for SE tasks, but these speech-specialized encoder-only models usually require extra decoders to become compatible with SE, and involve complicated pretraining procedures or complex data augmentation. Therefore, in pursuit of a universal audio model, the audio masked autoencoder (MAE) whose backbone is the autoencoder of Vision Transformers (ViT-AE), is extended from audio classification to SE, a representative restoration task with well-established evaluation standards. ViT-AE learns to restore masked audio signal via a mel-to-mel mapping during pretraining, which is similar to restoration tasks like SE. We propose variations of ViT-AE for a better SE performance, where the mel-to-mel variations yield high scores in non-intrusive metrics and the STFT-oriented variation is effective at intrusive metrics such as PESQ. Different variations can be used in accordance with the scenarios. Comprehensive evaluations reveal that MAE pretraining is beneficial to SE tasks and help the ViT-AE to better generalize to out-of-domain distortions. We further found that large-scale noisy data of general audio sources, rather than clean speech, is sufficiently effective for pretraining.
引用
下载
收藏
页数:5
相关论文
共 50 条
  • [21] INTERPOLATION OF MISSING SAMPLES FOR AUDIO RESTORATION
    ORUANAIDH, JJK
    FITZGERALD, WJ
    ELECTRONICS LETTERS, 1994, 30 (08) : 622 - 623
  • [22] AUDIOCLIP: EXTENDING CLIP TO IMAGE, TEXT AND AUDIO
    Guzhov, Andrey
    Raue, Federico
    Hees, Joern
    Dengel, Andreas
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 976 - 980
  • [23] Extending Audio Notetaker to Browse WebASR Transcriptions
    Tucker, Roger
    Fry, Dan
    Wan, Vincent
    Wrigley, Stuart
    Hain, Thomas
    12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 3336 - +
  • [24] Using Deep Autoencoders for In-vehicle Audio Anomaly Detection
    Pereira, Pedro Jose
    Coelho, Gabriel
    Ribeiro, Alexandrine
    Matos, Luis Miguel
    Nunes, Eduardo C.
    Ferreira, Andre
    Pilastri, Andre
    Cortez, Paulo
    KNOWLEDGE-BASED AND INTELLIGENT INFORMATION & ENGINEERING SYSTEMS (KSE 2021), 2021, 192 : 298 - 307
  • [25] GROUP MASKED MODEL LEARNING FOR GENERAL AUDIO REPRESENTATION
    Atito, Sara
    Awais, Muhammed
    Alex, Tony
    Kittler, Josef
    2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 2600 - 2604
  • [26] Phase-Aware Transformations in Variational Autoencoders for Audio Effects
    Cámara, Mateo
    Blanco, José Luis
    AES: Journal of the Audio Engineering Society, 2022, 70 (09): : 731 - 741
  • [27] SOME NEW POSSIBILITIES IN AUDIO-RESTORATION
    ALLEN, JS
    ASSOCIATION FOR RECORDED SOUND COLLECTIONS-JOURNAL, 1990, 21 (01): : 39 - 44
  • [28] Multi-channel audio statistical restoration
    Liu, Yinhong
    Godsill, Simon
    2020 IEEE INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING, COMMUNICATIONS AND COMPUTING (IEEE ICSPCC 2020), 2020,
  • [29] Objective and subjective comparison of audio restoration methods
    Canazza, S
    Coraddu, G
    De Poli, G
    Mian, GA
    JOURNAL OF NEW MUSIC RESEARCH, 2001, 30 (01) : 93 - 102
  • [30] 50 QUESTIONS ON AUDIO RESTORATION AND TRANSFER TECHNOLOGY
    OWEN, T
    ASSOCIATION FOR RECORDED SOUND COLLECTIONS-JOURNAL, 1983, 15 (2-3): : 39 - A45