EXTENDING AUDIO MASKED AUTOENCODERS TOWARD AUDIO RESTORATION

被引:1
|
作者
Zhong, Zhi [1 ]
Shi, Hao [2 ]
Hirano, Masato [1 ]
Shimada, Kazuki [3 ]
Tateishi, Kazuya [1 ]
Shibuya, Takashi [3 ]
Takahashi, Shusuke [1 ]
Mitsufuji, Yuki [1 ,3 ]
机构
[1] Sony Grp Corp, Tokyo, Japan
[2] Kyoto Univ, Kyoto, Japan
[3] Sony Res, Kyoto, Japan
关键词
Audio classification; audio restoration; speech enhancement; masked autoencoder; vision transformer;
D O I
10.1109/WASPAA58266.2023.10248171
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Audio classification and restoration are among major downstream tasks in audio signal processing. However, restoration derives less of a benefit from pretrained models compared to the overwhelming success of pretrained models in classification tasks. Due to such unbalanced benefits, there has been rising interest in how to improve the performance of pretrained models for restoration tasks, e.g., speech enhancement (SE). Previous works have shown that the features extracted by pretrained audio encoders are effective for SE tasks, but these speech-specialized encoder-only models usually require extra decoders to become compatible with SE, and involve complicated pretraining procedures or complex data augmentation. Therefore, in pursuit of a universal audio model, the audio masked autoencoder (MAE) whose backbone is the autoencoder of Vision Transformers (ViT-AE), is extended from audio classification to SE, a representative restoration task with well-established evaluation standards. ViT-AE learns to restore masked audio signal via a mel-to-mel mapping during pretraining, which is similar to restoration tasks like SE. We propose variations of ViT-AE for a better SE performance, where the mel-to-mel variations yield high scores in non-intrusive metrics and the STFT-oriented variation is effective at intrusive metrics such as PESQ. Different variations can be used in accordance with the scenarios. Comprehensive evaluations reveal that MAE pretraining is beneficial to SE tasks and help the ViT-AE to better generalize to out-of-domain distortions. We further found that large-scale noisy data of general audio sources, rather than clean speech, is sufficiently effective for pretraining.
引用
下载
收藏
页数:5
相关论文
共 50 条
  • [1] Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representation
    Niizumi, Daisuke
    Takeuchi, Daiki
    Ohishi, Yasunori
    Harada, Noboru
    Kashino, Kunio
    HEAR: HOLISTIC EVALUATION OF AUDIO REPRESENTATIONS, VOL 166, 2021, 166 : 1 - 24
  • [2] A deep learning framework for audio restoration using Convolutional/Deconvolutional Deep Autoencoders
    Nogales, Alberto
    Donaher, Santiago
    Garcia-Tejedor, Alvaro
    EXPERT SYSTEMS WITH APPLICATIONS, 2023, 230
  • [3] Audio restoration by constrained audio texture synthesis
    Lu, L
    Mao, Y
    Wenyin, L
    Zhang, HJ
    2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL V, PROCEEDINGS: SENSOR ARRAY & MULTICHANNEL SIGNAL PROCESSING AUDIO AND ELECTROACOUSTICS MULTIMEDIA SIGNAL PROCESSING, 2003, : 636 - 639
  • [4] Audio restoration by constrained audio texture synthesis
    Lu, L
    Mao, Y
    Liu, WY
    Zhang, HJ
    2003 INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOL III, PROCEEDINGS, 2003, : 405 - 408
  • [5] On the methodologies of audio restoration
    Orcalli, A
    JOURNAL OF NEW MUSIC RESEARCH, 2001, 30 (04) : 307 - 322
  • [6] Investigating Nonnegative Autoencoders for Efficient Audio Decomposition
    Oezer, Yigitcan
    Hansen, Jonathan
    Zunner, Tim
    Mueller, Meinard
    2022 30TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2022), 2022, : 254 - 258
  • [7] Using Autoencoders to Visualize Big Environmental Audio
    Rowe, Benjamin
    Eichinski, Philip
    Zhang, Jinglan
    Roe, Paul
    2023 27TH INTERNATIONAL CONFERENCE INFORMATION VISUALISATION, IV, 2023, : 13 - 18
  • [8] AUDIO RESTORATION AND TRANSFER TECHNOLOGY
    OWEN, T
    JOURNAL OF THE AUDIO ENGINEERING SOCIETY, 1980, 28 (12): : 923 - 923
  • [9] DSP restoration techniques for audio
    Moorer, James A.
    2007 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOLS 1-7, 2007, : 1701 - 1704
  • [10] AUDIO RESTORATION AND TRANSFER TECHNOLOGY
    OWEN, T
    JOURNAL OF THE AUDIO ENGINEERING SOCIETY, 1981, 29 (05): : 358 - 358