Degramnet: effective audio analysis based on a fully learnable time-frequency representation

被引:1
|
作者
Foggia, Pasquale [1 ]
Greco, Antonio [1 ]
Roberto, Antonio [1 ]
Saggese, Alessia [1 ]
Vento, Mario [1 ]
机构
[1] Univ Salerno, Via Giovanni Paolo II 132, Fisciano, SA, Italy
来源
NEURAL COMPUTING & APPLICATIONS | 2023年 / 35卷 / 27期
关键词
Deep learning; Audio representation learning; Signal processing; Sound event classification; Speaker identification; NEURAL-NETWORKS; RECOGNITION;
D O I
10.1007/s00521-023-08849-7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Current state-of-the-art audio analysis algorithms based on deep learning rely on hand-crafted Spectrogram-like audio representations, that are more compact than descriptors obtained from the raw waveform; the latter are, in turn, far from achieving good generalization capabilities when few data are available for the training. However, Spectrogram-like representations have two main limitations: (1) The parameters of the filters are defined a priori, regardless of the specific audio analysis task; (2) such representations do not perform any denoising operation on the audio signal, neither in the time domain nor in the frequency domain. To overcome these limitations, we propose a new general-purpose convolutional architecture for audio analysis tasks that we call DEGramNet, which is trained with audio samples described with a novel, compact and learnable time-frequency representation that we call DEGram. The proposed representation is fully trainable: Indeed, it is able to learn the frequencies of interest for the specific audio analysis task; in addition, it performs denoising through a custom time-frequency attention module, which amplifies the frequency and time components in which the sound is actually located. It implies that the proposed representation can be easily adapted to the specific problem at hands, for instance giving more importance to the voice frequencies when the network needs to be used for speaker recognition. DEGramNet achieved state-of-the-art performance on the VGGSound dataset (for Sound Event Classification) and comparable accuracy with a complex and special-purpose approach based on network architecture search over the VoxCeleb dataset (for Speaker Identification). Moreover, we demonstrate that DEGram allows to achieve high accuracy with lightweight neural networks that can be used in real-time on embedded systems, making the solution suitable for Cognitive Robotics applications.
引用
收藏
页码:20207 / 20219
页数:13
相关论文
共 50 条
  • [31] Time-frequency representation reconstruction based on the compressive sensing
    Li, Xiumei
    Bi, Guoan
    PROCEEDINGS OF THE 2014 9TH IEEE CONFERENCE ON INDUSTRIAL ELECTRONICS AND APPLICATIONS (ICIEA), 2014, : 1158 - +
  • [32] Underdetermined blind separation of audio sources from the time-frequency representation of their convolutive mixtures
    Aissa-El-Bey, Abdeldjalil
    Abed-Meraim, Karim
    Grenier, Yves
    2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PTS 1-3, PROCEEDINGS, 2007, : 153 - 156
  • [33] DISPERSIVE MODES IN THE TIME-DOMAIN - ANALYSIS AND TIME-FREQUENCY REPRESENTATION
    CARIN, L
    FELSEN, LB
    KRALJ, D
    PILLAI, SU
    LEE, WC
    IEEE MICROWAVE AND GUIDED WAVE LETTERS, 1994, 4 (01): : 23 - 25
  • [34] Model based time-frequency analysis
    Qian, S
    Chen, V
    WAVELET APPLICATIONS VI, 1999, 3723 : 72 - 77
  • [35] On the time-frequency analysis based filtering
    Stankovic, L
    ANNALS OF TELECOMMUNICATIONS, 2000, 55 (5-6) : 216 - 225
  • [36] AUDIO SOURCE SEPARATION WITH TIME-FREQUENCY VELOCITIES
    Wolf, Guy
    Mallat, Stephane
    Shamma, Shihab
    2014 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2014,
  • [37] AUDIO CLASSIFICATION FROM TIME-FREQUENCY TEXTURE
    Yu, Guoshen
    Slotine, Jean-Jacques
    2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS, 2009, : 1677 - +
  • [38] Audio watermarking using time-frequency characteristics
    Esmaili, S
    Krishnan, S
    Raahemifar, K
    CANADIAN JOURNAL OF ELECTRICAL AND COMPUTER ENGINEERING-REVUE CANADIENNE DE GENIE ELECTRIQUE ET INFORMATIQUE, 2003, 28 (02): : 57 - 61
  • [39] Audio denoising by time-frequency block thresholding
    Yu, Guoshen
    Mallat, Stephane
    Bacry, Emmanuel
    IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2008, 56 (05) : 1830 - 1839
  • [40] Time-frequency analysis of the first heart sound. Part 2: An appropriate time-frequency representation technique
    D. Chen
    L. -G. Durand
    Z. Guo
    H. C. Lee
    Medical and Biological Engineering and Computing, 1997, 35 : 311 - 317