Degramnet: effective audio analysis based on a fully learnable time-frequency representation

被引:1
|
作者
Foggia, Pasquale [1 ]
Greco, Antonio [1 ]
Roberto, Antonio [1 ]
Saggese, Alessia [1 ]
Vento, Mario [1 ]
机构
[1] Univ Salerno, Via Giovanni Paolo II 132, Fisciano, SA, Italy
来源
NEURAL COMPUTING & APPLICATIONS | 2023年 / 35卷 / 27期
关键词
Deep learning; Audio representation learning; Signal processing; Sound event classification; Speaker identification; NEURAL-NETWORKS; RECOGNITION;
D O I
10.1007/s00521-023-08849-7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Current state-of-the-art audio analysis algorithms based on deep learning rely on hand-crafted Spectrogram-like audio representations, that are more compact than descriptors obtained from the raw waveform; the latter are, in turn, far from achieving good generalization capabilities when few data are available for the training. However, Spectrogram-like representations have two main limitations: (1) The parameters of the filters are defined a priori, regardless of the specific audio analysis task; (2) such representations do not perform any denoising operation on the audio signal, neither in the time domain nor in the frequency domain. To overcome these limitations, we propose a new general-purpose convolutional architecture for audio analysis tasks that we call DEGramNet, which is trained with audio samples described with a novel, compact and learnable time-frequency representation that we call DEGram. The proposed representation is fully trainable: Indeed, it is able to learn the frequencies of interest for the specific audio analysis task; in addition, it performs denoising through a custom time-frequency attention module, which amplifies the frequency and time components in which the sound is actually located. It implies that the proposed representation can be easily adapted to the specific problem at hands, for instance giving more importance to the voice frequencies when the network needs to be used for speaker recognition. DEGramNet achieved state-of-the-art performance on the VGGSound dataset (for Sound Event Classification) and comparable accuracy with a complex and special-purpose approach based on network architecture search over the VoxCeleb dataset (for Speaker Identification). Moreover, we demonstrate that DEGram allows to achieve high accuracy with lightweight neural networks that can be used in real-time on embedded systems, making the solution suitable for Cognitive Robotics applications.
引用
收藏
页码:20207 / 20219
页数:13
相关论文
共 50 条
  • [41] EVALUATION OF AUDIO COMPANDORS IN THE TIME-FREQUENCY DOMAIN
    SKRITEK, P
    HLAWATSCH, F
    JOURNAL OF THE AUDIO ENGINEERING SOCIETY, 1986, 34 (05): : 386 - 386
  • [42] Time-frequency algorithm of audio signal compression
    Rabinovich, E. V.
    Shekhirev, A. V.
    APEIE-2006 8TH INTERNATIONAL CONFERENCE ON ACTUAL PROBLEMS OF ELECTRONIC INSTRUMENT ENGINEERING PROCEEDINGS, VOL 1, 2006, : 147 - +
  • [43] JOINT TIME-FREQUENCY SCATTERING FOR AUDIO CLASSIFICATION
    Anden, Joakim
    Lostanlen, Vincent
    Mallat, Stephane
    2015 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, 2015,
  • [44] Time-frequency analysis of the first heart sound. Part 2: An appropriate time-frequency representation technique
    Chen, D
    Durand, LG
    Guo, Z
    Lee, HC
    MEDICAL & BIOLOGICAL ENGINEERING & COMPUTING, 1997, 35 (04) : 311 - 317
  • [45] Persistent Time-Frequency Shrinkage for Audio Denoising
    Siedenburg, Kai
    Doerfler, Monika
    JOURNAL OF THE AUDIO ENGINEERING SOCIETY, 2013, 61 (1-2): : 29 - 38
  • [46] Classification of Time-Frequency Regions in Stereo Audio
    Harma, Aki
    JOURNAL OF THE AUDIO ENGINEERING SOCIETY, 2011, 59 (10): : 707 - 720
  • [47] Time-frequency domain fast audio transcoding
    Ju, Fu-Shing
    Fang, Ce-Min
    ISM 2006: EIGHTH IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA, PROCEEDINGS, 2006, : 750 - 753
  • [48] Perception-Based Audio Authentication Watermarking in the Time-Frequency Domain
    Zmudzinski, Sascha
    Steinebach, Martin
    INFORMATION HIDING, 2009, 5806 : 146 - 160
  • [49] Method of signal time-frequency representation
    Qiang, Lin
    Xi'an Shiyou Xueyuan Xuebao/Journal of Xi'an Petroleum Institute (Natural Science Edition), 1997, 12 (04): : 50 - 53
  • [50] Assessment of time-frequency representation techniques for thoracic sounds analysis
    Reyes, B. A.
    Charleston-Villalobos, S.
    Gonzalez-Camarena, R.
    Aljama-Corrales, T.
    COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE, 2014, 114 (03) : 276 - 290