Degramnet: effective audio analysis based on a fully learnable time-frequency representation

被引:1
|
作者
Foggia, Pasquale [1 ]
Greco, Antonio [1 ]
Roberto, Antonio [1 ]
Saggese, Alessia [1 ]
Vento, Mario [1 ]
机构
[1] Univ Salerno, Via Giovanni Paolo II 132, Fisciano, SA, Italy
来源
NEURAL COMPUTING & APPLICATIONS | 2023年 / 35卷 / 27期
关键词
Deep learning; Audio representation learning; Signal processing; Sound event classification; Speaker identification; NEURAL-NETWORKS; RECOGNITION;
D O I
10.1007/s00521-023-08849-7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Current state-of-the-art audio analysis algorithms based on deep learning rely on hand-crafted Spectrogram-like audio representations, that are more compact than descriptors obtained from the raw waveform; the latter are, in turn, far from achieving good generalization capabilities when few data are available for the training. However, Spectrogram-like representations have two main limitations: (1) The parameters of the filters are defined a priori, regardless of the specific audio analysis task; (2) such representations do not perform any denoising operation on the audio signal, neither in the time domain nor in the frequency domain. To overcome these limitations, we propose a new general-purpose convolutional architecture for audio analysis tasks that we call DEGramNet, which is trained with audio samples described with a novel, compact and learnable time-frequency representation that we call DEGram. The proposed representation is fully trainable: Indeed, it is able to learn the frequencies of interest for the specific audio analysis task; in addition, it performs denoising through a custom time-frequency attention module, which amplifies the frequency and time components in which the sound is actually located. It implies that the proposed representation can be easily adapted to the specific problem at hands, for instance giving more importance to the voice frequencies when the network needs to be used for speaker recognition. DEGramNet achieved state-of-the-art performance on the VGGSound dataset (for Sound Event Classification) and comparable accuracy with a complex and special-purpose approach based on network architecture search over the VoxCeleb dataset (for Speaker Identification). Moreover, we demonstrate that DEGram allows to achieve high accuracy with lightweight neural networks that can be used in real-time on embedded systems, making the solution suitable for Cognitive Robotics applications.
引用
收藏
页码:20207 / 20219
页数:13
相关论文
共 50 条
  • [12] A new time-frequency representation for music signal analysis: Resonator Time-Frequency Image
    Zhou, Ruohua
    Mattavelli, Marco
    2007 9TH INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING AND ITS APPLICATIONS, VOLS 1-3, 2007, : 1278 - 1281
  • [13] Performance on a Combined Representation for Time-Frequency Analysis
    Lin, Rongping
    Du, Chunhui
    Luo, Shan
    Xu, Qi
    2017 2ND INTERNATIONAL CONFERENCE ON IMAGE, VISION AND COMPUTING (ICIVC 2017), 2017, : 858 - 862
  • [14] Audio Fingerprint Extraction Based on Time-Frequency Domain
    Liu, Zhengzheng
    Li, Cong
    Cao, Sanxing
    2016 2ND IEEE INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATIONS (ICCC), 2016, : 1975 - 1979
  • [15] Content based audio classification and retrieval using joint time-frequency analysis
    Esmaili, S
    Krishnan, S
    Raahemifar, K
    2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL V, PROCEEDINGS: DESIGN AND IMPLEMENTATION OF SIGNAL PROCESSING SYSTEMS INDUSTRY TECHNOLOGY TRACKS MACHINE LEARNING FOR SIGNAL PROCESSING MULTIMEDIA SIGNAL PROCESSING SIGNAL PROCESSING FOR EDUCATION, 2004, : 665 - 668
  • [16] HILBERT SPECTRUM IN TIME-FREQUENCY REPRESENTATION OF AUDIO SIGNALS CONSIDERING DISJOINT ORTHOGONALITY
    Molla, Md. Khademul Islam
    Hirose, Keikichi
    ADVANCES IN DATA SCIENCE AND ADAPTIVE ANALYSIS, 2010, 2 (03) : 313 - 336
  • [17] Tracking of frequency in a time-frequency representation
    Roguet, W
    Martin, N
    Chehikian, A
    PROCEEDINGS OF THE IEEE-SP INTERNATIONAL SYMPOSIUM ON TIME-FREQUENCY AND TIME-SCALE ANALYSIS, 1996, : 341 - 344
  • [18] Time-Frequency Processing for Spatial Audio
    Rumsey, Francis
    JOURNAL OF THE AUDIO ENGINEERING SOCIETY, 2010, 58 (7-8): : 655 - 659
  • [19] Time-Frequency Based Thermal Imaging: An Effective Tool for Quantitative Analysis
    G. V. P. Chandra Sekhar Yadav
    V. S. Ghali
    S. K. Subhani
    Russian Journal of Nondestructive Testing, 2023, 59 : 1165 - 1176
  • [20] Multi-Gabor dictionaries for audio time-frequency analysis
    Wolfe, PJ
    Godsill, SJ
    Dörfler, M
    PROCEEDINGS OF THE 2001 IEEE WORKSHOP ON THE APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS, 2001, : 43 - 46