Degramnet: effective audio analysis based on a fully learnable time-frequency representation

被引:1
|
作者
Foggia, Pasquale [1 ]
Greco, Antonio [1 ]
Roberto, Antonio [1 ]
Saggese, Alessia [1 ]
Vento, Mario [1 ]
机构
[1] Univ Salerno, Via Giovanni Paolo II 132, Fisciano, SA, Italy
来源
NEURAL COMPUTING & APPLICATIONS | 2023年 / 35卷 / 27期
关键词
Deep learning; Audio representation learning; Signal processing; Sound event classification; Speaker identification; NEURAL-NETWORKS; RECOGNITION;
D O I
10.1007/s00521-023-08849-7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Current state-of-the-art audio analysis algorithms based on deep learning rely on hand-crafted Spectrogram-like audio representations, that are more compact than descriptors obtained from the raw waveform; the latter are, in turn, far from achieving good generalization capabilities when few data are available for the training. However, Spectrogram-like representations have two main limitations: (1) The parameters of the filters are defined a priori, regardless of the specific audio analysis task; (2) such representations do not perform any denoising operation on the audio signal, neither in the time domain nor in the frequency domain. To overcome these limitations, we propose a new general-purpose convolutional architecture for audio analysis tasks that we call DEGramNet, which is trained with audio samples described with a novel, compact and learnable time-frequency representation that we call DEGram. The proposed representation is fully trainable: Indeed, it is able to learn the frequencies of interest for the specific audio analysis task; in addition, it performs denoising through a custom time-frequency attention module, which amplifies the frequency and time components in which the sound is actually located. It implies that the proposed representation can be easily adapted to the specific problem at hands, for instance giving more importance to the voice frequencies when the network needs to be used for speaker recognition. DEGramNet achieved state-of-the-art performance on the VGGSound dataset (for Sound Event Classification) and comparable accuracy with a complex and special-purpose approach based on network architecture search over the VoxCeleb dataset (for Speaker Identification). Moreover, we demonstrate that DEGram allows to achieve high accuracy with lightweight neural networks that can be used in real-time on embedded systems, making the solution suitable for Cognitive Robotics applications.
引用
收藏
页码:20207 / 20219
页数:13
相关论文
共 50 条
  • [21] Time-frequency analysis for audio event detection in real scenarios
    Saggese, Alessia
    Strisciuglio, Nicola
    Vento, Mario
    Petkov, Nicolai
    2016 13TH IEEE INTERNATIONAL CONFERENCE ON ADVANCED VIDEO AND SIGNAL BASED SURVEILLANCE (AVSS), 2016, : 438 - 443
  • [22] Time-Frequency Based Thermal Imaging: An Effective Tool for Quantitative Analysis
    Yadav, G. V. P. Chandra Sekhar
    Ghali, V. S.
    Subhani, S. K.
    RUSSIAN JOURNAL OF NONDESTRUCTIVE TESTING, 2023, 59 (11) : 1165 - 1176
  • [24] Audio fingerprinting based on analyzing time-frequency localization of signals
    Lu, CS
    PROCEEDINGS OF THE 2002 IEEE WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 2002, : 174 - 177
  • [25] ADVANCED TIME-FREQUENCY REPRESENTATION IN VOICE SIGNAL ANALYSIS
    Mika, Dariusz
    Jozwik, Jerzy
    ADVANCES IN SCIENCE AND TECHNOLOGY-RESEARCH JOURNAL, 2018, 12 (01): : 251 - 259
  • [26] Analysis of the time-frequency representation using the gamma filter
    Celebi, S
    Principe, JC
    1996 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, CONFERENCE PROCEEDINGS, VOLS 1-6, 1996, : 2587 - 2590
  • [27] A new time-frequency representation:: Analysis of the combustion noise
    Cerdá, S
    Romero, J
    Navasquillo, J
    Zurita, G
    ACUSTICA, 2001, 87 (03): : 423 - 425
  • [28] An Eigen Based Feature on Time-Frequency Representation of EMG
    Sueaseenak, Direk
    Pintavirooj, Chuchart
    Sangworasil, Manas
    Chanwimalueang, Theerasak
    Praliwanon, Chaleeya
    2009 IEEE-RIVF INTERNATIONAL CONFERENCE ON COMPUTING AND COMMUNICATION TECHNOLOGIES: RESEARCH, INNOVATION AND VISION FOR THE FUTURE, 2009, : 73 - +
  • [29] An effective frequency-domain feature of atrial fibrillation based on time-frequency analysis
    Hu, Yusong
    Zhao, Yantao
    Liu, Jihong
    Pang, Jin
    Zhang, Chen
    Li, Peizhe
    BMC MEDICAL INFORMATICS AND DECISION MAKING, 2020, 20 (01)
  • [30] Environmental Sound Classification based on Time-frequency Representation
    Thwe, Khine Zar
    War, Nu
    2017 18TH IEEE/ACIS INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING AND PARALLEL/DISTRIBUTED COMPUTING (SNDP 2017), 2017, : 251 - 255