Degramnet: effective audio analysis based on a fully learnable time-frequency representation

被引:1
|
作者
Foggia, Pasquale [1 ]
Greco, Antonio [1 ]
Roberto, Antonio [1 ]
Saggese, Alessia [1 ]
Vento, Mario [1 ]
机构
[1] Univ Salerno, Via Giovanni Paolo II 132, Fisciano, SA, Italy
来源
NEURAL COMPUTING & APPLICATIONS | 2023年 / 35卷 / 27期
关键词
Deep learning; Audio representation learning; Signal processing; Sound event classification; Speaker identification; NEURAL-NETWORKS; RECOGNITION;
D O I
10.1007/s00521-023-08849-7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Current state-of-the-art audio analysis algorithms based on deep learning rely on hand-crafted Spectrogram-like audio representations, that are more compact than descriptors obtained from the raw waveform; the latter are, in turn, far from achieving good generalization capabilities when few data are available for the training. However, Spectrogram-like representations have two main limitations: (1) The parameters of the filters are defined a priori, regardless of the specific audio analysis task; (2) such representations do not perform any denoising operation on the audio signal, neither in the time domain nor in the frequency domain. To overcome these limitations, we propose a new general-purpose convolutional architecture for audio analysis tasks that we call DEGramNet, which is trained with audio samples described with a novel, compact and learnable time-frequency representation that we call DEGram. The proposed representation is fully trainable: Indeed, it is able to learn the frequencies of interest for the specific audio analysis task; in addition, it performs denoising through a custom time-frequency attention module, which amplifies the frequency and time components in which the sound is actually located. It implies that the proposed representation can be easily adapted to the specific problem at hands, for instance giving more importance to the voice frequencies when the network needs to be used for speaker recognition. DEGramNet achieved state-of-the-art performance on the VGGSound dataset (for Sound Event Classification) and comparable accuracy with a complex and special-purpose approach based on network architecture search over the VoxCeleb dataset (for Speaker Identification). Moreover, we demonstrate that DEGram allows to achieve high accuracy with lightweight neural networks that can be used in real-time on embedded systems, making the solution suitable for Cognitive Robotics applications.
引用
收藏
页码:20207 / 20219
页数:13
相关论文
共 50 条
  • [1] Degramnet: effective audio analysis based on a fully learnable time–frequency representation
    Pasquale Foggia
    Antonio Greco
    Antonio Roberto
    Alessia Saggese
    Mario Vento
    Neural Computing and Applications, 2023, 35 : 20207 - 20219
  • [2] Time-Frequency Representation of Audio Signals Using Hilbert Spectrum with Effective Frequency Scaling
    Molla, Md. Khademul Islam
    Shaikh, Mostafa Al Masum
    Hirose, Keikichi
    2008 11TH INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION TECHNOLOGY: ICCIT 2008, VOLS 1 AND 2, 2008, : 840 - +
  • [3] Interpretable and Learnable Super-Resolution Time-Frequency Representation
    Balestriero, Randall
    Glotin, Herve
    Baraniuk, Richard G.
    MATHEMATICAL AND SCIENTIFIC MACHINE LEARNING, VOL 145, 2021, 145 : 118 - 152
  • [4] An effective real-time audio segmentation method based on time-frequency energy analysis
    Gao, Chang
    Li, Haifeng
    Ma, Lin
    Zhang, Wei
    2013 THIRD INTERNATIONAL CONFERENCE ON INSTRUMENTATION & MEASUREMENT, COMPUTER, COMMUNICATION AND CONTROL (IMCCC), 2013, : 999 - 1002
  • [5] Music Files Compression Based on Time-Frequency Representation of Audio Signal
    Shekhirev, Andrew V.
    Rabinovich, Evgeniy V.
    IFOST 2008: PROCEEDING OF THE THIRD INTERNATIONAL FORUM ON STRATEGIC TECHNOLOGIES, 2008, : 340 - 342
  • [6] Time-frequency audio feature extraction based on tensor representation of sparse coding
    Zhang, Xue-Yuan
    He, Qian-Hua
    ELECTRONICS LETTERS, 2015, 51 (02) : 131 - U20
  • [7] An Efficient Time-Frequency Representation for Parametric-Based Audio Object Coding
    Beack, Seungkwon
    Lee, Taejin
    Kim, Minje
    Kang, Kyeongok
    ETRI JOURNAL, 2011, 33 (06) : 945 - 948
  • [8] Time-Frequency Scattergrams for Biomedical Audio Signal Representation and Classification
    Sharma, Garima
    Umapathy, Karthikeyan
    Krishnan, Sridhar
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 564 - 576
  • [9] EMD-based time-frequency analysis methods of audio signals
    Lewandowski, Marcin
    Deng, Qizhang
    INTERNATIONAL JOURNAL OF ELECTRONICS AND TELECOMMUNICATIONS, 2024, 70 (02) : 323 - 329
  • [10] An Analysis System of Sonar Signals Based on Time-Frequency Representation
    Aiordachioaie, Dorel
    PROCEEDINGS OF THE 9TH INTERNATIONAL CONFERENCE ON ELECTRONICS, COMPUTERS AND ARTIFICIAL INTELLIGENCE - ECAI 2017, 2017,