Degramnet: effective audio analysis based on a fully learnable time-frequency representation

被引：1

作者：

Foggia, Pasquale ^{[1
]}

Greco, Antonio ^{[1
]}

Roberto, Antonio ^{[1
]}

Saggese, Alessia ^{[1
]}

Vento, Mario ^{[1
]}

机构：

[1] Univ Salerno, Via Giovanni Paolo II 132, Fisciano, SA, Italy

来源：

NEURAL COMPUTING & APPLICATIONS | 2023年 / 35卷 / 27期

关键词：

Deep learning; Audio representation learning; Signal processing; Sound event classification; Speaker identification; NEURAL-NETWORKS; RECOGNITION;

D O I：

10.1007/s00521-023-08849-7

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Current state-of-the-art audio analysis algorithms based on deep learning rely on hand-crafted Spectrogram-like audio representations, that are more compact than descriptors obtained from the raw waveform; the latter are, in turn, far from achieving good generalization capabilities when few data are available for the training. However, Spectrogram-like representations have two main limitations: (1) The parameters of the filters are defined a priori, regardless of the specific audio analysis task; (2) such representations do not perform any denoising operation on the audio signal, neither in the time domain nor in the frequency domain. To overcome these limitations, we propose a new general-purpose convolutional architecture for audio analysis tasks that we call DEGramNet, which is trained with audio samples described with a novel, compact and learnable time-frequency representation that we call DEGram. The proposed representation is fully trainable: Indeed, it is able to learn the frequencies of interest for the specific audio analysis task; in addition, it performs denoising through a custom time-frequency attention module, which amplifies the frequency and time components in which the sound is actually located. It implies that the proposed representation can be easily adapted to the specific problem at hands, for instance giving more importance to the voice frequencies when the network needs to be used for speaker recognition. DEGramNet achieved state-of-the-art performance on the VGGSound dataset (for Sound Event Classification) and comparable accuracy with a complex and special-purpose approach based on network architecture search over the VoxCeleb dataset (for Speaker Identification). Moreover, we demonstrate that DEGram allows to achieve high accuracy with lightweight neural networks that can be used in real-time on embedded systems, making the solution suitable for Cognitive Robotics applications.

引用

页码：20207 / 20219

页数：13

共 50 条

[21] Time-frequency analysis for audio event detection in real scenarios
Saggese, Alessia
Strisciuglio, Nicola
Vento, Mario
Petkov, Nicolai
2016 13TH IEEE INTERNATIONAL CONFERENCE ON ADVANCED VIDEO AND SIGNAL BASED SURVEILLANCE (AVSS), 2016, : 438 - 443
[22] Time-Frequency Based Thermal Imaging: An Effective Tool for Quantitative Analysis
Yadav, G. V. P. Chandra Sekhar
Ghali, V. S.
Subhani, S. K.
RUSSIAN JOURNAL OF NONDESTRUCTIVE TESTING, 2023, 59 (11) : 1165 - 1176
[23] Methods of Time-Frequency Analysis in Authentication of Digital Audio Recordings
Korycki, Rafal
INTERNATIONAL JOURNAL OF ELECTRONICS AND TELECOMMUNICATIONS, 2010, 56 (03) : 257 - 261
[24] Audio fingerprinting based on analyzing time-frequency localization of signals
Lu, CS
PROCEEDINGS OF THE 2002 IEEE WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 2002, : 174 - 177
[25] ADVANCED TIME-FREQUENCY REPRESENTATION IN VOICE SIGNAL ANALYSIS
Mika, Dariusz
Jozwik, Jerzy
ADVANCES IN SCIENCE AND TECHNOLOGY-RESEARCH JOURNAL, 2018, 12 (01): : 251 - 259
[26] Analysis of the time-frequency representation using the gamma filter
Celebi, S
Principe, JC
1996 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, CONFERENCE PROCEEDINGS, VOLS 1-6, 1996, : 2587 - 2590
[27] A new time-frequency representation:: Analysis of the combustion noise
Cerdá, S
Romero, J
Navasquillo, J
Zurita, G
ACUSTICA, 2001, 87 (03): : 423 - 425
[28] An Eigen Based Feature on Time-Frequency Representation of EMG
Sueaseenak, Direk
Pintavirooj, Chuchart
Sangworasil, Manas
Chanwimalueang, Theerasak
Praliwanon, Chaleeya
2009 IEEE-RIVF INTERNATIONAL CONFERENCE ON COMPUTING AND COMMUNICATION TECHNOLOGIES: RESEARCH, INNOVATION AND VISION FOR THE FUTURE, 2009, : 73 - +
[29] An effective frequency-domain feature of atrial fibrillation based on time-frequency analysis
Hu, Yusong
Zhao, Yantao
Liu, Jihong
Pang, Jin
Zhang, Chen
Li, Peizhe
BMC MEDICAL INFORMATICS AND DECISION MAKING, 2020, 20 (01)
[30] Environmental Sound Classification based on Time-frequency Representation
Thwe, Khine Zar
War, Nu
2017 18TH IEEE/ACIS INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING AND PARALLEL/DISTRIBUTED COMPUTING (SNDP 2017), 2017, : 251 - 255

← 1 2 3 4 5 →