Using various pre-trained models for audio feature extraction in automated audio captioning

被引：0

作者：

Won, Hyejin ^{[1
]}

Kim, Baekseung ^{[1
]}

Kwak, Il-Youp ^{[1
]}

Lim, Changwon ^{[1
,2
]}

机构：

[1] Chung Ang Univ, Dept Appl Stat, Seoul 06974, South Korea

[2] Chung Ang Univ, Inst Community Care & Hlth Equ, Seoul 06974, South Korea

来源：

EXPERT SYSTEMS WITH APPLICATIONS | 2023年 / 231卷

基金：

新加坡国家研究基金会;

关键词：

Audio captioning; Acoustic scene detection; Transfer learning; Encoder-decoder; Convolutional neural network; Transformer;

D O I：

10.1016/j.eswa.2023.120664

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The DCASE automated audio captioning challenge aimed to construct a model that generates captions describing given audio. Our team developed a CNN14 encoder (pre-trained on AudioSet data) along with a Transformer decoder model that ranked sixth place in the competition. Many teams utilized pre-trained networks, and it was evident that more research into their utilization was required. This paper presented comprehensive experiments conducted with various encoder networks for the proposed system, including CNN10, CNN14 ResNet54, AST, VGGNet, and EfficientNet. The pre-trained networks of CNN10, CNN14, ResNet54, and AST were trained on AudioSet data, while the pre-trained networks of AST, VGGNet, and EfficientNet were trained on ImageNet data. The best outcomes were achieved when the pre-trained CNN10, trained on AudioSet data, was utilized as an encoder with the Transformer serving as a decoder, and fine-tuning applied. Moreover, a qualitative study confirmed that our model generates plausible captions for different types of audio.

引用

页数：11

共 50 条

[31] Intelligent Fault Diagnosis of Industrial Bearings Using Transfer Learning and CNNs Pre-Trained for Audio Classification
Di Maggio, Luigi Gianpio
[J]. SENSORS, 2023, 23 (01)
[32] FEATURE EXTRACTION USING PRE-TRAINED CONVOLUTIVE BOTTLENECK NETS FOR DYSARTHRIC SPEECH RECOGNITION
Takashima, Yuki
Nakashika, Toru
Takiguchi, Tetsuya
Ariki, Yasuo
[J]. 2015 23RD EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2015, : 1411 - 1415
[33] ACTUAL: Audio Captioning With Caption Feature Space Regularization
Zhang, Yiming
Yu, Hong
Du, Ruoyi
Tan, Zheng-Hua
Wang, Wenwu
Ma, Zhanyu
Dong, Yuan
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 2643 - 2657
[34] Exploring Pre-trained Language Models for Event Extraction and Generation
Yang, Sen
Feng, Dawei
Qiao, Linbo
Kan, Zhigang
Li, Dongsheng
[J]. 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 5284 - 5294
[35] FeatureCut: An Adaptive Data Augmentation for Automated Audio Captioning
Ye, Zhongjie
Wang, Yuqing
Wang, Helin
Yang, Dongchao
Zou, Yuexian
[J]. PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2022, : 313 - 318
[36] ASIC IMPLEMENTATION OF A PRE-TRAINED NEURAL NETWORK FOR ECG FEATURE EXTRACTION
Tefai, Huruy Tekle
Saleh, Hani
Tekeste, Temesghen
Alqutayri, Mahmoud
Mohammad, Baker
[J]. 2020 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), 2020,
[37] Interpretable domain adaptation using unsupervised feature selection on pre-trained source models
Zhang, Luxin
Germain, Pascal
Kessaci, Yacine
Biernacki, Christophe
[J]. NEUROCOMPUTING, 2022, 511 : 319 - 336
[38] Biomedical event extraction using pre-trained SciBERT
Mulya, Dimmas
Khodra, Masayu Leylia
[J]. JOURNAL OF INTELLIGENT SYSTEMS, 2023, 32 (01)
[39] AutoPV: Automated photovoltaic forecasts with limited information using an ensemble of pre-trained models
Meisenbacher, Stefan
Heidrich, Benedikt
Martin, Tim
Mikut, Ralf
Hagenmeyer, Veit
[J]. PROCEEDINGS OF THE 2023 THE 14TH ACM INTERNATIONAL CONFERENCE ON FUTURE ENERGY SYSTEMS, E-ENERGY 2023, 2023, : 386 - 414
[40] Fast Audio Feature Extraction From Compressed Audio Data
Schuller, Gerald
Gruhne, Matthias
Friedrich, Tobias
[J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2011, 5 (06) : 1262 - 1271

← 1 2 3 4 5 →