Using various pre-trained models for audio feature extraction in automated audio captioning

被引：0

作者：

Won, Hyejin ^{[1
]}

Kim, Baekseung ^{[1
]}

Kwak, Il-Youp ^{[1
]}

Lim, Changwon ^{[1
,2
]}

机构：

[1] Chung Ang Univ, Dept Appl Stat, Seoul 06974, South Korea

[2] Chung Ang Univ, Inst Community Care & Hlth Equ, Seoul 06974, South Korea

来源：

EXPERT SYSTEMS WITH APPLICATIONS | 2023年 / 231卷

基金：

新加坡国家研究基金会;

关键词：

Audio captioning; Acoustic scene detection; Transfer learning; Encoder-decoder; Convolutional neural network; Transformer;

D O I：

10.1016/j.eswa.2023.120664

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The DCASE automated audio captioning challenge aimed to construct a model that generates captions describing given audio. Our team developed a CNN14 encoder (pre-trained on AudioSet data) along with a Transformer decoder model that ranked sixth place in the competition. Many teams utilized pre-trained networks, and it was evident that more research into their utilization was required. This paper presented comprehensive experiments conducted with various encoder networks for the proposed system, including CNN10, CNN14 ResNet54, AST, VGGNet, and EfficientNet. The pre-trained networks of CNN10, CNN14, ResNet54, and AST were trained on AudioSet data, while the pre-trained networks of AST, VGGNet, and EfficientNet were trained on ImageNet data. The best outcomes were achieved when the pre-trained CNN10, trained on AudioSet data, was utilized as an encoder with the Transformer serving as a decoder, and fine-tuning applied. Moreover, a qualitative study confirmed that our model generates plausible captions for different types of audio.

引用

页数：11

共 50 条

[21] Automated LOINC Standardization Using Pre-trained Large Language Models
Tu, Tao
Loreaux, Eric
Chesley, Emma
Lelkes, Adam D.
Gamble, Paul
Bellaiche, Mathias
Seneviratne, Martin
Chen, Ming-Jun
[J]. MACHINE LEARNING FOR HEALTH, VOL 193, 2022, 193 : 343 - 355
[22] Basic investigation of sign language motion classification by feature extraction using pre-trained network models
Kawaguchi, Kaito
Nishimura, Hiromitsu
Wang, Zhizhong
Tanaka, Hiroshi
Ohta, Eiji
[J]. 2019 IEEE PACIFIC RIM CONFERENCE ON COMMUNICATIONS, COMPUTERS AND SIGNAL PROCESSING (PACRIM), 2019,
[23] WhisPAr: Transferring pre-trained audio models to fine-grained classification via Prompt and Adapter
Shi, Bin
Wang, Hao
Lu, Chenchen
Zhao, Meng
[J]. KNOWLEDGE-BASED SYSTEMS, 2024, 300
[24] Audio-Aware Spoken Multiple-Choice Question Answering with Pre-Trained Language Models
Kuo, Chia-Chih
Chen, Kuan-Yu
Luo, Shang-Bao
[J]. IEEE/ACM Transactions on Audio Speech and Language Processing, 2021, 29 : 3170 - 3179
[25] Audio-Aware Spoken Multiple-Choice Question Answering With Pre-Trained Language Models
Kuo, Chia-Chih
Chen, Kuan-Yu
Luo, Shang-Bao
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 3170 - 3179
[26] Multi-Stage Audio-Visual Fusion for Dysarthric Speech Recognition With Pre-Trained Models
Yu, Chongchong
Su, Xiaosu
Qian, Zhaopeng
[J]. IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, 2023, 31 : 1912 - 1921
[27] Composing General Audio Representation by Fusing Multilayer Features of a Pre-trained Model
Niizumi, Daisuke
Takeuchi, Daiki
Ohishi, Yasunori
Harada, Noboru
Kashino, Kunio
[J]. 2022 30TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2022), 2022, : 200 - 204
[28] AUTOMATED AUDIO CAPTIONING WITH RECURRENT NEURAL NETWORKS
Drossos, Konstantinos
Adavanne, Sharath
Virtanen, Tuomas
[J]. 2017 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS (WASPAA), 2017, : 374 - 378
[29] Interactive Audio-text Representation for Automated Audio Captioning with Contrastive Learning
Chen, Chen
Hou, Nana
Hu, Yuchen
Zou, Heqing
Qi, Xiaofeng
Chng, Eng Siong
[J]. INTERSPEECH 2022, 2022, : 2773 - 2777
[30] Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks
Duan, Haoyi
Xia, Yan
Zhou, Mingze
Tang, Li
Zhu, Jieming
Zhao, Zhou
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,

← 1 2 3 4 5 →