Using various pre-trained models for audio feature extraction in automated audio captioning

被引:0
|
作者
Won, Hyejin [1 ]
Kim, Baekseung [1 ]
Kwak, Il-Youp [1 ]
Lim, Changwon [1 ,2 ]
机构
[1] Chung Ang Univ, Dept Appl Stat, Seoul 06974, South Korea
[2] Chung Ang Univ, Inst Community Care & Hlth Equ, Seoul 06974, South Korea
基金
新加坡国家研究基金会;
关键词
Audio captioning; Acoustic scene detection; Transfer learning; Encoder-decoder; Convolutional neural network; Transformer;
D O I
10.1016/j.eswa.2023.120664
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The DCASE automated audio captioning challenge aimed to construct a model that generates captions describing given audio. Our team developed a CNN14 encoder (pre-trained on AudioSet data) along with a Transformer decoder model that ranked sixth place in the competition. Many teams utilized pre-trained networks, and it was evident that more research into their utilization was required. This paper presented comprehensive experiments conducted with various encoder networks for the proposed system, including CNN10, CNN14 ResNet54, AST, VGGNet, and EfficientNet. The pre-trained networks of CNN10, CNN14, ResNet54, and AST were trained on AudioSet data, while the pre-trained networks of AST, VGGNet, and EfficientNet were trained on ImageNet data. The best outcomes were achieved when the pre-trained CNN10, trained on AudioSet data, was utilized as an encoder with the Transformer serving as a decoder, and fine-tuning applied. Moreover, a qualitative study confirmed that our model generates plausible captions for different types of audio.
引用
收藏
页数:11
相关论文
共 50 条
  • [21] Automated LOINC Standardization Using Pre-trained Large Language Models
    Tu, Tao
    Loreaux, Eric
    Chesley, Emma
    Lelkes, Adam D.
    Gamble, Paul
    Bellaiche, Mathias
    Seneviratne, Martin
    Chen, Ming-Jun
    [J]. MACHINE LEARNING FOR HEALTH, VOL 193, 2022, 193 : 343 - 355
  • [22] Basic investigation of sign language motion classification by feature extraction using pre-trained network models
    Kawaguchi, Kaito
    Nishimura, Hiromitsu
    Wang, Zhizhong
    Tanaka, Hiroshi
    Ohta, Eiji
    [J]. 2019 IEEE PACIFIC RIM CONFERENCE ON COMMUNICATIONS, COMPUTERS AND SIGNAL PROCESSING (PACRIM), 2019,
  • [23] WhisPAr: Transferring pre-trained audio models to fine-grained classification via Prompt and Adapter
    Shi, Bin
    Wang, Hao
    Lu, Chenchen
    Zhao, Meng
    [J]. KNOWLEDGE-BASED SYSTEMS, 2024, 300
  • [24] Audio-Aware Spoken Multiple-Choice Question Answering with Pre-Trained Language Models
    Kuo, Chia-Chih
    Chen, Kuan-Yu
    Luo, Shang-Bao
    [J]. IEEE/ACM Transactions on Audio Speech and Language Processing, 2021, 29 : 3170 - 3179
  • [25] Audio-Aware Spoken Multiple-Choice Question Answering With Pre-Trained Language Models
    Kuo, Chia-Chih
    Chen, Kuan-Yu
    Luo, Shang-Bao
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 3170 - 3179
  • [26] Multi-Stage Audio-Visual Fusion for Dysarthric Speech Recognition With Pre-Trained Models
    Yu, Chongchong
    Su, Xiaosu
    Qian, Zhaopeng
    [J]. IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, 2023, 31 : 1912 - 1921
  • [27] Composing General Audio Representation by Fusing Multilayer Features of a Pre-trained Model
    Niizumi, Daisuke
    Takeuchi, Daiki
    Ohishi, Yasunori
    Harada, Noboru
    Kashino, Kunio
    [J]. 2022 30TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2022), 2022, : 200 - 204
  • [28] AUTOMATED AUDIO CAPTIONING WITH RECURRENT NEURAL NETWORKS
    Drossos, Konstantinos
    Adavanne, Sharath
    Virtanen, Tuomas
    [J]. 2017 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS (WASPAA), 2017, : 374 - 378
  • [29] Interactive Audio-text Representation for Automated Audio Captioning with Contrastive Learning
    Chen, Chen
    Hou, Nana
    Hu, Yuchen
    Zou, Heqing
    Qi, Xiaofeng
    Chng, Eng Siong
    [J]. INTERSPEECH 2022, 2022, : 2773 - 2777
  • [30] Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks
    Duan, Haoyi
    Xia, Yan
    Zhou, Mingze
    Tang, Li
    Zhu, Jieming
    Zhao, Zhou
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,