Using various pre-trained models for audio feature extraction in automated audio captioning

被引:0
|
作者
Won, Hyejin [1 ]
Kim, Baekseung [1 ]
Kwak, Il-Youp [1 ]
Lim, Changwon [1 ,2 ]
机构
[1] Chung Ang Univ, Dept Appl Stat, Seoul 06974, South Korea
[2] Chung Ang Univ, Inst Community Care & Hlth Equ, Seoul 06974, South Korea
基金
新加坡国家研究基金会;
关键词
Audio captioning; Acoustic scene detection; Transfer learning; Encoder-decoder; Convolutional neural network; Transformer;
D O I
10.1016/j.eswa.2023.120664
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The DCASE automated audio captioning challenge aimed to construct a model that generates captions describing given audio. Our team developed a CNN14 encoder (pre-trained on AudioSet data) along with a Transformer decoder model that ranked sixth place in the competition. Many teams utilized pre-trained networks, and it was evident that more research into their utilization was required. This paper presented comprehensive experiments conducted with various encoder networks for the proposed system, including CNN10, CNN14 ResNet54, AST, VGGNet, and EfficientNet. The pre-trained networks of CNN10, CNN14, ResNet54, and AST were trained on AudioSet data, while the pre-trained networks of AST, VGGNet, and EfficientNet were trained on ImageNet data. The best outcomes were achieved when the pre-trained CNN10, trained on AudioSet data, was utilized as an encoder with the Transformer serving as a decoder, and fine-tuning applied. Moreover, a qualitative study confirmed that our model generates plausible captions for different types of audio.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] Leveraging Pre-trained BERT for Audio Captioning
    Liu, Xubo
    Mei, Xinhao
    Huang, Qiushi
    Sun, Jianyuan
    Zhao, Jinzheng
    Liu, Haohe
    Plumbley, Mark D.
    Kilic, Volkan
    Wang, Wenwu
    [J]. 2022 30TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2022), 2022, : 1145 - 1149
  • [2] Augmenting pre-trained language models with audio feature embedding for argumentation mining in political debates
    Mestre, Rafael
    Middleton, Stuart E.
    Ryan, Matt
    Gheasi, Masood
    Norman, Timothy J.
    Zhu, Jiatong
    [J]. 17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 274 - 288
  • [3] EXPLORING PRE-TRAINED NEURAL AUDIO REPRESENTATIONS FOR AUDIO TOPIC SEGMENTATION
    Ghinassi, Iacopo
    Purver, Matthew
    Phan, Huy
    Newell, Chris
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 1086 - 1091
  • [4] Dynamic Convolutional Neural Networks as Efficient Pre-Trained Audio Models
    Schmid, Florian
    Koutini, Khaled
    Widmer, Gerhard
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 2227 - 2241
  • [5] A novel application of deep transfer learning with audio pre-trained models in pump audio fault detection
    Anvar, Ali Akbar Taghizadeh
    Mohammadi, Hossein
    [J]. COMPUTERS IN INDUSTRY, 2023, 147
  • [6] BYOL for Audio: Exploring Pre-Trained General-Purpose Audio Representations
    Niizumi, Daisuke
    Takeuchi, Daiki
    Ohishi, Yasunori
    Harada, Noboru
    Kashino, Kunio
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 137 - 151
  • [7] Classification of Respiration Sounds Using Deep Pre-trained Audio Embeddings
    Meza, Carlos A. Galindo
    del Hoyo Ontiveros, Juan A.
    Lopez-Meyer, Paulo
    [J]. 2021 IEEE LATIN AMERICAN CONFERENCE ON COMPUTATIONAL INTELLIGENCE (LA-CCI), 2021,
  • [8] Graph-Based Audio Classification Using Pre-Trained Models and Graph Neural Networks
    Castro-Ospina, Andres Eduardo
    Solarte-Sanchez, Miguel Angel
    Vega-Escobar, Laura Stella
    Isaza, Claudia
    Martinez-Vargas, Juan David
    [J]. SENSORS, 2024, 24 (07)
  • [9] Comparison of Pre-Trained CNNs for Audio Classification Using Transfer Learning
    Tsalera, Eleni
    Papadakis, Andreas
    Samarakou, Maria
    [J]. JOURNAL OF SENSOR AND ACTUATOR NETWORKS, 2021, 10 (04)
  • [10] Talking Head from Speech Audio using a Pre-trained Image Generator
    Alghamdi, Mohammed M.
    Wang, He
    Bulpitt, Andrew J.
    Hogg, David C.
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5228 - 5236