Using various pre-trained models for audio feature extraction in automated audio captioning

被引:0
|
作者
Won, Hyejin [1 ]
Kim, Baekseung [1 ]
Kwak, Il-Youp [1 ]
Lim, Changwon [1 ,2 ]
机构
[1] Chung Ang Univ, Dept Appl Stat, Seoul 06974, South Korea
[2] Chung Ang Univ, Inst Community Care & Hlth Equ, Seoul 06974, South Korea
基金
新加坡国家研究基金会;
关键词
Audio captioning; Acoustic scene detection; Transfer learning; Encoder-decoder; Convolutional neural network; Transformer;
D O I
10.1016/j.eswa.2023.120664
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The DCASE automated audio captioning challenge aimed to construct a model that generates captions describing given audio. Our team developed a CNN14 encoder (pre-trained on AudioSet data) along with a Transformer decoder model that ranked sixth place in the competition. Many teams utilized pre-trained networks, and it was evident that more research into their utilization was required. This paper presented comprehensive experiments conducted with various encoder networks for the proposed system, including CNN10, CNN14 ResNet54, AST, VGGNet, and EfficientNet. The pre-trained networks of CNN10, CNN14, ResNet54, and AST were trained on AudioSet data, while the pre-trained networks of AST, VGGNet, and EfficientNet were trained on ImageNet data. The best outcomes were achieved when the pre-trained CNN10, trained on AudioSet data, was utilized as an encoder with the Transformer serving as a decoder, and fine-tuning applied. Moreover, a qualitative study confirmed that our model generates plausible captions for different types of audio.
引用
收藏
页数:11
相关论文
共 50 条
  • [31] Intelligent Fault Diagnosis of Industrial Bearings Using Transfer Learning and CNNs Pre-Trained for Audio Classification
    Di Maggio, Luigi Gianpio
    [J]. SENSORS, 2023, 23 (01)
  • [32] FEATURE EXTRACTION USING PRE-TRAINED CONVOLUTIVE BOTTLENECK NETS FOR DYSARTHRIC SPEECH RECOGNITION
    Takashima, Yuki
    Nakashika, Toru
    Takiguchi, Tetsuya
    Ariki, Yasuo
    [J]. 2015 23RD EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2015, : 1411 - 1415
  • [33] ACTUAL: Audio Captioning With Caption Feature Space Regularization
    Zhang, Yiming
    Yu, Hong
    Du, Ruoyi
    Tan, Zheng-Hua
    Wang, Wenwu
    Ma, Zhanyu
    Dong, Yuan
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 2643 - 2657
  • [34] Exploring Pre-trained Language Models for Event Extraction and Generation
    Yang, Sen
    Feng, Dawei
    Qiao, Linbo
    Kan, Zhigang
    Li, Dongsheng
    [J]. 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 5284 - 5294
  • [35] FeatureCut: An Adaptive Data Augmentation for Automated Audio Captioning
    Ye, Zhongjie
    Wang, Yuqing
    Wang, Helin
    Yang, Dongchao
    Zou, Yuexian
    [J]. PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2022, : 313 - 318
  • [36] ASIC IMPLEMENTATION OF A PRE-TRAINED NEURAL NETWORK FOR ECG FEATURE EXTRACTION
    Tefai, Huruy Tekle
    Saleh, Hani
    Tekeste, Temesghen
    Alqutayri, Mahmoud
    Mohammad, Baker
    [J]. 2020 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), 2020,
  • [37] Interpretable domain adaptation using unsupervised feature selection on pre-trained source models
    Zhang, Luxin
    Germain, Pascal
    Kessaci, Yacine
    Biernacki, Christophe
    [J]. NEUROCOMPUTING, 2022, 511 : 319 - 336
  • [38] Biomedical event extraction using pre-trained SciBERT
    Mulya, Dimmas
    Khodra, Masayu Leylia
    [J]. JOURNAL OF INTELLIGENT SYSTEMS, 2023, 32 (01)
  • [39] AutoPV: Automated photovoltaic forecasts with limited information using an ensemble of pre-trained models
    Meisenbacher, Stefan
    Heidrich, Benedikt
    Martin, Tim
    Mikut, Ralf
    Hagenmeyer, Veit
    [J]. PROCEEDINGS OF THE 2023 THE 14TH ACM INTERNATIONAL CONFERENCE ON FUTURE ENERGY SYSTEMS, E-ENERGY 2023, 2023, : 386 - 414
  • [40] Fast Audio Feature Extraction From Compressed Audio Data
    Schuller, Gerald
    Gruhne, Matthias
    Friedrich, Tobias
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2011, 5 (06) : 1262 - 1271