Leveraging Pre-trained BERT for Audio Captioning

被引:0
|
作者
Liu, Xubo [1 ]
Mei, Xinhao [1 ]
Huang, Qiushi [2 ]
Sun, Jianyuan [1 ]
Zhao, Jinzheng [1 ]
Liu, Haohe [1 ]
Plumbley, Mark D. [1 ]
Kilic, Volkan [3 ]
Wang, Wenwu [1 ]
机构
[1] Univ Surrey, Ctr Vis Speech & Signal Proc CVSSP, Guildford, Surrey, England
[2] Univ Surrey, Dept Comp Sci, Guildford, Surrey, England
[3] Izmir Katip Celebi Univ, Dept Elect & Elect Engn, Izmir, Turkey
基金
英国工程与自然科学研究理事会;
关键词
audio captioning; language models; BERT; Pre-trained Audio Neural Networks (PANNs); deep learning;
D O I
暂无
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Audio captioning aims at using language to describe the content of an audio clip. Existing audio captioning systems are generally based on an encoder-decoder architecture, in which acoustic information is extracted by an audio encoder and then a language decoder is used to generate the captions. Training an audio captioning system often encounters the problem of data scarcity. Transferring knowledge from pre-trained audio models such as Pre-trained Audio Neural Networks (PANNs) have recently emerged as a useful method to mitigate this issue. However, there is less attention on exploiting pre-trained language models for the decoder, compared with the encoder. BERT is a pre-trained language model that has been extensively used in natural language processing tasks. Nevertheless, the potential of using BERT as the language decoder for audio captioning has not been investigated. In this study, we demonstrate the efficacy of the pre-trained BERT model for audio captioning. Specifically, we apply PANNs as the encoder and initialize the decoder from the publicly available pre-trained BERT models. We conduct an empirical study on the use of these BERT models for the decoder in the audio captioning model. Our models achieve competitive results with the existing audio captioning methods on the AudioCaps dataset.
引用
收藏
页码:1145 / 1149
页数:5
相关论文
共 50 条
  • [1] Using various pre-trained models for audio feature extraction in automated audio captioning
    Won, Hyejin
    Kim, Baekseung
    Kwak, Il-Youp
    Lim, Changwon
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2023, 231
  • [2] BERT-NAR-BERT: A Non-Autoregressive Pre-Trained Sequence-to-Sequence Model Leveraging BERT Checkpoints
    Sohrab, Mohammad Golam
    Asada, Masaki
    Rikters, Matiss
    Miwa, Makoto
    [J]. IEEE ACCESS, 2024, 12 : 23 - 33
  • [3] Patent classification with pre-trained Bert model
    Kahraman, Selen Yuecesoy
    Durmusoglu, Alptekin
    Dereli, Tuerkay
    [J]. JOURNAL OF THE FACULTY OF ENGINEERING AND ARCHITECTURE OF GAZI UNIVERSITY, 2024, 39 (04): : 2485 - 2496
  • [4] Interpreting Art by Leveraging Pre-Trained Models
    Penzel, Niklas
    Denzler, Joachim
    [J]. 2023 18TH INTERNATIONAL CONFERENCE ON MACHINE VISION AND APPLICATIONS, MVA, 2023,
  • [5] The Lottery Ticket Hypothesis for Pre-trained BERT Networks
    Chen, Tianlong
    Frankle, Jonathan
    Chang, Shiyu
    Liu, Sijia
    Zhang, Yang
    Wang, Zhangyang
    Carbin, Michael
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [6] Modeling essay grading with pre-trained BERT features
    Sharma, Annapurna
    Jayagopi, Dinesh Babu
    [J]. APPLIED INTELLIGENCE, 2024, 54 (06) : 4979 - 4993
  • [7] Leveraging Pre-Trained Embeddings for Welsh Taggers
    Ezeani, Ignatius M.
    Piao, Scott
    Neale, Steven
    Rayson, Paul
    Knight, Dawn
    [J]. 4TH WORKSHOP ON REPRESENTATION LEARNING FOR NLP (REPL4NLP-2019), 2019, : 270 - 280
  • [8] Sharing Pre-trained BERT Decoder for a Hybrid Summarization
    Wei, Ran
    Huang, Heyan
    Gao, Yang
    [J]. CHINESE COMPUTATIONAL LINGUISTICS, CCL 2019, 2019, 11856 : 169 - 180
  • [9] Lattice-BERT: Leveraging Multi-Granularity Representations in Chinese Pre-trained Language Models
    Lai, Yuxuan
    Liu, Yijia
    Feng, Yansong
    Huang, Songfang
    Zhao, Dongyan
    [J]. 2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 1716 - 1731
  • [10] MF-BERT: Multimodal Fusion in Pre-Trained BERT for Sentiment Analysis
    He, Jiaxuan
    Hu, Haifeng
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 454 - 458