ACTUAL: Audio Captioning With Caption Feature Space Regularization

被引:5
|
作者
Zhang, Yiming [1 ]
Yu, Hong [2 ]
Du, Ruoyi [1 ]
Tan, Zheng-Hua [3 ]
Wang, Wenwu [4 ]
Ma, Zhanyu [1 ]
Dong, Yuan [1 ]
机构
[1] Beijing Univ Posts & Telecommun, Sch Artificial Intelligence, Pattern Recognit & Intelligent Syst Lab, Beijing 100876, Peoples R China
[2] Ludong Univ, Sch Informat & Elect Engn, Dept Artificial Intelligence, Yantai 264025, Shandong, Peoples R China
[3] Aalborg Univ, Dept Elect Syst, DK-9220 Aalborg, Denmark
[4] Univ Surrey, Ctr Vis Speech & Signal Proc, Guildford GU2 7XH, England
基金
中国国家自然科学基金; 北京市自然科学基金;
关键词
Audio captioning; contrastive learning; cross-modal task; caption consistency regularization;
D O I
10.1109/TASLP.2023.3293015
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Audio captioning aims at describing the content of audio clips with human language. Due to the ambiguity of audio content, different people may perceive the same audio clip differently, resulting in caption disparities (i.e., the same audio clip may be described by several captions with diverse semantics). In the literature, the one-to-many strategy is often employed to train the audio captioning models, where a related caption is randomly selected as the optimization target for each audio clip at each training iteration. However, we observe that this can lead to significant variations during the optimization process and adversely affect the performance of the model. In this article, we address this issue by proposing an audio captioning method, named ACTUAL (Audio Captioning with capTion featUre spAce reguLarization). ACTUAL involves a two-stage training process: (i) in the first stage, we use contrastive learning to construct a proxy feature space where the similarities between captions at the audio level are explored, and (ii) in the second stage, the proxy feature space is utilized as additional supervision to improve the optimization of the model in a more stable direction. We conduct extensive experiments to demonstrate the effectiveness of the proposed ACTUAL method. The results show that proxy caption embedding can significantly improve the performance of the baseline model and the proposed ACTUAL method offers competitive performance on two datasets compared to state-of-the-art methods.
引用
收藏
页码:2643 / 2657
页数:15
相关论文
共 50 条
  • [1] AUTOMATED AUDIO CAPTIONING USING TRANSFER LEARNING AND RECONSTRUCTION LATENT SPACE SIMILARITY REGULARIZATION
    Koh, Andrew
    Xue Fuzhao
    Siong, Chng Eng
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7722 - 7726
  • [2] Feature-informed Embedding Space Regularization For Audio Classification
    Hung, Yun-Ning
    Lerch, Alexander
    [J]. 2022 30TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2022), 2022, : 419 - 423
  • [3] Using various pre-trained models for audio feature extraction in automated audio captioning
    Won, Hyejin
    Kim, Baekseung
    Kwak, Il-Youp
    Lim, Changwon
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2023, 231
  • [4] Generating Accurate Caption Units for Figure Captioning
    Qian, Xin
    Koh, Eunyee
    Du, Fan
    Kim, Sungchul
    Chan, Joel
    Rossi, Ryan A.
    Malik, Sana
    Lee, Tak Yeon
    [J]. PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE 2021 (WWW 2021), 2021, : 2792 - 2804
  • [5] AUDIO CAPTION: LISTEN AND TELL
    Wu, Mengyue
    Dinkel, Heinrich
    Yu, Kai
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 830 - 834
  • [6] BEYOND CAPTION TO NARRATIVE: VIDEO CAPTIONING WITH MULTIPLE SENTENCES
    Shin, Andrew
    Ohnishi, Katsunori
    Harada, Tatsuya
    [J]. 2016 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2016, : 3364 - 3368
  • [7] Caption TLSTMs: combining transformer with LSTMs for image captioning
    Yan, Jie
    Xie, Yuxiang
    Luan, Xidao
    Guo, Yanming
    Gong, Quanzhi
    Feng, Suru
    [J]. INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2022, 11 (02) : 111 - 121
  • [8] Caption TLSTMs: combining transformer with LSTMs for image captioning
    Jie Yan
    Yuxiang Xie
    Xidao Luan
    Yanming Guo
    Quanzhi Gong
    Suru Feng
    [J]. International Journal of Multimedia Information Retrieval, 2022, 11 : 111 - 121
  • [9] CLOTHO: AN AUDIO CAPTIONING DATASET
    Drossos, Konstantinos
    Lipping, Samuel
    Virtanen, Tuomas
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 736 - 740
  • [10] Audio Captioning Based on Combined Audio and Semantic Embeddings
    Eren, Aysegul Ozkaya
    Sert, Mustafa
    [J]. 2020 IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA (ISM 2020), 2020, : 41 - 48