ACTUAL: Audio Captioning With Caption Feature Space Regularization

被引：5

作者：

Zhang, Yiming ^{[1
]}

Yu, Hong ^{[2
]}

Du, Ruoyi ^{[1
]}

Tan, Zheng-Hua ^{[3
]}

Wang, Wenwu ^{[4
]}

Ma, Zhanyu ^{[1
]}

Dong, Yuan ^{[1
]}

机构：

[1] Beijing Univ Posts & Telecommun, Sch Artificial Intelligence, Pattern Recognit & Intelligent Syst Lab, Beijing 100876, Peoples R China

[2] Ludong Univ, Sch Informat & Elect Engn, Dept Artificial Intelligence, Yantai 264025, Shandong, Peoples R China

[3] Aalborg Univ, Dept Elect Syst, DK-9220 Aalborg, Denmark

[4] Univ Surrey, Ctr Vis Speech & Signal Proc, Guildford GU2 7XH, England

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2023年 / 31卷

基金：

中国国家自然科学基金; 北京市自然科学基金;

关键词：

Audio captioning; contrastive learning; cross-modal task; caption consistency regularization;

D O I：

10.1109/TASLP.2023.3293015

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Audio captioning aims at describing the content of audio clips with human language. Due to the ambiguity of audio content, different people may perceive the same audio clip differently, resulting in caption disparities (i.e., the same audio clip may be described by several captions with diverse semantics). In the literature, the one-to-many strategy is often employed to train the audio captioning models, where a related caption is randomly selected as the optimization target for each audio clip at each training iteration. However, we observe that this can lead to significant variations during the optimization process and adversely affect the performance of the model. In this article, we address this issue by proposing an audio captioning method, named ACTUAL (Audio Captioning with capTion featUre spAce reguLarization). ACTUAL involves a two-stage training process: (i) in the first stage, we use contrastive learning to construct a proxy feature space where the similarities between captions at the audio level are explored, and (ii) in the second stage, the proxy feature space is utilized as additional supervision to improve the optimization of the model in a more stable direction. We conduct extensive experiments to demonstrate the effectiveness of the proposed ACTUAL method. The results show that proxy caption embedding can significantly improve the performance of the baseline model and the proposed ACTUAL method offers competitive performance on two datasets compared to state-of-the-art methods.

引用

页码：2643 / 2657

页数：15

共 50 条

[1] AUTOMATED AUDIO CAPTIONING USING TRANSFER LEARNING AND RECONSTRUCTION LATENT SPACE SIMILARITY REGULARIZATION
Koh, Andrew
Xue Fuzhao
Siong, Chng Eng
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7722 - 7726
[2] Feature-informed Embedding Space Regularization For Audio Classification
Hung, Yun-Ning
Lerch, Alexander
[J]. 2022 30TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2022), 2022, : 419 - 423
[3] Using various pre-trained models for audio feature extraction in automated audio captioning
Won, Hyejin
Kim, Baekseung
Kwak, Il-Youp
Lim, Changwon
[J]. EXPERT SYSTEMS WITH APPLICATIONS, 2023, 231
[4] Generating Accurate Caption Units for Figure Captioning
Qian, Xin
Koh, Eunyee
Du, Fan
Kim, Sungchul
Chan, Joel
Rossi, Ryan A.
Malik, Sana
Lee, Tak Yeon
[J]. PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE 2021 (WWW 2021), 2021, : 2792 - 2804
[5] AUDIO CAPTION: LISTEN AND TELL
Wu, Mengyue
Dinkel, Heinrich
Yu, Kai
[J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 830 - 834
[6] BEYOND CAPTION TO NARRATIVE: VIDEO CAPTIONING WITH MULTIPLE SENTENCES
Shin, Andrew
Ohnishi, Katsunori
Harada, Tatsuya
[J]. 2016 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2016, : 3364 - 3368
[7] Caption TLSTMs: combining transformer with LSTMs for image captioning
Yan, Jie
Xie, Yuxiang
Luan, Xidao
Guo, Yanming
Gong, Quanzhi
Feng, Suru
[J]. INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2022, 11 (02) : 111 - 121
[8] Caption TLSTMs: combining transformer with LSTMs for image captioning
Jie Yan
Yuxiang Xie
Xidao Luan
Yanming Guo
Quanzhi Gong
Suru Feng
[J]. International Journal of Multimedia Information Retrieval, 2022, 11 : 111 - 121
[9] CLOTHO: AN AUDIO CAPTIONING DATASET
Drossos, Konstantinos
Lipping, Samuel
Virtanen, Tuomas
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 736 - 740
[10] Audio Captioning Based on Combined Audio and Semantic Embeddings
Eren, Aysegul Ozkaya
Sert, Mustafa
[J]. 2020 IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA (ISM 2020), 2020, : 41 - 48

← 1 2 3 4 5 →