Enhancing Image Captioning with Transformer-Based Two-Pass Decoding Framework

被引：0

作者：

Su, Jindian ^{[1
]}

Mou, Yueqi ^{[1
]}

Xie, Yunhao ^{[2
]}

机构：

[1] South China Univ Technol, Sch Comp Sci & Engn, Guangzhou, Peoples R China

[2] South China Univ Technol, Sch Software Engn, Guangzhou, Peoples R China

来源：

ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT I, ICIC 2024 | 2024年 / 14875卷

关键词：

Image Captioning; Two-Pass Decoding; Transformer;

D O I：

10.1007/978-981-97-5663-6_15

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The two-pass decoding framework significantly enhances image captioning models. However, existing two-pass models often train from scratch, missing the opportunity to fully leverage pre-trained knowledge from single-pass models. This practice leads to increased training cost and complexity. In this paper, we propose a unified two-pass decoding framework comprising three core modules: a pre-trained Visual Encoder, a pre-trained Draft Decoder, and a Deliberation Decoder. To enable effective information alignment and complementation between image and draft caption, we design a Cross-Modality Fusion (CMF) module in the Deliberation Decoder, forming a Cross-Modality Fusion-based Deliberation Decoder (CMF-DD). During the training process, we facilitate the transfer of foundational knowledge by extensively sharing parameters between the Draft and Deliberation Decoders. At the same time, we fix parameters from the single-pass baseline and only update a small subset within the Deliberation Decoder to reduce cost and complexity. Additionally, we introduce a Dominance-Adaptive reward scoring algorithm within the reinforcement learning stage to pertinently enhance the quality of refinements. Experiments on MS COCO datasets demonstrate that our method achieves substantial improvements over single-pass decoding baselines and competes favorably with other two-pass decoding methods.

引用

页码：171 / 183

页数：13

共 50 条

[1] A Sparse Transformer-Based Approach for Image Captioning
Lei, Zhou
Zhou, Congcong
Chen, Shengbo
Huang, Yiyong
Liu, Xianrui
[J]. IEEE ACCESS, 2020, 8 : 213437 - 213446
[2] ThaiTC:Thai Transformer-based Image Captioning
Jaknamon, Teetouch
Marukatat, Sanparith
[J]. 2022 17TH INTERNATIONAL JOINT SYMPOSIUM ON ARTIFICIAL INTELLIGENCE AND NATURAL LANGUAGE PROCESSING (ISAI-NLP 2022) / 3RD INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND INTERNET OF THINGS (AIOT 2022), 2022,
[3] A Review of Transformer-Based Approaches for Image Captioning
Ondeng, Oscar
Ouma, Heywood
Akuon, Peter
[J]. APPLIED SCIENCES-BASEL, 2023, 13 (19):
[4] Image Alone Are Not Enough: A General Semantic-Augmented Transformer-Based Framework for Image Captioning
Liu, Jiawei
Lin, Xin
He, Liang
[J]. 2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
[5] Transformer-based image captioning by leveraging sentence information
Chahkandi, Vahid
Fadaeieslam, Mohammad Javad
Yaghmaee, Farzin
[J]. JOURNAL OF ELECTRONIC IMAGING, 2022, 31 (04)
[6] Transformer-based local-global guidance for image captioning
Parvin, Hashem
Naghsh-Nilchi, Ahmad Reza
Mohammadi, Hossein Mahvash
[J]. EXPERT SYSTEMS WITH APPLICATIONS, 2023, 223
[7] Image captioning using transformer-based double attention network
Parvin, Hashem
Naghsh-Nilchi, Ahmad Reza
Mohammadi, Hossein Mahvash
[J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2023, 125
[8] Explaining transformer-based image captioning models: An empirical analysis
Cornia, Marcella
Baraldi, Lorenzo
Cucchiara, Rita
[J]. AI COMMUNICATIONS, 2022, 35 (02) : 111 - 129
[9] TRANSFORMER BASED DELIBERATION FOR TWO-PASS SPEECH RECOGNITION
Hu, Ke
Pang, Ruoming
Sainath, Tara N.
Strohman, Trevor
[J]. 2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 68 - 74
[10] Bornon: Bengali Image Captioning with Transformer-Based Deep Learning Approach
Faisal Muhammad Shah
Mayeesha Humaira
Md Abidur Rahman Khan Jim
Amit Saha Ami
Shimul Paul
[J]. SN Computer Science, 2022, 3 (1)

← 1 2 3 4 5 →