Quantifying the Impact of Complementary Visual and Textual Cues Under Image Captioning

被引：0

作者：

Akilan, Thangarajah ^{[1
]}

Thiagarajan, Amitha ^{[2
]}

Venkatesan, Bharathwaaj ^{[2
]}

Thirumeni, Sowmiya ^{[2
]}

Chandrasekaran, Sanjana Gurusamy ^{[2
]}

机构：

[1] Lakehead Univ, Dept Software Engn, Thunder Bay, ON, Canada

[2] Lakehead Univ, Dept Comp Sci, Thunder Bay, ON, Canada

来源：

2020 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC) | 2020年

关键词：

Image captioning; multimodal learning; NLP;

D O I：

10.1109/smc42975.2020.9283183

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Describing an image with natural sentence without human involvement requires knowledge of both image processing and Natural Language Processing (NLP). Most of the existing works are based on unimodal representations of the visual and textual contents using an Encoder-Decoder (EnDec) Deep Neural Network (DNN), where the input images are encoded using Convolutional Neural Network (CNN) and the caption is generated by a Recurrent Neural Network (RNN). This paper dives into a basic image captioning model to quantify the impact of multimodal representation of the visual and textual cues. The multimodal representation is carried out via an early fusion of encoded visual cues from different CNNs, along with combined textual features from different word embedding techniques. The resultant of the multimodal representation of the visual and textual cues are employed to train a Long Short-Term Memory (LSTM)-based baseline caption generator to quantify the impact of various levels of complementary feature mutations. The ablation study involves two different CNN feature extractors and two types of textual feature extractors, shows that exploitation of the complementary information outperforms the unimodal representations significantly with endurable timing overhead.

引用

页码：389 / 394

页数：6

共 50 条

[31] Improving Visual Question Answering by Image Captioning
Shao, Xiangjun
Dong, Hongsong
Wu, Guangsheng
IEEE ACCESS, 2025, 13 : 46299 - 46311
[32] Textual Primacy Online: Impression Formation Based on Textual and Visual Cues in Facebook Profiles
Pelled, Ayellet
Zilberstein, Tanya
Tsirulnikov, Alona
Pick, Eran
Patkin, Yael
Tal-Or, Nurit
AMERICAN BEHAVIORAL SCIENTIST, 2017, 61 (07) : 672 - 687
[33] RVAIC: Refined visual attention for improved image captioning
Al-Qatf, Majjed
Hawbani, Ammar
Wang, XingFu
Abdusallam, Amr
Alsamhi, Saeed
Alhabib, Mohammed
Curry, Edward
Journal of Intelligent and Fuzzy Systems, 2024, 46 (02): : 3447 - 3459
[34] Visual Linguistic Model and Its Applications in Image Captioning
Kumar R.
SN Computer Science, 2020, 1 (3)
[35] RVAIC: Refined visual attention for improved image captioning
Al-Qatf, Majjed
Hawbani, Ammar
Wang, XingFu
Abdusallam, Amr
Alsamhi, Saeed
Alhabib, Mohammed
Curry, Edward
JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2024, 46 (02) : 3447 - 3459
[36] Image captioning in Bengali language using visual attention
Masud, Adiba
Hosen, Md. Biplob
Habibullah, Md.
Anannya, Mehrin
Kaiser, M. Shamim
PLOS ONE, 2025, 20 (02):
[37] Character-Oriented Video Summarization With Visual and Textual Cues
Zhou, Peilun
Xu, Tong
Yin, Zhizhuo
Liu, Dong
Chen, Enhong
Lv, Guangyi
Li, Changliang
IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (10) : 2684 - 2697
[38] Image Captioning With Visual-Semantic Double Attention
He, Chen
Hu, Haifeng
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2019, 15 (01)
[39] Visual contextual relationship augmented transformer for image captioning
Su, Qiang
Hu, Junbo
Li, Zhixin
APPLIED INTELLIGENCE, 2024, 54 (06) : 4794 - 4813
[40] Image Captioning with Text-Based Visual Attention
Chen He
Haifeng Hu
Neural Processing Letters, 2019, 49 : 177 - 185

← 1 2 3 4 5 →