Quantifying the Impact of Complementary Visual and Textual Cues Under Image Captioning

被引：0

作者：

Akilan, Thangarajah ^{[1
]}

Thiagarajan, Amitha ^{[2
]}

Venkatesan, Bharathwaaj ^{[2
]}

Thirumeni, Sowmiya ^{[2
]}

Chandrasekaran, Sanjana Gurusamy ^{[2
]}

机构：

[1] Lakehead Univ, Dept Software Engn, Thunder Bay, ON, Canada

[2] Lakehead Univ, Dept Comp Sci, Thunder Bay, ON, Canada

来源：

2020 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC) | 2020年

关键词：

Image captioning; multimodal learning; NLP;

D O I：

10.1109/smc42975.2020.9283183

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Describing an image with natural sentence without human involvement requires knowledge of both image processing and Natural Language Processing (NLP). Most of the existing works are based on unimodal representations of the visual and textual contents using an Encoder-Decoder (EnDec) Deep Neural Network (DNN), where the input images are encoded using Convolutional Neural Network (CNN) and the caption is generated by a Recurrent Neural Network (RNN). This paper dives into a basic image captioning model to quantify the impact of multimodal representation of the visual and textual cues. The multimodal representation is carried out via an early fusion of encoded visual cues from different CNNs, along with combined textual features from different word embedding techniques. The resultant of the multimodal representation of the visual and textual cues are employed to train a Long Short-Term Memory (LSTM)-based baseline caption generator to quantify the impact of various levels of complementary feature mutations. The ablation study involves two different CNN feature extractors and two types of textual feature extractors, shows that exploitation of the complementary information outperforms the unimodal representations significantly with endurable timing overhead.

引用

页码：389 / 394

页数：6

共 50 条

[1] Integrating visual and textual cues for image classification
Gevers, T
Aldershoff, F
Geusebroek, JM
ADVANCES IN VISUAL INFORMATION SYSTEMS, PROCEEDINGS, 2000, 1929 : 419 - 429
[2] BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues
Sarto, Sara
Cornia, Marcella
Baraldi, Lorenzo
Cucchiara, Rita
COMPUTER VISION - ECCV 2024, PT LXXVIII, 2025, 15136 : 70 - 87
[3] Dual-adaptive interactive transformer with textual and visual context for image captioning
Chen, Lizhi
Li, Kesen
EXPERT SYSTEMS WITH APPLICATIONS, 2024, 243
[4] Integration of textual cues for fine-grained image captioning using deep CNN and LSTM
Neeraj Gupta
Anand Singh Jalal
Neural Computing and Applications, 2020, 32 : 17899 - 17908
[5] Integration of textual cues for fine-grained image captioning using deep CNN and LSTM
Gupta, Neeraj
Jalal, Anand Singh
NEURAL COMPUTING & APPLICATIONS, 2020, 32 (24): : 17899 - 17908
[6] Image captioning by incorporating affective concepts learned from both visual and textual components
Yang, Jufeng
Sun, Yan
Liang, Jie
Ren, Bo
Lai, Shang-Hong
NEUROCOMPUTING, 2019, 328 : 56 - 68
[7] Geometrically-Aware Dual Transformer Encoding Visual and Textual Features for Image Captioning
Chang, Yu-Ling
Ma, Hao-Shang
Li, Shiou-Chi
Huang, Jen-Wei
ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PT V, PAKDD 2024, 2024, 14649 : 15 - 27
[8] Complementary Shifted Transformer for Image Captioning
Liu, Yanbo
Yang, You
Xiang, Ruoyu
Ma, Jixin
NEURAL PROCESSING LETTERS, 2023, 55 (06) : 8339 - 8363
[9] Complementary Shifted Transformer for Image Captioning
Yanbo Liu
You Yang
Ruoyu Xiang
Jixin Ma
Neural Processing Letters, 2023, 55 : 8339 - 8363
[10] Quantifying Societal Bias Amplification in Image Captioning
Hirota, Yusuke
Nakashima, Yuta
Garcia, Noa
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 13440 - 13449

← 1 2 3 4 5 →