Quantifying the Impact of Complementary Visual and Textual Cues Under Image Captioning

被引:0
|
作者
Akilan, Thangarajah [1 ]
Thiagarajan, Amitha [2 ]
Venkatesan, Bharathwaaj [2 ]
Thirumeni, Sowmiya [2 ]
Chandrasekaran, Sanjana Gurusamy [2 ]
机构
[1] Lakehead Univ, Dept Software Engn, Thunder Bay, ON, Canada
[2] Lakehead Univ, Dept Comp Sci, Thunder Bay, ON, Canada
关键词
Image captioning; multimodal learning; NLP;
D O I
10.1109/smc42975.2020.9283183
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Describing an image with natural sentence without human involvement requires knowledge of both image processing and Natural Language Processing (NLP). Most of the existing works are based on unimodal representations of the visual and textual contents using an Encoder-Decoder (EnDec) Deep Neural Network (DNN), where the input images are encoded using Convolutional Neural Network (CNN) and the caption is generated by a Recurrent Neural Network (RNN). This paper dives into a basic image captioning model to quantify the impact of multimodal representation of the visual and textual cues. The multimodal representation is carried out via an early fusion of encoded visual cues from different CNNs, along with combined textual features from different word embedding techniques. The resultant of the multimodal representation of the visual and textual cues are employed to train a Long Short-Term Memory (LSTM)-based baseline caption generator to quantify the impact of various levels of complementary feature mutations. The ablation study involves two different CNN feature extractors and two types of textual feature extractors, shows that exploitation of the complementary information outperforms the unimodal representations significantly with endurable timing overhead.
引用
收藏
页码:389 / 394
页数:6
相关论文
共 50 条
  • [1] Integrating visual and textual cues for image classification
    Gevers, T
    Aldershoff, F
    Geusebroek, JM
    ADVANCES IN VISUAL INFORMATION SYSTEMS, PROCEEDINGS, 2000, 1929 : 419 - 429
  • [2] BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues
    Sarto, Sara
    Cornia, Marcella
    Baraldi, Lorenzo
    Cucchiara, Rita
    COMPUTER VISION - ECCV 2024, PT LXXVIII, 2025, 15136 : 70 - 87
  • [3] Dual-adaptive interactive transformer with textual and visual context for image captioning
    Chen, Lizhi
    Li, Kesen
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 243
  • [4] Integration of textual cues for fine-grained image captioning using deep CNN and LSTM
    Neeraj Gupta
    Anand Singh Jalal
    Neural Computing and Applications, 2020, 32 : 17899 - 17908
  • [5] Integration of textual cues for fine-grained image captioning using deep CNN and LSTM
    Gupta, Neeraj
    Jalal, Anand Singh
    NEURAL COMPUTING & APPLICATIONS, 2020, 32 (24): : 17899 - 17908
  • [6] Image captioning by incorporating affective concepts learned from both visual and textual components
    Yang, Jufeng
    Sun, Yan
    Liang, Jie
    Ren, Bo
    Lai, Shang-Hong
    NEUROCOMPUTING, 2019, 328 : 56 - 68
  • [7] Geometrically-Aware Dual Transformer Encoding Visual and Textual Features for Image Captioning
    Chang, Yu-Ling
    Ma, Hao-Shang
    Li, Shiou-Chi
    Huang, Jen-Wei
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PT V, PAKDD 2024, 2024, 14649 : 15 - 27
  • [8] Complementary Shifted Transformer for Image Captioning
    Liu, Yanbo
    Yang, You
    Xiang, Ruoyu
    Ma, Jixin
    NEURAL PROCESSING LETTERS, 2023, 55 (06) : 8339 - 8363
  • [9] Complementary Shifted Transformer for Image Captioning
    Yanbo Liu
    You Yang
    Ruoyu Xiang
    Jixin Ma
    Neural Processing Letters, 2023, 55 : 8339 - 8363
  • [10] Quantifying Societal Bias Amplification in Image Captioning
    Hirota, Yusuke
    Nakashima, Yuta
    Garcia, Noa
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 13440 - 13449