Quantifying the Impact of Complementary Visual and Textual Cues Under Image Captioning

被引:0
|
作者
Akilan, Thangarajah [1 ]
Thiagarajan, Amitha [2 ]
Venkatesan, Bharathwaaj [2 ]
Thirumeni, Sowmiya [2 ]
Chandrasekaran, Sanjana Gurusamy [2 ]
机构
[1] Lakehead Univ, Dept Software Engn, Thunder Bay, ON, Canada
[2] Lakehead Univ, Dept Comp Sci, Thunder Bay, ON, Canada
关键词
Image captioning; multimodal learning; NLP;
D O I
10.1109/smc42975.2020.9283183
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Describing an image with natural sentence without human involvement requires knowledge of both image processing and Natural Language Processing (NLP). Most of the existing works are based on unimodal representations of the visual and textual contents using an Encoder-Decoder (EnDec) Deep Neural Network (DNN), where the input images are encoded using Convolutional Neural Network (CNN) and the caption is generated by a Recurrent Neural Network (RNN). This paper dives into a basic image captioning model to quantify the impact of multimodal representation of the visual and textual cues. The multimodal representation is carried out via an early fusion of encoded visual cues from different CNNs, along with combined textual features from different word embedding techniques. The resultant of the multimodal representation of the visual and textual cues are employed to train a Long Short-Term Memory (LSTM)-based baseline caption generator to quantify the impact of various levels of complementary feature mutations. The ablation study involves two different CNN feature extractors and two types of textual feature extractors, shows that exploitation of the complementary information outperforms the unimodal representations significantly with endurable timing overhead.
引用
收藏
页码:389 / 394
页数:6
相关论文
共 50 条
  • [31] Improving Visual Question Answering by Image Captioning
    Shao, Xiangjun
    Dong, Hongsong
    Wu, Guangsheng
    IEEE ACCESS, 2025, 13 : 46299 - 46311
  • [32] Textual Primacy Online: Impression Formation Based on Textual and Visual Cues in Facebook Profiles
    Pelled, Ayellet
    Zilberstein, Tanya
    Tsirulnikov, Alona
    Pick, Eran
    Patkin, Yael
    Tal-Or, Nurit
    AMERICAN BEHAVIORAL SCIENTIST, 2017, 61 (07) : 672 - 687
  • [33] RVAIC: Refined visual attention for improved image captioning
    Al-Qatf, Majjed
    Hawbani, Ammar
    Wang, XingFu
    Abdusallam, Amr
    Alsamhi, Saeed
    Alhabib, Mohammed
    Curry, Edward
    Journal of Intelligent and Fuzzy Systems, 2024, 46 (02): : 3447 - 3459
  • [35] RVAIC: Refined visual attention for improved image captioning
    Al-Qatf, Majjed
    Hawbani, Ammar
    Wang, XingFu
    Abdusallam, Amr
    Alsamhi, Saeed
    Alhabib, Mohammed
    Curry, Edward
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2024, 46 (02) : 3447 - 3459
  • [36] Image captioning in Bengali language using visual attention
    Masud, Adiba
    Hosen, Md. Biplob
    Habibullah, Md.
    Anannya, Mehrin
    Kaiser, M. Shamim
    PLOS ONE, 2025, 20 (02):
  • [37] Character-Oriented Video Summarization With Visual and Textual Cues
    Zhou, Peilun
    Xu, Tong
    Yin, Zhizhuo
    Liu, Dong
    Chen, Enhong
    Lv, Guangyi
    Li, Changliang
    IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (10) : 2684 - 2697
  • [38] Image Captioning With Visual-Semantic Double Attention
    He, Chen
    Hu, Haifeng
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2019, 15 (01)
  • [39] Visual contextual relationship augmented transformer for image captioning
    Su, Qiang
    Hu, Junbo
    Li, Zhixin
    APPLIED INTELLIGENCE, 2024, 54 (06) : 4794 - 4813
  • [40] Image Captioning with Text-Based Visual Attention
    Chen He
    Haifeng Hu
    Neural Processing Letters, 2019, 49 : 177 - 185