Quantifying the Impact of Complementary Visual and Textual Cues Under Image Captioning

被引:0
|
作者
Akilan, Thangarajah [1 ]
Thiagarajan, Amitha [2 ]
Venkatesan, Bharathwaaj [2 ]
Thirumeni, Sowmiya [2 ]
Chandrasekaran, Sanjana Gurusamy [2 ]
机构
[1] Lakehead Univ, Dept Software Engn, Thunder Bay, ON, Canada
[2] Lakehead Univ, Dept Comp Sci, Thunder Bay, ON, Canada
关键词
Image captioning; multimodal learning; NLP;
D O I
10.1109/smc42975.2020.9283183
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Describing an image with natural sentence without human involvement requires knowledge of both image processing and Natural Language Processing (NLP). Most of the existing works are based on unimodal representations of the visual and textual contents using an Encoder-Decoder (EnDec) Deep Neural Network (DNN), where the input images are encoded using Convolutional Neural Network (CNN) and the caption is generated by a Recurrent Neural Network (RNN). This paper dives into a basic image captioning model to quantify the impact of multimodal representation of the visual and textual cues. The multimodal representation is carried out via an early fusion of encoded visual cues from different CNNs, along with combined textual features from different word embedding techniques. The resultant of the multimodal representation of the visual and textual cues are employed to train a Long Short-Term Memory (LSTM)-based baseline caption generator to quantify the impact of various levels of complementary feature mutations. The ablation study involves two different CNN feature extractors and two types of textual feature extractors, shows that exploitation of the complementary information outperforms the unimodal representations significantly with endurable timing overhead.
引用
收藏
页码:389 / 394
页数:6
相关论文
共 50 条
  • [21] Combining textual and visual cues for content-based image retrieval on the World Wide Web
    La Cascia, M
    Sethi, S
    Sclaroff, S
    IEEE WORKSHOP ON CONTENT-BASED ACCESS OF IMAGE AND VIDEO LIBRARIES - PROCEEDINGS, 1998, : 24 - 28
  • [22] Unifying textual and visual cues for content-based image retrieval on the World Wide Web
    Sclaroff, S
    La Cascia, M
    Sethi, S
    Taycher, L
    COMPUTER VISION AND IMAGE UNDERSTANDING, 1999, 75 (1-2) : 86 - 98
  • [23] Unifying textual and visual cues for content-based image retrieval on the World Wide Web
    Sclaroff, Stan
    Cascia, Marco La
    Sethi, Saratendu
    Taycher, Leonid
    Computer Vision and Image Understanding, 1999, 75 (01): : 86 - 98
  • [24] Image Captioning with Visual-Semantic LSTM
    Li, Nannan
    Chen, Zhenzhong
    PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 793 - 799
  • [25] Image captioning improved visual question answering
    Himanshu Sharma
    Anand Singh Jalal
    Multimedia Tools and Applications, 2022, 81 : 34775 - 34796
  • [26] Visual to Text: Survey of Image and Video Captioning
    Li, Sheng
    Tao, Zhiqiang
    Li, Kang
    Fu, Yun
    IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2019, 3 (04): : 297 - 312
  • [27] Rich Visual and Language Representation with Complementary Semantics for Video Captioning
    Tang, Pengjie
    Wang, Hanli
    Li, Qinyu
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2019, 15 (02)
  • [28] Towards local visual modeling for image captioning
    Ma, Yiwei
    Ji, Jiayi
    Sun, Xiaoshuai
    Zhou, Yiyi
    Ji, Rongrong
    PATTERN RECOGNITION, 2023, 138
  • [29] Image Captioning Based on Visual and Semantic Attention
    Wei, Haiyang
    Li, Zhixin
    Zhang, Canlong
    MULTIMEDIA MODELING (MMM 2020), PT I, 2020, 11961 : 151 - 162
  • [30] Image captioning improved visual question answering
    Sharma, Himanshu
    Jalal, Anand Singh
    MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (24) : 34775 - 34796