Multimodal object description network for dense captioning

被引:2
|
作者
Wang, Weixuan [1 ]
Hu, Haifeng [1 ]
机构
[1] Sun Yat Sen Univ, Sch Elect & Informat Engn, Guangzhou 510006, Guangdong, Peoples R China
关键词
D O I
10.1049/el.2017.0326
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
A new multimodal object description network (MODN) model for dense captioning is proposed. The proposed model is constructed by using a vision module and a language module. As for vision module, the modified faster regions-convolution neural network (R-CNN) is used to detect the salient objects and extract their inherited features. The language module combines the semantics features with the object features obtained from the vision module and calculate the probability distribution of each word in the sentence. Compared with existing methods, a multimodal layer in the proposed MODN framework is adopted which can effectively extract discriminant information from both object and semantic features. Moreover, MODN can generate object description rapidly without external region proposal. The effectiveness of MODN on the famous VOC2007 dataset and Visual Genome dataset is verified.
引用
收藏
页码:1041 / +
页数:2
相关论文
共 50 条
  • [41] Context and Attribute Grounded Dense Captioning
    Yin, Guojun
    Sheng, Lu
    Liu, Bin
    Yu, Nenghai
    Wang, Xiaogang
    Shao, Jing
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 6234 - 6243
  • [42] An Efficient Framework for Dense Video Captioning
    Suin, Maitreya
    Rajagopalan, A. N.
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 12039 - 12046
  • [43] Image Graph Production by Dense Captioning
    Sahba, Amin
    Das, Arun
    Rad, Paul
    Jamshidi, Mo
    2018 WORLD AUTOMATION CONGRESS (WAC), 2018, : 193 - 198
  • [44] MMT: A Multimodal Translator for Image Captioning
    Liu, Chang
    Sun, Fuchun
    Wang, Changhu
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, PT II, 2017, 10614 : 784 - 784
  • [45] Multimodal Image Captioning for Marketing Analysis
    Harzig, Philipp
    Brehm, Stephan
    Lienhart, Rainer
    Kaiser, Carolin
    Schallner, Rene
    IEEE 1ST CONFERENCE ON MULTIMEDIA INFORMATION PROCESSING AND RETRIEVAL (MIPR 2018), 2018, : 158 - 161
  • [46] Multimodal Feature Learning for Video Captioning
    Lee, Sujin
    Kim, Incheol
    MATHEMATICAL PROBLEMS IN ENGINEERING, 2018, 2018
  • [47] Improving multimodal datasets with image captioning
    Thao Nguyen
    Gadre, Samir Yitzhak
    Ilharco, Gabriel
    Oh, Sewoong
    Schmidt, Ludwig
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [48] Deep multimodal embedding for video captioning
    Lee, Jin Young
    MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (22) : 31793 - 31805
  • [49] A multimodal fusion approach for image captioning
    Zhao, Dexin
    Chang, Zhi
    Guo, Shutao
    NEUROCOMPUTING, 2019, 329 : 476 - 485
  • [50] Dense-Captioning Events in Videos
    Krishna, Ranjay
    Hata, Kenji
    Ren, Frederic
    Fei-Fei, Li
    Niebles, Juan Carlos
    2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 706 - 715