Multimodal object description network for dense captioning

被引:2
|
作者
Wang, Weixuan [1 ]
Hu, Haifeng [1 ]
机构
[1] Sun Yat Sen Univ, Sch Elect & Informat Engn, Guangzhou 510006, Guangdong, Peoples R China
关键词
D O I
10.1049/el.2017.0326
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
A new multimodal object description network (MODN) model for dense captioning is proposed. The proposed model is constructed by using a vision module and a language module. As for vision module, the modified faster regions-convolution neural network (R-CNN) is used to detect the salient objects and extract their inherited features. The language module combines the semantics features with the object features obtained from the vision module and calculate the probability distribution of each word in the sentence. Compared with existing methods, a multimodal layer in the proposed MODN framework is adopted which can effectively extract discriminant information from both object and semantic features. Moreover, MODN can generate object description rapidly without external region proposal. The effectiveness of MODN on the famous VOC2007 dataset and Visual Genome dataset is verified.
引用
收藏
页码:1041 / +
页数:2
相关论文
共 50 条
  • [31] MDC-Net: Multimodal Detection and Captioning Network for Steel Surface Defects
    Chazhoor, Anthony Ashwin Peter
    Hu, Shanfeng
    Gao, Bin
    Woo, Wai Lok
    ROBOTICS, COMPUTER VISION AND INTELLIGENT SYSTEMS, ROBOVIS 2024, 2024, 2077 : 316 - 333
  • [32] DVC-Net: A deep neural network model for dense video captioning
    Lee, Sujin
    Kim, Incheol
    IET COMPUTER VISION, 2021, 15 (01) : 12 - 23
  • [33] Element-Centered Multi-granularity Network for Dense Video Captioning
    Dane, Xuan
    Wang, Guolong
    Wu, Xun
    Qin, Zheng
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT X, 2025, 15040 : 445 - 459
  • [34] CDKM: Common and Distinct Knowledge Mining Network With Content Interaction for Dense Captioning
    Deng, Hongyu
    Xie, Yushan
    Wang, Qi
    Wang, Jianjun
    Ruan, Weijian
    Liu, Wu
    Liu, Yong-Jin
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 10462 - 10473
  • [35] A news image captioning approach based on multimodal pointer-generator network
    Chen, Jingqiang
    Zhuge, Hai
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2022, 34 (07):
  • [36] Dense Receptive Field Network: A Backbone Network for Object Detection
    Gao, Fei
    Yang, Chengguang
    Ge, Yisu
    Lu, Shufang
    Shao, Qike
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2019: IMAGE PROCESSING, PT III, 2019, 11729 : 105 - 118
  • [37] Object Hallucination in Image Captioning
    Rohrbach, Anna
    Hendricks, Lisa Anne
    Burns, Kaylee
    Darrell, Trevor
    Saenko, Kate
    2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 4035 - 4045
  • [38] Deep multimodal embedding for video captioning
    Jin Young Lee
    Multimedia Tools and Applications, 2019, 78 : 31793 - 31805
  • [39] Dense Captioning of Natural Scenes in Spanish
    Gomez-Garay, Alejandro
    Raducanu, Bogdan
    Salas, Joaquin
    PATTERN RECOGNITION, 2018, 10880 : 145 - 154
  • [40] Weakly Supervised Dense Video Captioning
    Shen, Zhiqiang
    Li, Jianguo
    Su, Zhou
    Li, Minjun
    Chen, Yurong
    Jiang, Yu-Gang
    Xue, Xiangyang
    30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 5159 - 5167