Multimodal object description network for dense captioning

被引:2
|
作者
Wang, Weixuan [1 ]
Hu, Haifeng [1 ]
机构
[1] Sun Yat Sen Univ, Sch Elect & Informat Engn, Guangzhou 510006, Guangdong, Peoples R China
关键词
D O I
10.1049/el.2017.0326
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
A new multimodal object description network (MODN) model for dense captioning is proposed. The proposed model is constructed by using a vision module and a language module. As for vision module, the modified faster regions-convolution neural network (R-CNN) is used to detect the salient objects and extract their inherited features. The language module combines the semantics features with the object features obtained from the vision module and calculate the probability distribution of each word in the sentence. Compared with existing methods, a multimodal layer in the proposed MODN framework is adopted which can effectively extract discriminant information from both object and semantic features. Moreover, MODN can generate object description rapidly without external region proposal. The effectiveness of MODN on the famous VOC2007 dataset and Visual Genome dataset is verified.
引用
收藏
页码:1041 / +
页数:2
相关论文
共 50 条
  • [1] Multimodal Pretraining for Dense Video Captioning
    Huang, Gabriel
    Pang, Bo
    Zhu, Zhenhai
    Rivera, Clara
    Soricut, Radu
    [J]. 1ST CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 10TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (AACL-IJCNLP 2020), 2020, : 470 - 490
  • [2] Learning Object Context for Dense Captioning
    Li, Xiangyang
    Jiang, Shuqiang
    Han, Jungong
    [J]. THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 8650 - 8657
  • [3] MULTIMODAL SEMANTIC ATTENTION NETWORK FOR VIDEO CAPTIONING
    Sun, Liang
    Li, Bing
    Yuan, Chunfeng
    Zha, Zhengjun
    Hu, Weiming
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 1300 - 1305
  • [4] Dense semantic embedding network for image captioning
    Xiao, Xinyu
    Wang, Lingfeng
    Ding, Kun
    Xiang, Shiming
    Pan, Chunhong
    [J]. PATTERN RECOGNITION, 2019, 90 : 285 - 296
  • [5] Region-Focused Network for Dense Captioning
    Huang, Qingbao
    Li, Pijian
    Huang, Youji
    Shuang, Feng
    Cai, Yi
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (06)
  • [6] Multimodal feature fusion based on object relation for video captioning
    Yan, Zhiwen
    Chen, Ying
    Song, Jinlong
    Zhu, Jia
    [J]. CAAI TRANSACTIONS ON INTELLIGENCE TECHNOLOGY, 2023, 8 (01) : 247 - 259
  • [7] Cascaded Revision Network for Novel Object Captioning
    Feng, Qianyu
    Wu, Yu
    Fan, Hehe
    Yan, Chenggang
    Xu, Mingliang
    Yang, Yi
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2020, 30 (10) : 3413 - 3421
  • [8] Multimodal graph neural network for video procedural captioning
    Ji, Lei
    Tu, Rongcheng
    Lin, Kevin
    Wang, Lijuan
    Duan, Nan
    [J]. NEUROCOMPUTING, 2022, 488 : 88 - 96
  • [9] An Object Localization-based Dense Image Captioning Framework in Hindi
    Mishra, Santosh Kumar
    Harshit
    Saha, Sriparna
    Bhattacharyya, Pushpak
    [J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (02)
  • [10] Multi-scale Hierarchical Residual Network for Dense Captioning
    Tian, Yan
    Wang, Xun
    Wu, Jiachen
    Wang, Ruili
    Yang, Bailin
    [J]. JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2019, 64 : 181 - 196