Multimodal object description network for dense captioning

被引:2
|
作者
Wang, Weixuan [1 ]
Hu, Haifeng [1 ]
机构
[1] Sun Yat Sen Univ, Sch Elect & Informat Engn, Guangzhou 510006, Guangdong, Peoples R China
关键词
D O I
10.1049/el.2017.0326
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
A new multimodal object description network (MODN) model for dense captioning is proposed. The proposed model is constructed by using a vision module and a language module. As for vision module, the modified faster regions-convolution neural network (R-CNN) is used to detect the salient objects and extract their inherited features. The language module combines the semantics features with the object features obtained from the vision module and calculate the probability distribution of each word in the sentence. Compared with existing methods, a multimodal layer in the proposed MODN framework is adopted which can effectively extract discriminant information from both object and semantic features. Moreover, MODN can generate object description rapidly without external region proposal. The effectiveness of MODN on the famous VOC2007 dataset and Visual Genome dataset is verified.
引用
收藏
页码:1041 / +
页数:2
相关论文
共 50 条
  • [1] Multimodal Pretraining for Dense Video Captioning
    Huang, Gabriel
    Pang, Bo
    Zhu, Zhenhai
    Rivera, Clara
    Soricut, Radu
    1ST CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 10TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (AACL-IJCNLP 2020), 2020, : 470 - 490
  • [2] Learning Object Context for Dense Captioning
    Li, Xiangyang
    Jiang, Shuqiang
    Han, Jungong
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 8650 - 8657
  • [3] Multimodal Context Fusion Based Dense Video Captioning Algorithm
    Li, Meiqi
    Zhou, Ziwei
    ENGINEERING LETTERS, 2025, 33 (04) : 1061 - 1072
  • [4] Dense semantic embedding network for image captioning
    Xiao, Xinyu
    Wang, Lingfeng
    Ding, Kun
    Xiang, Shiming
    Pan, Chunhong
    PATTERN RECOGNITION, 2019, 90 : 285 - 296
  • [5] Region-Focused Network for Dense Captioning
    Huang, Qingbao
    Li, Pijian
    Huang, Youji
    Shuang, Feng
    Cai, Yi
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (06)
  • [6] MULTIMODAL SEMANTIC ATTENTION NETWORK FOR VIDEO CAPTIONING
    Sun, Liang
    Li, Bing
    Yuan, Chunfeng
    Zha, Zhengjun
    Hu, Weiming
    2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 1300 - 1305
  • [7] Multimodal feature fusion based on object relation for video captioning
    Yan, Zhiwen
    Chen, Ying
    Song, Jinlong
    Zhu, Jia
    CAAI TRANSACTIONS ON INTELLIGENCE TECHNOLOGY, 2023, 8 (01) : 247 - 259
  • [8] Cascaded Revision Network for Novel Object Captioning
    Feng, Qianyu
    Wu, Yu
    Fan, Hehe
    Yan, Chenggang
    Xu, Mingliang
    Yang, Yi
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2020, 30 (10) : 3413 - 3421
  • [9] Multimodal graph neural network for video procedural captioning
    Ji, Lei
    Tu, Rongcheng
    Lin, Kevin
    Wang, Lijuan
    Duan, Nan
    NEUROCOMPUTING, 2022, 488 : 88 - 96
  • [10] An Object Localization-based Dense Image Captioning Framework in Hindi
    Mishra, Santosh Kumar
    Harshit
    Saha, Sriparna
    Bhattacharyya, Pushpak
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (02)