Multimodal object description network for dense captioning

被引：2

作者：

Wang, Weixuan ^{[1
]}

Hu, Haifeng ^{[1
]}

机构：

[1] Sun Yat Sen Univ, Sch Elect & Informat Engn, Guangzhou 510006, Guangdong, Peoples R China

来源：

ELECTRONICS LETTERS | 2017年 / 53卷 / 15期

关键词：

D O I：

10.1049/el.2017.0326

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

A new multimodal object description network (MODN) model for dense captioning is proposed. The proposed model is constructed by using a vision module and a language module. As for vision module, the modified faster regions-convolution neural network (R-CNN) is used to detect the salient objects and extract their inherited features. The language module combines the semantics features with the object features obtained from the vision module and calculate the probability distribution of each word in the sentence. Compared with existing methods, a multimodal layer in the proposed MODN framework is adopted which can effectively extract discriminant information from both object and semantic features. Moreover, MODN can generate object description rapidly without external region proposal. The effectiveness of MODN on the famous VOC2007 dataset and Visual Genome dataset is verified.

引用

页码：1041 / +

页数：2

共 50 条

[1] Multimodal Pretraining for Dense Video Captioning
Huang, Gabriel
Pang, Bo
Zhu, Zhenhai
Rivera, Clara
Soricut, Radu
1ST CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 10TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (AACL-IJCNLP 2020), 2020, : 470 - 490
[2] Learning Object Context for Dense Captioning
Li, Xiangyang
Jiang, Shuqiang
Han, Jungong
THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 8650 - 8657
[3] Multimodal Context Fusion Based Dense Video Captioning Algorithm
Li, Meiqi
Zhou, Ziwei
ENGINEERING LETTERS, 2025, 33 (04) : 1061 - 1072
[4] Dense semantic embedding network for image captioning
Xiao, Xinyu
Wang, Lingfeng
Ding, Kun
Xiang, Shiming
Pan, Chunhong
PATTERN RECOGNITION, 2019, 90 : 285 - 296
[5] Region-Focused Network for Dense Captioning
Huang, Qingbao
Li, Pijian
Huang, Youji
Shuang, Feng
Cai, Yi
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (06)
[6] MULTIMODAL SEMANTIC ATTENTION NETWORK FOR VIDEO CAPTIONING
Sun, Liang
Li, Bing
Yuan, Chunfeng
Zha, Zhengjun
Hu, Weiming
2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 1300 - 1305
[7] Multimodal feature fusion based on object relation for video captioning
Yan, Zhiwen
Chen, Ying
Song, Jinlong
Zhu, Jia
CAAI TRANSACTIONS ON INTELLIGENCE TECHNOLOGY, 2023, 8 (01) : 247 - 259
[8] Cascaded Revision Network for Novel Object Captioning
Feng, Qianyu
Wu, Yu
Fan, Hehe
Yan, Chenggang
Xu, Mingliang
Yang, Yi
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2020, 30 (10) : 3413 - 3421
[9] Multimodal graph neural network for video procedural captioning
Ji, Lei
Tu, Rongcheng
Lin, Kevin
Wang, Lijuan
Duan, Nan
NEUROCOMPUTING, 2022, 488 : 88 - 96
[10] An Object Localization-based Dense Image Captioning Framework in Hindi
Mishra, Santosh Kumar
Harshit
Saha, Sriparna
Bhattacharyya, Pushpak
ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (02)

← 1 2 3 4 5 →