Multimodal object description network for dense captioning

被引：2

作者：

Wang, Weixuan ^{[1
]}

Hu, Haifeng ^{[1
]}

机构：

[1] Sun Yat Sen Univ, Sch Elect & Informat Engn, Guangzhou 510006, Guangdong, Peoples R China

来源：

ELECTRONICS LETTERS | 2017年 / 53卷 / 15期

关键词：

D O I：

10.1049/el.2017.0326

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

A new multimodal object description network (MODN) model for dense captioning is proposed. The proposed model is constructed by using a vision module and a language module. As for vision module, the modified faster regions-convolution neural network (R-CNN) is used to detect the salient objects and extract their inherited features. The language module combines the semantics features with the object features obtained from the vision module and calculate the probability distribution of each word in the sentence. Compared with existing methods, a multimodal layer in the proposed MODN framework is adopted which can effectively extract discriminant information from both object and semantic features. Moreover, MODN can generate object description rapidly without external region proposal. The effectiveness of MODN on the famous VOC2007 dataset and Visual Genome dataset is verified.

引用

页码：1041 / +

页数：2

共 50 条

[31] MDC-Net: Multimodal Detection and Captioning Network for Steel Surface Defects
Chazhoor, Anthony Ashwin Peter
Hu, Shanfeng
Gao, Bin
Woo, Wai Lok
ROBOTICS, COMPUTER VISION AND INTELLIGENT SYSTEMS, ROBOVIS 2024, 2024, 2077 : 316 - 333
[32] DVC-Net: A deep neural network model for dense video captioning
Lee, Sujin
Kim, Incheol
IET COMPUTER VISION, 2021, 15 (01) : 12 - 23
[33] Element-Centered Multi-granularity Network for Dense Video Captioning
Dane, Xuan
Wang, Guolong
Wu, Xun
Qin, Zheng
PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT X, 2025, 15040 : 445 - 459
[34] CDKM: Common and Distinct Knowledge Mining Network With Content Interaction for Dense Captioning
Deng, Hongyu
Xie, Yushan
Wang, Qi
Wang, Jianjun
Ruan, Weijian
Liu, Wu
Liu, Yong-Jin
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 10462 - 10473
[35] A news image captioning approach based on multimodal pointer-generator network
Chen, Jingqiang
Zhuge, Hai
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2022, 34 (07):
[36] Dense Receptive Field Network: A Backbone Network for Object Detection
Gao, Fei
Yang, Chengguang
Ge, Yisu
Lu, Shufang
Shao, Qike
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2019: IMAGE PROCESSING, PT III, 2019, 11729 : 105 - 118
[37] Object Hallucination in Image Captioning
Rohrbach, Anna
Hendricks, Lisa Anne
Burns, Kaylee
Darrell, Trevor
Saenko, Kate
2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 4035 - 4045
[38] Deep multimodal embedding for video captioning
Jin Young Lee
Multimedia Tools and Applications, 2019, 78 : 31793 - 31805
[39] Dense Captioning of Natural Scenes in Spanish
Gomez-Garay, Alejandro
Raducanu, Bogdan
Salas, Joaquin
PATTERN RECOGNITION, 2018, 10880 : 145 - 154
[40] Weakly Supervised Dense Video Captioning
Shen, Zhiqiang
Li, Jianguo
Su, Zhou
Li, Minjun
Chen, Yurong
Jiang, Yu-Gang
Xue, Xiangyang
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 5159 - 5167

← 1 2 3 4 5 →