Multirate Multimodal Video Captioning

被引：7

作者：

Yang, Ziwei ^{[1
]}

Xu, Youjiang ^{[1
]}

Wang, Huiyun ^{[1
]}

Wang, Bo ^{[1
]}

Han, Yahong ^{[1
]}

机构：

[1] Tianjin Univ, Sch Comp Sci & Technol, Tianjin, Peoples R China

来源：

PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17) | 2017年

关键词：

Video Captioning; GRUs; Multimodal; CNN;

D O I：

10.1145/3123266.3127904

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Automatically describing videos with natural language is a crucial challenge of video understanding. Compared to images, videos have specific spatial-temporal structure and various modality information. In this paper, we propose a Multirate Multimodal Approach for video captioning. Considering that the speed of motion in videos varies constantly, we utilize a Multirate GRU to capture temporal structure of videos. It encodes video frames with different intervals and has a strong ability to deal with motion speed variance. As videos contain different modality cues, we design a particular multimodal fusion method. By incorporating visual, motion, and topic information together, we construct a well-designed video representation. Then the video representation is fed into a RNN-based language model for generating natural language descriptions. We evaluate our approach for video captioning on "Microsoft Research-Video to Text" (MSR-VTT), a large-scale video benchmark for video understanding. And our approach gets great performance on the 2nd MSR Video to Language Challenge.

引用

下载

页码：1877 / 1882

页数：6

共 50 条

[1] Deep multimodal embedding for video captioning
Jin Young Lee
Multimedia Tools and Applications, 2019, 78 : 31793 - 31805
[2] Multimodal Pretraining for Dense Video Captioning
Huang, Gabriel
Pang, Bo
Zhu, Zhenhai
Rivera, Clara
Soricut, Radu
1ST CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 10TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (AACL-IJCNLP 2020), 2020, : 470 - 490
[3] Multimodal Feature Learning for Video Captioning
Lee, Sujin
Kim, Incheol
MATHEMATICAL PROBLEMS IN ENGINEERING, 2018, 2018
[4] Deep multimodal embedding for video captioning
Lee, Jin Young
MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (22) : 31793 - 31805
[5] MULTIMODAL SEMANTIC ATTENTION NETWORK FOR VIDEO CAPTIONING
Sun, Liang
Li, Bing
Yuan, Chunfeng
Zha, Zhengjun
Hu, Weiming
2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 1300 - 1305
[6] Video Captioning with Guidance of Multimodal Latent Topics
Chen, Shizhe
Chen, Jia
Jin, Qin
Hauptmann, Alexander
PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 1838 - 1846
[7] Multimodal attention-based transformer for video captioning
Hemalatha Munusamy
Chandra Sekhar C
Applied Intelligence, 2023, 53 : 23349 - 23368
[8] Multimodal graph neural network for video procedural captioning
Ji, Lei
Tu, Rongcheng
Lin, Kevin
Wang, Lijuan
Duan, Nan
NEUROCOMPUTING, 2022, 488 : 88 - 96
[9] Concept Parser With Multimodal Graph Learning for Video Captioning
Wu, Bofeng
Liu, Buyu
Huang, Peng
Bao, Jun
Peng, Xi
Yu, Jun
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (09) : 4484 - 4495
[10] Multimodal attention-based transformer for video captioning
Munusamy, Hemalatha
Sekhar, C. Chandra
APPLIED INTELLIGENCE, 2023, 53 (20) : 23349 - 23368

← 1 2 3 4 5 →