Learning Multimodal Attention LSTM Networks for Video Captioning

被引：104

作者：

Xu, Jun ^{[1
]}

Yao, Ting ^{[2
]}

Zhang, Yongdong ^{[1
]}

Mei, Tao ^{[2
]}

机构：

[1] Univ Sci & Technol China, Hefei, Anhui, Peoples R China

[2] Microsoft Res, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17) | 2017年

基金：

中国国家自然科学基金;

关键词：

Multimodal Fusion; Video Captioning; CNN; LSTM; Deep Learning;

D O I：

10.1145/3123266.3123448

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Automatic generation of video caption is a challenging task as video is an information-intensive media with complex variations. Most existing methods, either based on language templates or sequence learning, have treated video as a flat data sequence while ignoring intrinsic multimodality nature. Observing that different modalities (e.g., frame, motion, and audio streams), as well as the elements within each modality, contribute differently to the sentence generation, we present a novel deep framework to boost video captioning by learning Multimodal Attention Long-Short Term Memory networks (MA-LSTM). Our proposed MA-LSTM fully exploits both multimodal streams and temporal attention to selectively focus on specific elements during the sentence generation. Moreover, we design a novel child-sum fusion unit in the MA-LSTM to effectively combine different encoded modalities to the initial decoding states. Different from existing approaches that employ the same LSTM structure for different modalities, we train modality-specific LSTM to capture the intrinsic representations of individual modalities. The experiments on two benchmark datasets (MSVD and MSR-VTT) show that our MA-LSTM significantly outperforms the state-of-the-art methods with 52.3 BLEU@4 and 70.4 CIDER-D metrics on MSVD dataset, respectively.

引用

页码：537 / 545

页数：9

共 50 条

[1] Multimodal architecture for video captioning with memory networks and an attention mechanism
Li, Wei
Guo, Dashan
Fang, Xiangzhong
[J]. PATTERN RECOGNITION LETTERS, 2018, 105 : 23 - 29
[2] MULTIMODAL SEMANTIC ATTENTION NETWORK FOR VIDEO CAPTIONING
Sun, Liang
Li, Bing
Yuan, Chunfeng
Zha, Zhengjun
Hu, Weiming
[J]. 2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 1300 - 1305
[3] Multimodal Feature Learning for Video Captioning
Lee, Sujin
Kim, Incheol
[J]. MATHEMATICAL PROBLEMS IN ENGINEERING, 2018, 2018
[4] Residual attention-based LSTM for video captioning
Li, Xiangpeng
Zhou, Zhilong
Chen, Lijiang
Gao, Lianli
[J]. WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2019, 22 (02): : 621 - 636
[5] Residual attention-based LSTM for video captioning
Xiangpeng Li
Zhilong Zhou
Lijiang Chen
Lianli Gao
[J]. World Wide Web, 2019, 22 : 621 - 636
[6] Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning
Song, Jingkuan
Gao, Lianli
Guo, Zhao
Liu, Wu
Zhang, Dongxiang
Shen, Heng Tao
[J]. PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 2737 - 2743
[7] Multimodal attention-based transformer for video captioning
Hemalatha Munusamy
Chandra Sekhar C
[J]. Applied Intelligence, 2023, 53 : 23349 - 23368
[8] Multimodal attention-based transformer for video captioning
Munusamy, Hemalatha
Sekhar, C. Chandra
[J]. APPLIED INTELLIGENCE, 2023, 53 (20) : 23349 - 23368
[9] Video Captioning With Attention-Based LSTM and Semantic Consistency
Gao, Lianli
Guo, Zhao
Zhang, Hanwang
Xu, Xing
Shen, Heng Tao
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2017, 19 (09) : 2045 - 2055
[10] Attention-based Densely Connected LSTM for Video Captioning
Zhu, Yongqing
Jiang, Shuqiang
[J]. PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 802 - 810

← 1 2 3 4 5 →