Learning Multimodal Attention LSTM Networks for Video Captioning

被引:104
|
作者
Xu, Jun [1 ]
Yao, Ting [2 ]
Zhang, Yongdong [1 ]
Mei, Tao [2 ]
机构
[1] Univ Sci & Technol China, Hefei, Anhui, Peoples R China
[2] Microsoft Res, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Multimodal Fusion; Video Captioning; CNN; LSTM; Deep Learning;
D O I
10.1145/3123266.3123448
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Automatic generation of video caption is a challenging task as video is an information-intensive media with complex variations. Most existing methods, either based on language templates or sequence learning, have treated video as a flat data sequence while ignoring intrinsic multimodality nature. Observing that different modalities (e.g., frame, motion, and audio streams), as well as the elements within each modality, contribute differently to the sentence generation, we present a novel deep framework to boost video captioning by learning Multimodal Attention Long-Short Term Memory networks (MA-LSTM). Our proposed MA-LSTM fully exploits both multimodal streams and temporal attention to selectively focus on specific elements during the sentence generation. Moreover, we design a novel child-sum fusion unit in the MA-LSTM to effectively combine different encoded modalities to the initial decoding states. Different from existing approaches that employ the same LSTM structure for different modalities, we train modality-specific LSTM to capture the intrinsic representations of individual modalities. The experiments on two benchmark datasets (MSVD and MSR-VTT) show that our MA-LSTM significantly outperforms the state-of-the-art methods with 52.3 BLEU@4 and 70.4 CIDER-D metrics on MSVD dataset, respectively.
引用
收藏
页码:537 / 545
页数:9
相关论文
共 50 条
  • [1] Multimodal architecture for video captioning with memory networks and an attention mechanism
    Li, Wei
    Guo, Dashan
    Fang, Xiangzhong
    [J]. PATTERN RECOGNITION LETTERS, 2018, 105 : 23 - 29
  • [2] MULTIMODAL SEMANTIC ATTENTION NETWORK FOR VIDEO CAPTIONING
    Sun, Liang
    Li, Bing
    Yuan, Chunfeng
    Zha, Zhengjun
    Hu, Weiming
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 1300 - 1305
  • [3] Multimodal Feature Learning for Video Captioning
    Lee, Sujin
    Kim, Incheol
    [J]. MATHEMATICAL PROBLEMS IN ENGINEERING, 2018, 2018
  • [4] Residual attention-based LSTM for video captioning
    Li, Xiangpeng
    Zhou, Zhilong
    Chen, Lijiang
    Gao, Lianli
    [J]. WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2019, 22 (02): : 621 - 636
  • [5] Residual attention-based LSTM for video captioning
    Xiangpeng Li
    Zhilong Zhou
    Lijiang Chen
    Lianli Gao
    [J]. World Wide Web, 2019, 22 : 621 - 636
  • [6] Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning
    Song, Jingkuan
    Gao, Lianli
    Guo, Zhao
    Liu, Wu
    Zhang, Dongxiang
    Shen, Heng Tao
    [J]. PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 2737 - 2743
  • [7] Multimodal attention-based transformer for video captioning
    Hemalatha Munusamy
    Chandra Sekhar C
    [J]. Applied Intelligence, 2023, 53 : 23349 - 23368
  • [8] Multimodal attention-based transformer for video captioning
    Munusamy, Hemalatha
    Sekhar, C. Chandra
    [J]. APPLIED INTELLIGENCE, 2023, 53 (20) : 23349 - 23368
  • [9] Video Captioning With Attention-Based LSTM and Semantic Consistency
    Gao, Lianli
    Guo, Zhao
    Zhang, Hanwang
    Xu, Xing
    Shen, Heng Tao
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2017, 19 (09) : 2045 - 2055
  • [10] Attention-based Densely Connected LSTM for Video Captioning
    Zhu, Yongqing
    Jiang, Shuqiang
    [J]. PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 802 - 810