Multimodal architecture for video captioning with memory networks and an attention mechanism

被引:26
|
作者
Li, Wei [1 ]
Guo, Dashan [1 ]
Fang, Xiangzhong [1 ]
机构
[1] Shanghai Jiao Tong Univ, Dept Elect Engn, Shanghai 200240, Peoples R China
关键词
Video captioning; Memory network; Attention mechanism;
D O I
10.1016/j.patrec.2017.10.012
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Automatically describing videos containing rich and open-domain activities is a very challenging task for computer vision and machine learning research. Obviously, accurate descriptions of video contents need the understanding of both visual concepts and their temporal dynamics. A lot of efforts have been made to understand visual concepts in still image tasks, e.g., image classification and object detection. However, the combination of visual concepts and temporal dynamics has not been given sufficient attention. To delve deeper into the unique characteristic of videos, we propose a novel video captioning architecture to integrate both visual concepts and temporal dynamics. In this paper, an attention mechanism and memory networks are combined together into our multimodal framework with a feature selection algorithm. Specially, we utilize the soft attention mechanism to choose visual concepts relevant frames based on previously generated words, and the memorization of temporal dynamics is implemented by the memory networks, which have great advantages of memorizing long-term information. Then the visual concepts and the temporal dynamics are integrated together into our multimodal architecture. Moreover, the feature selection algorithm is applied to select more relevant features between them according to the part of speech. Finally, we test our proposed framework on both MSVD and MSR-VTT datasets and achieve competitive performance compared with other state-of-the-art methods. (c) 2017Elsevier B.V. Allrightsreserved.
引用
收藏
页码:23 / 29
页数:7
相关论文
共 50 条
  • [1] Learning Multimodal Attention LSTM Networks for Video Captioning
    Xu, Jun
    Yao, Ting
    Zhang, Yongdong
    Mei, Tao
    [J]. PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 537 - 545
  • [2] MULTIMODAL SEMANTIC ATTENTION NETWORK FOR VIDEO CAPTIONING
    Sun, Liang
    Li, Bing
    Yuan, Chunfeng
    Zha, Zhengjun
    Hu, Weiming
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 1300 - 1305
  • [3] Multimodal attention-based transformer for video captioning
    Hemalatha Munusamy
    Chandra Sekhar C
    [J]. Applied Intelligence, 2023, 53 : 23349 - 23368
  • [4] Multimodal attention-based transformer for video captioning
    Munusamy, Hemalatha
    Sekhar, C. Chandra
    [J]. APPLIED INTELLIGENCE, 2023, 53 (20) : 23349 - 23368
  • [5] Hierarchical attention-based multimodal fusion for video captioning
    Wu, Chunlei
    Wei, Yiwei
    Chu, Xiaoliang
    Weichen, Sun
    Su, Fei
    Wang, Leiquan
    [J]. NEUROCOMPUTING, 2018, 315 : 362 - 370
  • [6] Multimodal-enhanced hierarchical attention network for video captioning
    Maosheng Zhong
    Youde Chen
    Hao Zhang
    Hao Xiong
    Zhixiang Wang
    [J]. Multimedia Systems, 2023, 29 : 2469 - 2482
  • [7] Multimodal-enhanced hierarchical attention network for video captioning
    Zhong, Maosheng
    Chen, Youde
    Zhang, Hao
    Xiong, Hao
    Wang, Zhixiang
    [J]. MULTIMEDIA SYSTEMS, 2023, 29 (05) : 2469 - 2482
  • [8] M3: Multimodal Memory Modelling for Video Captioning
    Wang, Junbo
    Wang, Wei
    Huang, Yan
    Wang, Liang
    Tan, Tieniu
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7512 - 7520
  • [9] Stacked Multimodal Attention Network for Context-Aware Video Captioning
    Zheng, Yi
    Zhang, Yuejie
    Feng, Rui
    Zhang, Tao
    Fan, Weiguo
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (01) : 31 - 42
  • [10] MAPS: Joint Multimodal Attention and POS Sequence Generation for Video Captioning
    Zou, Cong
    Wang, Xuchen
    Hu, Yaosi
    Chen, Zhenzhong
    Liu, Shan
    [J]. 2021 INTERNATIONAL CONFERENCE ON VISUAL COMMUNICATIONS AND IMAGE PROCESSING (VCIP), 2021,