M3: Multimodal Memory Modelling for Video Captioning

被引:107
|
作者
Wang, Junbo [1 ,3 ]
Wang, Wei [1 ,3 ]
Huang, Yan [1 ,3 ]
Wang, Liang [1 ,2 ,3 ]
Tan, Tieniu [1 ,2 ,3 ]
机构
[1] Ctr Res Intelligent Percept & Comp CRIPAC, NLPR, Beijing, Peoples R China
[2] Chinese Acad Sci CASIA, Inst Automat, CEBSIT, Beijing, Peoples R China
[3] UCAS, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
10.1109/CVPR.2018.00784
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video captioning which automatically translates video clips into natural language sentences is a very important task in computer vision. By virtue of recent deep learning technologies, video captioning has made great progress. However learning an effective mapping from the visual sequence space to the language space is still a challenging problem due to the long-term multimodal dependency modelling and semantic misalignment. Inspired by the facts that memory modelling poses potential advantages to long-term sequential problems [35] and working memory is the key factor of visual attention [33], we propose a Multimodal Memory Model (M3) to describe videos, which builds a visual and textual shared memory to model the long-term visual-textual dependency and further guide visual attention on described visual targets to solve visual-textual alignments. Specifically, similar to [10], the proposed M3 attaches an external memory to store and retrieve both visual and textual contents by interacting with video and sentence with multiple read and write operations. To evaluate the proposed model, we perform experiments on two public datasets: MSVD and MSR-VT. The experimental results demonstrate that our method outperforms most of the stateof- the-art methods in terms of BLEU and METEOR.
引用
收藏
页码:7512 / 7520
页数:9
相关论文
共 50 条
  • [1] Sequential Memory Modelling for Video Captioning
    Puttaraja
    Nayaka, Chidambara
    Manikesh
    Sharma, Nitin
    Anand, Kumar M.
    [J]. 2022 IEEE 19TH INDIA COUNCIL INTERNATIONAL CONFERENCE, INDICON, 2022,
  • [2] Hierarchical Memory Modelling for Video Captioning
    Wang, Junbo
    Wang, Wei
    Huang, Yan
    Wang, Liang
    Tan, Tieniu
    [J]. PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 63 - 71
  • [3] Multimodal architecture for video captioning with memory networks and an attention mechanism
    Li, Wei
    Guo, Dashan
    Fang, Xiangzhong
    [J]. PATTERN RECOGNITION LETTERS, 2018, 105 : 23 - 29
  • [4] Multirate Multimodal Video Captioning
    Yang, Ziwei
    Xu, Youjiang
    Wang, Huiyun
    Wang, Bo
    Han, Yahong
    [J]. PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 1877 - 1882
  • [5] Deep multimodal embedding for video captioning
    Jin Young Lee
    [J]. Multimedia Tools and Applications, 2019, 78 : 31793 - 31805
  • [6] Multimodal Feature Learning for Video Captioning
    Lee, Sujin
    Kim, Incheol
    [J]. MATHEMATICAL PROBLEMS IN ENGINEERING, 2018, 2018
  • [7] Multimodal Pretraining for Dense Video Captioning
    Huang, Gabriel
    Pang, Bo
    Zhu, Zhenhai
    Rivera, Clara
    Soricut, Radu
    [J]. 1ST CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 10TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (AACL-IJCNLP 2020), 2020, : 470 - 490
  • [8] Deep multimodal embedding for video captioning
    Lee, Jin Young
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (22) : 31793 - 31805
  • [9] MULTIMODAL SEMANTIC ATTENTION NETWORK FOR VIDEO CAPTIONING
    Sun, Liang
    Li, Bing
    Yuan, Chunfeng
    Zha, Zhengjun
    Hu, Weiming
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 1300 - 1305
  • [10] Video Captioning with Guidance of Multimodal Latent Topics
    Chen, Shizhe
    Chen, Jia
    Jin, Qin
    Hauptmann, Alexander
    [J]. PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 1838 - 1846