Hierarchical Memory Decoder for Visual Narrating

被引:10
|
作者
Wu, Aming [1 ,2 ]
Han, Yahong [1 ,2 ,3 ]
Zhao, Zhou [4 ]
Yang, Yi [5 ]
机构
[1] Tianjin Univ, Coll Intelligence & Comp, Tianjin 300350, Peoples R China
[2] Tianjin Univ, Tianjin Key Lab Machine Learning, Tianjin 300350, Peoples R China
[3] Peng Chong Lab, Shenzhen 518066, Peoples R China
[4] Zhejiang Univ, Coll Comp Sci, Hangzhou 310007, Peoples R China
[5] Univ Technol Sydney, Sch Comp Sci, Sydney, NSW 2007, Australia
关键词
Decoding; Visualization; Videos; Task analysis; Computer architecture; Electronic mail; Semantics; Visual narrating; multi-modal fusion; hierarchical memory decoder; video captioning; visual storytelling; STREAM;
D O I
10.1109/TCSVT.2020.3020877
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Visual narrating focuses on generating semantic descriptions to summarize visual content of images or videos, e.g., visual captioning and visual storytelling. The challenge mainly lies in how to design a decoder to generate accurate descriptions matching visual content. Recent advances often employ a recurrent neural network (RNN), e.g., Long-Short Term Memory (LSTM), as the decoder. However, RNN is prone to diluting long-term information, which weakens its performance of capturing long-term dependencies. Recent work has demonstrated memory network (MemNet) owns the advantage of storing long-term information. However, as the decoder, it has not been well exploited for visual narrating. The reason partially comes from the difficulty of multi-modal sequential decoding with MemNet. In this article, we devise a novel memory decoder for visual narrating. Concretely, to obtain a better multi-modal representation, we first design a new multi-modal fusion method to fully merge visual and lexical information. Then, based on the fusion result, during decoding, we construct a MemNet-based decoder consisting of multiple memory layers. Particularly, in each layer, we employ a memory set to store previous decoding information and utilize an attention mechanism to adaptively select the information related to the current output. Meanwhile, we also employ a memory set to store the decoding output of each memory layer at the current time step and still utilize an attention mechanism to select the related information. Thus, this decoder alleviates dilution of long-term information. Meanwhile, the hierarchical architecture leverages the latent information of each layer, which is helpful for generating accurate descriptions. Experimental results on two tasks of visual narrating, i.e., video captioning and visual storytelling, show that our decoder could obtain superior results and outperform the performance of conventional RNN-based decoder.
引用
收藏
页码:2438 / 2449
页数:12
相关论文
共 50 条
  • [31] Narrating the American West: New Forms of Historical Memory
    Spurgeon, Sara
    WESTERN AMERICAN LITERATURE, 2009, 44 (01) : 79 - 80
  • [32] Narrating Historical Injustice: Political Responsibility and the Politics of Memory
    Temin, David Myer
    Dahl, Adam
    POLITICAL RESEARCH QUARTERLY, 2017, 70 (04) : 905 - 917
  • [33] Why Narrating Changes Memory: A Contribution to an Integrative Model of Memory and Narrative Processes
    Andrea Smorti
    Chiara Fioretti
    Integrative Psychological and Behavioral Science, 2016, 50 : 296 - 319
  • [34] Unsupervised Story Comprehension with Hierarchical Encoder-Decoder
    Wang, Bingning
    Yao, Ting
    Zhang, Qi
    Xu, Jingfang
    Liu, Kang
    Tian, Zhixing
    Zhao, Jun
    PROCEEDINGS OF THE 2019 ACM SIGIR INTERNATIONAL CONFERENCE ON THEORY OF INFORMATION RETRIEVAL (ICTIR'19), 2019, : 92 - 99
  • [35] Visual Target Sequence Prediction via Hierarchical Temporal Memory Implemented on the iCub Robot
    Kirtay, Murat
    Falotico, Egidio
    Ambrosano, Alessandro
    Albanese, Ugo
    Vannucci, Lorenzo
    Laschi, Cecilia
    BIOMIMETIC AND BIOHYBRID SYSTEMS, LIVING MACHINES 2016, 2016, 9793 : 119 - 130
  • [36] Hierarchical Bayesian measurement models for continuous reproduction of visual features from working memory
    Oberauer, Klaus
    Stoneking, Colin
    Wabersich, Dominik
    Lin, Hsuan-Yu
    JOURNAL OF VISION, 2017, 17 (05):
  • [37] Hierarchical organization in visual working memory: From global ensemble to individual object structure
    Nie, Qi-Yang
    Mueller, Hermann J.
    Conci, Markus
    COGNITION, 2017, 159 : 85 - 96
  • [38] Memory optimization of MAP turbo decoder algorithms
    Schurgers, C
    Catthoor, F
    Engels, M
    IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2001, 9 (02) : 305 - 312
  • [39] An MPEG decoder with embedded compression for memory reduction
    de With, PHN
    Frencken, PH
    van der Schaar-Mitrea, M
    IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, 1998, 44 (03) : 545 - 555
  • [40] Optimum LDPC Decoder: A Memory Architecture Problem
    Amador, Erick
    Pacalet, Renaud
    Rezard, Vincent
    DAC: 2009 46TH ACM/IEEE DESIGN AUTOMATION CONFERENCE, VOLS 1 AND 2, 2009, : 891 - +