Movie Question Answering: Remembering the Textual Cues for Layered Visual Contents

被引:0
|
作者
Wang, Bo [1 ]
Xu, Youjiang [1 ]
Han, Yahong [1 ]
Hong, Richang [2 ]
机构
[1] Tianjin Univ, Sch Comp Sci & Technol, Tianjin, Peoples R China
[2] Hefei Univ Technol, Sch Comp & Informat, Hefei, Anhui, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Movies provide us with a mass of visual content as well as attracting stories. Existing methods have illustrated that understanding movie stories through only visual content is still a hard problem. In this paper, for answering questions about movies, we put forward a Layered Memory Network (LMN) that represents frame-level and clip-level movie content by the Static Word Memory module and the Dynamic Subtitle Memory module, respectively. Particularly, we firstly extract words and sentences from the training movie subtitles. Then the hierarchically formed movie representations, which are learned from LMN, not only encode the correspondence between words and visual content inside frames, but also encode the temporal alignment between sentences and frames inside movie clips. We also extend our LMN model into three variant frameworks to illustrate the good extendable capabilities. We conduct extensive experiments on the MovieQA dataset. With only visual content as inputs, LMN with frame-level representation obtains a large performance improvement. When incorporating subtitles into LMN to form the clip-level representation, we achieve the state-of-the-art performance on the online evaluation task of 'Video+Subtitles'. The good performance successfully demonstrates that the proposed framework of LMN is effective and the hierarchically formed movie representations have good potential for the applications of movie question answering.
引用
收藏
页码:7380 / 7387
页数:8
相关论文
共 50 条
  • [1] Movie Question Answering via Textual Memory and Plot Graph
    Han, Yahong
    Wang, Bo
    Hong, Richang
    Wu, Fei
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2020, 30 (03) : 875 - 887
  • [2] Visual Question Answering with Textual Representations for Images
    Hirota, Yusuke
    Garcia, Noa
    Otani, Mayu
    Chu, Chenhui
    Nakashima, Yuta
    Taniguchi, Ittetsu
    Onoye, Takao
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 3147 - 3150
  • [3] Dynamic Memory Networks for Visual and Textual Question Answering
    Xiong, Caiming
    Merity, Stephen
    Socher, Richard
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 48, 2016, 48
  • [4] Combining Multiple Cues for Visual Madlibs Question Answering
    Tatiana Tommasi
    Arun Mallya
    Bryan Plummer
    Svetlana Lazebnik
    Alexander C. Berg
    Tamara L. Berg
    [J]. International Journal of Computer Vision, 2019, 127 : 38 - 60
  • [5] Visual-Textual Semantic Alignment Network for Visual Question Answering
    Tian, Weidong
    Zhang, Yuzheng
    He, Bin
    Zhu, Junjun
    Zhao, Zhongqiu
    [J]. ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2021, PT V, 2021, 12895 : 259 - 270
  • [6] Combining Multiple Cues for Visual Madlibs Question Answering
    Tommasi, Tatiana
    Mallya, Arun
    Plummer, Bryan
    Lazebnik, Svetlana
    Berg, Alexander C.
    Berg, Tamara L.
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2019, 127 (01) : 38 - 60
  • [7] Multi visual and textual embedding on visual question answering for blind people
    Tung Le
    Huy Tien Nguyen
    Minh Le Nguyen
    [J]. NEUROCOMPUTING, 2021, 465 : 451 - 464
  • [8] Movienet: a movie multilayer network model using visual and textual semantic cues
    Mourchid, Youssef
    Renoust, Benjamin
    Roupin, Olivier
    Le Van
    Cherifi, Hocine
    El Hassouni, Mohammed
    [J]. APPLIED NETWORK SCIENCE, 2019, 4 (01)
  • [9] Movienet: a movie multilayer network model using visual and textual semantic cues
    Youssef Mourchid
    Benjamin Renoust
    Olivier Roupin
    Lê Văn
    Hocine Cherifi
    Mohammed El Hassouni
    [J]. Applied Network Science, 4
  • [10] Question Modifiers in Visual Question Answering
    Britton, William
    Sarkhel, Somdeb
    Venugopal, Deepak
    [J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 1472 - 1479