Movie Question Answering: Remembering the Textual Cues for Layered Visual Contents

被引：0

作者：

Wang, Bo ^{[1
]}

Xu, Youjiang ^{[1
]}

Han, Yahong ^{[1
]}

Hong, Richang ^{[2
]}

机构：

[1] Tianjin Univ, Sch Comp Sci & Technol, Tianjin, Peoples R China

[2] Hefei Univ Technol, Sch Comp & Informat, Hefei, Anhui, Peoples R China

来源：

THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE | 2018年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Movies provide us with a mass of visual content as well as attracting stories. Existing methods have illustrated that understanding movie stories through only visual content is still a hard problem. In this paper, for answering questions about movies, we put forward a Layered Memory Network (LMN) that represents frame-level and clip-level movie content by the Static Word Memory module and the Dynamic Subtitle Memory module, respectively. Particularly, we firstly extract words and sentences from the training movie subtitles. Then the hierarchically formed movie representations, which are learned from LMN, not only encode the correspondence between words and visual content inside frames, but also encode the temporal alignment between sentences and frames inside movie clips. We also extend our LMN model into three variant frameworks to illustrate the good extendable capabilities. We conduct extensive experiments on the MovieQA dataset. With only visual content as inputs, LMN with frame-level representation obtains a large performance improvement. When incorporating subtitles into LMN to form the clip-level representation, we achieve the state-of-the-art performance on the online evaluation task of 'Video+Subtitles'. The good performance successfully demonstrates that the proposed framework of LMN is effective and the hierarchically formed movie representations have good potential for the applications of movie question answering.

引用

页码：7380 / 7387

页数：8

共 50 条

[1] Movie Question Answering via Textual Memory and Plot Graph
Han, Yahong
Wang, Bo
Hong, Richang
Wu, Fei
[J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2020, 30 (03) : 875 - 887
[2] Visual Question Answering with Textual Representations for Images
Hirota, Yusuke
Garcia, Noa
Otani, Mayu
Chu, Chenhui
Nakashima, Yuta
Taniguchi, Ittetsu
Onoye, Takao
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 3147 - 3150
[3] Dynamic Memory Networks for Visual and Textual Question Answering
Xiong, Caiming
Merity, Stephen
Socher, Richard
[J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 48, 2016, 48
[4] Combining Multiple Cues for Visual Madlibs Question Answering
Tatiana Tommasi
Arun Mallya
Bryan Plummer
Svetlana Lazebnik
Alexander C. Berg
Tamara L. Berg
[J]. International Journal of Computer Vision, 2019, 127 : 38 - 60
[5] Visual-Textual Semantic Alignment Network for Visual Question Answering
Tian, Weidong
Zhang, Yuzheng
He, Bin
Zhu, Junjun
Zhao, Zhongqiu
[J]. ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2021, PT V, 2021, 12895 : 259 - 270
[6] Combining Multiple Cues for Visual Madlibs Question Answering
Tommasi, Tatiana
Mallya, Arun
Plummer, Bryan
Lazebnik, Svetlana
Berg, Alexander C.
Berg, Tamara L.
[J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2019, 127 (01) : 38 - 60
[7] Multi visual and textual embedding on visual question answering for blind people
Tung Le
Huy Tien Nguyen
Minh Le Nguyen
[J]. NEUROCOMPUTING, 2021, 465 : 451 - 464
[8] Movienet: a movie multilayer network model using visual and textual semantic cues
Mourchid, Youssef
Renoust, Benjamin
Roupin, Olivier
Le Van
Cherifi, Hocine
El Hassouni, Mohammed
[J]. APPLIED NETWORK SCIENCE, 2019, 4 (01)
[9] Movienet: a movie multilayer network model using visual and textual semantic cues
Youssef Mourchid
Benjamin Renoust
Olivier Roupin
Lê Văn
Hocine Cherifi
Mohammed El Hassouni
[J]. Applied Network Science, 4
[10] Question Modifiers in Visual Question Answering
Britton, William
Sarkhel, Somdeb
Venugopal, Deepak
[J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 1472 - 1479

← 1 2 3 4 5 →