Structured Two-Stream Attention Network for Video Question Answering

被引:0
|
作者
Gao, Lianli [1 ,2 ]
Zeng, Pengpeng [1 ,2 ]
Song, Jingkuan [1 ,2 ]
Li, Yuan-Fang [3 ]
Liu, Wu [4 ]
Mei, Tao [4 ]
Shen, Heng Tao [1 ,2 ]
机构
[1] Univ Elect Sci & Technol China, Ctr Future Media, Chengdu, Sichuan, Peoples R China
[2] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu, Sichuan, Peoples R China
[3] Monash Univ, Clayton, Vic, Australia
[4] JD AI Res, Chengdu, Sichuan, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
To date, visual question answering (VQA) (i.e., image QA and video QA) is still a holy grail in vision and language understanding, especially for video QA. Compared with image QA that focuses primarily on understanding the associations between image region-level details and corresponding questions, video QA requires a model to jointly reason across both spatial and long-range temporal structures of a video as well as text to provide an accurate answer. In this paper, we specifically tackle the problem of video QA by proposing a Structured Two-stream Attention network, namely STA, to answer a free-form or open-ended natural language question about the content of a given video. First, we infer rich long-range temporal structures in videos using our structured segment component and encode text features. Then, our structured two-stream attention component simultaneously localizes important visual instance, reduces the influence of background video and focuses on the relevant text. Finally, the structured two-stream fusion component incorporates different segments of query and video aware context representation and infers the answers. Experiments on the large-scale video QA dataset TGIF-QA show that our proposed method significantly surpasses the best counterpart (i.e., with one representation for the video input) by 13.0%, 13.5%, 11.0% and 0.3 for Action, Trans., TrameQA and Count tasks. It also outperforms the best competitor (i.e., with two representations) on the Action, Trans., TrameQA tasks by 4.1%, 4.7%, and 5.1%.
引用
收藏
页码:6391 / 6398
页数:8
相关论文
共 50 条
  • [1] Compositional Attention Networks With Two-Stream Fusion for Video Question Answering
    Yu, Ting
    Yu, Jun
    Yu, Zhou
    Tao, Dacheng
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 : 1204 - 1218
  • [2] Compositional attention networks with two-stream fusion for video question answering
    Yu, Ting
    Yu, Jun
    Yu, Zhou
    Tao, Dacheng
    [J]. IEEE Transactions on Image Processing, 2020, 29 : 1204 - 1218
  • [3] Two-Stream Heterogeneous Graph Network with Dynamic Interactive Learning for Video Question Answering
    Peng, Min
    Shao, Xiaohu
    Shi, Yu
    Zhou, Xiangdong
    [J]. 2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [4] Two-Stream Attention Network for Pain Recognition from Video Sequences
    Thiam, Patrick
    Kestler, Hans A.
    Schwenker, Friedhelm
    [J]. SENSORS, 2020, 20 (03)
  • [5] Progressive Graph Attention Network for Video Question Answering
    Peng, Liang
    Yang, Shuangji
    Bin, Yi
    Wang, Guoqing
    [J]. PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 2871 - 2879
  • [6] Hybrid-Attention Enhanced Two-Stream Fusion Network for Video Venue Prediction
    Zhang, Yanchao
    Min, Weiqing
    Nie, Liqiang
    Jiang, Shuqiang
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 : 2917 - 2929
  • [7] Two-Stream Video Classification with Cross-Modality Attention
    Chi, Lu
    Tian, Guiyu
    Mu, Yadong
    Tian, Qi
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 4511 - 4520
  • [8] TWO-STREAM HYBRID ATTENTION NETWORK FOR MULTIMODAL CLASSIFICATION
    Chen, Qipin
    Shi, Zhenyu
    Zuo, Zhen
    Fu, Jinmiao
    Sun, Yi
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, : 359 - 363
  • [9] Two-stream Graph Attention Convolutional for Video Action Recognition
    Zhang, Deyuan
    Gao, Hongwei
    Dai, Hailong
    Shi, Xiangbin
    [J]. 2021 IEEE 15TH INTERNATIONAL CONFERENCE ON BIG DATA SCIENCE AND ENGINEERING (BIGDATASE 2021), 2021, : 23 - 27
  • [10] Frame Augmented Alternating Attention Network for Video Question Answering
    Zhang, Wenqiao
    Tang, Siliang
    Cao, Yanpeng
    Pu, Shiliang
    Wu, Fei
    Zhuang, Yueting
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (04) : 1032 - 1041