Action-Centric Relation Transformer Network for Video Question Answering

被引:29
|
作者
Zhang, Jipeng [1 ]
Shao, Jie [1 ,2 ]
Cao, Rui [3 ]
Gao, Lianli [1 ]
Xu, Xing [1 ]
Shen, Heng Tao [1 ,2 ]
机构
[1] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Ctr Future Media, Chengdu 611731, Peoples R China
[2] Sichuan Artificial Intelligence Res Inst, Yibin 644000, Peoples R China
[3] Singapore Management Univ, Sch Informat Syst, Singapore 178902, Singapore
基金
中国国家自然科学基金;
关键词
Feature extraction; Visualization; Cognition; Task analysis; Knowledge discovery; Proposals; Encoding; Video question answering; video representation; temporal action detection; multi-modal reasoning; relation reasoning; ATTENTION;
D O I
10.1109/TCSVT.2020.3048440
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Video question answering (VideoQA) has emerged as a popular research topic in recent years. Enormous efforts have been devoted to developing more effective fusion strategies and better intra-modal feature preparation. To explore these issues further, we identify two key problems. (1) Current works take almost no account of introducing action of interest in video representation. Additionally, there exists insufficient labeling data on where the action of interest is in many datasets. However, questions in VideoQA are usually action-centric. (2) Frame-to-frame relations, which can provide useful temporal attributes (e.g., state transition, action counting), lack relevant research. Based on these observations, we propose an action-centric relation transformer network (ACRTransformer) for VideoQA and make two significant improvements. (1) We explicitly consider the action recognition problem and present a visual feature encoding technique, action-based encoding (ABE), to emphasize the frames with high actionness probabilities (the probability that the frame has actions). (2) We better exploit the interplays between temporal frames using a relation transformer network (RTransformer). Experiments on popular benchmark datasets in VideoQA clearly establish our superiority over previous state-of-the-art models. Code could be found at https://github.com/op-multimodal/ACRTransformer.
引用
收藏
页码:63 / 74
页数:12
相关论文
共 50 条
  • [31] Bracketing off the actors: Towards an action-centric research agenda
    Pentland, Brian T.
    Pentland, Alex P.
    Calantone, Roger J.
    INFORMATION AND ORGANIZATION, 2017, 27 (03) : 137 - 143
  • [32] Beyond representations: towards an action-centric perspective on tangible interaction
    Fernaeus, Ylva
    Tholander, Jakob
    Jonsson, Martin
    INTERNATIONAL JOURNAL OF ARTS AND TECHNOLOGY, 2008, 1 (3-4) : 249 - 267
  • [33] Relation-aware Hierarchical Attention Framework for Video Question Answering
    Li, Fangtao
    Liu, Zihe
    Bai, Ting
    Yan, Chenghao
    Cao, Chenyu
    Wu, Bin
    PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 164 - 172
  • [34] Local relation network with multilevel attention for visual question answering
    Sun, Bo
    Yao, Zeng
    Zhang, Yinghui
    Yu, Lejun
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2020, 73
  • [35] Affective question answering on video
    Ruwa, Nelson
    Mao, Qirong
    Wang, Liangjun
    Gou, Jianping
    NEUROCOMPUTING, 2019, 363 : 125 - 139
  • [36] A reasoning enhance network for muti-relation question answering
    Wenqing Wu
    Zhenfang Zhu
    Guangyuan Zhang
    Shiyong Kang
    Peiyu Liu
    Applied Intelligence, 2021, 51 : 4515 - 4524
  • [37] Multi-Attention Relation Network for Figure Question Answering
    Li, Ying
    Wu, Qingfeng
    Chen, Bin
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, PT II, 2022, 13369 : 667 - 680
  • [38] Semantic Relation Graph Reasoning Network for Visual Question Answering
    Lan, Hong
    Zhang, Pufen
    TWELFTH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING SYSTEMS, 2021, 11719
  • [39] AN AFFINITY-DRIVEN RELATION NETWORK FOR FIGURE QUESTION ANSWERING
    Zou, Jialong
    Wu, Guoli
    Xue, Taofeng
    Wu, Qingfeng
    2020 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2020,
  • [40] A reasoning enhance network for muti-relation question answering
    Wu, Wenqing
    Zhu, Zhenfang
    Zhang, Guangyuan
    Kang, Shiyong
    Liu, Peiyu
    APPLIED INTELLIGENCE, 2021, 51 (07) : 4515 - 4524