Action-Centric Relation Transformer Network for Video Question Answering

被引:29
|
作者
Zhang, Jipeng [1 ]
Shao, Jie [1 ,2 ]
Cao, Rui [3 ]
Gao, Lianli [1 ]
Xu, Xing [1 ]
Shen, Heng Tao [1 ,2 ]
机构
[1] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Ctr Future Media, Chengdu 611731, Peoples R China
[2] Sichuan Artificial Intelligence Res Inst, Yibin 644000, Peoples R China
[3] Singapore Management Univ, Sch Informat Syst, Singapore 178902, Singapore
基金
中国国家自然科学基金;
关键词
Feature extraction; Visualization; Cognition; Task analysis; Knowledge discovery; Proposals; Encoding; Video question answering; video representation; temporal action detection; multi-modal reasoning; relation reasoning; ATTENTION;
D O I
10.1109/TCSVT.2020.3048440
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Video question answering (VideoQA) has emerged as a popular research topic in recent years. Enormous efforts have been devoted to developing more effective fusion strategies and better intra-modal feature preparation. To explore these issues further, we identify two key problems. (1) Current works take almost no account of introducing action of interest in video representation. Additionally, there exists insufficient labeling data on where the action of interest is in many datasets. However, questions in VideoQA are usually action-centric. (2) Frame-to-frame relations, which can provide useful temporal attributes (e.g., state transition, action counting), lack relevant research. Based on these observations, we propose an action-centric relation transformer network (ACRTransformer) for VideoQA and make two significant improvements. (1) We explicitly consider the action recognition problem and present a visual feature encoding technique, action-based encoding (ABE), to emphasize the frames with high actionness probabilities (the probability that the frame has actions). (2) We better exploit the interplays between temporal frames using a relation transformer network (RTransformer). Experiments on popular benchmark datasets in VideoQA clearly establish our superiority over previous state-of-the-art models. Code could be found at https://github.com/op-multimodal/ACRTransformer.
引用
收藏
页码:63 / 74
页数:12
相关论文
共 50 条
  • [41] Video Reference: A Video Question Answering Engine
    Gao, Lei
    Li, Guangda
    Zheng, Yan-Tao
    Hong, Richang
    Chua, Tat-Seng
    ADVANCES IN MULTIMEDIA MODELING, PROCEEDINGS, 2010, 5916 : 799 - +
  • [42] Language-Guided Visual Aggregation Network for Video Question Answering
    Liang, Xiao
    Wang, Di
    Wang, Quan
    Wan, Bo
    An, Lingling
    He, Lihuo
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5195 - 5203
  • [43] Memory Augmented Deep Recurrent Neural Network for Video Question Answering
    Yin, Chengxiang
    Tang, Jian
    Xu, Zhiyuan
    Wang, Yanzhi
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2020, 31 (09) : 3159 - 3167
  • [44] Multi-Scale Progressive Attention Network for Video Question Answering
    Guo, Zhicheng
    Zhao, Jiaxuan
    Jiao, Licheng
    Liu, Xu
    Li, Lingling
    ACL-IJCNLP 2021: THE 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 2, 2021, : 973 - 978
  • [45] Multi-Scale Progressive Attention Network for Video Question Answering
    Guo, Zhicheng
    Zhao, Jiaxuan
    Jiao, Licheng
    Liu, Xu
    Li, Lingling
    ACL-IJCNLP 2021 - 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Proceedings of the Conference, 2021, 2 : 873 - 878
  • [46] Graph-based relational reasoning network for video question answering
    Tan, Tao
    Sun, Guanglu
    MACHINE VISION AND APPLICATIONS, 2025, 36 (01)
  • [47] Multi-Granularity Interaction and Integration Network for Video Question Answering
    Wang, Yuanyuan
    Liu, Meng
    Wu, Jianlong
    Nie, Liqiang
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (12) : 7684 - 7695
  • [48] Structured Two-Stream Attention Network for Video Question Answering
    Gao, Lianli
    Zeng, Pengpeng
    Song, Jingkuan
    Li, Yuan-Fang
    Liu, Wu
    Mei, Tao
    Shen, Heng Tao
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 6391 - 6398
  • [49] Dynamic Spatio-Temporal Modular Network for Video Question Answering
    Qian, Zi
    Wang, Xin
    Duan, Xuguang
    Chen, Hong
    Zhu, Wenwu
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4466 - 4477
  • [50] Dual-decoder transformer network for answer grounding in visual question answering
    Zhu, Liangjun
    Peng, Li
    Zhou, Weinan
    Yang, Jielong
    PATTERN RECOGNITION LETTERS, 2023, 171 : 53 - 60