Action-Centric Relation Transformer Network for Video Question Answering

被引:29
|
作者
Zhang, Jipeng [1 ]
Shao, Jie [1 ,2 ]
Cao, Rui [3 ]
Gao, Lianli [1 ]
Xu, Xing [1 ]
Shen, Heng Tao [1 ,2 ]
机构
[1] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Ctr Future Media, Chengdu 611731, Peoples R China
[2] Sichuan Artificial Intelligence Res Inst, Yibin 644000, Peoples R China
[3] Singapore Management Univ, Sch Informat Syst, Singapore 178902, Singapore
基金
中国国家自然科学基金;
关键词
Feature extraction; Visualization; Cognition; Task analysis; Knowledge discovery; Proposals; Encoding; Video question answering; video representation; temporal action detection; multi-modal reasoning; relation reasoning; ATTENTION;
D O I
10.1109/TCSVT.2020.3048440
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Video question answering (VideoQA) has emerged as a popular research topic in recent years. Enormous efforts have been devoted to developing more effective fusion strategies and better intra-modal feature preparation. To explore these issues further, we identify two key problems. (1) Current works take almost no account of introducing action of interest in video representation. Additionally, there exists insufficient labeling data on where the action of interest is in many datasets. However, questions in VideoQA are usually action-centric. (2) Frame-to-frame relations, which can provide useful temporal attributes (e.g., state transition, action counting), lack relevant research. Based on these observations, we propose an action-centric relation transformer network (ACRTransformer) for VideoQA and make two significant improvements. (1) We explicitly consider the action recognition problem and present a visual feature encoding technique, action-based encoding (ABE), to emphasize the frames with high actionness probabilities (the probability that the frame has actions). (2) We better exploit the interplays between temporal frames using a relation transformer network (RTransformer). Experiments on popular benchmark datasets in VideoQA clearly establish our superiority over previous state-of-the-art models. Code could be found at https://github.com/op-multimodal/ACRTransformer.
引用
收藏
页码:63 / 74
页数:12
相关论文
共 50 条
  • [21] SRTube: Video-Language Pre-Training with Action-Centric Video Tube Features and Semantic Role Labeling
    Lee, Ju-Hee
    Kang, Je-Won
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13689 - 13699
  • [22] Question-Aware Tube-Switch Network for Video Question Answering
    Yang, Tianhao
    Zha, Zheng-Jun
    Xie, Hongtao
    Wang, Meng
    Zhang, Hanwang
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 1184 - 1192
  • [23] Process institutionalism: toward an action-centric approach to state extraction
    Chen, Zetao
    JOURNAL OF CHINESE SOCIOLOGY, 2022, 9 (01):
  • [24] Process institutionalism: toward an action-centric approach to state extraction
    Zetao Chen
    The Journal of Chinese Sociology, 9
  • [25] Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer
    Peng, Min
    Wang, Chongyang
    Shi, Yu
    Zhou, Xiang-Dong
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 2, 2023, : 2038 - 2046
  • [26] Frame Augmented Alternating Attention Network for Video Question Answering
    Zhang, Wenqiao
    Tang, Siliang
    Cao, Yanpeng
    Pu, Shiliang
    Wu, Fei
    Zhuang, Yueting
    IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (04) : 1032 - 1041
  • [27] A Universal Quaternion Hypergraph Network for Multimodal Video Question Answering
    Guo, Zhicheng
    Zhao, Jiaxuan
    Jiao, Licheng
    Liu, Xu
    Liu, Fang
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 38 - 49
  • [28] Hierarchical Recurrent Contextual Attention Network for Video Question Answering
    Zhou, Fei
    Han, Yahong
    ARTIFICIAL INTELLIGENCE, CICAI 2022, PT II, 2022, 13605 : 280 - 290
  • [29] Hierarchical Representation Network With Auxiliary Tasks for Video Captioning and Video Question Answering
    Gao, Lianli
    Lei, Yu
    Zeng, Pengpeng
    Song, Jingkuan
    Wang, Meng
    Shen, Heng Tao
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 202 - 215
  • [30] Transformer-Based Neural Network for Answer Selection in Question Answering
    Shao, Taihua
    Guo, Yupu
    Chen, Honghui
    Hao, Zepeng
    IEEE ACCESS, 2019, 7 : 26146 - 26156