Action-Centric Relation Transformer Network for Video Question Answering

被引:29
|
作者
Zhang, Jipeng [1 ]
Shao, Jie [1 ,2 ]
Cao, Rui [3 ]
Gao, Lianli [1 ]
Xu, Xing [1 ]
Shen, Heng Tao [1 ,2 ]
机构
[1] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Ctr Future Media, Chengdu 611731, Peoples R China
[2] Sichuan Artificial Intelligence Res Inst, Yibin 644000, Peoples R China
[3] Singapore Management Univ, Sch Informat Syst, Singapore 178902, Singapore
基金
中国国家自然科学基金;
关键词
Feature extraction; Visualization; Cognition; Task analysis; Knowledge discovery; Proposals; Encoding; Video question answering; video representation; temporal action detection; multi-modal reasoning; relation reasoning; ATTENTION;
D O I
10.1109/TCSVT.2020.3048440
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Video question answering (VideoQA) has emerged as a popular research topic in recent years. Enormous efforts have been devoted to developing more effective fusion strategies and better intra-modal feature preparation. To explore these issues further, we identify two key problems. (1) Current works take almost no account of introducing action of interest in video representation. Additionally, there exists insufficient labeling data on where the action of interest is in many datasets. However, questions in VideoQA are usually action-centric. (2) Frame-to-frame relations, which can provide useful temporal attributes (e.g., state transition, action counting), lack relevant research. Based on these observations, we propose an action-centric relation transformer network (ACRTransformer) for VideoQA and make two significant improvements. (1) We explicitly consider the action recognition problem and present a visual feature encoding technique, action-based encoding (ABE), to emphasize the frames with high actionness probabilities (the probability that the frame has actions). (2) We better exploit the interplays between temporal frames using a relation transformer network (RTransformer). Experiments on popular benchmark datasets in VideoQA clearly establish our superiority over previous state-of-the-art models. Code could be found at https://github.com/op-multimodal/ACRTransformer.
引用
收藏
页码:63 / 74
页数:12
相关论文
共 50 条
  • [1] Video Graph Transformer for Video Question Answering
    Xiao, Junbin
    Zhou, Pan
    Chua, Tat-Seng
    Yan, Shuicheng
    COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 39 - 58
  • [2] Video -Context Aligned Transformer for Video Question Answering
    Zong, Linlin
    Wan, Jiahui
    Zhang, Xianchao
    Liu, Xinyue
    Liang, Wenxin
    Xu, Bo
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 17, 2024, : 19795 - 19803
  • [3] Embedding VLAD in Transformer for Video Question Answering
    Guo D.
    Yao S.-T.
    Wang H.
    Wang M.
    Jisuanji Xuebao/Chinese Journal of Computers, 2023, 46 (04): : 671 - 689
  • [4] Multi-interaction Network with Object Relation for Video Question Answering
    Jin, Weike
    Zhao, Zhou
    Gu, Mao
    Yu, Jun
    Xiao, Jun
    Zhuang, Yueting
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 1193 - 1201
  • [5] Contrastive Video Question Answering via Video Graph Transformer
    Xiao, Junbin
    Zhou, Pan
    Yao, Angela
    Li, Yicong
    Hong, Richang
    Yan, Shuicheng
    Chua, Tat-Seng
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (11) : 13265 - 13280
  • [6] Redundancy-aware Transformer for Video Question Answering
    Li, Yicong
    Yang, Xun
    Zhang, An
    Feng, Chun
    Wang, Xiang
    Chua, Tat-Seng
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3172 - 3180
  • [7] Object-Centric Representation Learning for Video Question Answering
    Long Hoang Dang
    Thao Minh Le
    Vuong Le
    Truyen Tran
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [8] Complementary spatiotemporal network for video question answering
    Xinrui Li
    Aming Wu
    Yahong Han
    Multimedia Systems, 2022, 28 : 161 - 169
  • [9] Complementary spatiotemporal network for video question answering
    Li, Xinrui
    Wu, Aming
    Han, Yahong
    MULTIMEDIA SYSTEMS, 2022, 28 (01) : 161 - 169
  • [10] ATM: Action Temporality Modeling for Video Question Answering
    Chen, Junwen
    Zhu, Jie
    Kong, Yu
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4886 - 4895