Action-Centric Relation Transformer Network for Video Question Answering

被引：29

作者：

Zhang, Jipeng ^{[1
]}

Shao, Jie ^{[1
,2
]}

Cao, Rui ^{[3
]}

Gao, Lianli ^{[1
]}

Xu, Xing ^{[1
]}

Shen, Heng Tao ^{[1
,2
]}

机构：

[1] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Ctr Future Media, Chengdu 611731, Peoples R China

[2] Sichuan Artificial Intelligence Res Inst, Yibin 644000, Peoples R China

[3] Singapore Management Univ, Sch Informat Syst, Singapore 178902, Singapore

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2022年 / 32卷 / 01期

基金：

中国国家自然科学基金;

关键词：

Feature extraction; Visualization; Cognition; Task analysis; Knowledge discovery; Proposals; Encoding; Video question answering; video representation; temporal action detection; multi-modal reasoning; relation reasoning; ATTENTION;

D O I：

10.1109/TCSVT.2020.3048440

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Video question answering (VideoQA) has emerged as a popular research topic in recent years. Enormous efforts have been devoted to developing more effective fusion strategies and better intra-modal feature preparation. To explore these issues further, we identify two key problems. (1) Current works take almost no account of introducing action of interest in video representation. Additionally, there exists insufficient labeling data on where the action of interest is in many datasets. However, questions in VideoQA are usually action-centric. (2) Frame-to-frame relations, which can provide useful temporal attributes (e.g., state transition, action counting), lack relevant research. Based on these observations, we propose an action-centric relation transformer network (ACRTransformer) for VideoQA and make two significant improvements. (1) We explicitly consider the action recognition problem and present a visual feature encoding technique, action-based encoding (ABE), to emphasize the frames with high actionness probabilities (the probability that the frame has actions). (2) We better exploit the interplays between temporal frames using a relation transformer network (RTransformer). Experiments on popular benchmark datasets in VideoQA clearly establish our superiority over previous state-of-the-art models. Code could be found at https://github.com/op-multimodal/ACRTransformer.

引用

页码：63 / 74

页数：12

共 50 条

[1] Video Graph Transformer for Video Question Answering
Xiao, Junbin
Zhou, Pan
Chua, Tat-Seng
Yan, Shuicheng
COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 39 - 58
[2] Video -Context Aligned Transformer for Video Question Answering
Zong, Linlin
Wan, Jiahui
Zhang, Xianchao
Liu, Xinyue
Liang, Wenxin
Xu, Bo
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 17, 2024, : 19795 - 19803
[3] Embedding VLAD in Transformer for Video Question Answering
Guo D.
Yao S.-T.
Wang H.
Wang M.
Jisuanji Xuebao/Chinese Journal of Computers, 2023, 46 (04): : 671 - 689
[4] Multi-interaction Network with Object Relation for Video Question Answering
Jin, Weike
Zhao, Zhou
Gu, Mao
Yu, Jun
Xiao, Jun
Zhuang, Yueting
PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 1193 - 1201
[5] Contrastive Video Question Answering via Video Graph Transformer
Xiao, Junbin
Zhou, Pan
Yao, Angela
Li, Yicong
Hong, Richang
Yan, Shuicheng
Chua, Tat-Seng
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (11) : 13265 - 13280
[6] Redundancy-aware Transformer for Video Question Answering
Li, Yicong
Yang, Xun
Zhang, An
Feng, Chun
Wang, Xiang
Chua, Tat-Seng
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3172 - 3180
[7] Object-Centric Representation Learning for Video Question Answering
Long Hoang Dang
Thao Minh Le
Vuong Le
Truyen Tran
2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
[8] Complementary spatiotemporal network for video question answering
Xinrui Li
Aming Wu
Yahong Han
Multimedia Systems, 2022, 28 : 161 - 169
[9] Complementary spatiotemporal network for video question answering
Li, Xinrui
Wu, Aming
Han, Yahong
MULTIMEDIA SYSTEMS, 2022, 28 (01) : 161 - 169
[10] ATM: Action Temporality Modeling for Video Question Answering
Chen, Junwen
Zhu, Jie
Kong, Yu
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4886 - 4895

← 1 2 3 4 5 →