Action-Centric Relation Transformer Network for Video Question Answering

被引：29

作者：

Zhang, Jipeng ^{[1
]}

Shao, Jie ^{[1
,2
]}

Cao, Rui ^{[3
]}

Gao, Lianli ^{[1
]}

Xu, Xing ^{[1
]}

Shen, Heng Tao ^{[1
,2
]}

机构：

[1] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Ctr Future Media, Chengdu 611731, Peoples R China

[2] Sichuan Artificial Intelligence Res Inst, Yibin 644000, Peoples R China

[3] Singapore Management Univ, Sch Informat Syst, Singapore 178902, Singapore

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2022年 / 32卷 / 01期

基金：

中国国家自然科学基金;

关键词：

Feature extraction; Visualization; Cognition; Task analysis; Knowledge discovery; Proposals; Encoding; Video question answering; video representation; temporal action detection; multi-modal reasoning; relation reasoning; ATTENTION;

D O I：

10.1109/TCSVT.2020.3048440

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Video question answering (VideoQA) has emerged as a popular research topic in recent years. Enormous efforts have been devoted to developing more effective fusion strategies and better intra-modal feature preparation. To explore these issues further, we identify two key problems. (1) Current works take almost no account of introducing action of interest in video representation. Additionally, there exists insufficient labeling data on where the action of interest is in many datasets. However, questions in VideoQA are usually action-centric. (2) Frame-to-frame relations, which can provide useful temporal attributes (e.g., state transition, action counting), lack relevant research. Based on these observations, we propose an action-centric relation transformer network (ACRTransformer) for VideoQA and make two significant improvements. (1) We explicitly consider the action recognition problem and present a visual feature encoding technique, action-based encoding (ABE), to emphasize the frames with high actionness probabilities (the probability that the frame has actions). (2) We better exploit the interplays between temporal frames using a relation transformer network (RTransformer). Experiments on popular benchmark datasets in VideoQA clearly establish our superiority over previous state-of-the-art models. Code could be found at https://github.com/op-multimodal/ACRTransformer.

引用

页码：63 / 74

页数：12

共 50 条

[31] Bracketing off the actors: Towards an action-centric research agenda
Pentland, Brian T.
Pentland, Alex P.
Calantone, Roger J.
INFORMATION AND ORGANIZATION, 2017, 27 (03) : 137 - 143
[32] Beyond representations: towards an action-centric perspective on tangible interaction
Fernaeus, Ylva
Tholander, Jakob
Jonsson, Martin
INTERNATIONAL JOURNAL OF ARTS AND TECHNOLOGY, 2008, 1 (3-4) : 249 - 267
[33] Relation-aware Hierarchical Attention Framework for Video Question Answering
Li, Fangtao
Liu, Zihe
Bai, Ting
Yan, Chenghao
Cao, Chenyu
Wu, Bin
PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 164 - 172
[34] Local relation network with multilevel attention for visual question answering
Sun, Bo
Yao, Zeng
Zhang, Yinghui
Yu, Lejun
JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2020, 73
[35] Affective question answering on video
Ruwa, Nelson
Mao, Qirong
Wang, Liangjun
Gou, Jianping
NEUROCOMPUTING, 2019, 363 : 125 - 139
[36] A reasoning enhance network for muti-relation question answering
Wenqing Wu
Zhenfang Zhu
Guangyuan Zhang
Shiyong Kang
Peiyu Liu
Applied Intelligence, 2021, 51 : 4515 - 4524
[37] Multi-Attention Relation Network for Figure Question Answering
Li, Ying
Wu, Qingfeng
Chen, Bin
KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, PT II, 2022, 13369 : 667 - 680
[38] Semantic Relation Graph Reasoning Network for Visual Question Answering
Lan, Hong
Zhang, Pufen
TWELFTH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING SYSTEMS, 2021, 11719
[39] AN AFFINITY-DRIVEN RELATION NETWORK FOR FIGURE QUESTION ANSWERING
Zou, Jialong
Wu, Guoli
Xue, Taofeng
Wu, Qingfeng
2020 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2020,
[40] A reasoning enhance network for muti-relation question answering
Wu, Wenqing
Zhu, Zhenfang
Zhang, Guangyuan
Kang, Shiyong
Liu, Peiyu
APPLIED INTELLIGENCE, 2021, 51 (07) : 4515 - 4524

← 1 2 3 4 5 →