Spatial attention based visual semantic learning for action recognition in still images

被引：12

作者：

Zheng, Yunpeng ^{[1
,2
]}

Zheng, Xiangtao ^{[1
]}

Lu, Xiaoqiang ^{[1
]}

Wu, Siyuan ^{[1
]}

机构：

[1] Chinese Acad Sci, Xian Inst Opt & Precis Mech, Key Lab Spectral Imaging Technol CAS, Xian 710119, Shaanxi, Peoples R China

[2] Univ Chinese Acad Sci, Beijing 100049, Peoples R China

来源：

NEUROCOMPUTING | 2020年 / 413卷

基金：

中国国家自然科学基金; 国家重点研发计划;

关键词：

Still image-based action recognition; Spatial attention; Semantic parts; Deep learning; MODEL;

D O I：

10.1016/j.neucom.2020.07.016

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Visual semantic parts play crucial roles in still image-based action recognition. A majority of existing methods require additional manual annotations such as human bounding boxes and predefined body parts besides action labels to learn action related visual semantic parts. However, labeling these manual annotations is rather time-consuming and labor-intensive. Moreover, not all manual annotations are effective when recognizing a specific action. Some of them can be irrelevant and even misguided. To address these limitations, this paper proposes a multi-stage deep learning method called Spatial Attention based Action Mask Networks (SAAM-Nets). The proposed method does not need any additional annotations besides action labels to obtain action-specific visual semantic parts. Instead, we propose a spatial attention layer injected in a convolutional neural network to create a specific action mask for each image with only action labels. Moreover, based on the action mask, we propose a region selection strategy to generate a semantic bounding box containing action-specific semantic parts. Furthermore, to effectively combine the information of the whole scene and the sematic box, two feature attention layers are adopted to obtain more discriminative representations. Experiments on four benchmark datasets have demonstrated that the proposed method can achieve promising performance compared with state-of-the-art methods. (C) 2020 Elsevier B.V. All rights reserved.

引用

页码：383 / 396

页数：14

共 50 条

[31] Audio-Visual Event Localization by Learning Spatial and Semantic Co-Attention
Xue, Cheng
Zhong, Xionghu
Cai, Minjie
Chen, Hao
Wang, Wenwu
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 418 - 429
[32] Symbolic control of visual attention: Semantic constraints on the spatial distribution of attention
Bradley S. Gibson
Matthias Scheutz
Gregory J. Davis
[J]. Attention, Perception, & Psychophysics, 2009, 71 : 363 - 374
[33] ESS: Learning Event-Based Semantic Segmentation from Still Images
Sun, Zhaoning
Messikommer, Nico
Gehrig, Daniel
Scaramuzza, Davide
[J]. COMPUTER VISION, ECCV 2022, PT XXXIV, 2022, 13694 : 341 - 357
[34] Symbolic control of visual attention: Semantic constraints on the spatial distribution of attention
Gibson, Bradley S.
Scheutz, Matthias
Davis, Gregory J.
[J]. ATTENTION PERCEPTION & PSYCHOPHYSICS, 2009, 71 (02) : 363 - 374
[35] The role of spatial attention in visual object recognition
Shyi, GCW
Cheng, SK
[J]. INTERNATIONAL JOURNAL OF PSYCHOLOGY, 1996, 31 (3-4) : 4841 - 4841
[36] Context Enhancement Methodology for Action Recognition in Still Images
He, Jiarong
Wu, Wei
Li, Yuxing
[J]. ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT I, 2023, 14254 : 112 - 122
[37] Temporal Hallucinating for Action Recognition with Few Still Images
Wang, Yali
Zhou, Lei
Qiao, Yu
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 5314 - 5322
[38] Action Recognition in Still Images With Minimum Annotation Efforts
Zhang, Yu
Cheng, Li
Wu, Jianxin
Cai, Jianfei
Do, Minh N.
Lu, Jiangbo
[J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2016, 25 (11) : 5479 - 5490
[39] Spatial-Temporal Attention for Action Recognition
Sun, Dengdi
Wu, Hanqing
Ding, Zhuanlian
Luo, Bin
Tang, Jin
[J]. ADVANCES IN MULTIMEDIA INFORMATION PROCESSING, PT I, 2018, 11164 : 854 - 864
[40] Loss Guided Activation for Action Recognition in Still Images
Liu, Lu
Tan, Robby T.
You, Shaodi
[J]. COMPUTER VISION - ACCV 2018, PT V, 2019, 11365 : 152 - 167

← 1 2 3 4 5 →