Swin-Fusion: Swin-Transformer with Feature Fusion for Human Action Recognition

被引：10

作者：

Chen, Tiansheng ^{[1
]}

Mo, Lingfei ^{[1
]}

机构：

[1] Southeast Univ, Sch Instrument Sci & Engn, Nanjing 210096, Jiangsu, Peoples R China

来源：

NEURAL PROCESSING LETTERS | 2023年 / 55卷 / 08期

基金：

中国国家自然科学基金;

关键词：

Action recognition; Swin-Transformer; Feature pyramid; Image classification; NETWORK;

D O I：

10.1007/s11063-023-11367-1

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Humanaction recognition based on still images is one of themost challenging computer vision tasks. In the past decade, convolutional neural networks (CNNs) have developed rapidly and achieved good performance in human action recognition tasks based on still images. Due to the absence of the remote perception ability of CNNs, it is challenging to have a global structural understanding of human behavior and the overall relationship between the behavior and the environment. Recently, transformer-based models have been making a splash in computer vision, even reaching SOTA in several vision tasks. We explore the transformer's capability in human action recognition based on still images and add a simple but effective feature fusion module based on the Swin-Transformer model. More specifically, we propose a newtransformer-basedmodel for behavioral feature extraction that uses a pre-trained SwinTransformer as the backbone network. Swin-Transformer's distinctive hierarchical structure, combined with the feature fusion module, is used to extract and fuse multi-scale behavioral information. Extensive experiments were conducted on five still image-based human action recognition datasets, including the Li's action dataset, the Stanford-40 dataset, the PPMI-24 dataset, the AUC-V1 dataset, and the AUC-V2 dataset. Results indicate that our proposed Swin-Fusion model achieves better behavior recognition than previously improved CNNbased models by sharing and reusing feature maps of different scales at multiple stages, without modifying the original backbone training method and with only increasing training resources by 1.6%. The code and models will be available at https://github.com/ cts4444/ Swin-Fusion.

引用

页码：11109 / 11130

页数：22

共 50 条

[21] Multimodal Fusion-based Swin Transformer for Facial Recognition Micro-Expression Recognition
Zhao, Xinhua
Lv, Yongjia
Huang, Zheng
PROCEEDINGS OF 2022 IEEE INTERNATIONAL CONFERENCE ON MECHATRONICS AND AUTOMATION (IEEE ICMA 2022), 2022, : 780 - 785
[22] Improved target tracking algorithm based on Swin-Transformer
Liu, Shi
Zhu, Ming
CHINESE JOURNAL OF LIQUID CRYSTALS AND DISPLAYS, 2024, 39 (11) : 1569 - 1580
[23] SwinSOD: Salient object detection using swin-transformer
Wu, Shuang
Zhang, Guangjian
Liu, Xuefeng
IMAGE AND VISION COMPUTING, 2024, 146
[24] Swin-FER: Swin Transformer for Facial Expression Recognition
Bie, Mei
Xu, Huan
Gao, Yan
Song, Kai
Che, Xiangjiu
APPLIED SCIENCES-BASEL, 2024, 14 (14):
[25] Memorizing Swin-Transformer Denoising Network for Diffusion Model
Chen, Jindou
Shen, Yiqing
ELECTRONICS, 2024, 13 (20)
[26] Learning Transferable Feature Representation with Swin Transformer for Object Recognition
Ren, Jian-Xin
Xiong, Yu-Jie
Xie, Xi-Jiong
Dai, Yu-Fan
NEURAL PROCESSING LETTERS, 2023, 55 (03) : 2211 - 2223
[27] HYPERSPECTRAL AND MULTISPECTRAL IMAGES FUSION BASED ON PYRAMID SWIN TRANSFORMER
Lang, Han
Bao, Wenxing
Feng, Wei
Sun, Shasha
Ma, Xuan
Zhang, Xiaowu
IGARSS 2024-2024 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, IGARSS 2024, 2024, : 3125 - 3128
[28] Hyperspectral and multispectral images fusion based on pyramid swin transformer
Lang, Han
Bao, Wenxing
Feng, Wei
Qu, Kewen
Ma, Xuan
Zhang, Xiaowu
INFRARED PHYSICS & TECHNOLOGY, 2024, 143
[29] Ancient Chinese Character Recognition with Improved Swin-Transformer and Flexible Data Enhancement Strategies
Zheng, Yi
Chen, Yi
Wang, Xianbo
Qi, Donglian
Yan, Yunfeng
SENSORS, 2024, 24 (07)
[30] End-to-end Image Compression with Swin-Transformer
Wang, Meng
Zhang, Kai
Zhang, Li
Li, Yue
Li, Junru
Wang, Yue
Wang, Shiqi
2022 IEEE INTERNATIONAL CONFERENCE ON VISUAL COMMUNICATIONS AND IMAGE PROCESSING (VCIP), 2022,

← 1 2 3 4 5 →