Integrated crossing pooling of representation learning for Vision Transformer

被引：0

作者：

Xu, Libo ^{[1
]}

Li, Xingsen ^{[2
]}

Huang, Zhenrui ^{[1
]}

Sun, Yucheng ^{[3
]}

Wang, Jiagong ^{[1
]}

机构：

[1] NingboTech Univ, Ningbo, Peoples R China

[2] Guangdong Univ Technol, Guangzhou, Peoples R China

[3] China E Port Data Ctr, Ningbo Branch, Ningbo, Peoples R China

来源：

PROCEEDINGS OF 2021 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE AND INTELLIGENT AGENT TECHNOLOGY WORKSHOPS AND SPECIAL SESSIONS: (WI-IAT WORKSHOP/SPECIAL SESSION 2021) | 2021年

关键词：

vision transformer; ViT; Pooling method; class token;

D O I：

10.1145/3498851.3499004

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In recent years, transformer technology such as ViT, has been widely developed in the field of computer vision. In the ViT model, a learnable class token parameter is added to the head of the token sequence. The output of the class token through the whole transformer encoder is looked as the final representation vector, which is then passed through a multi-layer perception (MLP) network to get the classification prediction. The class token can be seen as an information aggregation of all other tokens. But we consider that the global pooling of tokens can aggregate information more effective and intuitive. In the paper, we propose a new pooling method, called cross pooling, to replace class token to obtain representation vector of the input image, which can extract better features and effectively improve model performance without increasing the computational cost. Through extensive experiments, we demonstrate that cross pooling methods achieve significant improvement over the original class token and existing global pooling methods such as average pooling or maximum pooling.

引用

页码：491 / 496

页数：6

共 50 条

[31] Hierarchical Representation Learning in Graph Neural Networks With Node Decimation Pooling
Bianchi, Filippo Maria
Grattarola, Daniele
Livi, Lorenzo
Alippi, Cesare
[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022, 33 (05) : 2195 - 2207
[32] Deep Learning and Vision Transformer for Medical Image Analysis
Zhang, Yudong
Wang, Jiaji
Gorriz, Juan Manuel
Wang, Shuihua
[J]. JOURNAL OF IMAGING, 2023, 9 (07)
[33] End-to-End Multitask Learning With Vision Transformer
Tian, Yingjie
Bai, Kunlong
[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (07) : 9579 - 9590
[34] Joint learning of images and videos with a single Vision Transformer
Shimizu, Shuki
Tamaki, Toru
[J]. 2023 18TH INTERNATIONAL CONFERENCE ON MACHINE VISION AND APPLICATIONS, MVA, 2023,
[35] Vision Transformer With Contrastive Learning for Hyperspectral Image Classification
Zhou, Heng
Zhang, Xin
Zhang, Chunlei
Ma, Qiaoyu
[J]. IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2023, 20
[36] Image Retrieval Based on Vision Transformer and Masked Learning
李锋
潘煌圣
盛守祥
王国栋
[J]. Journal of Donghua University(English Edition), 2023, 40 (05) : 539 - 547
[37] Video captioning based on vision transformer and reinforcement learning
Zhao, Hong
Chen, Zhiwen
Guo, Lan
Han, Zeyu
[J]. PEERJ COMPUTER SCIENCE, 2022, 8
[38] Video captioning based on vision transformer and reinforcement learning
Zhao H.
Chen Z.
Guo L.
Han Z.
[J]. PeerJ Computer Science, 2022, 8
[39] FedViT: Federated continual learning of vision transformer at edge
Zuo, Xiaojiang
Luopan, Yaxin
Han, Rui
Zhang, Qinglong
Liu, Chi Harold
Wang, Guoren
Chen, Lydia Y.
[J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2024, 154 : 1 - 15
[40] Structure-Aware Transformer for Graph Representation Learning
Chen, Dexiong
O'Bray, Leslie
Borgwardt, Karsten
[J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,

← 1 2 3 4 5 →