ViT-PGC: vision transformer for pedestrian gender classification on small-size dataset

被引：3

作者：

Abbas, Farhat ^{[1
]}

Yasmin, Mussarat ^{[1
]}

Fayyaz, Muhammad ^{[2
]}

Asim, Usman ^{[3
]}

机构：

[1] COMSATS Univ Islamabad, Dept Comp Sci, Wah Campus, WahCantt 47040, Pakistan

[2] FAST Natl Univ Comp & Emerging Sci NUCES, Dept Comp Sci, Chiniot Faisalabad Campus, Chiniot, Punjab, Pakistan

[3] DeltaX, 3F,24,Namdaemun Ro 9 Gil, Seoul, South Korea

来源：

PATTERN ANALYSIS AND APPLICATIONS | 2023年 / 26卷 / 04期

关键词：

Vision transformer; LSA and SPT; Deep CNN models; SS datasets; Pedestrian gender classification; CONVOLUTIONAL NEURAL-NETWORK; RECOGNITION;

D O I：

10.1007/s10044-023-01196-2

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Pedestrian gender classification (PGC) is a key task in full-body-based pedestrian image analysis and has become an important area in applications like content-based image retrieval, visual surveillance, smart city, and demographic collection. In the last decade, convolutional neural networks (CNN) have appeared with great potential and with reliable choices for vision tasks, such as object classification, recognition, detection, etc. But CNN has a limited local receptive field that prevents them from learning information about the global context. In contrast, a vision transformer (ViT) is a better alternative to CNN because it utilizes a self-attention mechanism to attend to a different patch of an input image. In this work, generic and effective modules such as locality self-attention (LSA), and shifted patch tokenization (SPT)-based vision transformer model are explored for the PGC task. With the use of these modules in ViT, it is successfully able to learn from stretch even on small-size (SS) datasets and overcome the lack of locality inductive bias. Through extensive experimentation, we found that the proposed ViT model produced better results in terms of overall and mean accuracies. The better results confirm that ViT outperformed state-of-the-art (SOTA) PGC methods.

引用

页码：1805 / 1819

页数：15

共 50 条

[21] AnisotropicBreast-ViT: Breast Cancer Classification in Ultrasound Images Using Anisotropic Filtering and Vision Transformer
Diniz, Joao Otavio Bandeira
Ribeiro, Neilson P.
Dias, Domingos A., Jr.
da Cruz, Luana B.
da Silva, Giovanni L. F.
Gomes, Daniel L., Jr.
de Paiva, Anselmo C.
Silva, Aristofanes C.
INTELLIGENT SYSTEMS, BRACIS 2024, PT III, 2025, 15414 : 95 - 109
[22] ViT-DualAtt: An efficient pornographic image classification method based on Vision Transformer with dual attention
Cai, Zengyu
Xu, Liusen
Zhang, Jianwei
Feng, Yuan
Zhu, Liang
Liu, Fangmei
ELECTRONIC RESEARCH ARCHIVE, 2024, 32 (12): : 6698 - 6716
[23] SI-ViT: Shuffle instance-based Vision Transformer for pancreatic cancer ROSE image classification
Zhang, Tianyi
Feng, Youdan
Zhao, Yu
Lei, Yanli
Ying, Nan
Song, Fan
He, Yufang
Yan, Zhiling
Feng, Yunlu
Yang, Aiming
Zhang, Guanglei
COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE, 2024, 244
[24] Effects of dataset curation on body condition score (BCS) determination with a vision transformer (ViT) applied to RGB+depth images
Winkler, Zachary
Boucheron, Laura E.
Utsumi, Santiago
Nyamuryekung'e, Shelemia
Mcintosh, Matthew
Estell, Richard E.
SMART AGRICULTURAL TECHNOLOGY, 2024, 8
[25] White Blood Cell Classification: Convolutional Neural Network (CNN) and Vision Transformer (ViT) under Medical Microscope
Abou Ali, Mohamad
Dornaika, Fadi
Arganda-Carreras, Ignacio
ALGORITHMS, 2023, 16 (11)
[26] ViT-P: Classification of Genitourinary Syndrome of Menopause From OCT Images Based on Vision Transformer Models
Wang, Haoran
Ji, Yanju
Song, Kaiwen
Sun, Mingyang
Lv, Peitong
Zhang, Tianyu
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2021, 70
[27] Enhancing Fashion Classification with Vision Transformer (ViT) and Developing Recommendation Fashion Systems Using DINOVA2
Abd Alaziz, Hadeer M.
Elmannai, Hela
Saleh, Hager
Hadjouni, Myriam
Anter, Ahmed M.
Koura, Abdelrahim
Kayed, Mohammed
ELECTRONICS, 2023, 12 (20)
[28] GNViT- An enhanced image-based groundnut pest classification using Vision Transformer (ViT) model
Venkatasaichandrakanth, P.
Iyapparaja, M.
PLOS ONE, 2024, 19 (03):
[29] Patient teacher can impart locality to improve lightweight vision transformer on small dataset
Ling, Jun
Zhang, Xuan
Du, Fei
Li, Linyu
Shang, Weiyi
Gao, Chen
Li, Tong
PATTERN RECOGNITION, 2025, 157
[30] The DeepFish computer vision dataset for fish instance segmentation, classification, and size estimation
Garcia-d'Urso, Nahuel
Galan-Cuenca, Alejandro
Perez-Sanchez, Paula
Climent-Perez, Pau
Fuster-Guillo, Andres
Azorin-Lopez, Jorge
Saval-Calvo, Marcelo
Guillen-Nieto, Juan Eduardo
Soler-Capdepon, Gabriel
SCIENTIFIC DATA, 2022, 9 (01)

← 1 2 3 4 5 →