Rectify representation bias in vision-language models for long-tailed recognition

被引:4
|
作者
Li, Bo [1 ]
Yao, Yongqiang [2 ]
Tan, Jingru [3 ]
Gong, Ruihao [2 ]
Lu, Jianwei [4 ]
Luo, Ye [1 ]
机构
[1] Tongji Univ, 4800 Caoan Rd, Shanghai 201804, Peoples R China
[2] Sensetime Res, 1900 Hongmei Rd, Shanghai 201103, Peoples R China
[3] Cent South Univ, 932 South Lushan Rd, Changsha 410083, Hunan, Peoples R China
[4] Shanghai Univ Tradit Chinese Med, 530 Lingling Rd, Shanghai 201203, Peoples R China
基金
中国国家自然科学基金;
关键词
Long-tailed recognition; Vision-language model; Representation bias; SMOTE;
D O I
10.1016/j.neunet.2024.106134
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Natural data typically exhibits a long-tailed distribution, presenting great challenges for recognition tasks. Due to the extreme scarcity of training instances, tail classes often show inferior performance. In this paper, we investigate the problem within the trendy visual-language (VL) framework and find that the performance bottleneck mainly arises from the recognition confusion between tail classes and their highly correlated head classes. Building upon this observation, unlike previous research primarily emphasizing class frequency in addressing long-tailed issues, we take a novel perspective by incorporating a crucial additional factor namely class correlation. Specifically, we model the representation learning procedure for each sample as two parts, i.e., a special part that learns the unique properties of its own class and a common part that learns shared characteristics among classes. By analysis, we discover that the learning process of common representation is easily biased toward head classes. Because of the bias, the network may lean towards the biased common representation as classification criteria, rather than prioritizing the crucial information encapsulated within the specific representation, ultimately leading to recognition confusion. To solve the problem, based on the VL framework, we introduce the rectification contrastive term (ReCT) to rectify the representation bias, according to semantic hints and training status. Extensive experiments on three widely-used long-tailed datasets demonstrate the effectiveness of ReCT. On iNaturalist2018, it achieves an overall accuracy of 75.4%, surpassing the baseline by 3.6 points in a ResNet-50 visual backbone.
引用
收藏
页数:10
相关论文
共 50 条
  • [1] Light Up the Shadows: Enhance Long-Tailed Entity Grounding with Concept-Guided Vision-Language Models
    Zhang, Yikai
    He, Qianyu
    Wang, Xintao
    Yuan, Siyu
    Liang, Jiaqing
    Xiao, Yanghua
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 13379 - 13389
  • [2] Task Bias in Contrastive Vision-Language Models
    Menon, Sachit
    Chandratreya, Ishaan Preetam
    Vondrick, Carl
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (06) : 2026 - 2040
  • [3] Delving Deep into Simplicity Bias for Long-Tailed Image Recognition
    Wei, Xiu-Shen
    Sun, Xuhao
    Shen, Yang
    Wang, Peng
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2025,
  • [4] Towards prior gap and representation gap for long-tailed recognition
    Zhang, Ming-Liang
    Zhang, Xu-Yao
    Wang, Chuang
    Liu, Cheng-Lin
    PATTERN RECOGNITION, 2023, 133
  • [5] A Multi-dimensional study on Bias in Vision-Language models
    Ruggeri, Gabriele
    Nozza, Debora
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 6445 - 6455
  • [6] Feature Bias Correction: A Feature Augmentation Method for Long-tailed Recognition
    Yang, Jiaxin
    Li, Xiaofei
    Zhang, Jun
    Li, Shuohao
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 558 - 563
  • [7] Revisiting Classifier: Transferring Vision-Language Models for Video Recognition
    Wu, Wenhao
    Sun, Zhun
    Ouyang, Wanli
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 3, 2023, : 2847 - 2855
  • [8] Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective
    Wenhao Wu
    Zhun Sun
    Yuxin Song
    Jingdong Wang
    Wanli Ouyang
    International Journal of Computer Vision, 2024, 132 (2) : 392 - 409
  • [9] Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective
    Wu, Wenhao
    Sun, Zhun
    Song, Yuxin
    Wang, Jingdong
    Ouyang, Wanli
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (02) : 392 - 409
  • [10] Open-Set Recognition in the Age of Vision-Language Models
    Miller, Dimity
    Sunderhauf, Niko
    Kenna, Alex
    Mason, Keita
    COMPUTER VISION - ECCV 2024, PT XLII, 2025, 15100 : 1 - 18