Pyramidal Cross-Modal Transformer with Sustained Visual Guidance for Multi-Label Image Classification

被引:0
|
作者
Li, Zhuohua [1 ,2 ]
Wang, Ruyun [1 ,2 ]
Zhu, Fuqing [1 ,2 ]
Han, Jizhong [1 ]
Hu, Songlin [1 ,2 ]
机构
[1] Chinese Acad Sci, Inst Informat Engn, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Cyber Secur, Beijing, Peoples R China
关键词
Pyramidal Transformer; Sustained Visual Guidance; Multi-label image classification;
D O I
10.1145/3652583.3658005
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multi-label image classification poses a formidable challenge due to the presence of multiple objects in each image, rendering it notably complex to decipher the visual content comprehensively. Discriminating between multiple objects necessitates the establishment of robust visual label dependencies. Previous methods attempt to formulate cross-modal interaction or one-shot co-occurrence relationship guidance. However, it not only exhibits limitations when handling occluded or blurry objects but also fails to fully leverage the diverse hierarchical properties for sustainably guiding the learning process of label dependencies. To sustainably establish hierarchical visual label dependencies, this paper introduces a Pyramidal Cross-modal Transformer framework for MLIC tasks. Specifically, the pyramidal visual guidance layer parses the visual features into a multi-resolution pyramid structure, allowing the updated visual-related information to provide sustained guidance for label semantics. This surpasses the conventional pre-processing of co-occurrence relationships. Besides, the hybrid modal interaction layer is proposed to effectively mitigate the semantic disparities between visual and label information with modal-blended indiscriminate attention, replacing vanilla self-attention. Several combination blocks consisting of these two layers are integrated and embedded within the encoder-decoder structure to facilitate the exploration of meticulous visual label dependencies. Extensive experiments on two widely-used benchmarks, including MS-COCO and PASCAL VOC 2007, consistently demonstrate that PCMT could provide state-of-the-art results.
引用
收藏
页码:740 / 748
页数:9
相关论文
共 50 条
  • [1] Cross-modal fusion for multi-label image classification with attention mechanism
    Wang, Yangtao
    Xie, Yanzhao
    Zeng, Jiangfeng
    Wang, Hanpin
    Fan, Lisheng
    Song, Yufan
    [J]. COMPUTERS & ELECTRICAL ENGINEERING, 2022, 101
  • [2] Cross-modal fusion for multi-label image classification with attention mechanism
    Wang, Yangtao
    Xie, Yanzhao
    Zeng, Jiangfeng
    Wang, Hanpin
    Fan, Lisheng
    Song, Yufan
    [J]. Computers and Electrical Engineering, 2022, 101
  • [3] Cross-modal multi-label image classification modeling and recognition based on nonlinear
    Yuan, Shuping
    Chen, Yang
    Ye, Chengqiong
    Bhatt, Mohammed Wasim
    Saradeshmukh, Mhalasakant
    Hossain, Md Shamim
    [J]. NONLINEAR ENGINEERING - MODELING AND APPLICATION, 2023, 12 (01):
  • [4] Label-Guided Cross-Modal Attention Network for Multi-Label Aerial Image Classification
    Chen, Ying
    Zhang, Ding
    Han, Tao
    Meng, Xiaoliang
    Gao, Mianxin
    Wang, Teng
    [J]. IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2024, 21 : 1 - 5
  • [5] Multi-Label Cross-modal Retrieval
    Ranjan, Viresh
    Rasiwasia, Nikhil
    Jawahar, C. V.
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4094 - 4102
  • [6] Label graph learning for multi-label image recognition with cross-modal fusion
    Yanzhao Xie
    Yangtao Wang
    Yu Liu
    Ke Zhou
    [J]. Multimedia Tools and Applications, 2022, 81 : 25363 - 25381
  • [7] Label graph learning for multi-label image recognition with cross-modal fusion
    Xie, Yanzhao
    Wang, Yangtao
    Liu, Yu
    Zhou, Ke
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (18) : 25363 - 25381
  • [8] Multi-Label Weighted Contrastive Cross-Modal Hashing
    Yi, Zeqian
    Zhu, Xinghui
    Wu, Runbing
    Zou, Zhuoyang
    Liu, Yi
    Zhu, Lei
    [J]. APPLIED SCIENCES-BASEL, 2024, 14 (01):
  • [9] A Cross-Modal View to Utilize Label Semantics for Enhancing Student Network in Multi-label Classification
    Qin, Yuzhuo
    Liu, Hengwei
    Gu, Xiaodong
    [J]. ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT I, 2023, 14254 : 14 - 25
  • [10] Cross-modality semantic guidance for multi-label image classification
    Huang, Jun
    Wang, Dian
    Hong, Xudong
    Qu, Xiwen
    Xue, Wei
    [J]. INTELLIGENT DATA ANALYSIS, 2024, 28 (03) : 633 - 646