Image-text multimodal classification via cross-attention contextual transformer with modality-collaborative learning

被引:0
|
作者
Shi, Qianyao [1 ]
Xu, Wanru
Miao, Zhenjiang [2 ]
机构
[1] Beijing Jiaotong Univ, Informat & Commun Engn, Beijing, Peoples R China
[2] Beijing Jiaotong Univ, Media Comp Ctr, Beijing, Peoples R China
基金
中国国家自然科学基金; 北京市自然科学基金;
关键词
multimodal classification; cross-attention; contextual transformer; modality-collaborative;
D O I
10.1117/1.JEI.33.4.043042
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Nowadays, we are surrounded by various types of data from different modalities, such as text, images, audio, and video. The existence of this multimodal data provides us with rich information, but it also brings new challenges: how do we effectively utilize this data for accurate classification? This is the main problem faced by multimodal classification tasks. Multimodal classification is an important task that aims to classify data from different modalities. However, due to the different characteristics and structures of data from different modalities, effectively fusing and utilizing them for classification is a challenging problem. To address this issue, we propose a cross-attention contextual transformer with modality-collaborative learning for multimodal classification (CACT-MCL-MMC) to better integrate information from different modalities. On the one hand, existing multimodal fusion methods ignore the intra- and inter-modality relationships, and there is unnoticed information in the modalities, resulting in unsatisfactory classification performance. To address the problem of insufficient interaction of modality information in existing algorithms, we use a cross-attention contextual transformer to capture the contextual relationships within and among modalities to improve the representativeness of the model. On the other hand, due to differences in the quality of information among different modalities, some modalities may have misleading or ambiguous information. Treating each modality equally may result in modality perceptual noise, which reduces the performance of multimodal classification. Therefore, we use modality-collaborative to filter misleading information, alleviate the quality difference of information among modalities, align modality information with high-quality and effective modalities, enhance unimodal information, and obtain more ideal multimodal fusion information to improve the model's discriminative ability. Our comparative experimental results on two benchmark datasets for image-text classification, CrisisMMD and UPMC Food-101, show that our proposed model outperforms other classification methods and even state-of-the-art (SOTA) multimodal classification methods. Meanwhile, the effectiveness of the cross-attention module, multimodal contextual attention network, and modality-collaborative learning was verified through ablation experiments. In addition, conducting hyper-parameter validation experiments showed that different fusion calculation methods resulted in differences in experimental results. The most effective feature tensor calculation method was found. We also conducted qualitative experiments. Compared with the original model, our proposed model can identify the expected results in the vast majority of cases. The codes are available at https://github.com/KobeBryant8-24-MVP/CACT-MCL-MMC. The CrisisMMD is available at https://dataverse.mpisws.org/dataverse/icwsm18, and the UPMC-Food-101 is available at https://visiir.isir.upmc.fr/. (c) 2024 SPIE and IS&T
引用
收藏
页数:23
相关论文
共 33 条
  • [1] Fusion of Image-text attention for Transformer-based Multimodal Machine Translation
    Ma, Junteng
    Qin, Shihao
    Su, Lan
    Li, Xia
    Xiao, Lixian
    PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2019, : 199 - 204
  • [2] Unsupervised multimodal learning for image-text relation classification in tweets
    Sun, Lin
    Li, Qingyuan
    Liu, Long
    Su, Yindu
    PATTERN ANALYSIS AND APPLICATIONS, 2023, 26 (04) : 1793 - 1804
  • [3] Unsupervised multimodal learning for image-text relation classification in tweets
    Lin Sun
    Qingyuan Li
    Long Liu
    Yindu Su
    Pattern Analysis and Applications, 2023, 26 : 1793 - 1804
  • [4] Spatial-Spectral Transformer With Cross-Attention for Hyperspectral Image Classification
    Peng, Yishu
    Zhang, Yuwen
    Tu, Bing
    Li, Qianming
    Li, Wujing
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
  • [5] Image-Text Cross-Modal Retrieval via Modality-Specific Feature Learning
    Wang, Jian
    He, Yonghao
    Kang, Cuicui
    Xiang, Shiming
    Pan, Chunhong
    ICMR'15: PROCEEDINGS OF THE 2015 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2015, : 347 - 354
  • [6] Cross-Modality Transfer Learning for Image-Text Information Management
    Niu, Shuteng
    Jiang, Yushan
    Chen, Bowen
    Wang, Jian
    Liu, Yongxin
    Song, Houbing
    ACM TRANSACTIONS ON MANAGEMENT INFORMATION SYSTEMS, 2022, 13 (01)
  • [7] Bidirectional feature fusion via cross-attention transformer for chrysanthemum classification
    Chen, Yifan
    Yang, Xichen
    Yan, Hui
    Liu, Jia
    Jiang, Jian
    Mao, Zhongyuan
    Wang, Tianshu
    PATTERN ANALYSIS AND APPLICATIONS, 2025, 28 (02)
  • [8] Cross-attention Based Text-image Transformer for Visual Question Answering
    Rezapour M.
    Recent Advances in Computer Science and Communications, 2024, 17 (04) : 72 - 78
  • [9] BVA-Transformer: Image-text multimodal classification and dialogue model architecture based on Blip and visual attention mechanism
    Zhang, Kaiyu
    Wu, Fei
    Zhang, Guowei
    Liu, Jiawei
    Li, Min
    DISPLAYS, 2024, 83
  • [10] Cross-attention interaction learning network for multi-model image fusion via transformer
    Wang, Jing
    Yu, Long
    Tian, Shengwei
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2025, 139