Image-text multimodal classification via cross-attention contextual transformer with modality-collaborative learning

被引：0

作者：

Shi, Qianyao ^{[1
]}

Xu, Wanru

Miao, Zhenjiang ^{[2
]}

机构：

[1] Beijing Jiaotong Univ, Informat & Commun Engn, Beijing, Peoples R China

[2] Beijing Jiaotong Univ, Media Comp Ctr, Beijing, Peoples R China

来源：

JOURNAL OF ELECTRONIC IMAGING | 2024年 / 33卷 / 04期

基金：

中国国家自然科学基金; 北京市自然科学基金;

关键词：

multimodal classification; cross-attention; contextual transformer; modality-collaborative;

D O I：

10.1117/1.JEI.33.4.043042

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Nowadays, we are surrounded by various types of data from different modalities, such as text, images, audio, and video. The existence of this multimodal data provides us with rich information, but it also brings new challenges: how do we effectively utilize this data for accurate classification? This is the main problem faced by multimodal classification tasks. Multimodal classification is an important task that aims to classify data from different modalities. However, due to the different characteristics and structures of data from different modalities, effectively fusing and utilizing them for classification is a challenging problem. To address this issue, we propose a cross-attention contextual transformer with modality-collaborative learning for multimodal classification (CACT-MCL-MMC) to better integrate information from different modalities. On the one hand, existing multimodal fusion methods ignore the intra- and inter-modality relationships, and there is unnoticed information in the modalities, resulting in unsatisfactory classification performance. To address the problem of insufficient interaction of modality information in existing algorithms, we use a cross-attention contextual transformer to capture the contextual relationships within and among modalities to improve the representativeness of the model. On the other hand, due to differences in the quality of information among different modalities, some modalities may have misleading or ambiguous information. Treating each modality equally may result in modality perceptual noise, which reduces the performance of multimodal classification. Therefore, we use modality-collaborative to filter misleading information, alleviate the quality difference of information among modalities, align modality information with high-quality and effective modalities, enhance unimodal information, and obtain more ideal multimodal fusion information to improve the model's discriminative ability. Our comparative experimental results on two benchmark datasets for image-text classification, CrisisMMD and UPMC Food-101, show that our proposed model outperforms other classification methods and even state-of-the-art (SOTA) multimodal classification methods. Meanwhile, the effectiveness of the cross-attention module, multimodal contextual attention network, and modality-collaborative learning was verified through ablation experiments. In addition, conducting hyper-parameter validation experiments showed that different fusion calculation methods resulted in differences in experimental results. The most effective feature tensor calculation method was found. We also conducted qualitative experiments. Compared with the original model, our proposed model can identify the expected results in the vast majority of cases. The codes are available at https://github.com/KobeBryant8-24-MVP/CACT-MCL-MMC. The CrisisMMD is available at https://dataverse.mpisws.org/dataverse/icwsm18, and the UPMC-Food-101 is available at https://visiir.isir.upmc.fr/. (c) 2024 SPIE and IS&T

引用

页数：23

共 33 条

[1] Fusion of Image-text attention for Transformer-based Multimodal Machine Translation
Ma, Junteng
Qin, Shihao
Su, Lan
Li, Xia
Xiao, Lixian
PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2019, : 199 - 204
[2] Unsupervised multimodal learning for image-text relation classification in tweets
Sun, Lin
Li, Qingyuan
Liu, Long
Su, Yindu
PATTERN ANALYSIS AND APPLICATIONS, 2023, 26 (04) : 1793 - 1804
[3] Unsupervised multimodal learning for image-text relation classification in tweets
Lin Sun
Qingyuan Li
Long Liu
Yindu Su
Pattern Analysis and Applications, 2023, 26 : 1793 - 1804
[4] Spatial-Spectral Transformer With Cross-Attention for Hyperspectral Image Classification
Peng, Yishu
Zhang, Yuwen
Tu, Bing
Li, Qianming
Li, Wujing
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
[5] Image-Text Cross-Modal Retrieval via Modality-Specific Feature Learning
Wang, Jian
He, Yonghao
Kang, Cuicui
Xiang, Shiming
Pan, Chunhong
ICMR'15: PROCEEDINGS OF THE 2015 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2015, : 347 - 354
[6] Cross-Modality Transfer Learning for Image-Text Information Management
Niu, Shuteng
Jiang, Yushan
Chen, Bowen
Wang, Jian
Liu, Yongxin
Song, Houbing
ACM TRANSACTIONS ON MANAGEMENT INFORMATION SYSTEMS, 2022, 13 (01)
[7] Bidirectional feature fusion via cross-attention transformer for chrysanthemum classification
Chen, Yifan
Yang, Xichen
Yan, Hui
Liu, Jia
Jiang, Jian
Mao, Zhongyuan
Wang, Tianshu
PATTERN ANALYSIS AND APPLICATIONS, 2025, 28 (02)
[8] Cross-attention Based Text-image Transformer for Visual Question Answering
Rezapour M.
Recent Advances in Computer Science and Communications, 2024, 17 (04) : 72 - 78
[9] BVA-Transformer: Image-text multimodal classification and dialogue model architecture based on Blip and visual attention mechanism
Zhang, Kaiyu
Wu, Fei
Zhang, Guowei
Liu, Jiawei
Li, Min
DISPLAYS, 2024, 83
[10] Cross-attention interaction learning network for multi-model image fusion via transformer
Wang, Jing
Yu, Long
Tian, Shengwei
ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2025, 139

← 1 2 3 4 →