Cross-modal fusion for multi-label image classification with attention mechanism

被引:16
|
作者
Wang, Yangtao [1 ]
Xie, Yanzhao [2 ]
Zeng, Jiangfeng [3 ]
Wang, Hanpin [1 ]
Fan, Lisheng [1 ]
Song, Yufan [4 ]
机构
[1] Guangzhou Univ, Sch Comp Sci & Cyber Engn, Guangzhou, Peoples R China
[2] Huazhong Univ Sci & Technol, Wuhan Natl Lab Optoelect, Wuhan, Peoples R China
[3] Cent China Normal Univ, Sch Informat Management, Wuhan, Peoples R China
[4] Nanjing Univ Posts & Telecommun, Nanjing, Peoples R China
基金
中国国家自然科学基金;
关键词
Graph convolution network; Attention mechanism; Cross-modal fusion; Multi-label image classification;
D O I
10.1016/j.compeleceng.2022.108002
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
For multi-label image classification, existing studies either utilize a poor multi-step training workflow to explore the (local) relationships between the image target regions and their corresponding labels with attention mechanism or model the (global) label dependencies via graph convolution network (GCN) but fail to efficiently fuse these image features and label word vectors. To address these problems, we develop Cross-modal Fusion for Multi-label Image Classification with attention mechanism (termed as CFMIC), which combines attention mechanism and GCN to capture the local and global label dependencies simultaneously in an end-to-end manner. CFMIC mainly contains three key modules: (1) a feature extraction module with attention mechanism which helps generate the accurate feature of each input image by focusing on the relationships between image labels and image target regions, (2) a label co occurrence embedding learning module with GCN which utilizes GCN to learn the relationships between different objects to generate the label co-occurrence embeddings and (3) a cross-modal fusion module with Multi-modal Factorized Bilinear pooling (termed as MFB) which efficiently fuses the above image features and label co-occurrence embeddings. Extensive experiments on MS-COCO and VOC2007 verify CFMIC greatly promotes the convergence efficiency and produces better classification results than the state-of-the-art approaches.
引用
收藏
页数:12
相关论文
共 50 条
  • [21] Cross-modal attention for multi-modal image registration
    Song, Xinrui
    Chao, Hanqing
    Xu, Xuanang
    Guo, Hengtao
    Xu, Sheng
    Turkbey, Baris
    Wood, Bradford J.
    Sanford, Thomas
    Wang, Ge
    Yan, Pingkun
    MEDICAL IMAGE ANALYSIS, 2022, 82
  • [22] Multi-label Thoracic Disease Image Classification with Cross-Attention Networks
    Ma, Congbo
    Wang, Hu
    Hoi, Steven C. H.
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2019, PT VI, 2019, 11769 : 730 - 738
  • [23] Multi-Label Image Classification by Feature Attention Network
    Yan, Zheng
    Liu, Weiwei
    Wen, Shiping
    Yang, Yin
    IEEE ACCESS, 2019, 7 : 98005 - 98013
  • [24] Multi-label modality enhanced attention based self-supervised deep cross-modal hashing
    Zou, Xitao
    Wu, Song
    Zhang, Nian
    Bakker, Erwin M.
    Knowledge-Based Systems, 2022, 239
  • [25] Multi-label modality enhanced attention based self-supervised deep cross-modal hashing
    Zou, Xitao
    Wu, Song
    Zhang, Nian
    Bakker, Erwin M.
    KNOWLEDGE-BASED SYSTEMS, 2022, 239
  • [26] Cross-modal image fusion guided by subjective visual attention
    Fang, Aiqing
    Zhao, Xinbo
    Zhang, Yanning
    NEUROCOMPUTING, 2020, 414 (414) : 333 - 345
  • [27] Multi-label adversarial fine-grained cross-modal retrieval
    Sun, Chunpu
    Zhang, Huaxiang
    Liu, Li
    Liu, Dongmei
    Wang, Lin
    SIGNAL PROCESSING-IMAGE COMMUNICATION, 2023, 117
  • [28] Deep Noisy Multi-label Learning for Robust Cross-Modal Retrieval
    Pu, Ruitao
    Peng, Dezhong
    Hua, Fujun
    PATTERN RECOGNITION AND COMPUTER VISION, PT V, PRCV 2024, 2025, 15035 : 304 - 317
  • [29] DEEP PAIRWISE RANKING WITH MULTI-LABEL INFORMATION FOR CROSS-MODAL RETRIEVAL
    Jian, Yangwo
    Xiao, Jing
    Cao, Yang
    Khan, Asad
    Zhu, Jia
    2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 1810 - 1815
  • [30] Multi-label semantics preserving based deep cross-modal hashing
    Zou, Xitao
    Wang, Xinzhi
    Bakker, Erwin M.
    Wu, Song
    SIGNAL PROCESSING-IMAGE COMMUNICATION, 2021, 93