Cross-modal fusion for multi-label image classification with attention mechanism

被引:16
|
作者
Wang, Yangtao [1 ]
Xie, Yanzhao [2 ]
Zeng, Jiangfeng [3 ]
Wang, Hanpin [1 ]
Fan, Lisheng [1 ]
Song, Yufan [4 ]
机构
[1] Guangzhou Univ, Sch Comp Sci & Cyber Engn, Guangzhou, Peoples R China
[2] Huazhong Univ Sci & Technol, Wuhan Natl Lab Optoelect, Wuhan, Peoples R China
[3] Cent China Normal Univ, Sch Informat Management, Wuhan, Peoples R China
[4] Nanjing Univ Posts & Telecommun, Nanjing, Peoples R China
基金
中国国家自然科学基金;
关键词
Graph convolution network; Attention mechanism; Cross-modal fusion; Multi-label image classification;
D O I
10.1016/j.compeleceng.2022.108002
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
For multi-label image classification, existing studies either utilize a poor multi-step training workflow to explore the (local) relationships between the image target regions and their corresponding labels with attention mechanism or model the (global) label dependencies via graph convolution network (GCN) but fail to efficiently fuse these image features and label word vectors. To address these problems, we develop Cross-modal Fusion for Multi-label Image Classification with attention mechanism (termed as CFMIC), which combines attention mechanism and GCN to capture the local and global label dependencies simultaneously in an end-to-end manner. CFMIC mainly contains three key modules: (1) a feature extraction module with attention mechanism which helps generate the accurate feature of each input image by focusing on the relationships between image labels and image target regions, (2) a label co occurrence embedding learning module with GCN which utilizes GCN to learn the relationships between different objects to generate the label co-occurrence embeddings and (3) a cross-modal fusion module with Multi-modal Factorized Bilinear pooling (termed as MFB) which efficiently fuses the above image features and label co-occurrence embeddings. Extensive experiments on MS-COCO and VOC2007 verify CFMIC greatly promotes the convergence efficiency and produces better classification results than the state-of-the-art approaches.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Cross-modal fusion for multi-label image classification with attention mechanism
    Wang, Yangtao
    Xie, Yanzhao
    Zeng, Jiangfeng
    Wang, Hanpin
    Fan, Lisheng
    Song, Yufan
    Computers and Electrical Engineering, 2022, 101
  • [2] Label-Guided Cross-Modal Attention Network for Multi-Label Aerial Image Classification
    Chen, Ying
    Zhang, Ding
    Han, Tao
    Meng, Xiaoliang
    Gao, Mianxin
    Wang, Teng
    IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2024, 21 : 1 - 5
  • [3] Multi-Scale Cross-Modal Spatial Attention Fusion for Multi-label Image Recognition
    Li, Junbing
    Zhang, Changqing
    Wang, Xueman
    Du, Ling
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2020, PT I, 2020, 12396 : 736 - 747
  • [4] Label graph learning for multi-label image recognition with cross-modal fusion
    Yanzhao Xie
    Yangtao Wang
    Yu Liu
    Ke Zhou
    Multimedia Tools and Applications, 2022, 81 : 25363 - 25381
  • [5] Label graph learning for multi-label image recognition with cross-modal fusion
    Xie, Yanzhao
    Wang, Yangtao
    Liu, Yu
    Zhou, Ke
    MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (18) : 25363 - 25381
  • [6] Cross-modal multi-label image classification modeling and recognition based on nonlinear
    Yuan, Shuping
    Chen, Yang
    Ye, Chengqiong
    Bhatt, Mohammed Wasim
    Saradeshmukh, Mhalasakant
    Hossain, Md Shamim
    NONLINEAR ENGINEERING - MODELING AND APPLICATION, 2023, 12 (01):
  • [7] Multi-Label Cross-modal Retrieval
    Ranjan, Viresh
    Rasiwasia, Nikhil
    Jawahar, C. V.
    2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4094 - 4102
  • [8] Pyramidal Cross-Modal Transformer with Sustained Visual Guidance for Multi-Label Image Classification
    Li, Zhuohua
    Wang, Ruyun
    Zhu, Fuqing
    Han, Jizhong
    Hu, Songlin
    PROCEEDINGS OF THE 4TH ANNUAL ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2024, 2024, : 740 - 748
  • [9] Multi-modal bilinear fusion with hybrid attention mechanism for multi-label skin lesion classification
    Wei, Yun
    Ji, Lin
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (24) : 65221 - 65247
  • [10] Fast Graph Convolution Network Based Multi-label Image Recognition via Cross-modal Fusion
    Wang, Yangtao
    Xie, Yanzhao
    Liu, Yu
    Zhou, Ke
    Li, Xiaocui
    CIKM '20: PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, 2020, : 1575 - 1584