Cross-modal fusion for multi-label image classification with attention mechanism

被引:16
|
作者
Wang, Yangtao [1 ]
Xie, Yanzhao [2 ]
Zeng, Jiangfeng [3 ]
Wang, Hanpin [1 ]
Fan, Lisheng [1 ]
Song, Yufan [4 ]
机构
[1] Guangzhou Univ, Sch Comp Sci & Cyber Engn, Guangzhou, Peoples R China
[2] Huazhong Univ Sci & Technol, Wuhan Natl Lab Optoelect, Wuhan, Peoples R China
[3] Cent China Normal Univ, Sch Informat Management, Wuhan, Peoples R China
[4] Nanjing Univ Posts & Telecommun, Nanjing, Peoples R China
基金
中国国家自然科学基金;
关键词
Graph convolution network; Attention mechanism; Cross-modal fusion; Multi-label image classification;
D O I
10.1016/j.compeleceng.2022.108002
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
For multi-label image classification, existing studies either utilize a poor multi-step training workflow to explore the (local) relationships between the image target regions and their corresponding labels with attention mechanism or model the (global) label dependencies via graph convolution network (GCN) but fail to efficiently fuse these image features and label word vectors. To address these problems, we develop Cross-modal Fusion for Multi-label Image Classification with attention mechanism (termed as CFMIC), which combines attention mechanism and GCN to capture the local and global label dependencies simultaneously in an end-to-end manner. CFMIC mainly contains three key modules: (1) a feature extraction module with attention mechanism which helps generate the accurate feature of each input image by focusing on the relationships between image labels and image target regions, (2) a label co occurrence embedding learning module with GCN which utilizes GCN to learn the relationships between different objects to generate the label co-occurrence embeddings and (3) a cross-modal fusion module with Multi-modal Factorized Bilinear pooling (termed as MFB) which efficiently fuses the above image features and label co-occurrence embeddings. Extensive experiments on MS-COCO and VOC2007 verify CFMIC greatly promotes the convergence efficiency and produces better classification results than the state-of-the-art approaches.
引用
收藏
页数:12
相关论文
共 50 条
  • [31] Multi-label double-layer learning for cross-modal retrieval
    He, Jianfeng
    Ma, Bingpeng
    Wang, Shuhui
    Liu, Yugui
    Huang, Qingming
    NEUROCOMPUTING, 2018, 275 : 1893 - 1902
  • [32] Adaptive multi-label structure preserving network for cross-modal retrieval
    Zhu, Jie
    Zhang, Hui
    Chen, Junfen
    Xie, Bojun
    Liu, Jianan
    Zhang, Junsan
    INFORMATION SCIENCES, 2024, 682
  • [33] Scalable multi-label canonical correlation analysis for cross-modal retrieval
    Shu, Xin
    Zhao, Guoying
    PATTERN RECOGNITION, 2021, 115
  • [34] Bi-Modal Learning With Channel-Wise Attention for Multi-Label Image Classification
    Li, Peng
    Chen, Peng
    Xie, Yonghong
    Zhang, Dezheng
    IEEE ACCESS, 2020, 8 : 9965 - 9977
  • [35] Graph Attention Transformer Network for Multi-label Image Classification
    Yuan, Jin
    Chen, Shikai
    Zhang, Yao
    Shi, Zhongchao
    Geng, Xin
    Fan, Jianping
    Rui, Yong
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (04)
  • [36] DATran: Dual Attention Transformer for Multi-Label Image Classification
    Zhou, Wei
    Zheng, Zhijie
    Su, Tao
    Hu, Haifeng
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (01) : 342 - 356
  • [37] Pose Guided Attention for Multi-label Fashion Image Classification
    Ferreira, Beatriz Quintino
    Costeira, Joao P.
    Sousa, Ricardo G.
    Gui, Liang-Yan
    Gomes, Joao P.
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 3125 - 3128
  • [38] Deep Class-Guided Hashing for Multi-Label Cross-Modal Retrieval
    Chen, Hao
    Zou, Zhuoyang
    Liu, Yiqiang
    Zhu, Xinghui
    APPLIED SCIENCES-BASEL, 2025, 15 (06):
  • [39] Deep robust multilevel semantic hashing for multi-label cross-modal retrieval
    Song, Ge
    Tan, Xiaoyang
    Zhao, Jun
    Yang, Ming
    PATTERN RECOGNITION, 2021, 120
  • [40] Dual Adversarial Graph Neural Networks for Multi-label Cross-modal Retrieval
    Qian, Shengsheng
    Xue, Dizhan
    Zhang, Huaiwen
    Fang, Quan
    Xu, Changsheng
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 2440 - 2448