Multi-Modal Reasoning Graph for Scene-Text Based Fine-Grained Image Classification and Retrieval

被引:14
|
作者
Mafla, Andres [1 ]
Dey, Sounak [1 ]
Biten, Ali Furkan [1 ]
Gomez, Lluis [1 ]
Karatzas, Dimosthenis [1 ]
机构
[1] UAB, Comp Vis Ctr, Barcelona, Spain
关键词
D O I
10.1109/WACV48630.2021.00407
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Scene text instances found in natural images carry explicit semantic information that can provide important cues to solve a wide array of computer vision problems. In this paper, we focus on leveraging multi-modal content in the form of visual and textual cues to tackle the task of fine-grained image classification and retrieval. First, we obtain the text instances from images by employing a text reading system. Then, we combine textual features with salient image regions to exploit the complementary information carried by the two sources. Specifically, we employ a Graph Convolutional Network to perform multi-modal reasoning and obtain relationship-enhanced features by learning a common semantic space between salient objects and text found in an image. By obtaining an enhanced set of visual and textual features, the proposed model greatly outperforms previous state-of-the-art in two different tasks, fine-grained classification and image retrieval in the ConText[23] and Drink Bottle[4] datasets.
引用
收藏
页码:4022 / 4032
页数:11
相关论文
共 50 条
  • [21] Audio-Visual Scene Classification Based on Multi-modal Graph Fusion
    Lei, Han
    Chen, Ning
    INTERSPEECH 2022, 2022, : 4157 - 4161
  • [22] Fine-grained image classification method with noisy labels based on retrieval augmentation
    Bao, Heng
    Deng, Lirui
    Zhang, Liang
    Chen, Xunxun
    Beijing Hangkong Hangtian Daxue Xuebao/Journal of Beijing University of Aeronautics and Astronautics, 2024, 50 (07): : 2284 - 2292
  • [23] Cross-Media Fine-Grained Representation Learning Based on Multi-modal Graph and Adversarial Hash Attention Network
    Liang M.
    Wang X.
    Du J.
    Moshi Shibie yu Rengong Zhineng/Pattern Recognition and Artificial Intelligence, 2022, 35 (03): : 195 - 206
  • [24] A Multi-modal Approach to Fine-grained Opinion Mining on Video Reviews
    Marrese-Taylor, Edison
    Rodriguez-Opazo, Cristian
    Balazs, Jorge A.
    Gould, Stephen
    Matsuo, Yutaka
    PROCEEDINGS OF THE SECOND GRAND CHALLENGE AND WORKSHOP ON MULTIMODAL LANGUAGE (CHALLENGE-HML), VOL 1, 2020, : 8 - 18
  • [25] Image and Encoded Text Fusion for Multi-Modal Classification
    Gallo, I.
    Calefati, A.
    Nawaz, S.
    Janjua, M. K.
    2018 INTERNATIONAL CONFERENCE ON DIGITAL IMAGE COMPUTING: TECHNIQUES AND APPLICATIONS (DICTA), 2018, : 203 - 209
  • [26] Fine-grained Activities Recognition with Coarse-grained Labeled Multi-modal Data
    Hu, Zhizhang
    Yu, Tong
    Zhang, Yue
    Pan, Shijia
    UBICOMP/ISWC '20 ADJUNCT: PROCEEDINGS OF THE 2020 ACM INTERNATIONAL JOINT CONFERENCE ON PERVASIVE AND UBIQUITOUS COMPUTING AND PROCEEDINGS OF THE 2020 ACM INTERNATIONAL SYMPOSIUM ON WEARABLE COMPUTERS, 2020, : 644 - 649
  • [27] Fine-Grained Text Classification Based on Label Augmentation
    Guo, Ruiqiang
    Yang, Shilong
    Jia, Xiaowen
    Wei, Qianqiang
    Computer Engineering and Applications, 60 (21): : 134 - 141
  • [28] COREN: Multi-Modal Co-Occurrence Transformer Reasoning Network for Image-Text Retrieval
    Wang, Yaodong
    Ji, Zhong
    Chen, Kexin
    Pang, Yanwei
    Zhang, Zhongfei
    NEURAL PROCESSING LETTERS, 2023, 55 (05) : 5959 - 5978
  • [29] COREN: Multi-Modal Co-Occurrence Transformer Reasoning Network for Image-Text Retrieval
    Yaodong Wang
    Zhong Ji
    Kexin Chen
    Yanwei Pang
    Zhongfei Zhang
    Neural Processing Letters, 2023, 55 : 5959 - 5978
  • [30] Cross-modal subspace learning for fine-grained sketch-based image retrieval
    Xu, Peng
    Yin, Qiyue
    Huang, Yongye
    Song, Yi-Zhe
    Ma, Zhanyu
    Wang, Liang
    Xiang, Tao
    Kleijn, W. Bastiaan
    Guo, Jun
    NEUROCOMPUTING, 2018, 278 : 75 - 86