Dual-graph hierarchical interaction network for referring image segmentation

被引:1
|
作者
Shi, Zhaofeng [1 ]
Wu, Qingbo [1 ]
Li, Hongliang [1 ]
Meng, Fanman [1 ]
Ngan, King Ngi [1 ]
机构
[1] Univ Elect Sci & Technol China, Sch Informat & Commun Engn, Chengdu 611731, Peoples R China
基金
中国国家自然科学基金;
关键词
Referring image segmentation; Graph reasoning; Hierarchical interaction; BLIND QUALITY ASSESSMENT; MOVEMENT; HEAD;
D O I
10.1016/j.displa.2023.102575
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Referring Image Segmentation (RIS) aims to extract the object or stuff from an image according to the given natural language expression. As a representative multi-modal reasoning task, the main challenge of RIS lies in accurately understanding and aligning two types of heterogeneous data (i.e. image and text). Existing methods typically conduct this task via inexplicit cross-modal fusion toward the visual and linguistic features, which are separately extracted from different encoders and hard to capture accurate image-text alignment information due to their distinct latent representation structures. In this paper, we propose a Dual-Graph Hierarchical Interaction Network (DGHIN) to facilitate the explicit and comprehensive alignment between the image and text data. Firstly, two graphs are separately built for the initial visual and linguistic features extracted with different pre-trained encoders. By means of graph reasoning, we obtain a unified representation structure for different modalities to capture the intra-modal entities and their contexts, where each projected node incorporates the long-range dependencies into the latent representation. Then, the Hierarchical Interaction Module (HIM) is applied to the visual and linguistic graphs to extract comprehensive inter-modal correlations from the entity level and graph level, which not only capture the corresponding keywords and visual patches but also draws the whole sentence closer to the image region with the consistent context in the latent space. Extensive experiments on RefCOCO, RefCOCO+, G-Ref, and ReferIt demonstrate that the proposed DGHIN outperforms many state-of-the-art methods. Code is available at https://github.com/ZhaofengSHI/referring-DGHIN.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Dual Convolutional LSTM Network for Referring Image Segmentation
    Ye, Linwei
    Liu, Zhi
    Wang, Yang
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (12) : 3224 - 3235
  • [2] Bilateral Knowledge Interaction Network for Referring Image Segmentation
    Ding, Haixin
    Zhang, Shengchuan
    Wu, Qiong
    Yu, Songlin
    Hu, Jie
    Cao, Liujuan
    Ji, Rongrong
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 2966 - 2977
  • [3] Hierarchical collaboration for referring image segmentation
    Zhang, Wei
    Cheng, Zesen
    Chen, Jie
    Gao, Wen
    [J]. Neurocomputing, 2025, 613
  • [4] Recurrent Multimodal Interaction for Referring Image Segmentation
    Liu, Chenxi
    Lin, Zhe
    Shen, Xiaohui
    Yang, Jimei
    Lu, Xin
    Yuille, Alan
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 1280 - 1289
  • [5] Structured Attention Network for Referring Image Segmentation
    Lin, Liang
    Yan, Pengxiang
    Xu, Xiaoqian
    Yang, Sibei
    Zeng, Kun
    Li, Guanbin
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 1922 - 1932
  • [6] Dual-Graph Convolutional Network and Dual-View Fusion for Group Recommendation
    Zhou, Chenyang
    Zou, Guobing
    Hui, Shengxiang
    Lv, Hehe
    Wu, Liangrui
    Zhang, Bofeng
    [J]. ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PT V, PAKDD 2024, 2024, 14649 : 231 - 243
  • [7] Cross-modal fusion encoder via graph neural network for referring image segmentation
    Zhang, Yuqing
    Zhang, Yong
    Piao, Xinglin
    Yuan, Peng
    Hu, Yongli
    Yin, Baocai
    [J]. IET IMAGE PROCESSING, 2024, 18 (04) : 1083 - 1095
  • [8] Structured Multimodal Fusion Network for Referring Image Segmentation
    Xue, Mingcheng
    Liu, Yu
    Xu, Kaiping
    Zhang, Haiyang
    Yu, Chengyang
    [J]. PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2022, 2022, : 36 - 47
  • [9] A CONTEXT-BASED NETWORK FOR REFERRING IMAGE SEGMENTATION
    Li, Xinyu
    Liu, Yu
    Xu, Kaiping
    Zhao, Zhehuan
    Liu, Sipei
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2020, : 1436 - 1440
  • [10] Query Reconstruction Network for Referring Expression Image Segmentation
    Shi, Hengcan
    Li, Hongliang
    Wu, Qingbo
    Ngan, King Ngi
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 : 995 - 1007