Vision-Aware Language Reasoning for Referring Image Segmentation

被引:0
|
作者
Fayou Xu
Bing Luo
Chao Zhang
Li Xu
Mingxing Pu
Bo Li
机构
[1] Xihua University,School of Computer and Software Engineering
[2] Sichuan Police College,Key Laboratory of Intelligent Policing
[3] Xihua University,School of Science
来源
Neural Processing Letters | 2023年 / 55卷
关键词
Referring image segmentation; Vision and language; Explainable language-structure reasoning;
D O I
暂无
中图分类号
学科分类号
摘要
Referring image segmentation is a multimodal joint task that aims to segment linguistically indicated objects from images in paired expressions and images. However, the diversity of language annotations trends to result in semantic ambiguity, which makes the semantic representation of language feature encoding imprecise. Existing methods ignore the correction of language encoding module, so that the semantic error of language features cannot be improved in the subsequent process, resulting in semantic deviation. To this end, we propose a vision-aware language reasoning model. Intuitively, the segmentation result can be used to guide the reconstruction of language features, which could be expressed as a tree-structured recursive process. Specifically, we designed a language reasoning encoding module and a mask loopback optimization module to optimize the language encoding tree. The feature weights of tree nodes are learned through backpropagation. In order to overcome the problem that local language words and visual regions are easily introduced into noise regions in the traditional attention module, we use the global language prior information to calculate the importance of different words to further weight the visual region features, which could be embodied as language-aware vision attention module. Our experimental results on four benchmark datasets show that the proposed method achieves performance improvement.
引用
收藏
页码:11313 / 11331
页数:18
相关论文
共 50 条
  • [21] SATR: Semantics-Aware Triadic Refinement network for referring image segmentation
    Xie, Jialong
    Liu, Jin
    Wang, Guoxiang
    Zhou, Fengyu
    KNOWLEDGE-BASED SYSTEMS, 2024, 284
  • [22] Cross-modal transformer with language query for referring image segmentation
    Zhang, Wenjing
    Tan, Quange
    Li, Pengxin
    Zhang, Qi
    Wang, Rong
    NEUROCOMPUTING, 2023, 536 : 191 - 205
  • [23] Vision-aware target recognition toward autonomous robot by Kinect sensors
    Chang, Qiuxiang
    Xiong, Zhenkai
    SIGNAL PROCESSING-IMAGE COMMUNICATION, 2020, 84 (84)
  • [24] Vision-aware air-ground cooperative target localization for UAV and UGV
    Liu, Daqian
    Bao, Weidong
    Zhu, Xiaomin
    Fei, Bowen
    Xiao, Zhenliang
    Men, Tong
    AEROSPACE SCIENCE AND TECHNOLOGY, 2022, 124
  • [25] Hierarchical collaboration for referring image segmentation
    Zhang, Wei
    Cheng, Zesen
    Chen, Jie
    Gao, Wen
    Neurocomputing, 2025, 613
  • [26] Toward Robust Referring Image Segmentation
    Wu, Jianzong
    Li, Xiangtai
    Li, Xia
    Ding, Henghui
    Tong, Yunhai
    Tao, Dacheng
    IEEE Transactions on Image Processing, 2024, 33 : 1782 - 1794
  • [27] Toward Robust Referring Image Segmentation
    Wu, Jianzong
    Li, Xiangtai
    Li, Xia
    Ding, Henghui
    Tong, Yunhai
    Tao, Dacheng
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 1782 - 1794
  • [28] Token-word mixer meets object-aware transformer for referring image segmentation
    Zhang, Zhenliang
    Teng, Zhu
    Fan, Jack
    Zhang, Baopeng
    Fan, Jianping
    PATTERN RECOGNITION, 2024, 155
  • [29] LViT: Language Meets Vision Transformer in Medical Image Segmentation
    Li, Zihan
    Li, Yunxiang
    Li, Qingde
    Wang, Puyang
    Guo, Dazhou
    Lu, Le
    Jin, Dakai
    Zhang, You
    Hong, Qingqi
    IEEE TRANSACTIONS ON MEDICAL IMAGING, 2024, 43 (01) : 96 - 107
  • [30] Video Object Segmentation with Language Referring Expressions
    Khoreva, Anna
    Rohrbach, Anna
    Schiele, Bernt
    COMPUTER VISION - ACCV 2018, PT IV, 2019, 11364 : 123 - 141