Vision-Aware Language Reasoning for Referring Image Segmentation

被引:0
|
作者
Fayou Xu
Bing Luo
Chao Zhang
Li Xu
Mingxing Pu
Bo Li
机构
[1] Xihua University,School of Computer and Software Engineering
[2] Sichuan Police College,Key Laboratory of Intelligent Policing
[3] Xihua University,School of Science
来源
Neural Processing Letters | 2023年 / 55卷
关键词
Referring image segmentation; Vision and language; Explainable language-structure reasoning;
D O I
暂无
中图分类号
学科分类号
摘要
Referring image segmentation is a multimodal joint task that aims to segment linguistically indicated objects from images in paired expressions and images. However, the diversity of language annotations trends to result in semantic ambiguity, which makes the semantic representation of language feature encoding imprecise. Existing methods ignore the correction of language encoding module, so that the semantic error of language features cannot be improved in the subsequent process, resulting in semantic deviation. To this end, we propose a vision-aware language reasoning model. Intuitively, the segmentation result can be used to guide the reconstruction of language features, which could be expressed as a tree-structured recursive process. Specifically, we designed a language reasoning encoding module and a mask loopback optimization module to optimize the language encoding tree. The feature weights of tree nodes are learned through backpropagation. In order to overcome the problem that local language words and visual regions are easily introduced into noise regions in the traditional attention module, we use the global language prior information to calculate the importance of different words to further weight the visual region features, which could be embodied as language-aware vision attention module. Our experimental results on four benchmark datasets show that the proposed method achieves performance improvement.
引用
下载
收藏
页码:11313 / 11331
页数:18
相关论文
共 50 条
  • [1] Vision-Aware Language Reasoning for Referring Image Segmentation
    Xu, Fayou
    Luo, Bing
    Zhang, Chao
    Xu, Li
    Pu, Mingxing
    Li, Bo
    NEURAL PROCESSING LETTERS, 2023, 55 (08) : 11313 - 11331
  • [2] LAVT: Language-Aware Vision Transformer for Referring Image Segmentation
    Yang, Zhao
    Wang, Jiaqi
    Tang, Yansong
    Chen, Kai
    Zhao, Hengshuang
    Torr, Philip H. S.
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 18134 - 18144
  • [3] Cross-Aware Early Fusion With Stage-Divided Vision and Language Transformer Encoders for Referring Image Segmentation
    Cho, Yubin
    Yu, Hyunwoo
    Kang, Suk-Ju
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 5823 - 5833
  • [4] Bridging Vision and Language Encoders: Parameter-Efficient Tuning for Referring Image Segmentation
    Xu, Zunnan
    Chen, Zhihong
    Zhang, Yong
    Song, Yibing
    Wan, Xiang
    Li, Guanbin
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 17457 - 17466
  • [5] Image Segmentation With Language Referring Expression and Comprehension
    Sun, Jiaxing
    Li, Yujie
    Cai, Jintong
    Lu, Huimin
    Serikawa, Seiichi
    IEEE SENSORS JOURNAL, 2022, 22 (18) : 17406 - 17413
  • [6] CARIS: Context-Aware Referring Image Segmentation
    Liu, Sun-Ao
    Zhang, Yiheng
    Qiu, Zhaofan
    Xie, Hongtao
    Zhang, Yongdong
    Yao, Ting
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 779 - 788
  • [7] Bottom-Up Shift and Reasoning for Referring Image Segmentation
    Yang, Sibei
    Xia, Meng
    Li, Guanbin
    Zhou, Hong-Yu
    Yu, Yizhou
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 11261 - 11270
  • [8] SLViT: Scale-Wise Language-Guided Vision Transformer for Referring Image Segmentation
    Ouyang, Shuyi
    Wang, Hongyi
    Xie, Shiao
    Niu, Ziwei
    Tong, Ruofeng
    Chen, Yen-Wei
    Lin, Lanfen
    PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 1294 - 1302
  • [9] Visionary: vision-aware enhancement with reminding scenes generated by captions via multimodal transformer for embodied referring expression
    Yuan, Zhengwu
    Tang, Peixian
    Sang, Xinguang
    Zhang, Fan
    Zhang, Zheqi
    VISUAL COMPUTER, 2024, : 1673 - 1688
  • [10] Vision-Language Transformer and Query Generation for Referring Segmentation
    Ding, Henghui
    Liu, Chang
    Wang, Suchen
    Jiang, Xudong
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 16301 - 16310