Recurrent Multimodal Interaction for Referring Image Segmentation

被引:103
|
作者
Liu, Chenxi [1 ]
Lin, Zhe [2 ]
Shen, Xiaohui [2 ]
Yang, Jimei [2 ]
Lu, Xin [2 ]
Yuille, Alan [1 ]
机构
[1] Johns Hopkins Univ, Baltimore, MD 21218 USA
[2] Adobe Res, San Jose, CA USA
关键词
D O I
10.1109/ICCV.2017.143
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper we are interested in the problem of image segmentation given natural language descriptions, i.e. referring expressions. Existing works tackle this problem by first modeling images and sentences independently and then segment images by combining these two types of representations. We argue that learning word-to-image interaction is more native in the sense of jointly modeling two modalities for the image segmentation task, and we propose convolutional multimodal LSTM to encode the sequential interactions between individual words, visual information, and spatial information. We show that our proposed model outperforms the baseline model on benchmark datasets. In addition, we analyze the intermediate output of the proposed multimodal LSTM approach and empirically explain how this approach enforces a more effective word-to-image interaction.(1)
引用
收藏
页码:1280 / 1289
页数:10
相关论文
共 50 条
  • [1] Structured Multimodal Fusion Network for Referring Image Segmentation
    Xue, Mingcheng
    Liu, Yu
    Xu, Kaiping
    Zhang, Haiyang
    Yu, Chengyang
    [J]. PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2022, 2022, : 36 - 47
  • [2] Referring Image Segmentation via Recurrent Refinement Networks
    Li, Ruiyu
    Li, Kaican
    Kuo, Yi-Chun
    Shu, Michelle
    Qi, Xiaojuan
    Shen, Xiaoyong
    Jia, Jiaya
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 5745 - 5753
  • [3] Bilateral Knowledge Interaction Network for Referring Image Segmentation
    Ding, Haixin
    Zhang, Shengchuan
    Wu, Qiong
    Yu, Songlin
    Hu, Jie
    Cao, Liujuan
    Ji, Rongrong
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 2966 - 2977
  • [4] Cross-Modal Recurrent Semantic Comprehension for Referring Image Segmentation
    Shang, Chao
    Li, Hongliang
    Qiu, Heqian
    Wu, Qingbo
    Meng, Fanman
    Zhao, Taijin
    Ngan, King Ngi
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (07) : 3229 - 3242
  • [5] Dual-graph hierarchical interaction network for referring image segmentation
    Shi, Zhaofeng
    Wu, Qingbo
    Li, Hongliang
    Meng, Fanman
    Ngan, King Ngi
    [J]. DISPLAYS, 2023, 80
  • [6] Hierarchical collaboration for referring image segmentation
    Zhang, Wei
    Cheng, Zesen
    Chen, Jie
    Gao, Wen
    [J]. Neurocomputing, 2025, 613
  • [7] Toward Robust Referring Image Segmentation
    Wu, Jianzong
    Li, Xiangtai
    Li, Xia
    Ding, Henghui
    Tong, Yunhai
    Tao, Dacheng
    [J]. IEEE Transactions on Image Processing, 2024, 33 : 1782 - 1794
  • [8] Toward Robust Referring Image Segmentation
    Wu, Jianzong
    Li, Xiangtai
    Li, Xia
    Ding, Henghui
    Tong, Yunhai
    Tao, Dacheng
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 1782 - 1794
  • [9] Multi-Modal Mutual Attention and Iterative Interaction for Referring Image Segmentation
    Liu, Chang
    Ding, Henghui
    Zhang, Yulun
    Jiang, Xudong
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 : 3054 - 3065
  • [10] Decoupling Multimodal Transformers for Referring Video Object Segmentation
    Gao, Mingqi
    Yang, Jinyu
    Han, Jungong
    Lu, Ke
    Zheng, Feng
    Montana, Giovanni
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (09) : 4518 - 4528