Toward Robust Referring Image Segmentation

被引:6
|
作者
Wu, Jianzong [1 ]
Li, Xiangtai [1 ]
Li, Xia [2 ]
Ding, Henghui [3 ]
Tong, Yunhai [1 ]
Tao, Dacheng [4 ,5 ]
机构
[1] Peking Univ, Sch Intelligence Sci & Technol, Natl Key Lab Gen Artificial Intelligence, Beijing 100871, Peoples R China
[2] Swiss Fed Inst Technol, Dept Comp Sci, CH-8092 Zurich, Switzerland
[3] Swiss Fed Inst Technol, Dept Informat Technol & Elect Engn, CH-8092 Zurich, Switzerland
[4] Univ Sydney, Camperdown, NSW 2050, Australia
[5] Nanyang Technol Univ, Sch Comp Sci & Engn SCSE, Singapore 639798, Singapore
关键词
Computer vision; image segmentation; natural language processing;
D O I
10.1109/TIP.2024.3371348
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Referring Image Segmentation (RIS) is a fundamental vision-language task that outputs object masks based on text descriptions. Many works have achieved considerable progress for RIS, including different fusion method designs. In this work, we explore an essential question, "What if the text description is wrong or misleading?" For example, the described objects are not in the image. We term such a sentence as a negative sentence. However, existing solutions for RIS cannot handle such a setting. To this end, we propose a new formulation of RIS, named Robust Referring Image Segmentation (R-RIS). It considers the negative sentence inputs besides the regular positive text inputs. To facilitate this new task, we create three R-RIS datasets by augmenting existing RIS datasets with negative sentences and propose new metrics to evaluate both types of inputs in a unified manner. Furthermore, we propose a new transformer-based model, called RefSegformer, with a token-based vision and language fusion module. Our design can be easily extended to our R-RIS setting by adding extra blank tokens. Our proposed RefSegformer achieves state-of-the-art results on both RIS and R-RIS datasets, establishing a solid baseline for both settings. Our project page is at https://github.com/jianzongwu/robust-ref-seg.
引用
收藏
页码:1782 / 1794
页数:13
相关论文
共 50 条
  • [1] Toward Robust Referring Image Segmentation
    Wu, Jianzong
    Li, Xiangtai
    Li, Xia
    Ding, Henghui
    Tong, Yunhai
    Tao, Dacheng
    IEEE Transactions on Image Processing, 2024, 33 : 1782 - 1794
  • [2] Toward Complex-query Referring Image Segmentation: A Novel Benchmark
    Ji, Wei
    Li, Li
    Fei, Hao
    Liu, Xiangyan
    Yang, Xun
    Li, Juncheng
    Zimmermann, Roger
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 21 (01)
  • [3] Hierarchical collaboration for referring image segmentation
    Zhang, Wei
    Cheng, Zesen
    Chen, Jie
    Gao, Wen
    NEUROCOMPUTING, 2025, 613
  • [4] Mask Grounding for Referring Image Segmentation
    Chng, Yong Xien
    Zheng, Henry
    Han, Yizeng
    Qiu, Xuchong
    Huang, Gao
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 26563 - 26573
  • [5] RRSIS: Referring Remote Sensing Image Segmentation
    Yuan, Zhenghang
    Mou, Lichao
    Hua, Yuansheng
    Zhu, Xiao Xiang
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62 : 1 - 12
  • [6] Referring Image Segmentation Using Text Supervision
    Liu, Fang
    Liu, Yuhao
    Kong, Yuqiu
    Xu, Ke
    Zhang, Lihe
    Yin, Baocai
    Hancke, Gerhard
    Lau, Rynson
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 22067 - 22077
  • [7] Image Segmentation With Language Referring Expression and Comprehension
    Sun, Jiaxing
    Li, Yujie
    Cai, Jintong
    Lu, Huimin
    Serikawa, Seiichi
    IEEE SENSORS JOURNAL, 2022, 22 (18) : 17406 - 17413
  • [8] Distillation and Supplementation of Features for Referring Image Segmentation
    Tan, Zeyu
    Xu, Dahong
    Li, Xi
    Liu, Hong
    IEEE ACCESS, 2024, 12 : 171269 - 171279
  • [9] Recurrent Multimodal Interaction for Referring Image Segmentation
    Liu, Chenxi
    Lin, Zhe
    Shen, Xiaohui
    Yang, Jimei
    Lu, Xin
    Yuille, Alan
    2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 1280 - 1289
  • [10] Contrastive Grouping with Transformer for Referring Image Segmentation
    Tang, Jiajin
    Zheng, Ge
    Shi, Cheng
    Yang, Sibei
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23570 - 23580