Mask-free OVIS: Open-Vocabulary Instance Segmentation without Manual Mask Annotations

被引:9
|
作者
Vibashan, V. S. [1 ]
Yu, Ning [2 ]
Xing, Chen [2 ]
Qin, Can [3 ]
Gao, Mingfei [2 ]
Nieblest, Juan Carlos [2 ]
Patel, Vishal M. [1 ]
Xu, Ran [2 ]
机构
[1] Johns Hopkins Univ, Baltimore, MD 21218 USA
[2] Northeastern Univ, Boston, MA 02115 USA
[3] Salesforce Res, Hong Kong, Peoples R China
关键词
D O I
10.1109/CVPR52729.2023.02254
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Existing instance segmentation models learn task-specific information using manual mask annotations from base (training) categories. These mask annotations require tremendous human effort, limiting the scalability to annotate novel (new) categories. To alleviate this problem, Open-Vocabulary (OV) methods leverage large-scale image-caption pairs and vision-language models to learn novel categories. In summary, an OV method learns task-specific information using strong supervision from base annotations and novel category information using weak supervision from image-captions pairs. This difference between strong and weak supervision leads to overfitting on base categories, resulting in poor generalization towards novel categories. In this work, we overcome this issue by learning both base and novel categories from pseudomask annotations generated by the vision-language model in a weakly supervised manner using our proposed Mask-free OVIS pipeline. Our method automatically generates pseudo-mask annotations by leveraging the localization ability of a pre-trained vision-language model for objects present in image-caption pairs. The generated pseudomask annotations are then used to supervise an instance segmentation model, freeing the entire pipeline from any labour-expensive instance-level annotations and overfitting. Our extensive experiments show that our method trained with just pseudo-masks significantly improves the mAP scores on the MS-COCO dataset and OpenImages dataset compared to the recent state-of-the-art methods trained with manual masks. Codes and models are provided in https://vibashan.github.io/ovis-web/.
引用
收藏
页码:23539 / 23549
页数:11
相关论文
共 27 条
  • [1] Mask-Free Video Instance Segmentation
    Ke, Lei
    Danelljan, Martin
    Ding, Henghui
    Tai, Yu-Wing
    Tang, Chi-Keung
    Yu, Fisher
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 22857 - 22866
  • [2] Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP
    Liang, Feng
    Wu, Bichen
    Dai, Xiaoliang
    Li, Kunpeng
    Zhao, Yinan
    Zhang, Hang
    Zhang, Peizhao
    Vajda, Peter
    Marculescu, Diana
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 7061 - 7070
  • [3] Towards Open-Vocabulary Video Instance Segmentation
    Wang, Haochen
    Yan, Cilin
    Wang, Shuai
    Jiang, Xiaolong
    Tang, Xu
    Hu, Yao
    Xie, Weidi
    Gavves, Efstratios
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 4034 - 4043
  • [4] Open-Vocabulary Instance Segmentation-Boundary IS-Goal
    Tang, Quan
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT IV, 2025, 15034 : 420 - 435
  • [5] Unified Embedding Alignment for Open-Vocabulary Video Instance Segmentation
    Fang, Hao
    Wu, Peng
    Li, Yawei
    Zhang, Xinxin
    Lu, Xiankai
    COMPUTER VISION - ECCV 2024, PT LXX, 2025, 15128 : 225 - 241
  • [6] OV-VIS: Open-Vocabulary Video Instance Segmentation
    Wang, Haochen
    Yan, Cilin
    Chen, Keyan
    Jiang, Xiaolong
    Tang, Xu
    Hu, Yao
    Kang, Guoliang
    Xie, Weidi
    Gavves, Efstratios
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (11) : 5048 - 5065
  • [7] CLIP-VIS: Adapting CLIP for Open-Vocabulary Video Instance Segmentation
    Zhu, Wenqi
    Cao, Jiale
    Xie, Jin
    Yang, Shuangming
    Pang, Yanwei
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2025, 35 (02) : 1098 - 1110
  • [8] Video Instance Segmentation Without Using Mask and Identity Supervision
    Li, Ge
    Cao, Jiale
    Sun, Hanqing
    Anwer, Rao Muhammad
    Xie, Jin
    Khan, Fahad
    Pang, Yanwei
    IEEE TRANSACTIONS ON MULTIMEDIA, 2025, 27 : 224 - 235
  • [9] TAG: Guidance-Free Open-Vocabulary Semantic Segmentation
    Kawano, Yasufumi
    Aoki, Yoshimitsu
    IEEE ACCESS, 2024, 12 : 88322 - 88331
  • [10] OVIS: Open-Vocabulary Visual Instance Search via Visual-Semantic Aligned Representation Learning
    Liu, Sheng
    Lin, Kevin
    Wang, Lijuan
    Yuan, Junsong
    Liu, Zicheng
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 1773 - 1781