CLUE: Contrastive language-guided learning for referring video object segmentation

被引:0
|
作者
Gao, Qiqi [1 ]
Zhong, Wanjun [2 ]
Li, Jie [1 ]
Zhao, Tiejun [1 ]
机构
[1] Harbin Inst Technol, 92 Xida St, Harbin 150001, Heilongjiang, Peoples R China
[2] Sun Yat Sen Univ, 135 Xingangxi Rd, Guangzhou 510275, Guangdong, Peoples R China
基金
中国国家自然科学基金;
关键词
Video object segmentation; Multi-modal; Contrastive learning; Deep learning;
D O I
10.1016/j.patrec.2023.12.017
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Referring video object segmentation (R-VOS), the task of separating the object described by a natural language query from the video frames, has become increasingly critical with recent advances in multi-modal understanding. Existing approaches are mainly visual-dominant in both representation-learning and decision-making process, and are less sensitive to fine-grained clues in text description. To address this, we propose a language-guided contrastive learning and data augmentation framework to enhance the model sensitivity to the fine-grained textual clues (i.e., color, location, subject) in the text that relate heavily to the video information. By substituting key information of the original sentences and paraphrasing them with a text-based generation model, our approach conducts contrastive learning through automatically building diverse and fluent contrastive samples. We further enhance the multi-modal alignment with a sparse attention mechanism, which can find the most relevant video information by optimal transport. Experiments on a large-scale R-VOS benchmark show that our method significantly improves strong Transformer-based baselines, and further analysis demonstrates the better ability of our model in identifying textual semantics.
引用
收藏
页码:115 / 121
页数:7
相关论文
共 50 条
  • [1] Language as Queries for Referring Video Object Segmentation
    Wu, Jiannan
    Jiang, Yi
    Sun, Peize
    Yuan, Zehuan
    Luo, Ping
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 4964 - 4974
  • [2] Video Object Segmentation with Language Referring Expressions
    Khoreva, Anna
    Rohrbach, Anna
    Schiele, Bernt
    [J]. COMPUTER VISION - ACCV 2018, PT IV, 2019, 11364 : 123 - 141
  • [3] SLViT: Scale-Wise Language-Guided Vision Transformer for Referring Image Segmentation
    Ouyang, Shuyi
    Wang, Hongyi
    Xie, Shiao
    Niu, Ziwei
    Tong, Ruofeng
    Chen, Yen-Wei
    Lin, Lanfen
    [J]. PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 1294 - 1302
  • [4] Video Object Segmentation with Referring Expressions
    Khoreva, Anna
    Rohrbach, Anna
    Schiele, Bernt
    [J]. COMPUTER VISION - ECCV 2018 WORKSHOPS, PT IV, 2019, 11132 : 7 - 12
  • [5] Local-Global Context Aware Transformer for Language-Guided Video Segmentation
    Liang, Chen
    Wang, Wenguan
    Zhou, Tianfei
    Miao, Jiaxu
    Luo, Yawei
    Yang, Yi
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (08) : 10055 - 10069
  • [6] CLIP-It! Language-Guided Video Summarization
    Narasimhan, Medhini
    Rohrbach, Anna
    Darrell, Trevor
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [7] Spectrum-guided Multi-granularity Referring Video Object Segmentation
    Miao, Bo
    Bennamoun, Mohammed
    Gao, Yongsheng
    Mian, Ajmal
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 920 - 930
  • [8] mmFilter: Language-Guided Video Analytics at the Edge
    Hu, Zhiming
    Ye, Ning
    Phillips, Caleb
    Capes, Tim
    Mohomed, Iqbal
    [J]. PROCEEDINGS OF THE 2020 21ST INTERNATIONAL MIDDLEWARE CONFERENCE INDUSTRIAL TRACK (MIDDLEWARE INDUSTRY '20), 2020, : 1 - 7
  • [9] LEVERAGING VISUAL PROMPTS TO GUIDE LANGUAGE MODELING FOR REFERRING VIDEO OBJECT SEGMENTATION
    Gao, Qiqi
    Zhong, Wanjun
    Li, Jie
    Zhao, Tiejun
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 685 - 689
  • [10] LGDN: Language-Guided Denoising Network for Video-Language Modeling
    Lu, Haoyu
    Ding, Mingyu
    Fei, Nanyi
    Huo, Yuqi
    Lu, Zhiwu
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,