Spectrum-guided Multi-granularity Referring Video Object Segmentation

被引:2
|
作者
Miao, Bo [1 ]
Bennamoun, Mohammed [1 ]
Gao, Yongsheng [2 ]
Mian, Ajmal [1 ]
机构
[1] Univ Western Australia, Perth, WA, Australia
[2] Griffith Univ, Griffith, NSW, Australia
基金
澳大利亚研究理事会;
关键词
D O I
10.1109/ICCV51070.2023.00091
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Current referring video object segmentation (R-VOS) techniques extract conditional kernels from encoded (low-resolution) vision-language features to segment the decoded high-resolution features. We discovered that this causes significant feature drift, which the segmentation kernels struggle to perceive during the forward computation. This negatively affects the ability of segmentation kernels. To address the drift problem, we propose a Spectrum-guided Multi-granularity (SgMg) approach, which performs direct segmentation on the encoded features and employs visual details to further optimize the masks. In addition, we propose Spectrum-guided Cross-modal Fusion (SCF) to perform intra-frame global interactions in the spectral domain for effective multimodal representation. Finally, we extend SgMg to perform multi-object R-VOS, a new paradigm that enables simultaneous segmentation of multiple referred objects in a video. This not only makes R-VOS faster, but also more practical. Extensive experiments show that SgMg achieves state-of-the-art performance on four video benchmark datasets, outperforming the nearest competitor by 2.8% points on Ref-YouTube-VOS. Our extended SgMg enables multi-object R-VOS, runs about 3x faster while maintaining satisfactory performance. Code is available at https://github.com/bo-miao/SgMg.
引用
收藏
页码:920 / 930
页数:11
相关论文
共 50 条
  • [1] Multi-Granularity Context Network for Efficient Video Semantic Segmentation
    Liang, Zhiyuan
    Dai, Xiangdong
    Wu, Yiqian
    Jin, Xiaogang
    Shen, Jianbing
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 : 3163 - 3175
  • [2] Attention-guided multi-granularity fusion model for video summarization
    Zhang, Yunzuo
    Liu, Yameng
    Wu, Cunyu
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2024, 249
  • [3] Video Object Segmentation with Referring Expressions
    Khoreva, Anna
    Rohrbach, Anna
    Schiele, Bernt
    [J]. COMPUTER VISION - ECCV 2018 WORKSHOPS, PT IV, 2019, 11132 : 7 - 12
  • [4] Progressive Multi-granularity Analysis for Video Prediction
    Jingwei Xu
    Bingbing Ni
    Xiaokang Yang
    [J]. International Journal of Computer Vision, 2021, 129 : 601 - 618
  • [5] Progressive Multi-granularity Analysis for Video Prediction
    Xu, Jingwei
    Ni, Bingbing
    Yang, Xiaokang
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2021, 129 (03) : 601 - 618
  • [6] Language-guided target segmentation method based on multi-granularity feature fusion
    Tan, Quange
    Wang, Rong
    Wu, Ao
    [J]. Beijing Hangkong Hangtian Daxue Xuebao/Journal of Beijing University of Aeronautics and Astronautics, 2024, 50 (02): : 542 - 550
  • [7] CLUE: Contrastive language-guided learning for referring video object segmentation
    Gao, Qiqi
    Zhong, Wanjun
    Li, Jie
    Zhao, Tiejun
    [J]. PATTERN RECOGNITION LETTERS, 2024, 178 : 115 - 121
  • [8] Multi-Attention Network for Compressed Video Referring Object Segmentation
    Chen, Weidong
    Hong, Dexiang
    Qi, Yuankai
    Han, Zhenjun
    Wang, Shuhui
    Qing, Laiyun
    Huang, Qingming
    Li, Guorong
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4416 - 4425
  • [9] Language as Queries for Referring Video Object Segmentation
    Wu, Jiannan
    Jiang, Yi
    Sun, Peize
    Yuan, Zehuan
    Luo, Ping
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 4964 - 4974
  • [10] Video Object Segmentation with Language Referring Expressions
    Khoreva, Anna
    Rohrbach, Anna
    Schiele, Bernt
    [J]. COMPUTER VISION - ACCV 2018, PT IV, 2019, 11364 : 123 - 141