Multi-Attention Network for Compressed Video Referring Object Segmentation

被引:13
|
作者
Chen, Weidong [1 ]
Hong, Dexiang [1 ]
Qi, Yuankai [2 ]
Han, Zhenjun [1 ]
Wang, Shuhui [3 ]
Qing, Laiyun [1 ]
Huang, Qingming [1 ,3 ]
Li, Guorong [1 ]
机构
[1] Univ Chinese Acad Sci, Beijing, Peoples R China
[2] Univ Adelaide, Adelaide, SA, Australia
[3] Chinese Acad Sci, ICT, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Compressed Video Understanding; Vision and Language; Dual-path; Dual-attention; Multi-modal Transformer;
D O I
10.1145/3503161.3547761
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Referring video object segmentation aims to segment the object referred by a given language expression. Existing works typically require compressed video bitstream to be decoded to RGB frames before being segmented, which increases computation and storage requirements and ultimately slows the inference down. This may hamper its application in real-world computing resource limited scenarios, such as autonomous cars and drones. To alleviate this problem, in this paper, we explore the referring object segmentation task on compressed videos, namely on the original video data flow. Besides the inherent difficulty of the video referring object segmentation task itself, obtaining discriminative representation from compressed video is also rather challenging. To address this problem, we propose a multi-attention network which consists of dual-path dual-attention module and a query-based cross-modal Transformer module. Specifically, the dual-path dual-attention module is designed to extract effective representation from compressed data in three modalities, i.e., I-frame, Motion Vector and Residual. The query-based cross-modal Transformer firstly models the correlation between linguistic and visual modalities, and then the fused multi-modality features are used to guide object queries to generate a content-aware dynamic kernel and to predict final segmentation masks. Different from previous works, we propose to learn just one kernel, which thus removes the complicated post mask-matching procedure of existing methods. Extensive promising experimental results on three challenging datasets show the effectiveness of our method compared against several state-of-the-art methods which are proposed for processing RGB data. Source code is available at: https://github.com/DexiangHong/MANet.
引用
收藏
页码:4416 / 4425
页数:10
相关论文
共 50 条
  • [31] A closer look at referring expressions for video object segmentation
    Bellver, Miriam
    Ventura, Carles
    Silberer, Carina
    Kazakos, Ioannis
    Torres, Jordi
    Giro-i-Nieto, Xavier
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (03) : 4419 - 4438
  • [32] Decoupling Multimodal Transformers for Referring Video Object Segmentation
    Gao, Mingqi
    Yang, Jinyu
    Han, Jungong
    Lu, Ke
    Zheng, Feng
    Montana, Giovanni
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (09) : 4518 - 4528
  • [33] Temporal Collection and Distribution for Referring Video Object Segmentation
    Tang, Jiajin
    Zheng, Ge
    Yang, Sibei
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15420 - 15430
  • [34] MRRVOS: Modular Refinement Referring Video Object Segmentation
    Duan, Zhijiang
    Sun, Yukuan
    Wang, Jianming
    [J]. WEB AND BIG DATA, 2021, 1505 : 117 - 128
  • [35] MATNet: a multi-attention transformer network for nuclei segmentation in thymoma histopathology images
    Qin, Jin
    Liu, Jie
    Liu, Weifan
    Chen, Huang
    Zhong, Dingrong
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (21) : 60735 - 60759
  • [36] Multi-Attention Network for Stereo Matching
    Yang, Xiaowei
    He, Lin
    Zhao, Yong
    Sang, Haiwei
    Yang, Zuliu
    Cheng, Xianjing
    [J]. IEEE ACCESS, 2020, 8 : 113371 - 113382
  • [37] Structured Attention Network for Referring Image Segmentation
    Lin, Liang
    Yan, Pengxiang
    Xu, Xiaoqian
    Yang, Sibei
    Zeng, Kun
    Li, Guanbin
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 1922 - 1932
  • [38] Multi-Level Representation Learning with Semantic Alignment for Referring Video Object Segmentation
    Wu, Dongming
    Dong, Xingping
    Shao, Ling
    Shen, Jianbing
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 4986 - 4995
  • [39] Video object segmentation: A compressed domain approach
    Babu, RV
    Ramakrishnan, KR
    Srinivasan, SH
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2004, 14 (04) : 462 - 474
  • [40] REAL TIME COMPRESSED VIDEO OBJECT SEGMENTATION
    Tan, Zhentao
    Liu, Bin
    Li, Weihai
    Yu, Nenghai
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 628 - 633