Multi-Attention Network for Compressed Video Referring Object Segmentation

被引:13
|
作者
Chen, Weidong [1 ]
Hong, Dexiang [1 ]
Qi, Yuankai [2 ]
Han, Zhenjun [1 ]
Wang, Shuhui [3 ]
Qing, Laiyun [1 ]
Huang, Qingming [1 ,3 ]
Li, Guorong [1 ]
机构
[1] Univ Chinese Acad Sci, Beijing, Peoples R China
[2] Univ Adelaide, Adelaide, SA, Australia
[3] Chinese Acad Sci, ICT, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Compressed Video Understanding; Vision and Language; Dual-path; Dual-attention; Multi-modal Transformer;
D O I
10.1145/3503161.3547761
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Referring video object segmentation aims to segment the object referred by a given language expression. Existing works typically require compressed video bitstream to be decoded to RGB frames before being segmented, which increases computation and storage requirements and ultimately slows the inference down. This may hamper its application in real-world computing resource limited scenarios, such as autonomous cars and drones. To alleviate this problem, in this paper, we explore the referring object segmentation task on compressed videos, namely on the original video data flow. Besides the inherent difficulty of the video referring object segmentation task itself, obtaining discriminative representation from compressed video is also rather challenging. To address this problem, we propose a multi-attention network which consists of dual-path dual-attention module and a query-based cross-modal Transformer module. Specifically, the dual-path dual-attention module is designed to extract effective representation from compressed data in three modalities, i.e., I-frame, Motion Vector and Residual. The query-based cross-modal Transformer firstly models the correlation between linguistic and visual modalities, and then the fused multi-modality features are used to guide object queries to generate a content-aware dynamic kernel and to predict final segmentation masks. Different from previous works, we propose to learn just one kernel, which thus removes the complicated post mask-matching procedure of existing methods. Extensive promising experimental results on three challenging datasets show the effectiveness of our method compared against several state-of-the-art methods which are proposed for processing RGB data. Source code is available at: https://github.com/DexiangHong/MANet.
引用
收藏
页码:4416 / 4425
页数:10
相关论文
共 50 条
  • [1] Multi-Attention Network for Unsupervised Video Object Segmentation
    Zhang, Guifang
    Wong, Hon-Cheng
    Lo, Sio-Long
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2021, 28 : 71 - 75
  • [2] MAIN: Multi-Attention Instance Network for video segmentation
    Alcazar, Juan Leon
    Bravo, Maria A.
    Jeanneret, Guillaume
    Thabet, Ali K.
    Brox, Thomas
    Arbelaez, Pablo
    Ghanem, Bernard
    [J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2021, 210
  • [3] Multi-Video-Object Segmentation based on SOFM Network for Compressed Video Sequences
    Fu Wenxiu
    Wang Lei
    Wang Xu
    [J]. ICNC 2008: FOURTH INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION, VOL 3, PROCEEDINGS, 2008, : 255 - 259
  • [4] Multi-attention embedded network for salient object detection
    Wei He
    Chen Pan
    Wenlong Xu
    Ning Zhang
    [J]. Soft Computing, 2021, 25 : 13053 - 13067
  • [5] Hybrid multi-attention transformer for robust video object detection
    Moorthy, Sathishkumar
    K.S., Sachin Sakthi
    Arthanari, Sathiyamoorthi
    Jeong, Jae Hoon
    Joo, Young Hoon
    [J]. Engineering Applications of Artificial Intelligence, 2025, 139
  • [6] Polyp Segmentation Network Combined With Multi-Attention Mechanism
    Jia, Lixin
    Hu, Yibiao
    Jin, Yan
    Xue, Zhizhong
    Jiang, Zhiwei
    Zheng, Qiufu
    [J]. Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2023, 35 (03): : 463 - 473
  • [7] Multi-attention embedded network for salient object detection
    He, Wei
    Pan, Chen
    Xu, Wenlong
    Zhang, Ning
    [J]. SOFT COMPUTING, 2021, 25 (20) : 13053 - 13067
  • [8] MACNet: Multi-Attention and Context Network for Polyp Segmentation
    Hao, Xiuzhen
    Pan, Haiwei
    Zhang, Kejia
    Chen, Chunling
    Bian, Xiaofei
    He, Shuning
    [J]. WEB AND BIG DATA, PT II, APWEB-WAIM 2022, 2023, 13422 : 369 - 384
  • [9] Multi-Attention Convolutional Neural Network for Video Deblurring
    Zhang, Xiaoqin
    Wang, Tao
    Jiang, Runhua
    Zhao, Li
    Xu, Yuewang
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (04) : 1986 - 1997
  • [10] Video Object Segmentation with Referring Expressions
    Khoreva, Anna
    Rohrbach, Anna
    Schiele, Bernt
    [J]. COMPUTER VISION - ECCV 2018 WORKSHOPS, PT IV, 2019, 11132 : 7 - 12