Learning Depth Signal Guided Mixed Transformer for High-Performance Unsupervised Video Object Segmentation

被引:0
|
作者
Su T.-K. [1 ,2 ]
Song H.-H. [1 ,2 ]
Fan J.-Q. [3 ]
Zhang K.-H. [1 ,2 ]
机构
[1] Jiangsu Key Laboratory of Big Data Analysis Technology, Nanjing University of Information Science and Technology, Jiangsu, Nanjing
[2] Collaborative Innovation Center on Atmospheric Environment and Equipment Technology, University of Information Science and Technology, Jiangsu, Nanjing
[3] College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Jiangsu, Nanjing
来源
基金
中国国家自然科学基金;
关键词
depth estimation; mixed attention; multimodality; robust features; unsupervised video object segmentation mixed transformer;
D O I
10.12263/DZXB.20221162
中图分类号
学科分类号
摘要
The existing unsupervised video object segmentation methods usually employ optical flow as a motion cue to improve the model performance. However, the estimation of optical flow frequently involves errors, resulting in lots of noise, especially for objects with static or complicated motion interference. The two-stream networks will easily overfit to the noise, which severely degrades the segmentation model. To relieve this, we propose to a novel mixed transformer in unsupervised video object segmentation, which can efficiently fuse different modality data by introducing depth signals to learn more robust feature representation and reduce the model overfitting to noise. In specific, the video frame, optical flow and depth map that are cropped into a set of fixed-size patches and concatenated together, are first composed of a triplet as the transformer input. The linear layer followed by a position-encoding layer is applied on the triplet, producing the features to be encoded. After this, the features are integrated by a novel mixed attention module, which can obtain the global respective field and sufficiently interact with the various modality features, to enhance the global semantic features and improve the anti-interference ability of the model. The local-non-local semantic enhancement module is developed in order to further perceive the refined target edge by introducing the inductive bias of local semantic information into supplementary learning of non-local semantic features. In this way, the target region is more refined while improving the anti-interference capability of the model. In the end, the enhanced features as the transformer decoder input to produce the predicted segmentation mask. Extensive experiments on four standard challenging benchmarks demonstrate that the proposed method achieves favorable performance against state-of-the-art methods. © 2023 Chinese Institute of Electronics. All rights reserved.
引用
收藏
页码:1388 / 1395
页数:7
相关论文
共 22 条
  • [11] Mahadevan S, Athar A, Osep A, Et al., Making a case for 3d convolutions for object segmentation in videos, (2020)
  • [12] SCHMIDT C, ATHAR A, MAHADEVAN S, Et al., D2conv3d: Dynamic dilated convolutions for object segmentation in videos, 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 1200-1209, (2022)
  • [13] RANFTL R, LASINGER K, HAFNER D, Et al., Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer, IEEE Transactions on Pattern Analysis and Machine Intelligence, 44, 3, pp. 1623-1637, (2020)
  • [14] TEED Z, DENG J., RAFT: Recurrent all-pairs field transforms for optical flow, Computer Vision - ECCV 2020, pp. 402-419, (2020)
  • [15] PERAZZI F, PONT-TUSET J, MCWILLIAMS B, Et al., A benchmark dataset and evaluation methodology for video object segmentation, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 724-732, (2016)
  • [16] OCHS P, MALIK J, BROX T., Segmentation of moving objects by long term video analysis[J], IEEE Transactions on Pattern Analysis and Machine Intelligence, 36, 6, pp. 1187-1200, (2013)
  • [17] XU N, YANG L J, FAN Y C, Et al., Youtube-vos: Sequence-to-sequence video object segmentation [C], Computer Vision - ECCV 2018, pp. 585-601, (2018)
  • [18] FAN D P, WANG W G, CHENG M M, Et al., Shifting more attention to video salient object detection, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8554-8564, (2019)
  • [19] WANG W G, SHEN J B, SHAO L., Consistent video saliency using local gradient flow optimization and global refinement, IEEE Transactions on Image Processing, 24, 11, pp. 4185-4196, (2015)
  • [20] CHEN Y W, JIN X J, SHEN X H, Et al., Video salient object detection via contrastive features and attention modules, 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 1320-1329, (2022)