Learning Depth Signal Guided Mixed Transformer for High-Performance Unsupervised Video Object Segmentation

被引:0
|
作者
Su T.-K. [1 ,2 ]
Song H.-H. [1 ,2 ]
Fan J.-Q. [3 ]
Zhang K.-H. [1 ,2 ]
机构
[1] Jiangsu Key Laboratory of Big Data Analysis Technology, Nanjing University of Information Science and Technology, Jiangsu, Nanjing
[2] Collaborative Innovation Center on Atmospheric Environment and Equipment Technology, University of Information Science and Technology, Jiangsu, Nanjing
[3] College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Jiangsu, Nanjing
来源
基金
中国国家自然科学基金;
关键词
depth estimation; mixed attention; multimodality; robust features; unsupervised video object segmentation mixed transformer;
D O I
10.12263/DZXB.20221162
中图分类号
学科分类号
摘要
The existing unsupervised video object segmentation methods usually employ optical flow as a motion cue to improve the model performance. However, the estimation of optical flow frequently involves errors, resulting in lots of noise, especially for objects with static or complicated motion interference. The two-stream networks will easily overfit to the noise, which severely degrades the segmentation model. To relieve this, we propose to a novel mixed transformer in unsupervised video object segmentation, which can efficiently fuse different modality data by introducing depth signals to learn more robust feature representation and reduce the model overfitting to noise. In specific, the video frame, optical flow and depth map that are cropped into a set of fixed-size patches and concatenated together, are first composed of a triplet as the transformer input. The linear layer followed by a position-encoding layer is applied on the triplet, producing the features to be encoded. After this, the features are integrated by a novel mixed attention module, which can obtain the global respective field and sufficiently interact with the various modality features, to enhance the global semantic features and improve the anti-interference ability of the model. The local-non-local semantic enhancement module is developed in order to further perceive the refined target edge by introducing the inductive bias of local semantic information into supplementary learning of non-local semantic features. In this way, the target region is more refined while improving the anti-interference capability of the model. In the end, the enhanced features as the transformer decoder input to produce the predicted segmentation mask. Extensive experiments on four standard challenging benchmarks demonstrate that the proposed method achieves favorable performance against state-of-the-art methods. © 2023 Chinese Institute of Electronics. All rights reserved.
引用
收藏
页码:1388 / 1395
页数:7
相关论文
共 22 条
  • [1] XIE Q S, LIU X Q, AN Z Y, Et al., Visual object tracking algorithm based on foreground optimization, Acta Electronica Sinica, 50, 7, pp. 1558-1566, (2022)
  • [2] FU L H, ZHAO Y, JIANG H X, Et al., Semi-supervised video object segmentation based on foreground perception visual attention, Acta Electronica Sinica, 50, 1, pp. 195-206, (2022)
  • [3] FU L H, ZHAO Y, SUN X W, Et al., Fast video object segmentation based on Siamese networks, Acta Electronica Sinica, 48, 4, pp. 625-630, (2020)
  • [4] FAN J, ZHANG K, ZHAO Y, Et al., Unsupervised video object segmentation via weak user interaction and temporal modulation, Chinese Journal of Electronics, 32, pp. 1-13, (2022)
  • [5] ZHOU T F, LI J W, WANG S Z, Et al., Matnet: Motion-attentive transition network for zero-shot video object segmentation, IEEE Transactions on Image Processing, 29, pp. 8326-8338, (2020)
  • [6] JI G P, FU K R, WU Z, Et al., Full-duplex strategy for video object segmentation, 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4922-4933, (2022)
  • [7] ZHANG K H, ZHAO Z C, LIU D, Et al., Deep transport network for unsupervised video object segmentation, 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 8781-8790, (2021)
  • [8] REN S, LIU W, LIU Y, Et al., Reciprocal transformations for unsupervised video object segmentation, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15455-15464, (2021)
  • [9] TOKMAKOV P, ALAHARI K, SCHMID C., Learning video object segmentation with visual memory, 2017 IEEE International Conference on Computer Vision (ICCV), pp. 4481-4490, (2017)
  • [10] LU X K, WANG W G, DANELLJAN M, Et al., Video object segmentation with episodic graph memory Networks [C], Computer Vision - ECCV 2020, pp. 661-679, (2020)