Multi-Scale Segmentation of Episodic Video Instance through Polarized Self-Attention Manipulation

被引:0
|
作者
Huang Y. [1 ]
He Z.-F. [1 ]
Yang H.-K. [1 ]
Zhao C.-R. [1 ]
Zhang Y.-H. [1 ]
机构
[1] Faculty of Mechanical and Electrical Engineering, Kunming University of Science and Technology, Kunming
来源
基金
中国国家自然科学基金;
关键词
Polarized self-attention manipulation; PSAM-Net; Scale scaling; Spatial positioning branch; Topological deformation; Video instance segmentation;
D O I
10.11897/SP.J.1016.2022.02605
中图分类号
学科分类号
摘要
Video instance segmentation (VIS) is a key technology for developing vision systems of intelligent robots, and ones deployed with video instance segmentation algorithms can accurately perform highly complex robotic tasks, such as target tracking and obstacle avoidance. The imaging results that are acquired during environment perception when the robot moves autonomously in a specific scene is easily affected by its own motion speed, shooting angle, distance from the target position and the target relative motion speed, resulting in randomness problems such as topological deformation and scale scaling of the captured moving targets in general. For the same target instance exsting across the adjacent static frames in a series of video frames, it is generally diverse and uncertain in terms of the discernible feature representations which are learned by the model of common methods. Existing video instance segmentation models mostly emphasize more on temporal interaction methods such as inter-frame mask propagation or multi-scale feature tracking, through which the deep semantic parsing of topological instances and contour discrimination of targets in multi-scale are neglected, thus the effective attention to high-level fine-grained features and accurate localization of low-level spatial information are seriously limited. To address the above mentioned issues, a model based on polarized self-attention manipulation for multi-scale video instance segmentation (named PSAM-Net) is proposed in this paper. First of all, in order to establish the positional correlation information between arbitrary non-linear spaces and dependences among orthogonal channels, we propose a single-stage and a cascaded polarized self-attention manipulation mechanisms, which are respectively embedded in the residual network after each residual block in an optimal form. The above mentioned measures benefit to overcome the dispersion of regression distribution for fine-grained features in the feature maps of deep levels and enhance the feature focusing ability of the model on key regions, so as to complete the task of deep semantic parsing of topological instances. Secondly, a multi-scale spatial location model of multi-granularity spatial information is established, which can make up for the lack of low-level feature space location and indistinct instance edge information caused by the feature flow from the top to the bottom of the feature pyramid networks. Through this can we achieve better requirements of target location detection and contour segmentation for foreground objects under different scales. Finally, we construct the episodic video dataset of animal instances extracted from Youtube-VIS, and extensive experiments are conducted to verify the competitive performance of our method. Compared with the YolactEdge benchmark model, the comprehensive testing results obtained by the PSAM-Net model show multiple improvements, where the average detection reaches 44.06% which is increased by 6.08%, and the average segmentation accuracy is increased by 8.87% to reach 44.41% respectively. In addition, the proposed PSAM-Net is capable of processing 80 video frames per second, which is well above the real-time requirements for video instance segmentation task. The model in this paper realizes real-time high-precision segmentation of video instances, and provides an effective theoretical basis and certain reference value for the autonomous environment perception of intelligent mobile robots. © 2022, Science Press. All right reserved.
引用
收藏
页码:2605 / 2618
页数:13
相关论文
共 19 条
  • [1] Chen Jia, Chen Ya-Song, Li Wei-Hao, Et al., Application and prospect of deep learning in video object segmentation, Chinese Journal of Computers, 44, 3, pp. 609-631, (2021)
  • [2] He Kai-Ming, Gkioxari G, Dollar P, Et al., Mask R-CNN, Proceedings of the IEEE International Conference on Computer Vision, pp. 2961-2969, (2017)
  • [3] Ren Shao-Qing, He Kai-Ming, Girshick R, Et al., Faster R-CNN: Towards real-time object detection with region proposal networks, IEEE Transactions on Pattern Analysis and Machine Intelligence, 39, 6, pp. 1137-1149, (2017)
  • [4] Huang Zhao-Jin, Huang Li-Chao, Gong Yong-Chao, Et al., Mask scoring R-CNN, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6402-6411, (2019)
  • [5] Ke Lei, Tai Yu-Wing, Tang Chi-Keung, Deep occlusion-aware instance segmentation with overlapping BiLayers, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4019-4028, (2021)
  • [6] Bolya D, Zhou C, Xiao F, Et al., YOLACT: Real-time instance segmentation, Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9157-9166, (2019)
  • [7] Xie En-Ze, Sun Pei-Ze, Song Xiao-Ge, Et al., PolarMask: Single shot instance segmentation with polar representation, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12193-12202, (2020)
  • [8] Tian Zhi, Shen Chun-Hua, Chen Hao, Et al., FCOS: Fully convolutional one-stage object detection, Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9627-9636, (2019)
  • [9] Hua-Jun Liu, Liu Fu-Qiang, Fan Xin-Yi, Et al., Polarized self-attention: Towards high-quality pixel-wise regression, (2021)
  • [10] Lin Tsung-Yi, Dollar P, Girshick R, Et al., Feature pyramid networks for object detection, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 936-944, (2017)